WO2011064898A1

WO2011064898A1 - Apparatus to enable time and area efficient access to square matrices and its transposes distributed stored in internal memory of processing elements working in simd mode and method therefore

Info

Publication number: WO2011064898A1
Application number: PCT/JP2009/070272
Authority: WO
Inventors: Hanno Lieske
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2009-11-26
Filing date: 2009-11-26
Publication date: 2011-06-03
Anticipated expiration: 2012-05-26
Also published as: JP5532132B2; JP2013512479A

Abstract

Disclosed is an apparatus having a plurality of processing elements which work in single instruction, multiple data (SIMD) mode. The processing elements have internal memory units. Instead of a full crossbar connection, only connections from each processing element to selected internal memory units are provided. These internal memory connections are selected in a way to enable time and area efficient access only to the own internal memory for direct access to matrix stored in a distributed manner in the internal memory units and to internal memory units which are needed for the generation of the transposed matrix to perform a corner turn execution.

Description

DESCRIPTION

APPARATUS TO ENABLE TIME AND AREA EFFICIENT ACCESS TO SQUARE MATRICES AND ITS TRANSPOSES DISTRIBUTED STORED IN INTERNAL MEMORY OF PROCESSING ELEMENTS WORKING IN

SIMD MODE AND METHOD THEREFORE

TECHNICAL FIELD

[0001 ]

The present invention relates to a SIMD processor array. More particularly, it relates to a connection apparatus which may be suitably adapted to enables time and area efficient access from processing elements (PE) to associated internal memory (IMEM) units inclusive of non-neighbored IMEM units.

BACKGROUND ART

[0002]

There have been proposed many processors operating in single instruction multiple data (SIMD) (see Patent Document 1) . To support, in such kind of processors, access from one processing element (PE) to data stored in an internal memory (IMEM) of another PE, different types of inter PE data access have been introduced. Fig. l is a diagram showing the typical configuration of the SIMD processor array.

[0003]

One way to provide data which is for example stored in the IMEM 101 of a PE 100, here for example PE00, to be used by another PE, here for example PEO l , is by loading the data from the IMEM o f PEOO to PEOO and moving the data afterwards from PEOO to PEO l over inter PE communication channels 102 (clock wise) or 103 (counter clock wise), as disclosed in Patent Document 2. In Patent Document 2, there is disclosed a system where a plurality of PEs are connected with each other over a ring bus system used as communication channels between the PEs. This system allows an inexpensive and efficient way to perform data exchange between two neighbored PEs, because only one or for bidirectional transfer two ring buses are needed and data can be exchanged between two neighbored PEs in one clock cycle.

[0004]

An alternative way to enable access to IMEM data from neighbored PEs is disclosed in Patent Document 3. Fig. 2 illustrates the configuration of Patent Document 3. As shown in Fig. 2, data from neighbored PEs IMEM is routed over direct connections 200 to the PE for the configuration of 8 PEs. A multiplexer 20 1 selects data accessed by each PE from among IMEMs data. This is an inexpensive and fast way for an access to neighbored IMEM data, because only additional multiplexers 201 and connections between PEs and neighbored PE's IMEMs are needed.

[0005]

A following analysis of the related art is given by the present invention.

While implementations of the both related arts are working efficiently and with limited area overhead for neighbored data exchange, they both have drawbacks when data has to be exchanged between PEs which are not located in the neighborhood. In PATENT DOCUMENT 2, while the area is not increasing, the number of clock cycles the data is travelling over the communication channels increases in proportion to the distance between the PEs involved in the data exchange, so that this solution is not fast anymore for data transfers between far distant PEs. For example, in case of a system equipped with 1 6 PEs which are connected over two contrary ring buses with each other, it takes at maximum 8 clock cycles to guarantee the communication between any IMEM and PE. This would result for a corner turn of a 16 times 1 6 element matrix in 64 clo ck cycles transfer delay additional to the 32 clock cycles for reading from and writing to IMEM, so overall 96 clock cycles.

[0006]

In PATENT DOCUMENT 3 , the delay can be held constant.

However, the area for the connection crossbar increases with the expansion of the neighbored PE to IMEM connection crossbar, so that, in a general case in which access from every PE to every IMEM can be supported, a full crossbar connection, used for the inter PE connection in the Multiprocessor System-on-Chip, as disclosed in NON-PATENT DOCUMENT 1 , or used inside the register read port connection, as disclo sed in PATENT DOCUMENT 4, is required. This does not provide an inexpensive solution anymore.

[0007]

[Patent Document 1 ] U.S. Pat. No.3.537.074

[Patent Document 2]

International Pub. No. WO 2008/108005 Al

[Patent Document 3]

U.S. Patent Pub. No.: US2008/320273 Al

[Patent Document 4]

U.S. Patent Pub. No.: US2005/0108503 Al

[Non-Patent Document 1]

M.Z. Urfianto, T. Isshiki, A.U. Khan, D. Li, H. Kunieda, "A Multiprocessor System on Chip Architecture with Enhanced Compiler Support and Efficient Interconnect," in IP/SoC 2006, Grenoble, France, Dec.2006

DISCLOSURE OF THE INVENTION

PROBLEMS TO BE SOLVED BY THE INVENTION

[0008]

The analysis of the related art by the present invention described above is summarized as follows.

The first related art described with reference to Fig.l, which accesses data from non neighbored IMEM units for a time and area efficient generation of the transposed matrix to perform a corner turn execution, results in large transfer delays when transferring the data over inter PE communication channels.

[0009]

The second related art described with reference to Fig.2 which accesses data from non neighbored IMEM units for a time and area efficient generation of the transposed matrix to perform a corner turn execution, results in large area needs for a full crossbar connection between every PE and every IMEM.

[00 10]

Accordingly, it is an object of the present invention to provide an apparatus and a method which enable time and area efficient access from each PE to IMEMs in a SIMD processor array.

MEANS TO SOLVE THE PROBLEMS

[001 1 ]

In accordance with one aspect of the present invention, there is provided a connection apparatus through which from each PE only access to selected IMEM units is provided. The selection is performed in a way to enable time and area efficient access only to own IMEM unit for direct access to the IMEM units storing a matrix in a distributed manner, and to IMEM units which are needed for the generation of the transposed matrix to perform a corner turn execution. MERITORIOUS EFFECTS OF THE INVENTION

[0012]

According to the present invention, in place of a full crossbar connection, there are provided only connections from each PE to selected IMEM units which reduces the necessary cell and net area by around 85%. IMEM connections are selected in a way to enable time and area efficient access to the own IMEM for direct access to a matrix which is stored in a distributed manner in a plurality of IMEM units and to IMEM units which are needed for the generation of the transposed matrix to perform a corner turn execution.

Still other features and advantages of the present invention will become readily apparent to those skilled in this art from the following detailed description in conjunction with the accompanying drawings wherein only exemplary embodiments of the invention are shown and described, simply by way o f illustration of the best mode contemplated of carrying out this invention. As will be realized, the invention is capable of other and different embo diments, and its several details are capable o f modifications in various obvious respects, all without departing from the invention. Accordingly, the drawing and description are to be regarded as illustrative in nature, and not as restrictive.

BRIEF DES CRIPTION OF THE DRAWINGS

[0013]

Fig. 1 is a diagram showing a configuration of the PE to IMEM interconnection o f Reference 2 ;

Fig. 2 is a diagram showing a configuration of the PE to IMEM interconnection of Reference 3 ;

Fig. 3 is a diagram showing a configuration of an exemplary embodiment of the present invention;

Figs. 4A to 4D are diagrams showing the separated execution of a 2 dimensional transform;

Fig. 5 is a diagram showing a configuration of the connection apparatus;

Fig. 6 is a diagram showing a configuration o f the read control function unit for N=l 6, ra-4;

Fig.7 is a diagram showing a configuration of a read control function units input and output connections;

Fig. 8 is a diagram showing a configuration of a selector; Fig. 9 is a diagram showing a configuration of an address generator 0;

Fig. 10 is a diagram showing a configuration of an address generator 1;

Fig. 11 is a diagram showing a configuration of an address generator 2;

Fig. 12 is a diagram showing a configuration of an address generator 3;

Fig. 13 is a diagram showing a configuration of a 4 x 16byte swap unit;

Fig. 14 is a diagram showing a configuration of a 4 x 4byte transpose unit;

Fig. 15 is a diagram showing a configuration of a write control function unit;

Fig. 16 shows the write control function units input and output connections;

Fig. 17 is a flowchart of the operation of a read control function unit;

Fig. 18 shows the macro block sub division with N=16 times N matrix elements (bytes) grouped to memory elements each with m=4 matrix elements (bytes), memory element sub blocks and memory element sub block rows;

Figs. 1 9A to 19D show the sub division of the matrix of level 2 in 4 matrices o f level 1 ;

Figs. 20A to 20D show the assignment of partial address offset values to each block on each level;

Fig. 21 shows the generation of the address o ffset matrix with the address offset values for each memory element sub block from each level;

Fig. 22 shows the matrix lo aded from IMEM with correct vertical position of each memory element sub block;

Fig. 23 shows the matrix with correct horizontal and vertical position for each memory element sub block after pair-wise swapping;

Fig. 24 shows an example memory element sub block corner turn;

Figs. 25A to 25D show the matrix stored in N/m (=4) PE registers after execution of N/m (=4) read instructions;

Fig. 26 is a flowchart of the operation o f a write control function unit;

Fig. 27 shows the matrix data stored in IMEM after 4 write transfer instructions;

Fig. 28 shows the control signal setting to perform a corner turn of a matrix with N (= 1 6) times N pixel and a data bit width of m (= 4) pixel;

Fig. 29 shows a matrix transposition process for 8x8 matrix by the connection apparatus at clock cycle 1 in read operation; Fig. 30 shows the matrix transposition process for 8x8 matrix by the connection apparatus at clock cycle 2 in read operation;

Fig. 3 1 shows the matrix transpo sition process for 8x8 matrix by the connection apparatus at clock cycle 3 in read operation;

Fig. 32 shows the matrix transposition process for 8x8 matrix by the connection apparatus at clock cycle 4 in read operation;

Fig. 33 shows the matrix transposition process for 8x8 matrix by the connection apparatus at clock cycle 1 in write operation;

Fig. 34 shows the matrix transpo sition process for 8x8 matrix by the connection apparatus at clock cycle 2 in write operation;

Fig. 35 shows the matrix transpo sition process for 8x8 matrix by the connection apparatus at clock cycle 3 in write operation;

Fig. 36 shows the matrix transposition process for 8x8 matrix by the connection apparatus at clo ck cycle 4 in write operation;

Fig. 37 shows a matrix transpo sition process for 8x8 matrix by the connection apparatus at clock cycle 1 in read operation;

Fig. 38 shows the matrix transpo sition process for 8x8 matrix by the connection apparatus at clock cycle 2 in read operation;

Fig. 39 shows the matrix transpo sition process for 8x8 matrix by the connection apparatus at clock cycle 1 in write operation;

Fig. 40 shows the matrix transpo sition process for 8x8 matrix by the connection apparatus at clock cycle 2 in write operation;

Fig. 41 shows a matrix transposition process for 16x 16 matrix by the connection apparatus at clock cycle 1 in read operation;

Fig.42 shows the matrix transpo sition process for 16x 16 matrix by the connection apparatus at clock cycle 2 in read operation;

Fig. 43 shows the matrix transposition process for 16x 16 matrix by the connection apparatus at clock cycle 3 in read operation;

Fig. 44 shows the matrix transposition process for 16x 16 matrix by the connection apparatus at clock cycle 4 in read operation;

Fig. 45 shows a matrix transposition process for 16x16 matrix by the connection apparatus at clo ck cycle 5 in read operation;

Fig. 46 shows the matrix transposition process for 16x 16 matrix by the connection apparatus at clock cycle 6 in read operation;

Fig. 47 shows the matrix transposition process for 16x 16 matrix by the connection apparatus at clock cycle 7 in read operation;

Fig. 48 shows the matrix transposition process for 1 6x 16 matrix by the connection apparatus at clo ck cycle 8 in read operation;

Fig. 49 shows a matrix transposition process for 16x 16 matrix by the connection apparatus at clock cycle 1 in write operation;

Fig. 50 shows the matrix transposition process for 16x16 matrix by the connection apparatus at clo ck cycle 2 in write operation;

Fig. 51 shows the matrix transposition process for 16x 16 matrix by the connection apparatus at clock cycle 3 in write operation;

Fig. 52 shows the matrix transposition process for 16x16 matrix by the connection apparatus at clock cycle 4 in write operation;

Fig. 53 shows a matrix transposition process for 16x16 matrix by the connection apparatus at clock cycle 5 in write operation;

Fig. 54 shows the matrix transposition process for 1 6x16 matrix by the connection apparatus at clock cycle 6 in write operation; Fig. 55 shows the matrix transposition process for 16x 16 matrix by the connection apparatus at clock cycle 7 in write operation;

Fig. 56 shows the matrix transposition process for 16x 16 matrix by the connection apparatus at clock cycle 8 in write operation;

Fig. 57 shows a matrix transposition process for 16x 16 matrix by the connection apparatus at clo ck cycle 1 in read operation;

Fig. 58 shows the matrix transposition process for 16x 16 matrix by the connection apparatus at clock cycle 2 in read operation;

Fig. 59 shows the matrix transposition process for 16x 16 matrix by the connection apparatus at clock cycle 3 in read operation;

Fig. 60 shows the matrix transposition process for 16x 16 matrix by the connection apparatus at clock cycle 4 in read operation;

Fig. 6 1 shows a matrix transposition process for 16x 16 matrix by the connection apparatus at clo ck cycle 1 in write operation;

Fig.62 shows the matrix transpo sition process for 16x 16 matrix by the connection apparatus at clock cycle 2 in write operation;

Fig. 63 shows the matrix transposition process for 1 6x 16 matrix by the connection apparatus at clo ck cycle 3 in write operation;

Fig. 64 shows the matrix transposition process for 1 6x 16 matrix by the connection apparatus at clo ck cycle 4 in write operation;

Fig.65 shows another example of the configuration of the read control function unit of the connection apparatus;

Fig.66 shows another example of the configuration of the write control function unit of the connection apparatus;

Fig.67 shows another example of the configuration of the read control function unit of the connection apparatus;

Fig. 68 shows still another example of the configuration of the write control function unit of the connection apparatus;

Fig.69 shows another example of the configuration of the read control function unit of the connection apparatus;

Fig.70 shows another example o f the configuration of the write control function unit of the connection apparatus;

Fig.71 shows another example of the configuration o f the read control function unit of the connection apparatus; and

Fig.72 shows another example of the configuration of the write control function unit of the connection apparatus.

EXPLANATIONS OF SYMBOLS

[0014]

100 : Processor element (PE)

101 : Internal memory of each PE (IMEM)

102 : Pipelined ring bus in clockwise direction

103 : Pipelined ring bus in counter-clockwise direction

104 : Data connection between PE and own IMEM

200: Data connection between PE and neighbored IMEM

20 1 : Multiplexer to select the connection pass

300: Group of PE

301 : Group of IMEM

302: Connection apparatus

303 : Control apparatus

400 : Pixel of picture 401 : Sub block with 4 times 4 pixel

402 : Macro blo ck with 16 times 16 pixel

403 : Filter in horizontal direction

404: Sub block with example pixel value

405 : Corner turn, a transpose operation exchanging the pixel position from position (x,y) to position (y,x)

406 : Sub block with transposed pixel values for the example input 407: Filter in vertical direction

408 : 2 dimensional filtered macro block

500 : Connection apparatus

50 1 : Read control function

502: Write control function

503 : Path 0 from the read control function providing a path from each PE to the own IMEM

504: Path 1 from the read control function providing a path from each PE to an IMEM specified by the unit R TRANS 506

505 : Selection means

506: Unit R TRANS

507: Path 0 from the write control function providing a path from each PE to the own IMEM

508 : Path 1 from the write control function providing a path from each

PE to an IMEM specified by the unit W_TRANS

509 : S election means

510: Unit W TRANS

600: 3bit counter used as control signal and for address offset generation, lower 2 bits specifying the currently processed memory element sub block row

601: Selector

602: Group of 4 IMEM units

603: IMEM address generator

604: Inverter

605: 4xl6byte swap unit

606: 4x4byte transpose unit

607: PE register file address bits

608: Group of 4 PE register files inside 4 PE

700: Address for a read access to IMEM

701: Address for a write access to IMEM

702: Counter bit 2 used for correct IMEM address selection

703: Selection means

800: Address generator 0

900: Address generator 1

901: Selection means to select between the counter bit 0 and the inverted counter bit 0 by evaluating the PATH information

1000: Address generator 2 generating IMEM address by appending counter bit 0 and counter bit 1 or the inverted counter bit 1 to the memory base address

1001: Selection means to select between the counter bit 1 and the inverted counter bit 1 by evaluating the PATH information

1100: Address generator 3 generating IMEM address by appending counter bit 0 or the inverted counter bit 0 and counter bit 1 or the inverted counter bit 1 to the memory base address

1 1 01 : S election means to select between the counter bit 1 and the inverted counter bit 1 by evaluating the PATH information

1 1 02 : S election means to select between the counter bit 0 and the inverted counter bit 0 by evaluating the PATH information

1200 : 4x 16byte data swap unit used for swapping the horizontal position of the memory element sub blocks in a memory element sub block row

1201 : First multiplexer stage responsible in case that the counter bit 0 is 1 to swap the sub blo ck data from 2 sub blocks where only the horizontal address bit 0 is different

1202 : Second multiplexer stage responsible in case that the counter bit 1 is 1 to swap the sub block data from 2 sub blocks where only the horizontal address bit 1 is different

1300: 4x 4byte transpose unit performing a memory element sub block corner turn

1301 : 16byte input from the 4x l 6byte swap unit arranged as a memory element sub block with 4 horizontal words with each 4 byte in vertical direction

1302 : Memory element sub block with 4 horizontal words with each 4 byte in vertical direction arranged as 16byte output vector for transfer to the PE register file

1303 : Output multiplexer to select between the direct output of the 16byte input vector or the transposed memory element sub block arranged as a 16byte output vector 1400 : 3bit counter used as control signal and for address offset generation, lower 2 bits specifying the currently processed memory element sub block row

1401 : S elector selecting whether the read or write address is send to the IMEM units

1402 : Group of 4 IMEM units receiving the same address to write one memory element sub block

1403 : IMEM address generator for each memory element sub block o f the memory element sub block row

1404: Inverter used for generation of inverted counter bits to generate the correct IMEM address

1405 : PE register file address bits

1406 : Group of 4 PE register files inside 4 PE

1500: Flowchart of the read control function unit: Distinguish which data is passed to the PE register file by evaluating the variable PATH. 1501 : Flowchart of the read control function unit: Calculating (N/m) IMEM addresses by combining IMEM base address and the lower log2(N/m) counter bits.

1502 : Flowchart of the read control function unit: Sending addresses to IMEM and receiving N/m times m2 bytes per memory element sub blo ck row.

1503 : Flowchart of the read control function unit: Calculating (N/m) IMEM addresses by combining IMEM base address and the lower log2(N/m) counter and inverted counter bits.

1504: Flowchart of the read control function unit: Send (N/m) addresses to IMEM, m neighbored IMEM receive 1 address.

1505 : Flowchart of the read control function unit: Receive (N/m) times m² bytes.

1506 : Flowchart of the read control function unit : Set index i, which is running over the address bit positions, to zero.

1507 : Flowchart of the read control function unit: Test counter bit at position i. If bit at position i is zero continue with 1509. Otherwise continue with 1508.

1508 : Flowchart of the read control function unit : Swap pair-wise the position in the sub block row, a pair of 2 sub blocks is formed by the sub blocks where only bit i differs in the index.

1509 : Flowchart of the read control function unit : Increase index i.

15 10 : Flowchart of the read control function unit: As long as i is lower than lo g2 N/m jump to 1507. Otherwise continue with 15 1 1 .

15 1 1 : Flowchart of the read control function unit: Rearrangement o f m² bytes as m horizontal neighbored elements with m bytes in vertical direction.

15 12 : Flowchart of the read control function unit: Transpose m times m matrix by changing each byte from po sition (x, y) to position (y, x). 15 13 : Flowchart of the read control function unit: Store in all N PEs 1 memory element with m bytes at address specified by lower log2(N/m) CNT bits.

1600 : One Matrix element equal to one byte

1601 : Macro block with 16 times 16 matrix elements

1602 : IMEM number 1603 : Memory element with m = 4 vertical matrix elements

1604: Memory element sub block with 4 horizontal neighbored memory elements

1605 : Memory element sub block row with 4 horizontal neighbored memory element sub blo cks

1700 : Level 1 matrix with 8 times 8 matrix elements

1800: Diagonal sub block of level 1 matrix

1801 : Anti-diagonal sub block of level 1 matrix

1802 : Anti-diagonal block of level 2 matrix

1900 : Level 1 partial address offset matrix

1901 : Level 2 partial address offset matrix

1902: Address offset matrix

2400 : Flowchart of the write control function unit: Read from all N PEs 1 memory element with m bytes from address specified by lower log2(N/m) CNT bits.

2401 : Flowchart of the write control function unit: Distinguish the way the address calculation is done by evaluating the variable PATH.

2402: Flowchart of the write control function unit: Calculating (N/m) IMEM addresses by combining IMEM base address and the lower log2(N/m) counter bits.

2403 : Flowchart of the write control function unit: Calculating (N/m) IMEM addresses by combining IMEM base address and the lower log2(N/m) counter and inverted counter bits.

2404: Flowchart of the write control function unit: S ending (N/m) addresses and N/m times ml bytes per memory element sub block row to IMEM

PREFERRED MODES OF THE PRESENT INVENTION

[00 15]

The preferred modes of the present invention will now be described with reference to drawings . Compared to the first related art described with reference to Fig. l , the way proposed by the present invention to access data from non neighbored IMEM units for a time and area efficient generation of the transposed matrix to perform fast a corner turn execution will significantly reduce the required time for the transpo sed matrix generation by only slightly increasing the area requirements.

[00 16]

Compared to the second related art described with reference to Fig.2, the way propo sed by the present invention to access data from non neighbored IMEM units for a time and area efficient generation o f the transposed matrix to perform fast a corner turn execution will significantly reduce the area requirement for the transposed matrix generation by only slightly increasing the required time.

[00 17]

By enabling over the connection apparatus the access between a

PE and selected IMEM units, a reduction of the necessary clock cycles as compared to the first related art and a reduction of the necessary area requirement as compared to the second related art can be achieved.

[001 8]

The connection apparatus according to an exemplary embodiment of the present invention enables time and area efficient access to IMEM units to provide direct access to N times N pixel matrix stored in a distributed manner in the IMEM units, and to the IMEM units, which are needed to generate the transposed N times N pixel matrix for a corner turn execution.

[0019]

In the following, the connection apparatus and a method for a specific algorithm used in an exemplary embodiment will be described.

[0020]

Fig. 3 is a diagram showing the configuration o f a first exemplary embodiment of the present invention. In Fig.3 , there is shown a PE to IMEM interconnection. Referring to Fig.3 , there are provided 16 PEs 300, 16 IMEMs 301 , a connection apparatus 302 arranged between an array of the PEs 300 and an array of the IMEMs 30 1 , and a control apparatus 303 controlling the connection apparatus 302. As an example of an algorithm, a corner turn o f a matrix, which changes the position (x,y) of an element of the matrix to (y, x), with N= 16 times N= 16 bytes is executed with a data bit width of m = 4 byte.

[0021 ]

The corner turn o f such kind o f matrix is very often executed in image and video processing algorithms, like for example JPEG, MPEG l , MPEG2, MPEG4, H.261 , H.263 H.264 and so forth. The corner turn is, for example, executed when processing a two dimensional Fourier transform (FFT) or discrete cosine transform (DCT) on a macro block in a separated way by performing a transpose operation on the output of the first transform step in horizontal direction as shown in Fig . 4A to Fig. 4D. Fig. 4A to Fig. 4D each show a macro block 402, which consists of 16 times 16 pixel data 400 and is divided into 16 sub blocks 40 1 . After performing the horizontal filter 403 on the vertical edges, the corner turn 405 is performed for the upper left input sub blo ck 404 in Fig. 4C to output sub block 406. Finally, the vertical filter 407 on the horizontal edges is performed, which delivers the output pixel values 408 for the 2-dimensional filter execution as shown in Fig.4D. In the fo llowing, the connection apparatus equipped with the function of transposition for a corner turn will be described in detail.

[0022]

Fig.5 is a diagram showing the configuration of the connection apparatus according to the first exemplary embodiment. A connection apparatus 500 in Fig.5 corresponds to the connection apparatus 302 in Fig.3. Referring to Fig.5 , the connection apparatus 500 includes a read control function unit 501 and a write control function unit 502. The input and output data signals are connected on one side to IMEM group and on the other side to PE group. In order to select and execute the function appropriately, a number of additional control signals are needed. In the read control function unit 501 , a data path from IMEM group is divided into 2 paths, that is, path 0 and path 1 . The first path (path 0) is connected to a first input o f a selector 505. The second path (path 1 ) includes a read transformation unit (R__TRANS) 506, an output of which is connected to a second input o f the sector 505. Data on the path selected by the selector 505 is supplied to PE group.

[0023]

In the write control function unit 502, data path from PE group is divided into 2 paths, that is, path 0 and path 1. The first path (path 0) is connected to a first input of a selector 509. The second path (path 1 ) includes a write transformation unit (W_TRANS) 5 10, an output o f which is connected to a second input of the sector 509. Data on the path selected by the selector 509 is supplied to IMEM group . The first path (path 0) in the read control function 501 and the first path (path 0) in the write control function unit 502 provide a direct connection between the input side and output side to enable a connection between each PE and its own IMEM. The read transformation unit (R_TRANS) 506 provided on the second path (path 1) in the read control function unit 501 and the write transformation unit (W_TRANS) 5 10 provided on the second path (path 1 ) in the write control function unit 502 are in charge of performing change o f necessary data position to create the transposed matrix, as will be shown in the following in detail.

[0024]

Fig. 6 is a block diagram showing the configuration of the read control function unit with the number of matrix elements N equal to 16 and with the number o f matrix elements which are stored in one memory element IMEM equal to 4. The read contro l function unit has the input and output connections as shown in Fig. 7.

[0025] An input memory read base address MEM_Base_Address is used to calculate IMEM read address. A two bit counter value , that is, lower two bit value generated in the counter unit (3bit counter) 600 and its inverted value generated by an inverter 604 are used as o ffset for the IMEM read address calculation inside the address generator 603 and for internal controlling. The lower two bit value of the 3 bit counter 600 is also referred to by a signal name CNT [ 1 : 0] (see Fig. 7). The output address from the address generator 603 to IMEM Read_address is used to read out the IMEM data. Read_data from IMEM is, after passing the processing stages inside the read control function unit, transferred as Write_data to PE register file PE-RF 608 and stored there at the address 607 specified by the lower two bits of the 3bit counter 600. Here is to mention, that for simplification the IMEM units in this exemplary embo diment are not introducing additional delay when reading from them, so the data is read in the same clock cycle where the address is provided.

[0026]

The signal PATH supplied from the control apparatus 303 in Fig. 3 is used to control the processing inside the read control function unit. After calculation o f the 4 addresses for the 16 internal memories (IMEM 0 to IMEM F), the addresses are sent over selectors 601 to the internal memories (IMEM 0 to IMEM F) 602 and 4 times 16byte (4 times 4x4 byte) data words are received from the 16 internal memories (IMEM 0 to IMEM F). The 4 times 1 6byte data words are passed to the 4x 1 6 byte swap unit 605, where the position of the memory element sub blocks are horizontally interchanged with each other. In this exemplary embodiment, one sub block corresponds to a 4x4 sub-matrix with a size of one matrix element l byte.

[0027]

More specifically, 4x4byte read data from IMEM 0 to 3 correspond to a first 4x4 sub-matrix, 4x4byte read data from IMEM 4 to 7 correspond to a second 4 x4 sub-matrix, 4x4byte read data from IMEM 8 to B correspond to a third 4 x4 sub-matrix, and 4x4byte read data from IMEM C to F correspond to a forth 4 x4 sub-matrix. The first to fourth 4x4 sub-matrices are supplied to the 4x 16 byte swap unit 605 which performs swapping between each preset pair of the 4x4 sub-matrices, except for a case wherein the first to fourth 4x4 sub-matrices each constitute diagonal elements. That is, the 4x 1 6 byte swap unit 605 causes the first to fourth 4x4 sub-matrices which constitute diagonal elements to pass therethough without swapping.

[0028]

The four 4x4 sub-matrices output from the 4x l 6byte swap unit 605 are respectively supplied to four pieces of 4x4byte transpose unit 606. Each 4x4byte transpose unit 606 performs the transposition of the 4 x4 sub-matrix. In the 4x4byte transpose unit 606, after each sub-matrix, that is, 16 byte data word, is arranged as 4 horizontal memory elements of 4 byte vertical width, the po sition of each byte is transpo sed from position (x, y) to position (y, x) . The above described swap and transpose operation performed by the 4x l 6byte swap unit 605 and 4x4byte transpose unit 606 in the read control function unit will be in more detail described on clo ck cycle base, with reference to Fig.57 to Fig.61 .

[0029]

Fig. 8 shows the blo ck diagram of the selector for the IMEM input address. This selector in Fig.8 composes the selector 601 in Fig.6 and a selector 1401 which will be described later with reference to Fig. 15. Each o f the selectors 601 in Fig. 6 receives Read_address form the read control unit function and Write_address from the write control function unit which is not shown in Fig.6 and will be described later with reference to Fig. 15 , and selects one of Read_address and Write_address to provide the selected address and Read/Write control signal R/W to IMEMs of a corresponding group. Referring to Fig.8, a selection over a selector 703 is done depending on the transfer direction of the data defined by the upper bit of the 3 bit counter unit CNT [2] 702 which can be "0" for read and " 1 " for write. In case o f read, the address 700 from the read control function unit is used otherwise the address 701 from the write control function unit is used.

[0030]

Fig. 9 shows the configuration of the address generator 0 (800) which corresponds to the address generator 0 (603) in Fig. 6. Referring to Fig.9, in the address generator 0, the output address is generated by combining the memory base address ((N-2) bit) and the lower 2 bits counter value CNT[ 1 ] and CNT[0] .

[003 1]

Fig. 10 shows the configuration of the address generator 1 (900) which corresponds to the address generator 1 in Fig. 6. Referring to Fig. 1 0, in the address generator 1 , the output address (N-bit) is generated by combining the memory base address ((N-2) bit), bit 1 (CNT [ 1 ]), and one of bit 0 (CNT[0]) and inverted bit 0 (Inv(CNT[0])) selected by a selector 901 depending on the signal PATH.

[0032]

Fig. 1 1 shows the configuration of the address generator 2 ( 1000) which corresponds to the address generator 2 (603) in Fig. 6. Referring to Fig. 1 1 , in the address generator 2, the output address (N bit) is generated by combining the memory base address ((N-2) bit), one of bit 1 (CNT[ 1 ]) and inverted bit l (Inv(CNT[ l ])) selected by a selector 1 001 depending on the signal PATH, and bit 0 (CNT[0]) .

[0033]

Fig. 12 shows the configuration of the address generator 3 ( 1 100) which corresponds to the address generator 3 (603) in Fig. 6. Referring to Fig. 12, in the address generator 3, the output address (N bit) is generated by combining the memory base address ((N-2) bit), one of bit 1 (CNT[ 1 ]) and inverted bit 1 (Inv(CNT[ 1 ])) selected by a selector 1 101 depending on the signal PATH, and one of bit 0 (CNT[0]) and inverted bit 0 (Inv(CN[0])) selected by a selector 1 102 depending on the signal PATH.

[0034]

Fig. 13 shows the configuration of the 4x l 6byte swap unit 1200, which receives on input side the 4 times 1 6byte read data from IMEM as well as the lower 2 counter value bits CNT[0] and CNT[ 1 ] . The 4 times 16byte output data for the swap unit is generated by passing two stages o f multiplexers 1201 and 1202. In the first multiplexer stage 1201 , depending on bit 0 (CNT[0j) of the 2bit counter value CNT[ 1 :0] , the first and second 16byte data signal are passed through to the output side or subjected to interchange of their position with each other. Same is done for the third and fourth 16 byte data signal.

[0035]

In the second multiplexer stage 1202, depending on bit 1 (CNT[ 1 J) o f the 2 bit counter value CNT [ 1 : 0], the first and third 16byte data signal are passed though to the output side or subjected to interchange of their position with each other. S ame is done for the second and fourth 16byte data signal.

[0036]

More specifically, 4x4byte data from IMEM 0-3 and 4x4byte data from IMEM 4-7 are supplied respectively to 0-input and 1 -input of a first selector and supplied respectively to 1 -input and 0-input of a second selector of the multiplexer stage 1201 . 4x4byte data from IMEM 8-B and 4x4byte data from IMEM C-F are supplied respectively to 0-input and 1 -input of a third selector and supplied respectively to 1 -input and 0-input of a fourth selector of the multiplexer stage 120 1 . The first to fourth selectors of the multiplexer stage 1201 select and output data supplied to 0-input thereof, when CNT[0] is "0 " and select and output data supplied to 1 -input thereof, when CNT[0] is " 1 " .

[0037]

The outputs of the first and third selectors of the multiplexer stage 1201 are supplied respectively to 0-input and 1 -input of a first selector of the multiplexer stage 1202. The outputs of the second and fourth selectors o f the multiplexer stage 1201 are supplied respectively to 0-input and 1 - input of a second selector of the multiplexer stage 1202. The outputs o f the third and first selectors of the multiplexer stage 1201 are supplied respectively to 0-input and 1 -input o f a third selector of the multiplexer stage 1202. The outputs of the fourth and second selectors of the multiplexer stage 1201 are supplied respectively to 0-input and 1 -input of a fourth selector of the multiplexer stage 1202. The first to fourth selectors of the multiplexer stage 1202 select and output data supplied to 0-input thereof, when CNT[ 1 ] is "0" and select and output data supplied to 1 -input thereof, when CNT[ 1 ] is " 1 " . 4x4byte data output from the first selector of the multiplexer stage 1202 are supplied to 4x4byte transpo se units for PE 0-3. 4x4byte data output from the second selector of the multiplexer stage 1202 are supplied to 4x4byte transpo se units for PE 4-7. 4x4byte data output from the third selector of the multiplexer stage 1202 are supplied to 4x4byte transpose units for PE 8-B. 4x4byte data output from the fourth selector o f the multiplexer stage 1202 are supplied to 4x4byte transpo se units for PE C-F.

[0038]

When CNT[ 1 : 0] = "00", the 4x l 6byte swap unit 1200 outputs 4x4byte data from IMEM 0-3, 4x4byte data from IMEM 4-7, 4x4byte data from IMEM 8-B and 4x4byte data from IMEM C-F in this order without swapping.

[0039]

When CNT[1:0] = "01", the 4xl6byte swap unit 1200 outputs 4x4byte data from IMEM 4-7, 4x4byte data from IMEM 0-3, 4x4byte data from IMEM C-F and 4x4byte data from IMEM 8-B in this order. In this case, 4x4byte data from IMEM 0-3 and 4x4byte data from IMEM 4-7 are swapped each other and 4x4byte data from IMEM 8-B and 4x4byte data from IMEM C-F are swapped each other.

[0040]

When CNT[1:0] = "10", the 4xl6byte swap unit 1200 outputs

4x4byte data from IMEM 8-B, 4x4byte data from IMEM C-F, 4x4byte data from IMEM 0-3, and 4x4byte data from IMEM 4-7 in this order. In this case, 4x4byte data from IMEM 0-3 and 4x4byte data from IMEM 8-B are swapped each other and 4x4byte data from IMEM 4-7 and 4x4byte data from IMEM C-F are swapped each other.

[0041]

When CNT[1:0] = "11", the 4xl6byte swap unit 1200 outputs 4x4byte data from IMEM C-F, 4x4byte data from IMEM 8-B, 4x4byte data from IMEM 4-7 and 4x4byte data from IMEM 0-3 in this order. In this case, 4x4byte data from IMEM 0-3 and 4x4byte data from IMEM C-F are swapped each other and 4x4byte data from IMEM 4-7 and 4x4byte data from IMEM 8-B are swapped.

[0042]

Fig. 14 shows the configuration of a 4 times 4 byte transpose unit 1300. This 4 times 4 byte transpose unit 1300 corresponds to the 4x4 byte transpo se unit 606 in Fig.6. The output data from the 4x 16 byte swap unit in Fig. 13 (or 605 in Fig. 6) forms the input to four pieces of these transpose units 1300. 16 byte output data is reorganized as a matrix with 4 columns with each 4 byte on input side 1301 as well as on output side 1 302. Then, depending on the signal PATH, the data is in one path directly sent to the output for access through the non-transpo sed path. For access through the transposed path, the data is in the second path transposed by exchanging the position (x, y) of each data signal to the position (y,x) . At the output side, a multiplexer 1303 is clearing the selected path.

[0043]

As shown in Fig. 14, 4x4 sub-matrix is represented by a linear order as (Column 0, byte 0), (Column 0, byte 1 ), (Column 0, byte 2), (Column 0, byte 3), (Column 1 , byte 0), (Column 1 , byte l ), ... , and (Column 3 , byte 3) . The transposed 4 x4 sub-matrix is represented as (Column 0, byte 0), (Column 1 , byte 0), (Column 2, byte 0), (Column 3 , byte 0), (Column 0, byte 1 ), (Column 1 , byte 1 ) ... ,and (Column 3 , byte 3) .

[0044]

Fig. 15 is a block diagram showing the configuration of the write control function unit for the examp le with the number of matrix elements N equal to 16 and the number of matrix elements which are stored in one memory element m equal to 4. The write contro l function unit has the input and output connections as shown in Fig. 16. An input memory read base address MEM_BASE_ADDRESS is used to calculate the IMEM write address. 2bit counter value CNT[ 1 : 0] generated in a counter unit 1400 (3bit counter identical to the counter unit 600 in Fig. 6) and its inverted value generated by an inverter 1404 (identical to the inverter 604 in Fig . 6) are used as offset for the IMEM write address calculation inside an address generator 1403 (identical to the address generator 603 in Fig.6) .

[0045]

The Write_address to IMEM 1402 (identical to IMEM 602 in Fig.6) generated from the address generator 1403 is, after passing a selector 1401 (identical to the selector 60 1 in Fig. 6), used to store the data from a PE-RF 1406 (identical to PE-RF 608 in Fig.6), which is accessed by the address 1405 (identical to the address 607 in Fig.6) specified by the lower two counter bits CNT[ 1 : 0] . The signal PATH is used to control the address generation inside the write control function.

[0046]

The write control function unit has the input and output connections as shown in Fig. 16.

[0047]

Fig. 17 shows a flowchart of the read control function. Fig. 1 8 shows an example of the macro block with 16 times 16 matrix elements (bytes). In the macro block 1 60 1 , elements of the matrix are grouped to memory elements 1 603 each with 4 vertical matrix elements (bytes) 1600. The memory elements 1 603 are grouped to memory element sub blo cks 1604 with 4 horizontal neighbored memory elements, and the memory element sub blocks 1604 are grouped to memory element sub block rows 1605 with 4 horizontal neighbored memory element sub blocks. With reference to Figs. 17 and 1 8, the operation o f the read control function will now be described.

[0048]

Evaluate the signal PATH information. If PATH is equal to zero, execute the "Yes"-branch explained in steps 2 till 3, otherwise execute the "No"-branch explained in steps 4 till 14 ( 1 500).

[0049]

Calculate N/m (=4) IMEM addresses by combining the IMEM base address and the log2(N/m) (=2) counter bits ( 1501).

[0050]

Send the addresses to the IMEM and receive N/m (=4) times m²

(=16) bytes per memory element sub block row 1 605 which consists o f a row of memory element sub blocks 1604 with each m (=4) matrix elements 1603 with each holding m (=4) matrix elements 1600 ( 1502) .

[0051 ]

Calculate N/m (=4) IMEM addresses by combining the IMEM base address and the log2(N/m) (=2) counter and inverted counter bits ( 1503) . The decision whether to use the counter or inverted counter bits is done by generating the address offset matrix.

[0052] The address offset matrix is generated in the following way:

a) Form matrices of different levels. As starting point, the memory element sub blocks with m (=4) times m matrix elements are forming the matrices of level 1 . Then 4 neighbored matrices of level 1 are grouped to a new matrix on level 2. This is recursively done till only 4 matrices are left. Fig. 19 shows the output of this step for the initial values N= 16 and m=4.

[0053]

b) Assign to each memory element sub blo ck, partial address offsets, as shown in Fig. 20. For each diagonal memory element sub block 1 800 of each matrix on each level, a value 0 is assigned, while for each anti-diagonal memory element sub block, a value equal to

₂ (level- l ) is assigned, so for the anti-diagonal memory element sub blocks of level 1 , a value 1 assigned to 1801 and for the anti-diagonal memory element sub blocks of level 2, a value 2 assigned to 1802.

[0054]

c) Sum up the partial address o ffsets from all levels (here from level 1 1900 and level 2 1901) for each memory element sub blo ck to receive the address offset matrix 1902 with the address offsets for each memory element sub block as shown in Fig. 21 .

[0055]

The correct address offset for each memory element sub blo ck in a memory element sub block row is generated by using the counter bits CNT[0] and CNT[ 1 ] as vertical index to the address offset matrix to select the current processed address matrix row and by forming the address matrix entry for each memory element sub blo ck of the selected row out of counter and inverted counter bits

[0056]

Send the (N/m) (=4) addresses to IMEM, m (=4) neighbored

IMEM receive 1 address ( 1504) .

[0057]

Receive (N/m) (=4) times m² (= 16) bytes from the IMEM. Fig . 22 shows the received data for the N/m (=4) memory element sub block row requests with correct vertical memory element sub block position ( 1505).

[0058]

Set index i equal to zero ( 1506).

[0059]

Evaluate bit i from the counter CNT. If the value is equal to zero go to step 10, otherwise continue with step 9 ( 1507).

[0060]

Build pairs of 2 sub blo cks. The pairs are formed out of the sub blocks where only bit i differs in the memory element sub block index inside one memory element sub block row. Swap the position o f these sub blocks. Fig. 23 shows the output of this step for the macro block shown in Fig.23 after finishing the loop execution over all values of index i and counter CNT (1508).

[0061]

Increase index I (1509).

[0062]

Compare index i against log2(N/m) (=2). If the value is not equal go to step 8, otherwise continue with step 12 (1510).

[0063]

Rearrange the m (=16) bytes of a memory element sub block as m (=4) horizontal neighbored memory elements with each m (=4) bytes in vertical direction (1511).

[0064]

For each memory element sub block transpose m (=4) times m (=4) bytes by changing each byte from position (x,y) to position (y,x) as shown exemplarily for the upper left memory element sub block in Fig. 24. Fig. 25 show the output of this step after finishing the loop execution over all values of counter CNT (1512).

[0065]

Store in all N (=16) PEs one memory element at address specified by lower log2(N/m) (=2) CNT bits (1513). [0066]

The resulting matrix as shown in Fig. 25 is stored in N/m (=4) registers of N (= 16) PEs. Registers 0, 1 , 2, and 3 store 4x l 6byte data corresponding, respectively to rows 0 to 3 , row 4 to 7, rows 8 to B, and rows C to F of the 16 times 16 matrix transposed as shown in Figs.25A, 25B, 25 C and 25D. PE-RF 0 to PE-RF F in Fig.6 and Fig 1 5 correspond to Registers 0 to 3. For example, PE-RF 0 stores elements of first to four rows in first columns of Registers 0 to 3 shown respectively in Figs.25A to 25D, PE-RF 1 stores elements of first to four rows in second columns of Registers 0 to 3 shown respectively in Figs.25A to 25D, ... and PE-RF F stores elements of first to four rows in sixteenth columns of Registers 0 to 3 shown respectively in Figs.25A to 25D.

[0067]

Fig. 26 shows a flowchart for explaining the write control function. The data stored in registers PE-RF 0-3 , PE-RF 4-7, PE-RF 8-B and PE-RF C-F are read and written at addresses specified by the address generators, in IMEM 0-3 , IMEM 4-7, IMEM 8-B and IMEM C-F, respectively.

[0068]

Read from N (= 16) PEs one memory element with m (=4) bytes from the address specified by the lower log2(N/m) (=2) CNT bits (2400).

[0069] <Step 2>

Evaluate the signal PATH information (2401). If PATH is equal to zero, execute the "Yes"-branch explained in Step 3 , if not, execute the "No"-branch explained in Step 4.

[0070]

Calculate N/m (=4) IMEM addresses by combining IMEM base address and the lower log2(N/m) (=2) counter bits (2402). Continue with step 5.

[0071 ]

Calculate N/m (=4) IMEM addresses by combining IMEM base address and the lower log2(N/m) (=2) counter and inverted counter bits. The decision whether to use the counter or inverted counter bits is done by generating the address offset matrix as described for the read control function. The correct address o ffset for each memory element sub block in a memory element sub block row is then again generated by using the counter bits CNT[0] and CNT[ 1 ] as vertical index to the address offset matrix to select the current processed address matrix row and by forming the address matrix entry for each memory element sub block of the selected row out of counter and inverted counter bits (2403).

[0072]

Sending N/m ( 16/4=4) addresses and N/m (=4) times m² (= 16) bytes per memory element sub block row to IMEM (2404) .

[0073]

The resulting matrix after the transfer which is stored in IMEM is shown in Fig. 27. In Fig. 27, the position (x, y) of matrix element of Fig. 1 8 is changed to (y, x) .

[0074]

Additional to the data signal, control signals are needed to specify the correct function to be executed. The fo llowing three information are decoded inside the control signals:

1 .) Which function is executed (read or write) ( 1 bit)

2. ) Which path is executed (path 0 or path 1 ) ( 1 bit)

3. ) Which memory element sub block row is executed (log2 (N/m) = 2 bit)

[0075]

A corner turn of the N (= 16) x N matrix can then be executed by using the transposed paths of the read and write control function unit in the connection apparatus 302 in 2x(N/m) = 2x( 16/4)=8 clock cycles with the control signal setting in each clock cycle as specified in Fig. 28.

[0076]

Specific examples of a matrix swap and transpo sition processing by the connection apparatus 302 according to the present exemplary embodiment will now be described.

[0077]

<Example 1 > In this example, it is assumed that a matrix to be pro cessed is an NxN (where N=8) matrix with a size of one matrix element l byte, and that each one co lumn data, namely N(=8) elements data, are stored in IMEMs 0 to N- l (=7), as one word, respectively. In this example, the NxN (N=8) matrix are divided into (N/m)x(N/m)(=4x4) number of mxm(2x2) sub-matrices, each o f which has m (=2) rows x m columns.

[0078]

One word of each o f IMEMs and Registers stores m(=2) vertically aligned elements. In this example, the (N/m)(=4) number o f 2x2 sub-matrices of the 8x8 matrix stored in specified addresses of the IMEMs 0 to 7 are read, subjected to swapping and transposition and then stored in respective registers of PEs 0 to 7, (N/m) times (cycles). After the read operation, the (N/m)(=4) pieces of 2x2 sub-matrices o f the 8x8 matrix stored in registers o f PEs 0 to 7 are read and restored in specified addresses of the IMEMs 0 to 7, (N/m) times (cycles) . The 8x8 matrix restored in the IMEMs 0 to 7 is a corner turned version of the 8x8 matrix originally retained in the IMEMs 0 to 7. That is, the position of every matrix element of the 8x8 matrix restored in the IMEMs 0 to 7 is changed from the original position (x, y) to (y,x).

[0079]

The read operation will now be described.

a) Read addresses for IMEMs 0- 1 , IMEMs 2-3 , IMEMs 4-5 and IMEMs 6-7 are generated by modifying lower two bit of MEM_Base Address in accordance with lower two bit value of 3bit counter 600 (or b) Swap the horizontal positions of 2x2 sub-matrices;

c) Corner turn for each 2x2 sub-matrix is performed; and d) Store to Registers 0-3 in each PE.

[0080]

Following figures provide further details of the above described read process.

[008 1]

In Fig. 29, memory contents, that is, matrix elements o f the 8x8 matrix stored in IMEMs 0 to 7 are each expressed with a row index number x and a column index number y. In Fig. 29, "00 " denotes a matrix element at a position of a first row and a first column (0,0), and "77 " denotes a matrix element at a position of of an eighth row and an eighth column (7,7) . The 8x8 matrix are divided into 4 times4 number of 2x2 sub-matrices .

[0082]

Two vertically neighboring matrix elements (0,0) and (1 , 0) of the 8x8 matrix stored at address of the MEM_Base_Address & "00 " in IMEM 0, and two vertically neighboring matrix elements (0, 1 ) and ( 1 , 1) of the 8x8 matrix stored at the address of the MEM_Base_Address & "00 " in IMEM 1 compo se a first 2x2 sub-matrix of an upper left position (0, 0) . In the same sway, four matrix elements (2,2), (3 ,2) , (2,3) and (3 ,3) of the 8x8 matrix stored at the address of the MEM_Base_Address & "01 " in IMEMs 2-3 compo se a second 2x2 sub-matrix of a position (1 , 1), four matrix elements (4,4), (5,4), (4,5) and (5, 5) of the 8x8 matrix stored at the address of the MEM_Base_Address & "10" in IMEMs 4-5 compose a third 2x2 sub-matrix of a position (2,2), and four matrix elements (6,6), (7,6), (6,7) and (7,7) of the 8x8 matrix stored at the address of the MEM_Base_Address & "11" in IMEMs 6-7 compose a fourth 2x2 sub-matrix of a lower right position (3,3). It is noted that regarding the operator "&", MEM Base Address & "11", for example, indicates an address obtained by concatenating the MEM_Base_Address (upper (N-2) bit ) and lower two bits "11", such as by adding "11" by modulo 2.

[0083]

Referring to Fig. 29, in clock cycle 1, with CNT[0]=0 and CNT[1] = 0 of 3bit counter 600 in Fig.6, the address generator 0 (Fig.6 and Fig.9) supplies MEM_Base_Address & "00" to IMEMs 0-1, the address generator 1 (Fig.6 and Fig.10) supplies MEM_Base_Address & "01" (CNT[1] and Inv(CNT[0]) selected by the selector 901) to IMEMs 2-3, the address generator 2 (Fig.6 and Fig.11) supplies MEM_Base_Address & "10" (Inv(CNT[l]) selected by the selector 1001 and CNT[0]) to IMEMs 4-5 and the address generator 3 (Fig.6 and Fig.12) supplies MEM_Base_Address & "11" (Inv(CNT[l]) selected by the selector 1101 and Inv(CNT[0]) selected by the selector 1102) to IMEMs 6-7.

[0084]

Four 2x2 sub-matrices forming diagonal sub-matrix elements which are diagonally located at (0,0), (1,1), (2,2) and (3,3) in 4 rows and 4 columns of 2x2 sub-matrices are read and transferred respectively to corresponding transpose units without swapping. That is, the diagonally located four 2x2sub-matrices are not swapped and only transposition of each sub-matrix is preformed. Each transpose unit generates a transposed 2x2 sub-matrix, wherein the position (x,y) of the 2x2 sub-matrix element is changed to the position of (y,x), where x=0 or 1 , y=0 or 1 and x≠y. The four 2x2 sub-matrices, each o f which has been transpo sed are then stored in Register 0 of PEs 0-7. Register 0 o f each of PEs 0-7 stores two vertically neighboring elements of the associated 2x2 sub-matrix.

[0085]

Referring to Fig. 30, in clock cycle 2, IMEMs 0- 1 , IMEMs 2-3 , IMEMs 4-5, and IMEMs 6-7 are supplied with MEM_Base_Address & "0 1 ", MEM_Base_Address &"00 ", MEM_Base_Address & " 1 1 ", and ME M_B as e_ Address & " 1 0 " , respectively. In a clock cycle 2, four 2x2 sub-matrices respectively located at ( 1 ,0), (0, 1 ), (3,2) and (2,3) of 4 rows and 4 columns of 2x2 sub-matrices are accessed by the address generators. The four 2x2 sub-matrices ( 16byte read data) are subjected to swap operation in which horizontal po sitions between two 2x2 sub-matrices are swapped each other. Here, a pair of first and second 2x2 sub-matrices are swapped and a pair o f third and fourth sub-matrices are swapped. The four 2x2 sub-matrices swapped are each subjected to corner turn and then stored in Register 1 of PEs 0-7. Register 1 of each of PEs 0-7 stores two vertically neighboring elements of the associated 2x2 sub-matrix.

[0086] Referring to Fig.31, in clock cycle 3, IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, and IMEMs 6-7 are supplied with MEM_Base_Address &"10", MEMJ3ase_Address &"11", MEM_Base_Address &"00", and MEM_Base_Address & "01", respectively. In a clock cycle 3, four 2x2 sub-matrices respectively located at (2,0), (3,1), (0,2) and (1,3) of 4 rows and 4 columns of 2x2 sub-matrices are accessed by the address generators. The four 2x2 sub-matrices are subjected to swap operation in which a pair of first and third 2x2 sub-matrices are swapped and a pair of second and fourth 2x2 sub-matrices are swapped. The 2x2 sub-matrices swapped are each subjected to corner turn and then stored in Register 2 of PEs 0-7. Register 2 of each of PEs 0-7 stores two vertically neighboring elements of the associated 2x2 sub-matrix.

[0087]

Referring to Fig.32, in clock cycle 4, IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, and IMEMs 6-7 are supplied with MEM_Base_Address &"11", MEM_Base_Address &"10", MEM_Base_Address &"01", and MEM_B as e_ Address &"00", respectively. In a clock cycle 4, four 2x2 sub-matrices respectively located at (3,0), (2,1), (1,2) and (0,3) of 4 rows and 4 columns of 2x2 sub-matrices are accessed by the address generators. Four 2x2 sub-matrices are subjected to swap operation in which a pair of first and fourth 2x2 sub-matrices are swapped and a pair of second and third 2x2 sub-matrices are swapped. The 2x2 sub-matrices swapped are subjected to corner turn and then stored in Register 3 of PEs 0-7. Register 3 of each of PEs 0-7 stores two vertically neighboring elements of the associated 2x2 sub-matrix. As a result of 4 clock cycle-read operation of the 8x8 matrix stored in the IMEMs 0-7, the 8x8 matrix are stored in Registers 0 to 3 of PEs 0-7 as shown in Fig. 33.

[0088]

In write operation, one word from Registers 0 to 3 in number order are read and stored into IMEMs. The addresses for IMEMs 0-7 are generated by modifying lower two bit of MEM Address according to a lower two bit counter value of 3 bit counter 1400.

[0089]

More specifically, referring to Fig.33 , in write clock cycle 1 , four 2x2 sub-matrices (32byte data) are read from Register 0 of PEs 0-7 and then stored at the addresses of MEM_Base_Address &" 00 ", MEM_Base_Address &" 01 ", MEM_Base_Address &" 10 ", and MEM_Base_Address &" 1 1 " of IMEMs 0- 1 , IMEMs 2-3 , IMEMs 4-5, and IMEMs 6- 7, respectively. It should be noted that the combination o f the lower two bit values in the write addresses respectively supplied to IMEMs 0- 1 , IMEMs 2-3 , IMEMs 4-5 , and IMEMs 6-7 in write clo ck cycle 1 is the same as that of the lower two bit values in the read addresses respectively supplied to IMEMs 0- 1 , IMEMs 2-3, IMEMs 4-5 , and IMEMs 6-7 in read clo ck cycle 1 shown in Fig.29. IMEMs 0-7 each store two vertically neighboring elements of the associated 2x2 sub-matrix as one word at a specified address.

[0090]

Referring to Fig.34, in write clock cycle 2, 16 byte data are read from Register 1 of PEs 0-7 and then stored at addresses of MEM_Base_Address & "01", MEM_Base_Address & "00", MEM_Base_Address & "11", and MEM_Base_Address & "10" of IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, and IMEMs 6-7, respectively. The combination of the lower two bit values in the write addresses respectively supplied to IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, and IMEMs 6-7 in write clock cycle 2 is the same as that of the lower two bit values in the read addresses respectively supplied to IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, and IMEMs 6-7 in read clock cycle 2 shown in Fig.30.

[0091]

Referring to Fig.35, in write clock cycle 3, 16 byte data are read from Register 2 of PEs 0-7 and then stored at addresses of MEM_Base_Address & "10", MEM_Base_Address & "11", MEM_Base_ Address & "00", and MEMJBase_Address & "10" of IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, and IMEMs 6-7, respectively. The combination of the lower two bit values in the write addresses respectively supplied to IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, and IMEMs 6-7 in write clock cycle 3 is the same as that of the lower two bit values in the read addresses respectively supplied to IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, and IMEMs 6-7 in read clock cycle 3 shown in Fig.31.

[0092]

Referring to Fig.36, in write clock cycle 4, 16 byte data are read from Register 3 of PEs 0-7 and then stored at addresses of MEMJBase_Address & "11", MEM_Base_Address & "10", MEM_Base_Address & "01 ", and MEM_Base_Address & " 00" o f IMEMs 0- 1 , IMEMs 2-3, IMEMs 4-5, and IMEMs 6-7, respectively. The combination of the lower two bit values in the write addresses respectively supplied to IMEMs 0- 1 , IMEMs 2- 3 , IMEMs 4-5 , and IMEMs 6-7 in write clock cycle 4 is the same as that of the lower two bit values in the read addresses respectively supplied to IMEMs 0- 1 , IMEMs 2-3 , IMEMs 4-5 , and IMEMs 6-7 in read clo ck cycle 4 shown in Fig.32. The 8x8 matrix stored in IMEMs 0- 7 by the above write operation is a corner turned version of the matrix which is shown in Fig.29. That is, the position (x,y) of an matrix element in the 8x8 matrix stored in IMEMs 0-7 of Fig.29 is changed to (y,x) in the 8x8 matrix stored in IMEMs 0-7 of Fig.36.

[0093]

In this example, it is assumed that a matrix to be processed is an

8x8 matrix (N=8) with a size of one matrix element equal to lbyte, each one column data, namely 4 elements data (m=4), are stored in IMEM 0 to IMEM 7, respectively, and one word of IMEMs and Registers store four vertically aligned elements. In this example, the 8x8 matrix are divided in to (8/4)x(8/4) = four sub-matrices, each of which has 4 rows and 4 columns.

[0094]

The read operation will now be described.

a) Read addresses are generated by modifying lower one bit of MEM_Address in accordance with a 1 bit counter value (LSB) o f 3bit counter 1400 ;

b) Swap horizontal positions of 4x4 sub-matrices;

c) Corner turn for each 4x4 sub-matrix; and

d) Store to Registers 0- 1 in each PE. Following figures provide further details o f the above described read process.

[0095]

Referring to Fig.37, in clock cycle 1 , IMEMs 0-3 , and IMEMs 4-7 are supplied with MEM_Base_Address & "0 ", and MEM_Base_Address & " 1 ". Two 4x4 sub-matrices (32byte data) diagonally located at the positions of (0,0) and ( 1 , 1 ) in 2 rows and 2 columns of 4x4 sub-matrices are read from IMEMs 0-3 and 4-7 and transferred to corresponding transpo se units without swapping, and then two 4x4 sub-matrices corner turned are then stored in Register 0 of PEs 0-7. Register 0 of each of PEs 0-7 stores four vertically neighboring elements.

[0096]

Referring to Fig.38 , in clock cycle 2, IMEMs 0-3 , and IMEMs 4-7 are supplied with MEM_Base_Address & " 1 ", and MEM_Base_Address & "0" . Two 4x4 sub-matrices located respectively at the positions of ( 1 ,0) and (0, 1 ) in 2 rows and 2 columns o f 4x4 sub-matrices are read from IMEMs 0-3 and 4-7 and are subjected to swap operation in which two 4x4 sub-matrices are horizontally swapped. The two 4x4 sub-matrices swapped are each subjected to corner turned and then stored in Register 1 of PEs 0-7. Register 1 o f each of PEs 0-7 stores four vertically neighboring elements.

[0097]

In write operation, one word from Register 0 to 1 of PEs 0-7 in number order are read, and stored at specified addresses in IMEMs 0-7. The addresses for IMEMs 0-7 are generated by modifying lower one bit of MEM_Address according to a 1 bit counter value (LSB) of the 3bit counter 1400.

[0098]

More specifically, referring to Fig.39, in write clo ck cycle 1 , two 4x4 sub-matrices are read from Register 0 of PEs 0-7 and then stored at addresses o f MEM_B as e_ Address & "0 " , MEM_B ase_Address & " 1 " of IMEMs 0-3 and IMEMs 4-7, respectively. The combination of the lower one bit values in the write addresses respectively supplied to IMEMs 0-3 and IMEMs 4- 7 in write clock cycle 1 is the same as that of the lower one bit values in the read addresses respectively supplied to IMEMs 0-3 and IMEMs 4-7 in read clock cycle 1 shown in Fig.33. IMEMs 0-7 each store four vertically neighboring elements of the associated 2x2 sub-matrix as one word at a specified address.

[0099]

The two 4x4 sub-matrices which were originally located at positions of (0, 0) and ( 1 , 1 ) in 2 rows and 2 columns of 4x4 sub matrices, each subjected to corner turn and then stored in Register 0 of PEs 0-7, as shown in Fig.37, are now restored at the positions of (0, 0) and ( 1 , 1 ) in 2 rows and 2 columns o f 4x4 sub matrices, respectively, as shown in Fig. 39. [0100]

Referring to Fig.40, in write clock cycle 2, 32 byte data are read from Register 1 of PEs 0-7 and then stored at addresses of MEM_Base_Address & " 1 " and MEM_Base_Address & "0 " of IMEMs 0-3 , and IMEMs 4-7, respectively. The combination of the lowest one bit values in the write addresses respectively supplied to IMEMs 0- 3 and IMEMs 4-7 in write clock cycle 2 is the same as that of the lowest one bit values in the read addresses respectively supplied to IMEMs 0-3 and IMEMs 4-7 in read clock cycle 2 shown in Fig.33.

[0101 ]

The two 4x4 sub-matrices which were originally located at po sitions of ( 1 , 0) and (0, 1 ) in 2 rows and 2 columns of 4x4 sub matrices, subjected to pair-wise swapping and corner turn and then stored in Register 1 of PEs 0-7, as shown in Fig.38, are now restored at the transposed po sitions o f (0, 1 ) and ( 1 , 0) in 2 rows and 2 columns o f 4x4 sub matrix elements, respectively, as shown in Fig. 40.

[0102]

The 8x8 matrix compo sed by four 4x4 sub-matrices, each of which had been corner turned in the read operation, and written at respective addresses of IMEMs 0-7 by the above write operation, is a corner turned version of the 8x8 matrix of Fig.37. That is, in the 8x8 matrix stored in IMEMs 0-7 of Fig. 40, the position (x, y) of an matrix element in the 8x8 matrix stored in IMEMs 0-7 of Fig. 37 is changed to

(y, x) .

[01 03] <Example 3>

In this example, it is assumed that a matrix to be processed is a 16x 1 6 matrix (N= 16), with a size of one matrix element is lbyte, each one column data, namely 16 elements data, are stored in IMEM 0 to IMEM F, respectively, and one word of IMEM and Registers store two vertically aligned elements (m=2). In this example, the 16x16 matrix are divided in to ( 16/2)x( 16/2) =64 sub-matrices, each of which has 2 rows x 2 columns.

[0104]

The read operation will now be described.

a) Read one word from each IMEM wherein read addresses are generated by modifying lower three bit of MEM_Address in accordance with a 3 bit counter value (low order 3 bits) of 4bit counter 1400;

b) Swap horizontal positions of 2x2 sub-matrices;

c) Corner turn for each 2x2 sub-matrix; and

d) Store to Registers 0-7 in each PE.

[0105]

Following figures provide further details of the above described read process.

Referring to Fig.4 1 , in clock cycle 1 , IMEMs 0- 1 , IMEMs 2-3,

IMEMs 4-5, IMEMs 6-7, IMEMs 8-9, IMEMs A-B, IMEMs C-D, and IMEMs E-F are supplied with MEM_Base_Address & "000" , MEM_Base_Address & "001 ", MEM_Base_Address & "010" , MEM_Base_Address & "01 1 ", MEM_Base_Address & " 100 ", MEM_Base_Address & " 101 ", MEM_Base_Address & " 1 10 ", and MEM_B as e_ Address & "111", respectively. Eight 2x2 sub-matrices (32byte data) which are diagonally located in 8 rows and 8 columns of 2x2 sub-matrices are read and transferred to corresponding transpose units without swapping, and then eight 2x2 sub-matrices each corner turned are stored in Register 0 of PEs 0-F. Register 0 of each of PEs 0-F stores two vertically neighboring elements of the associated 2x2 sub-matrix.

[0106]

Referring to Fig.42, in clock cycle 2, IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, IMEMs 6-7, IMEMs 8-9, IMEMs A-B, IMEMs C-D, and IMEMs E-F are supplied with MEM_Base_Address & "001", MEM_Base_Address & "000", MEM_Base_Address &"011", MEM_Base_Address & "010", MEM_Base_Address & "101", MEM_Base_Address & "100", MEM_Base_Address & "111", and MEM_Base_Address & "110", respectively. Eight 2x2 sub-matrices (32bte data) read from IMEMs 0-F are subjected to swap operation in which horizontal positions between an adjacent pair of 2x2 sub-matrices are swapped. Eight 2x2 sub-matrices swapped are subjected to corner turn and then stored in Register 1 of PEs 0-F. Register 1 of each of PEs 0-F stores two vertically neighboring elements of the associated 2x2 sub-matrix.

[0107]

Referring to Fig.43, in clock cycle 3, IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, IMEMs 6-7, IMEMs 8-9, IMEMs A-B, IMEMs C-D, and IMEMs E-F are supplied with MEM_Base_Address & "010", MEM_Base_Address & "Oil", MEM_Base_Address & "000", MEM_Base__Address & "001", MEM_Base_Address & "110", MEM_Base_Address & "111", MEM_Base_Address & "100", and MEM_Base_Address & "101". Eight 2x2 sub-matrices (32byte data) read from IMEMs 0-F are subjected to swap operation in which a pair of first and third 2x2 sub-matrices, a pair of second and fourth 2x2 sub-matrices, a pair of fifth and seventh 2x2 sub-matrices, and a pair of sixth and eighth 2x2 sub-matrices are each swapped in horizontal position. Each of eight 2x2 sub-matrices swapped are subjected to corner turn and then stored in Register 2 of PEs 0-F. Register 2 of each of PEs 0-F stores two vertically neighboring elements of the associated 2x2 sub-matrix.

[0108]

Referring to Fig.44, in clock cycle 4, IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, IMEMs 6-7, IMEMs 8-9, IMEMs A-B, IMEMs C-D, and IMEMs E-F are supplied with MEM_Base_Address & "011", MEM_Base_Address & ""010", MEM_Base_Address & "001",MEM_Base_Address & "000", MEM_Base_Address & "111", MEM_Base_Address & "110", MEM_Base_Address & "101", and MEM_Base_Address & "100". Eight 2x2 sub-matrices (32byte data) read from IMEMs 0-F are subjected to swap operation in which a pair of first and fourth 2x2 sub-matrices, a pair of second and third 2x2 sub-matrices, a pair of fifth and eighth 2x2 sub-matrices, and a pair of sixth and seventh 2x2 sub-matrices, each are swapped in horizontal position. Each of eight 2x2 sub-matrices swapped are subjected to corner turn and then stored in Register 3 of PEs 0-F. Register 3 of each of PEs 0-F stores two vertically neighboring elements of the associated 2x2 sub-matrix.

[0109]

Referring to Fig.45, in clock cycle 5, IMEMs 0-1, IMEMs 2-3,

IMEMs 4-5, IMEMs 6-7, IMEMs 8-9, IMEMs A-B, IMEMs C-D, and IMEMs E-F are supplied with MEM_Base_Address & "100", MEM_Base_Address & "101", MEM JBase_Address & "110", MEM_Base_Address & "111", MEMJBase_Address & "000", MEM_Base_Address & "001", MEM_Base_Address & "010", and MEM_Base_Address & "011". Eight 2x2 sub-matrices (32byte data) read from IMEMs 0-F are subjected to swap operation in which a pair of first and fifth 2x2 sub-matrices, a pair of second and sixth 2x2 sub-matrices, a pair of third and seventh 2x2 sub-matrices, and a pair of fourth and eighth 2x2 sub-matrices, each are swapped in horizontal position. Each of eight 2x2 sub-matrices swapped are subjected to corner turn and then stored in Register 4 of PEs 0-F. Register 4 of each of PEs 0-F stores two vertically neighboring elements of the associated 2x2 sub-matrix.

[0110]

Referring to Fig.46, in clock cycle 6, IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, IMEMs 6-7, IMEMs 8-9, IMEMs A-B, IMEMs C-D, and IMEMs E-F are supplied with MEM_Base_Address & "101", MEM_Base_Address & "100", MEMJBase_Address & "111", MEM_Base_Address & "110", MEM_Base_Address & "001", MEM_Base_Address & "000", MEM_Base_Address & "Oil", and MEM_Base_Address & "010". Eight 2x2 sub-matrices (32byte data) read from IMEMs 0-F are subjected to swap operation in which a pair of first and sixth 2x2 sub-matrices, a pair of second and fifth 2x2 sub-matrices, a pair of third and eighth 2x2 sub-matrices, and a pair of fourth and seventh 2x2 sub-matrices, each are swapped in horizontal position. Each of eight 2x2 sub-matrices swapped are subjected to corner turn and then stored in Register 5 of PEs 0-F. Register 5 of each of PEs 0-F stores two vertically neighboring elements of the associated 2x2 sub-matrix.

[0111]

Referring to Fig.47, in clock cycle 7, IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, IMEMs 6-7, IMEMs 8-9, IMEMs A-B, IMEMs C-D, and IMEMs E-F are supplied with MEM_Base_Address & "110", MEM_Base_Address & "111", MEM_Base_Address & "100", MEM_Base_Address &"101", MEM_Base_Address & "010", MEM_Base_Address & "011", MEM_Base_Address & "000", and MEM_Base_Address & "001". Eight 2x2 sub-matrices(32byte data) read from IMEMs 0-F are subjected to swap operation in which a pair of first and seventh 2x2 sub-matrices, a pair of second and eighth 2x2 sub-matrices, a pair of third and fifth 2x2 sub-matrices, and a pair of fourth and sixth 2x2 sub-matrices, each are swapped in horizontal position. Each of eight 2x2 sub-matrices swapped are subjected to corner turn and then stored in Register 6 of PEs 0-F. Register 6 of each of PEs 0-F stores two vertically neighboring elements of the associated 2x2 sub-matrix.

[01 12]

Referring to Fig.48, in clock cycle 8, IMEMs 0- 1 , IMEMs 2-3 , IMEMs 4-5, IMEMs 6- 7, IMEMs 8-9, IMEMs A-B, IMEMs C-D, and IMEMs E-F are supplied with MEM_Base_Address & " 111 " , MEM_Base_Address & " 110 ", MEM_Base_Address & " 101 ", MEM_Base_Address & " 100", MEMJ3ase_Address & " O i l ", MEM_Base__Address & "010", MEM_B ase_Address & "001 ", and MEM_Base_Address & "000" . Eight 2x2 sub-matrices (32byte data) read from IMEMs 0-F are subjected to swap operation in which a pair of first and eighth 2x2 sub-matrices, a pair of second and seventh 2x2 sub-matrices, a pair of third and sixth 2x2 sub-matrices, and a pair of fourth and fifth 2x2 sub-matrices, each are swapped in horizontal position. Each o f eight 2x2 sub-matrices swapped are subjected to corner turn and then stored in Register 7 of PEs 0-F. Register 7 of each of PEs 0-F stores two vertically neighboring elements of the associated 2x2 sub-matrix.

[01 13]

In write operation, one word from Register 0 to 7 in number order are read, and store to IMEMs. The addresses for IMEMs 0-F are generated by modifying lower three bit o f MEM_Address according to a three bit counter value of 4bit counter 1400.

[01 14]

More specifically, referring to Fig.49, in write clock cycle 1 , eight 2x2 sub-matrices (32byte data) are read from Register 0 of PEs 0-F and then stored at addresses of MEM_Base_Address & "000", MEM_Base_Address & "001", MEM_Base_Address & "010", MEM_Base_Address & "Oil", MEM_Base_Address & "100", MEM_Base_Address & "101", MEM_Base_Address & "110", and MEM_Base_Address & "111" of IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, IMEMs 6-7, IMEMs 8-9, IMEMs A-B, IMEMs C-D, and IMEMs E-F, respectively. The combination of the lower three bit values in the write addresses respectively supplied to IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, IMEMs 6-7, IMEMs 8-9, IMEMs A-B, IMEMs C-D, and IMEMs E-F in write clock cycle 1 is the same as that of the lower three bit values in the read addresses respectively supplied to IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, IMEMs 6-7, IMEMs 8-9, IMEMs A-B, IMEMs C-D, and IMEMs E-F in read clock cycle 1 shown in Fig.41. Each of IMEMs 0-F stores two vertically neighboring elements of the associated 2x2 sub-matrix as a one word at a specified address.

[0115]

Referring to Fig.50, in write clock cycle 2, eight 2x2 sub-matrices (32byte data) are read from Register 1 of PEs 0-F and then stored at addresses of MEM_Base_Address & "001", MEM_ Base_Address & "000", MEM_Base_Address & "011",

MEM_ Base_Address & "010", MEM_Base_Address & "101",

MEM_ Base_Address & "100", MEM_Base_Address & "111 ", and

MEM_ Base Address & "110" of IMEMs 0-1, IMEMs 2-3, IMEMs 4-5,

IMEMs 6-7, IMEMs 8-9, IMEMs A-B, IMEMs C-D, and IMEMs E-F, respectively. The combination of the lower three bit values in the write addresses respectively supplied to IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, IMEMs 6-7, IMEMs 8-9, IMEMs A-B, IMEMs C-D, and IMEMs E-F in write clock cycle 2 is the same as that of the lower three bit values in the read addresses respectively supplied to IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, IMEMs 6-7, IMEMs 8-9, IMEMs A-B, IMEMs C-D, and IMEMs E-F in read clock cycle 2 shown in Fig.42.

[0116]

Referring to Fig.51, in write clock cycle 3, eight 2x2 sub-matrices (32byte data) are read from Register 2 of PEs 0-F and then stored at addresses of MEM Base Address & "010", MEM_Base_Address &"011", MEM_Base_Address & "000", MEM_Base_Address & "001", MEM_Base_Address & "110", MEM_Base_Address & "111", MEM_Base_Address & "100", and MEM_Base_Address & "101" of IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, IMEMs 6-7, IMEMs 8-9, IMEMs A-B, IMEMs C-D, and IMEMs E-F, respectively. The combination of the lower three bit values in the write addresses respectively supplied to IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, IMEMs 6-7, IMEMs 8-9, IMEMs A-B, IMEMs C-D, and IMEMs E-F in write clock cycle 3 is the same as that of the lower three bit values in the read addresses respectively supplied to IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, IMEMs 6-7, IMEMs 8-9, IMEMs A-B, IMEMs C-D, and IMEMs E-F in read clock cycle 3 shown in Fig.43.

[0117]

Referring to Fig.52, in write clock cycle 4, eight 2x2 sub-matrices (32byte data) are read from Register 3 of PEs 0-F and then stored at addresses of MEM_Base_Address & "Oil",

MEM_ Base Address & "010", MEM_Base_Address & "001",

MEM_ Base_Address & "000", MEM_Base_Address & "111",

MEM_ Base_Address & "110", MEM_Base_Address & "101 ", and

MEM_ Base_Address & "100" of IMEMs 0-1, IMEMs 2-3, IMEMs 4-5,

IMEMs 6-7, IMEMs 8-9, IMEMs A-B, IMEMs C-D, and IMEMs E-F, respectively. The combination of the lower three bit values in the write addresses respectively supplied to IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, IMEMs 6-7, IMEMs 8-9, IMEMs A-B, IMEMs C-D, and IMEMs E-F in write clock cycle 4 is the same as that of the lower three bit values in the read addresses respectively supplied to IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, IMEMs 6-7, IMEMs 8-9, IMEMs A-B, IMEMs C-D, and IMEMs E-F in read clock cycle 4 shown in Fig.44.

[0118]

Referring to Fig.53, in write clock cycle 5, eight 2x2 sub-matrices (32byte data) are read from Register 4 of PEs 0-F and then stored at addresses of MEM_Base_Address & "100",

MEM_ Base_Address & "101", MEM_Base_Address & "110",

MEM_ Base Address & "111", MEM_Base_Address & "000", MEM_ Base_Address & "001", MEM_Base_Address & "010 ", and

MEM_ Base_Address & "011" of IMEMs 0-1, IMEMs 2-3, IMEMs 4-5,

IMEMs 6-7, IMEMs 8-9, IMEMs A-B, IMEMs C-D, and IMEMs E-F, respectively. The combination of the lower three bit values in the write addresses respectively supplied to IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, IMEMs 6-7, IMEMs 8-9, IMEMs A-B, IMEMs C-D, and IMEMs E-F in write clock cycle 5 is the same as that of the lower three bit values in the read addresses respectively supplied to IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, IMEMs 6-7, IMEMs 8-9, IMEMs A-B, IMEMs C-D, and IMEMs E-F in read clock cycle 5 shown in Fig.45.

[0119]

Referring to Fig.54, in write clock cycle 6, eight 2x2 sub-matrices (32byte data) are read from Register 5 of PEs 0-F and then stored at addresses of MEM_Base_Address & "101",

MEM_ Base_Address & "100 ", MEM_Base_Address & "111", MEM_ Base Address & "110 ", MEMJBase_Address & "001",

MEM_ Base_Address & "000", MEM_Base_Address & "011 ", and

MEM_Base_Address & "010" of IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, IMEMs 6-7, IMEMs 8-9, IMEMs A-B, IMEMs C-D, and IMEMs E-F, respectively. The combination of the lower three bit values in the write addresses respectively supplied to IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, IMEMs 6-7, IMEMs 8-9, IMEMs A-B, IMEMs C-D, and IMEMs E-F in write clock cycle 6 is the same as that of the lower three bit values in the read addresses respectively supplied to IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, IMEMs 6-7, IMEMs 8-9, IMEMs A-B, IMEMs C-D, and IMEMs E-F in read clock cycle 6 shown in Fig.46.

[0120]

Referring to Fig.55, in write clock cycle 7, eight 2x2 sub-matrices (32byte data) are read from Register 6 of PEs 0-F and then stored at addresses of MEM_Base_Address & "110", MEM_Base_Address & "111", MEM_Base_Address & "100", MEMJBase_Address & "101", MEM_Base_Address & "010", MEM_Base_Address & "Oil", MEM_Base_Address & "000", and MEM_Base_Address & "001" of IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, IMEMs 6-7, IMEMs 8-9, IMEMs A-B, IMEMs C-D, and IMEMs E-F, respectively. The combination of the lower three bit values in the write addresses respectively supplied to IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, IMEMs 6-7, IMEMs 8-9, IMEMs A-B, IMEMs C-D, and IMEMs E-F in write clock cycle 7 is the same as that of the lower three bit values in the read addresses respectively supplied to IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, IMEMs 6-7, IMEMs 8-9, IMEMs A-B, IMEMs C-D, and IMEMs E-F in read clock cycle 7 shown in Fig.47.

[0121]

Referring to Fig.56, in write clock cycle 8, eight 2x2 sub-matrices (32byte data) are read from Register 7 of PEs 0-F and then stored at addresses of MEM_Base_Address & "111",

MEM Base_Address & "110 ", MEM_Base_Address & "101",

MEM_ _Base_Address & "100 ", MEM_Base_Address & "011",

MEM_. _Base_Address & "010", MEM_Base_Address & "001 and

MEM_Base_Address & "000" of IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, IMEMs 6-7, IMEMs 8-9, IMEMs A-B, IMEMs C-D, and IMEMs E-F, respectively. The combination of the lower three bit values in the write addresses respectively supplied to IMEMs 0-1, IMEMs 2-3, IMEMs 4-5, IMEMs 6-7, IMEMs 8-9, IMEMs A-B, IMEMs C-D, and IMEMs E-F in write clock cycle 8 is the same as that of the lower three bit values in the read addresses respectively supplied to IMEMs 0-1, IMEMs 2-3 , IMEMs 4-5 , IMEMs 6-7, IMEMs 8-9, IMEMs A-B, IMEMs

C-D, and IMEMs E-F in read clock cycle 8 shown in Fig.48.

[0122]

The 1 6x 16 matrix stored in IMEMs 0-F by the above write operation is a corner turned version of the matrix shown in Fig.41 .

That is, in the 16x 16 matrix stored in IMEMs 0-F of Fig. 56, the position (x, y) of an matrix element in the 16x 1 6 matrix stored in

IMEMs 0-F of Fig. 41 is changed to (y, x).

[0123]

In this example, it is assumed that a matrix to be processed is a

16x 1 6 matrix (N= 16), with a size o f one matrix element is l byte, each one column data, namely 16 elements data, are stored in IMEM0 to

IMEMF, respectively, and one word of IMEM and Registers store four vertically aligned elements (m=4) . In this example, the 16x 1 6 matrix are divided in to ( 16/4)x( 1 6/4) =1 6 sub-matrices, each of which has 4 rows x 4 columns.

[0124]

The read operation will now be described.

a) Read one word from each IMEM, wherein read addresses are generated by modifying lower two bit of MEM__Address in accordance with a 2 bit counter value o f 3 bit counter 1400 ;

b) Swap horizontal po sitions of 4x4 sub-matrices;

c) Corner turn for each 4x4 sub-matrix; and

d) Store to Registers 0-3 in each PE. [0125]

Following figures provide further details of the above described read process.

Referring to Fig.57, in clock cycle 1 , IMEMs 0-3 , IMEMs 4-7, IMEMs 8-B, and IMEMs C-F are supplied with MEM_Base_Address &" 00 ", MEM_Base_Address & "01 ", MEM_Base_Address & " 10 ", and MEM_B ase_Address & " 1 1 ", respectively. Four 4x4 sub-matrices (64byte data) which are diagonally located in 4 rows and 4 co lumns o f 4x4 sub-matrices are read and transferred to corresponding transpose units without swapping, and then four 4x4 sub-matrices each corner turned are stored in Register 0 o f PEs 0-F. Register 0 of each of PE s 0-F stores four vertically neighboring matrix elements of the associated 4x4 sub-matrix.

[0126]

Referring to Fig .58, in clock cycle 2, IMEMs 0-3 , IMEMs 4-7,

IMEMs 8-B, and IMEMs C-F are supplied with MEM_Base_Address & "01 ", MEM_Base_Address & "00" , MEM_Base_Address & " 11 ", and MEM_Base_Address & " 10 " , respectively. Four 4x4 sub-matrices (64byte data) read from IMEMs 0-F are subjected to swap operation in which each pair of adjacent 4x4 sub-matrices are swapped in horizontal positions. Four 4x4 sub-matrices swapped are subjected to corner turn and then stored in Register 1 of PEs 0-F. Register 1 of each of PEs 0-F stores four vertically neighboring matrix elements of the associated 4x4 sub-matrix.

[0127] Referring to Fig.59, in clock cycle 3, IMEMs 0-3, IMEMs 4-7, IMEMs 8-B, and IMEMs C-F are supplied with MEM_Base_Address & " 1 0 ", MEM_Base_Address & " 1 1 ", MEM_Base_Address & " 00", and MEM_B ase_Address & "0 1 ", respectively. Four 4x4 sub-matrices (64 byte data) read from IMEMs 0-F are subjected to swap operation in which a pair of first and third 4x4 sub-matrices are swapped in horizontal positions and a pair o f second and fourth 4x4 sub-matrices are swapped in horizontal positions. Four 4x4 sub-matrices swapped are subjected to corner turn and then stored in Register 2 of PEs 0-F. Register 2 o f each o f PEs 0-F stores four vertically neighboring matrix elements o f the associated 4x4 sub-matrix.

[0128]

Referring to Fig.60, in clock cycle 4, IMEMs 0-3, IMEMs 4-7, IMEMs 8-B, and IMEMs C-F are supplied with MEM_Base_Address & " 1 1 ", MEM_Base_Address & " 10", MEM_Base_Address & " 01 ", and MEM_Base_Address & "00" , respectively. Four 4x4 sub-matrices(64byte data) read from IMEMs 0-F are subjected to swap operation in which a pair of first and fourth 4x4 sub-matrices are swapped in horizontal po sitions and a pair of second and third 4x4 sub-matrices are swapped in horizontal positions. Four 4x4 sub-matrices swapped are subjected to corner turn and then stored in Register 3 of PEs 0-F. Register 3 of each of PEs 0-F stores four vertically neighboring matrix elements of the associated 4x4 sub-matrix.

[0129] In write operation, one word from Register 0 to 3 in number order are read, and store to IMEMs. The addresses for IMEMs 0-F are generated by modifying lower two bit of MEM Address according to a 2 bit counter value of 3 bit counter 1400.

[0130]

More specifically, referring to Fig.61, in write clock cycle 1, four 4x4 sub-matrices (64 byte data) are read from Register 0 of PEs 0-F and then stored at addresses of MEM_Base_Address & "00", MEM_Base_Address & "01", MEM_Base_Address & "10", and MEM_Base_Address & "11" of IMEMs 0-3, IMEMs 4-7, IMEMs 8-B, and IMEMs C-F, respectively. The combination of the lower two bit values in the write addresses respectively supplied to IMEMs 0-3, IMEMs 4-7, IMEMs 8-B, and IMEMs C-F in write clock cycle 1 is the same as that of the lower two bit values in the read addresses respectively supplied to IMEMs 0-3, IMEMs 4-7, IMEMs 8-B, and IMEMs C-F in read clock cycle 1 shown in Fig. 57. IMEMs 0-F each store four vertically neighboring matrix elements as one word at a specified address.

[0131]

Referring to Fig.62, in write clock cycle 2, 64 byte data are read from Register 1 of PEs 0-F and then stored at addresses of MEM_Base_Address & "01", MEM_Base_Address & "00", MEM_Base_Address & "11", and MEM_Base_Address & "10", respectively. The combination of the lower two bit values in the write addresses respectively supplied to IMEMs 0-3, IMEMs 4-7, IMEMs 8-B, and IMEMs C-F in write clock cycle 2 is the same as that of the lower two bit values in the read addresses respectively supplied to IMEMs 0-3, IMEMs 4-7, IMEMs 8-B, and IMEMs C-F in read clock cycle 2 shown in Fig.58.

[0132]

Referring to Fig.63, in write clock cycle 3, 64 byte data are read from Register 2 of PEs 0-F and then stored at addresses of MEM_Base_Address & "10", MEMJBase_Address & "11", MEM_Base_Address & "00", and MEM_Base_Address & "01" of IMEMs 0-3, IMEMs 4-7, IMEMs 8-B, and IMEMs C-F, respectively. The combination of the lower two bit values in the write addresses respectively supplied to IMEMs 0-3, IMEMs 4-7, IMEMs 8-B, and IMEMs C-F in write clock cycle 3 is the same as that of the lower two bit values in the read addresses respectively supplied to IMEMs 0-3, IMEMs 4-7, IMEMs 8-B, and IMEMs C-F in read clock cycle 3 shown in Fig. 59.

[0133]

Referring to Fig.64, in write clock cycle 4, 64 byte data are read from Register 3 of PEs 0-F and then stored at addresses of MEM_Base_Address & "11", MEM_Base_Address & "10", MEM_Base_Address & "01", and MEM_Base_Address & "00" of IMEMs 0-3, IMEMs 4-7, IMEMs 8-B, and IMEMs C-F, respectively. The combination of the lower two bit values in the write addresses respectively supplied to IMEMs 0-3, IMEMs 4-7, IMEMs 8-B, and IMEMs C-F in write clock cycle 4 is the same as that of the lower two bit values in the read addresses respectively supplied to IMEMs 0-3 , IMEMs 4-7, IMEMs 8-B, and IMEMs C-F in read clock cycle 4 shown in Fig. 60.

[0134]

The 1 6x 16 matrix stored in IMEMs 0-F by the above write operation is a corner turned version of the matrix shown in Fig.57. That is, in the 16x 16 matrix stored in IMEMs 0-F of Fig. 64, the position (x, y) of an matrix element in the 8x8 matrix stored in IMEMs 0-F of Fig. 57 is changed to (y, x).

[0135]

In the read control function unit according to the above embodiment shown in Fig. 6, the sub-matrices read from the IMEMs 0-F are supplied to the swap unit and then supplied to corresponding transpose units. The sub-matrices which have undergone corner turn are stored in Register Files (PE-RF) of PEs as shown in Fig.6. However, the placement of the swap unit and the transpo se units may be interchanged. For example, as shown in Fig.65, 4x4byte data (4x4 sub-matrices) read from four IMEMs are first of all supplied to a corresponding 4x4 byte transpo se unit and then to the 4x l 6byte swap unit. The 4x4 sub-matrices which have undergone swapping in horizontal positions are stored in the register files of PEs (PE-RF 0 to PE-RF F) . In the present exemplary embodiment, the write control function unit has the same configuration as that o f Fig. 15.

[0136] <Exemplary Embodiment 3>

In the above exemplary embodiments, in the transfer paths o f the data read from IMEMs to Register Files of PEs, the read data are subjected to swapping and corner turn. However, it is a matter o f course that the present invention is not limited to such a configuration.

[01 37]

Fig. 66 is a block diagram showing the configuration of the write control function unit according to the exemplary embodiment 3. In the present exemplary embodiment shown in Fig.66, the write control function unit comprises the 4x 1 6 byte swap unit 1405 and four 4x4byte transpose units 1406.

[0138]

In the present exemplary embodiment wherein the write control function unit has the configuration shown in Fig. 66, the read contro l function unit is configured as shown in Fig.67. Referring to Fig.67, the read data (four 4x4 sub-matrix data) which have been read in parallel from IMEMs 0-3 , IMEMs 4-7, IMEMs 8-B and IMEMs C-F at respective read addresses specified by the address generators 601 are directly written to the corresponding Registers 0 to 3 of PEs 0-3, PEs 4-7, PEs 8 -B and PEs C-F specified by a two bit counter value of the 3bit counter 1400.

[0139]

In the write control function unit according to the present exemplary embodiment, the four 4x4 sub-matrix data which have been stored in Registers 0-3 of PEs 0-3, PEs 4-7, PEs 8-B and PEs C-F by the read control function unit are read in parallel from the corresponding Registers 0 to 3 of PEs 0-3 , PEs 4-7, PEs 8-B and PEs C-F specified by a two bit counter value of the 3bit counter 1400 and supplied to corresponding 4x4byte transpose units 1406. The 4x4byte transpose units 1406 supply the transposed four 4x4 sub-matrices to the 4x 16 byte swap unit 1405. Four 4x4 byte data (four 4x4 sub-matrix data) from the 4x 16 byte swap unit 1405 are written to IMEMs 0-3, IMEMs 4- 7, IMEMs 8-B and IMEMs C-F at respective addresses specified by the address generators 1403.

[0140]

In the present exemplary embodiment, the entire 16x 16 matrix data stored in IMEMs 0-F are read and stored into the Registers 0-3 o f PEs 0-F by the read control function unit in four clock cycles and then, the 16x 16 matrix data stored in Registers 0-3 of PEs 0-F are read and written to IMEMs 0-F by the write control function unit in four clock cycles. IMEMs 0-F now hold the corner turned version of the 16x 16 matrix originally stored in IMEMs 0-F.

[0141 ]

Fig. 68 is a block diagram showing the configuration of the write control function unit according to the exemplary embodiment 4.

In the present exemplary embodiment, the write control function unit comprises 4* 16byte swap unit 1405 and four 4x4 byte transpo se units

1406.

[0142] In the present exemplary embo diment wherein the write control function unit has the configuration shown in Fig. 68, the read contro l function unit is configured as shown in Fig.67. As described in the above, the read data (four 4x4 sub-matrix data) which have been read in parallel from IMEMs 0-3 , IMEMs 4-7, IMEMs 8-B and IMEMs C-F at respective read addresses specified by the address generators 601 are directly written to the corresponding Registers 0 to 3 of PEs 0-3 , PEs 4-7, PEs 8 -B and PEs C-F specified by a two bit counter value o f the 3bit counter 1400.

[0143]

In the write control function unit according to the present exemplary embodiment, the four 4x4 sub-matrix data read from Register 0-3 o f PEs 0-3, PEs 4-7, PEs 8-B and PEs C-F at addresses specified by a two bit counter value of the 3 bit counter 1400 are supplied to the 4x 16 byte swap unit 1405. The four 4x4 sub-matrix data from the 4x 16 byte swap unit 1405 are then supplied to the 4x4 transpo se units 1406. The four 4x4 sub-matrix data output from four 4x4 transpose units 1406 are written to IMEMs 0-3 , IMEMs 4-7, IMEMs 8-B and IMEMs C-F at respective addresses specified by the address generators 1403.

[0144]

In the present exemplary embodiment, the entire 16x16 matrix data stored in IMEMs 0-F are read and stored into the Registers 0-3 of PEs 0-F by the read control function unit in four clock cycles and then, the 16x16 matrix data stored in Registers 0-3 of PEs 0-F are read and written to IMEMs 0-F by the write control function unit in four clock cycles. IMEMs 0-F now hold the corner turned version of the 16x 1 6 matrix originally stored in IMEMs 0-F.

[0145]

Fig. 69 is a blo ck diagram showing the configuration o f the read control function unit according to the exemplary embo diment 5. In the present exemplary embodiment, as shown in Fig.69, in the read control function unit, there is provided a 4x 16 byte swap unit 605 which interchanges horizontal positions of 4x4 sub-matrices, and 4x4 transpo se units 606 is omitted. Four 4x4 sub-matrices (64byte data) read from IMEMs 0-3 , IMEMs 4-7, IMEMs 8-B and IMEMs C-F at respective read addresses specified by the address generators 601 are supplied to the 4x 16 byte swap unit 605 and then stored in corresponding Registers 0 to 3 o f PEs 0-3 , PEs 4-7, PEs 8-B and PE s C-F sequentially specified by a two bit counter value of the 3bit counter 600.

[0146]

Fig. 70 is a block diagram showing the configuration of the write control function unit according to the exemplary embodiment 5. As shown in Fig.70, in the write control function unit, there are provided four 4x4 transpo se units 1406. Four 4x4 sub-matrices (64byte data) read sequentially from Registers 0 to 3 o f PEs 0-3 , PEs 4-7, PEs 8 -B and PEs C-F are supplied to four 4x4 transpo se units 1406. Four 4x4 sub-matrix respectively output from the four 4x4 transpose unit 1406 are stored in corresponding IMEMs 0-3 , IMEMs 4-7, IMEMs 8-B and IMEMs C-F at write addresses generated by the address generator 1403.

[0147]

The 16x 16 matrix which has been written in IMEMs 0-F and which has the sub-matrices, each of which undergone swapping in the read control function unit and transposition in the write control function unit, is a corner turned version of the 1 6x 16 matrix data originally stored in IMEMs 0-F.

[0148]

Fig. 7 1 is a block diagram showing the configuration of the read control function unit according to the exemplary embo diment 6. In the exemplary embodiment, as shown in Fig.71 , in the read contro l function unit, there are provided four 4x4 transpose units 606, while a 4x16 byte swap unit 605 is omitted. Four 4x4 sub-matrices (64byte data) read from IMEMs 0-3 , IMEMs 4-7, IMEMs 8-B and IMEMs C-F at respective read addresses specified by the address generators 601 are respectively supplied to the four 4x4 transposed units 606 and then are stored in corresponding Registers 0 to 3 of PEs 0-3 , PEs 4-7, PEs 8-B and PEs C-F sequentially specified by a two bit counter value of the 3bit counter 600.

[0149]

Fig. 72 is a block diagram showing the configuration of the write control function unit according to the exemplary embodiment 6. As shown in Fig.72, in the write control function unit, there is provided a 4x l 6byte swap unit 1405 which interchanges horizontal positions o f 4 x4 sub-matrices. Four 4x4 sub-matrices (64byte data) each transposed are read sequentially from Registers 0 to 3 of PEs 0-3 , PEs 4-7, PEs 8-B and PEs C-F and supplied to the 4x l 6byte swap unit 1405. Four 4x4 sub-matrix respectively output from the 4x l 6byte swap unit 1405 are stored in corresponding IMEMs 0-3, IMEMs 4-7, IMEMs 8-B and IMEMs C-F at write addresses generated by the address generator 1403.

[0150]

The 16x 16 matrix which has been written in IMEMs 0-F and which has the sub-matrices, each o f which undergone transposition in the read control function unit and swapping in the write control function unit, is a corner turned version of the 1 6x 16 matrix data originally stored in IMEMs 0-F.

[01 5 1 ]

INDUSTRIAL APPLICABILITY

The present invention can be used to achieve time and area efficient access to square matrices and its transposes distributed stored in the internal memory of processing elements working in SIMD mode by providing instead of a full cro ssbar connection through apparatuses only connections from each PE to selected IMEM units, the selection is done in a way to enable time and area efficient access only to the own IMEM for direct access and IMEM units which are needed for the generation of the transposed matrix to perform fast a corner turn execution.

[0152]

The exemplified embodiments and the examples may be changed and adjusted in the scope of all disclo sures (including claims) of the present invention and based on the basic technological concept thereof. In the scope of the claims o f the present invention, various disclo sed elements may be combined and selected in a variety o f ways. That is, it is to be understood that modifications and changes that may be made by tho se skilled in the art according to all disclo sures, including the claims, and technological concepts are included.

Claims

CLAIMS :

1 . An apparatus making access to a square matrix of N times N (where N is a preset positive even integer larger 2) matrix elements and transposes, and working in single instruction, multiple data mode, the apparatus comprising:

N pieces of processing elements, each of the processing elements having a data width equal to m (where m is a preset positive integer and divisor of N) matrix elements, the N pieces of processing elements being grouped into (N/m) groups, each of the groups having m processing elements;

N pieces of single ported internal memories, each of the internal memories having a data width equal to m matrix elements, the N pieces of the single ported internal memories being grouped into (N/m) groups, each of the groups having m internal memories ; and

a connection apparatus that handles connection between the processing elements and the internal memories,

the connection apparatus comprising a read control function unit and a write control function unit to enable execution of matrix copy as well as matrix corner turn in 2x(N/m) clock cycles.

2. The apparatus as claimed in claim 1 , wherein the read control function unit includes two paths, either one of the two paths to be selected from input to output, the two paths including:

a first path providing a direct connection between input data and output data signals for access from each the processing element of the group of processing elements to an own internal memory element out of the group of internal memory elements ; and

a second path providing access for a matrix transposition generation.

3. The apparatus as claimed in claim 1 , wherein the write contro l function unit includes two paths, either one of the two paths to be selected from input to output, the two paths including:

a first path providing a direct connection between input side and output side for access from each the processing element of the group o f processing elements to an own internal memory element of the group o f memory elements; and

a second path providing access for a matrix transposition generation.

4. The apparatus as claimed in claim 1 , wherein the read control function unit comprises

a selector that selects either one of a first path and a second path, from input to output, the first path providing a direct connection between input side and output side for access from each the processing element of the group of processing elements to an own internal memory element out of the group of internal memory elements, while the second path providing access for a matrix transposition generation,

the read control function unit further comprising :

(N/m) pieces of address generators that calculates in parallel vertical memory element sub block row addresses,

the memory element sub block including m horizontal neighbored memory elements, each having m vertically neighbored matrix elements,

the vertical memory element sub block row address being used to access the memory element sub block at a desired vertical position; an (N/m) times m² element swap unit that moves in parallel the (N/m) pieces of the memory element sub blocks belonging to one memory element sub block row selected to respective target horizontal positions; and

(N/m) pieces of m times m element transpose units that operate in parallel, each moving each element of the m times m element matrix from position (x, y) to transposed po sition (y, x).

5. The apparatus as claimed in claim 1 , wherein the write control function unit comprises

a selector that selects either one of a first path and a second path, from input to output, the first path providing a direct connection between input side and output side for access from each the processing element of the group o f processing elements to an own internal memory element of the group of memory elements, while the second path providing access for a matrix transpo sition generation,

the write control function unit further comprising :

(N/m) pieces of address generators that calculate in parallel vertical memory element sub blo ck row addresses of respective memory element sub block,

the vertical memory element sub block row address being used to transfer in parallel each o f the processor element group received memory elements to a target memory element sub block row po sition of the group of internal memories to enable execution of a matrix copy as well as matrix transposition in 2x(N/m) clock cycles.

6. A method for making access to a square matrix o f N times N (where N is a preset po sitive even integer larger 2) matrix elements and transposes, and working in single instruction, multiple data mode, the method comprising:

grouping N pieces of processing elements, each of the processing elements having a data width equal to m (where m is a preset positive integer and divisor o f N) matrix elements, into (N/m) groups, each of the groups having m pro cessing elements;

grouping N pieces of single ported internal memories, each of the internal memories having a data width equal to m matrix elements, into (N/m) groups, each of the groups having m internal memories; and handling connection between the processing elements and the internal memories, with read control function and write contro l function to enable execution of matrix copy as well as matrix transpo sition in 2(N/m) clock cycles.

7. The method as claimed in claim 6, comprising :

selecting, with the read control function, either one of two paths, the two paths including:

a first path providing a direct connection between input side and output side for access from each processing element o f the group o f processing elements to an own internal memory element out of the group of internal memory elements; and

a second path providing access for a matrix transposition generation.

8. The method as claimed in claim 6, comprising:

selecting, with the write control function, either one of two paths, the two paths including :

a first path providing a direct connection between input side and output side for access from each processing element of the group of processing elements to an own internal memory element o f the group o f internal memory elements; and

a second path providing access for a matrix transposition generation.

9. The method as claimed in claim 6, comprising:

with the read control function,

selecting, by a selector, first and second paths from input to output, the first path providing a direct connection between input side and output side for access from each processing element of the group of processing elements to an own internal memory element out of the group of internal memory elements, while the second path providing access for a matrix transpo sition generation;

calculating, by (N/m) pieces of address generators, in parallel vertical memory element sub blo ck row addresses,

the memory element sub block including m horizontal neighbored memory elements, each having m vertically neighbored matrix elements, the vertical memory element sub blo ck row address being used to access the memory element sub block at a desired vertical position;

moving, by an (N/m) times m² element swap unit, in parallel the (N/m) pieces of the memory element sub blocks belonging to one memory element sub block row selected to respective target horizontal positions; and

moving, by (N/m) pieces of m times m element transpose units, each element of the m times m element matrix from position (x, y) to transpo sed po sition (y, x).

10. The method as claimed in claim 6, comprising :

with the write control function,

selecting either one of a first path and a second path from input to output,

the first path providing a direct connection between input side and output side for access from each the processing element of the group of processing elements to an own internal memory element of the group of memory elements, while the second path providing access for a matrix transposition generation, the method further comprising :

calculating by (N/m) pieces of address generators that calculate in parallel vertical memory element sub block row addresses of respective memory element sub block, the vertical memory element sub block row address being used to transfer in parallel each of the processor element group received memory elements to a target memory element sub block row po sition of the group of internal memories to enable execution of a matrix copy as well as matrix transpo sition in 2(N/m) clock cycles.

1 1 . An apparatus working in single instruction, multiple data mode and accessing an NxN square matrix, comprising :

N pieces o f internal memories, each of the internal memories inputting and outputting data with a data width equal to m matrix elements, where m is a preset positive integer and divisor of N, the N pieces of the internal memories being grouped into (N/m) groups, each of the groups having m pieces of the internal memories,

the NxN square matrix stored in the N pieces o f the internal memories, being handled as divided into (N/m)x(N/m) pieces of mxm sub-matrices, each of the mxm sub-matrices having m rows and m columns of matrix elements and being stored in m pieces of the internal memories composing one group, each internal memory of the group storing m rows of matrix elements of an associated column of the mxm sub-matrix as one word;

N pieces of processing elements, each of the processing elements having a data width equal to m matrix elements, the N pieces of the processing elements being grouped into (N/m) groups, each o f the groups having m pieces of the processing elements, in association with the groups of the internal memories, each of the processing elements including a register file having (N/m) pieces of registers, each of the registers storing m rows of matrix elements of an associated co lumn of the mxm sub-matrix as one word; and

a connection apparatus provided between the N pieces of the processing elements and the N pieces of the internal memories, the connection apparatus including a read control function unit and a write control unit,

the read control function unit comprising:

a read address generation means;

a swap unit; and

(N/m) pieces of transpose units, and

the write control function unit comprising

a write address generation means, wherein

when reading the NxN square matrix stored in N pieces of the internal memories and storing the NxN square matrix into the registers of N pieces of the processing elements,

the read control function unit carries out the following operations, in one cycle :

(N/m) pieces of the mxm sub-matrices being read respectively from the (N/m) groups o f the internal memories which are respectively addressed by the read address generation means, and being supplied to the swap unit;

the swap unit effecting swapping between each of a preset (N/m)/2 pairs out of (N/m) pieces of the mxm sub-matrices, the swap unit effecting no swapping for diagonally located (N/m) pieces of the mxm sub-matrices;

the (N/m) pieces o f the mxm sub-matrices from the swap unit being supplied to the (N/m) pieces of the transpose units;

each of the (N/m) pieces of the transpose units, generating a transpo sed version of the mxm sub-matrix supplied from the swap unit ; and

(N/m) pieces of the transposed mxm sub-matrices from the (N/m) pieces of the transpo se units being stored respectively in the registers of the (N/m) groups o f the processing elements,

such that copying with transposition of the NxN square matrix from the N pieces of the internal memories to the registers of the N pieces o f the processing elements takes (N/m) cycles, and wherein

when writing the NxN square matrix stored in registers of the N pieces of the processing elements into the N pieces of the internal memories,

the write control function unit carries out the following operations, in one cycle :

(N/m) pieces of the mxm sub-matrices being read respectively from the registers of the (N/m) groups of the processing elements and being written respectively into the (N/m) groups of the internal memories which are respectively addressed by the write address generation means,

such that copying of the NxN square matrix from the registers o f the N pieces of the processing elements to the N pieces of the internal memories takes (N/m) cycles,

the NxN square matrix written by the write control function unit into the N pieces of the internal memories being a corner turned version of the NxN square matrix originally stored in the N pieces of the internal memories.

12. An apparatus working in single instruction, multiple data mo de and accessing an NxN square matrix, comprising :

N pieces of internal memories, each of the internal memories inputting and outputting data with a data width equal to m matrix elements, where m is a preset positive integer and divisor o f N,

the N pieces of the internal memories being grouped into (N/m) groups, each of the groups having m pieces o f the internal memories, the NxN square matrix stored in the N pieces of the internal memories, being handled as divided into (N/m)x(N/m) pieces of mxm sub-matrices, each of the mxm sub-matrices having m rows and m columns of matrix elements and being stored in m pieces of the internal memories composing one group, each internal memory of the group storing m rows of matrix elements of an associated column of the mxm sub-matrix as one word;

N pieces of processing elements, each of the processing elements having a data width equal to m matrix elements, the N pieces of the processing elements being grouped into (N/m) groups, each o f the groups having m pieces of the processing elements, in association with the groups of the internal memories, each of the processing elements including a register file having (N/m) pieces of registers, each of the registers storing m rows of matrix elements of an associated column o f the mxm sub-matrix as one word; and

the read control function unit comprising:

a read address generation means;

a swap unit; and

(N/m) pieces of transpose units, and

the write control function unit comprising:

a write address generation means, wherein

when reading the NxN square matrix stored in the N pieces of internal memories and storing the NxN square matrix into the registers of N pieces of processing elements,

(N/m) pieces of the mxm sub-matrices being read respectively from the (N/m) groups of the internal memories which are respectively addressed by the read address generation means, and being supplied respectively to the (N/m) pieces of the transpose units; each of the (N/m) pieces of the transpo se units generating a transposed version of the mxm sub-matrix supplied thereto;

(N/m) pieces of the transpo sed mxm sub-matrix respectively output from the (N/m) pieces of the transpose units being supplied to the swap unit;

the swap unit effecting swapping between each of a preset (N/m)/2 pairs out of (N/m) pieces of the mxm sub-matrices, the swap unit effecting no swapping for diagonally located (N/m) pieces o f the mxm sub-matrices; and

(N/m) pieces of the mxm sub-matrices output from the swap unit being stored respectively in the registers of the (N/m) groups of the processing elements,

such that copying the NxN square matrix from the N pieces o f the internal memories to the registers of the N pieces of the processing elements takes (N/m) cycles, and wherein

the write control function unit carries out the following operations, in one cycle:

(N/m) pieces of the mxm sub-matrices being read respectively from the registers of the (N/m) groups of the processing elements and being written respectively into the (N/m) groups of the internal memories, which are respectively addressed by the write address generation means, such that copying of the NxN square matrix from the registers of the N pieces of the processing elements to the N pieces of the internal memories takes (N/m) cycles,

13. An apparatus working in single instruction, multiple data mode and accessing an NxN square matrix, comprising :

N pieces of internal memories, each of the internal memories inputting and outputting data with a data width equal to m matrix elements, where m is a preset positive integer and divisor of N,

the N pieces of the internal memories being grouped into (N/m) groups, each of the group having m pieces of the internal memories, the NxN square matrix stored in the N pieces of the internal memories, being handled as divided into (N/m)x(N/m) pieces of mxm sub-matrices, each of of the mxm sub-matrices having m rows and m columns of matrix elements and being stored in m pieces of internal memories composing one group, each internal memory o f the group storing m rows of matrix elements of an associated column of the mxm sub-matrix as one word;

N pieces of processing elements, each of the processing elements having a data width equal to m matrix elements, the N pieces of the processing elements being grouped into (N/m) groups, each of the groups having m pieces of the processing elements, in association with the groups of the internal memories, each of the processing elements including a register file having (N/m) pieces o f registers, each of the registers storing m rows of matrix elements of an associated column of the mxm sub-matrix as one word; and

a connection apparatus provided between the N pieces o f the processing elements and the N pieces of the internal memories, the connection apparatus including a read control function unit and a write control function unit,

the read control function unit comprising

a read address generation means, and

the write control function unit comprising:

a write address generation means;

a swap unit; and

(N/m) pieces of transpose units, wherein

the read control function unit carries out the following operations, in one cycle:

(N/m) pieces of the mxm sub-matrices being read respectively from the (N/m) groups of the internal memories which are respectively addressed by the read address generation means, and the (N/m) pieces of the mxm sub-matrices being stored respectively in the registers o f the (N/m) groups of the processing elements, such that copying the NxN square matrix from the N pieces o f the internal memories to the registers of the N pieces of the processing elements takes (N/m) cycles, and wherein

(N/m) pieces of the mxm sub-matrices being read respectively from registers of the (N/m) groups of the processing elements and being supplied respectively to the (N/m) pieces of the transpose units; each of the (N/m) pieces of the transpo se units generating a transposed version of the mxm sub-matrix supplied thereto from the registers;

(N/m) pieces of the transpo sed mxm sub-matrix from the (N/m) pieces of the transpose units being supplied to the swap unit;

the swap unit effecting swapping between each of a preset (N/m)/2 pairs out of (N/m) pieces of the mxm sub-matrices, the swap unit effecting no swapping for diagonally located (N/m) pieces of the mxm sub-matrices; and

(N/m) pieces of the mxm sub-matrices from the swap unit being written respectively into the (N/m) groups of the internal memories which are respectively addressed by the write address generation means,

such that copying the NxN square matrix from the registers of the N pieces of the processing elements to the N pieces of the internal memories takes (N/m) cycles,

the NxN square matrix written by the write control function unit into the N pieces of the internal memories being a corner turned version o f the NxN square matrix originally stored in the N pieces o f the internal memories .

14. An apparatus working in single instruction, multiple data mode and accessing an NxN square matrix, comprising :

N pieces o f internal memories, each of the internal memories inputting and outputting data with a data width equal to m matrix elements, where m is a preset positive integer and divisor of N,

the N pieces of the internal memories being grouped into (N/m) groups, each of the groups having m pieces of the internal memories, the NxN square matrix stored in the N pieces of the internal memories, being handled as divided into (N/m)x(N/m) pieces of mxm sub-matrices, each of the mxm sub-matrices having m rows and m columns of matrix elements and stored in m pieces of internal memories composing one group, each internal memory of the group storing m rows o f matrix elements o f an associated column of the mxm sub-matrix as one word;

N pieces o f processing elements, each of the processing elements having a data width equal to m matrix elements, the N pieces of the processing elements being grouped into (N/m) groups, each group having m pieces of the processing elements, in association with the groups of the internal memories, each o f the processing elements including a register file having (N/m) pieces of registers, each of the registers storing m rows o f matrix elements of an associated column o f the mxm sub-matrix as one word; and

a connection apparatus provided between the N pieces of the processing elements and the N pieces of the internal memories, the connection apparatus including a read function unit and a write control function unit,

the read control function unit comprising

a read address generation means, and

the write control function unit comprising:

a write address generation means;

a swap unit; and

(N/m) pieces o f transpose units, wherein

(N/m) pieces of the mxm sub-matrices being read respectively from the (N/m) groups o f the internal memories which are respectively addressed by the read address generation means, and the (N/m) pieces of the mxm sub-matrices being stored respectively in the registers of the (N/m) groups of the processing element;

such that copying of the NxN square matrix from the N pieces o f the internal memories to the registers of the N pieces o f the processing elements takes (N/m) cycles, and wherein

when writing the NxN square matrix stored in registers of the N pieces of the processing elements into the N pieces o f the internal memories,

(N/m) pieces of the mxm sub-matrices being read respectively from registers o f the (N/m) groups of the processing elements and being supplied to the swap unit;

(N/m) pieces of the mxm sub-matrices from the swap unit being supplied respectively to the (N/m) pieces of the transpo se units;

each of the (N/m) pieces of the transpo se units generating a transposed version of the mxm sub-matrix supplied thereto from the swap unit; and

(N/m) pieces of the transposed mxm sub-matrix from the (N/m) pieces of the transpo se units being written respectively into the (N/m) groups of the internal memories which are respectively addressed by the write address generation means,

such that copying o f the NxN square matrix from the registers of the N pieces of the processing elements to the N pieces of the internal memories takes (N/m) cycles,

the NxN square matrix written by the write control function unit into the N pieces of the internal memories being a corner turned version of the NxN square matrix originally stored in the N pieces o f the internal memories.

15. An apparatus working in single instruction, multiple data mode and accessing an NxN square matrix, comprising:

the N pieces of the internal memories being grouped into (N/m) groups, each of the groups having m pieces o f the internal memories, the NxN square matrix stored in the N pieces o f the internal memories, being handled as divided into (N/m)x(N/m) pieces of mxm sub-matrices, each of the mxm sub-matrices having m rows and m columns of matrix elements and being stored in m pieces of internal memories composing one group, each internal memory of the group storing m rows of matrix elements of an associated column of the mxm sub-matrix as one word;

N pieces of processing elements, each of the processing elements having a data width equal to m matrix elements, the N pieces of the processing elements being grouped into (N/m) groups, each group having m pieces of the processing elements, in association with the groups of the internal memories, each of the processing elements including a register file having (N/m) pieces of registers, each of the registers storing m rows of matrix elements of an associated column of the mxm sub-matrix as one word; and

the read control function unit comprising:

a read address generation means; and

a swap unit; and

the write contro l function unit comprising:

a write address generation means; and

(N/m) pieces of transpose units, wherein

when reading the NxN square matrix stored in the N pieces o f internal memories and storing the NxN square matrix into the registers of N pieces of processing elements,

(N/m) pieces of the mxm sub-matrices being read respectively from the (N/m) groups of the internal memories which are respectively addressed by the read address generation means, and being supplied to the swap unit;

(N/m) pieces of the mxm sub-matrices from the swap unit being stored respectively in the registers of the (N/m) groups of the processing elements,

such that copying of the NxN square matrix from the N pieces o f the internal memories to the registers of the N pieces of the processing elements takes (N/m) cycles, and wherein

when writing the NxN square matrix stored in the registers o f the N pieces of the processing elements into the N pieces of the internal memories,

the write contro l function unit carries out the following operations, in one cycle :

(N/m) pieces of the mxm sub-matrices being read respectively from the registers of the (N/m) groups of the processing elements and being supplied to the (N/m) pieces of the transpose units;

each of the (N/m) pieces of the transpo se units generating a transpo sed version of the mxm sub-matrix supplied thereto from the registers;

(N/m) pieces of the transpo sed mxm sub-matrix from the (N/m) pieces of the transpose units being written respectively into the (N/m) groups o f the internal memories, which are respectively addressed by the write address generation means,

such that copying with transposition of the NxN square matrix from the registers of the N pieces o f the processing elements to the N pieces of the internal memories takes (N/m) cycles, the NxN square matrix written by the write control function unit into the N pieces of the internal memories being a corner turned version o f the NxN square matrix originally stored in the N pieces o f the internal memories.

16. An apparatus working in single instruction, multiple data mode and accessing an NxN square matrix, comprising:

the N pieces of the internal memories being grouped into (N/m) groups, each of the groups having m pieces of the internal memories, the NxN square matrix stored in the N pieces of the internal memories, being handled as divided into (N/m)x(N/m) pieces of mxm sub-matrices, each of the mxm sub-matrices having m rows and m columns of matrix elements and being stored in m pieces of internal memories composing one group, each internal memory of the group storing m rows of matrix elements of an associated column of the mxm sub-matrix as one word;

N pieces of processing elements, each of the processing elements having a data width equal to m matrix elements, the N pieces of the processing elements being grouped into (N/m) groups, each group having m pieces of the processing elements, in association with the groups of the internal memories, each of the processing elements including a register file having (N/m) pieces of registers, each of the registers storing m rows of matrix elements o f an associated column o f the mxm sub-matrix as one word; and

the read control function unit comprising:

a read address generation means; and

(N/m) pieces of transpose units, and

the write control function unit comprising:

a write address generation means; and

a swap unit, wherein

when reading the NxN square matrix stored in the N pieces of the internal memories and storing the NxN square matrix into the registers of N pieces o f the processing elements,

(N/m) pieces of the mxm sub-matrices being read respectively from the (N/m) groups o f the internal memories which are respectively addressed by the read address generation means, and being supplied respectively to the(N/m) pieces of the transpose units;

each of the (N/m) pieces of the transpo se units generating a transpo sed version of the mxm sub-matrix supplied thereto ; and

(N/m) pieces of the transpo sed mxm sub-matrix from the (N/m) pieces of the transpose units being stored respectively in the registers of the (N/m) groups of the processing elements,

(N/m) pieces of the mxm sub-matrices being read respectively from the registers of the (N/m) groups of the processing elements and being supplied to the swap unit;

such that copying of the NxN square matrix from the registers of the N pieces o f the processing elements to the N pieces of the internal memories takes (N/m) cycles,

the NxN square matrix written by the write control function unit into the N pieces of the internal memories being a corner turned version of the NxN square matrix originally stored in the N pieces the internal memories.