CN116776058A

CN116776058A - Matrix transpose method

Info

Publication number: CN116776058A
Application number: CN202310518621.8A
Authority: CN
Inventors: 裴京; 王松; 马骋; 李博文; 徐海峥
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2023-05-09
Filing date: 2023-05-09
Publication date: 2023-09-19

Abstract

The application relates to a matrix transposition method, which comprises the following steps: and circularly executing data read-write operation on the target matrix until the data in the target matrix are read out, and obtaining a transposed matrix of the target matrix. The data read-write operation comprises the following steps: sequentially controlling a plurality of shift registers in a chip to read preset quantity of data from a target matrix according to a clock sequence; the preset number is the same as the number of the plurality of shift registers, and stored data in each shift register moves one bit to high in each clock in the reading process; and under the condition that each shift register finishes one data reading, controlling the most significant stored data of each shift register to be written into the transpose matrix. By adopting the method, the calculation speed of matrix transposition of the chip can be improved, and the running speed of computer equipment carried by the chip is further improved.

Description

Matrix transpose method

技术领域Technical field

本申请涉及信息处理技术领域，特别是涉及一种矩阵转置方法。The present application relates to the field of information processing technology, and in particular to a matrix transposition method.

背景技术Background technique

通常，人工智能芯片在运行神经网络模型时，都需要把处理的数据矩阵化，然后在对矩阵进行运算，其中，矩阵运算中最基本的运算操作包括矩阵的转置。Usually, when an artificial intelligence chip runs a neural network model, it needs to matrix the processed data and then perform operations on the matrix. Among them, the most basic operation in matrix operations includes the transpose of the matrix.

相关技术，在进行矩阵的转置时，通常需要按照地址顺序读取矩阵数据，再寻址，之后转换地址输出矩阵转置数据，循环往复上述过程，实现对矩阵的转置操作。In related technology, when transposing a matrix, it is usually necessary to read the matrix data according to the address sequence, then address, and then convert the address to output the matrix transposed data, and repeat the above process to realize the transposition operation of the matrix.

然而，上述方法存在矩阵转置效率低，影响人工智能芯片的计算速度，导致人工智能芯片的计算效力较低。However, the above method has low matrix transposition efficiency, which affects the calculation speed of the artificial intelligence chip, resulting in low computing efficiency of the artificial intelligence chip.

发明内容Contents of the invention

基于此，有必要针对上述技术问题，提供一种能够提升人工智能芯片的计算速度，进而提升人工智能芯片的计算效力的矩阵转置方法。Based on this, it is necessary to address the above technical problems and provide a matrix transposition method that can increase the computing speed of the artificial intelligence chip and thereby improve the computing efficiency of the artificial intelligence chip.

第一方面，本申请提供了一种矩阵转置方法，该方法包括：In the first aspect, this application provides a matrix transposition method, which method includes:

对目标矩阵循环执行数据读写操作，直至目标矩阵中数据读取完毕，得到目标矩阵的转置矩阵；数据读写操作包括：Perform data read and write operations on the target matrix cyclically until the data in the target matrix is read, and the transpose matrix of the target matrix is obtained; the data read and write operations include:

按照时钟顺序，依次控制芯片中的多个移位寄存器从目标矩阵中读取预设数量的数据；预设数量与多个移位寄存器的数量相同，且在读取过程中每个时钟下各移位寄存器中的已存储数据均向高位移动一位；According to the clock sequence, multiple shift registers in the chip are sequentially controlled to read a preset number of data from the target matrix; the preset number is the same as the number of multiple shift registers, and each clock is read during the reading process. The stored data in the shift register is moved to the high bit by one bit;

在各移位寄存器均完成一次数据读取的情况下，控制各移位寄存器的最高位存储数据写入转置矩阵中。When each shift register has completed one data read, the highest-order stored data of each shift register is controlled to be written into the transposed matrix.

在其中一个实施例中，按照时钟顺序，依次控制多个移位寄存器从目标矩阵中读取数据，包括：In one embodiment, multiple shift registers are sequentially controlled to read data from the target matrix according to clock sequence, including:

获取多个移位寄存器的读取顺序；Get the reading order of multiple shift registers;

按照时钟顺序和多个移位寄存器的读取顺序，获取每个移位寄存器在目标矩阵中的读取地址；According to the clock sequence and the reading sequence of multiple shift registers, obtain the read address of each shift register in the target matrix;

根据每个移位寄存器在目标矩阵中的读取地址，控制各移位寄存器从目标矩阵中读取数据。According to the read address of each shift register in the target matrix, each shift register is controlled to read data from the target matrix.

在其中一个实施例中，获取多个移位寄存器的读取顺序包括：In one embodiment, obtaining the reading sequence of multiple shift registers includes:

根据芯片中存储单元的带宽确定多个移位寄存器的长度；各移位寄存器的长度均不相同；The lengths of multiple shift registers are determined according to the bandwidth of the memory cells in the chip; the lengths of each shift register are different;

将各移位寄存器的长度从大到小的顺序确定为多个移位寄存器的排序。Determine the order of the multiple shift registers in descending order of the length of each shift register.

在其中一个实施例中，多个移位寄存器的数量与带宽相同，且各移位寄存器的长度依次递减；其中，各移位寄存器中的最小长度等于带宽加一，最大长度等于带宽的两倍。In one embodiment, the number of multiple shift registers is the same as the bandwidth, and the length of each shift register decreases in sequence; wherein, the minimum length of each shift register is equal to the bandwidth plus one, and the maximum length is equal to twice the bandwidth. .

在其中一个实施例中，按照时钟顺序和多个移位寄存器的读取顺序，获取每个移位寄存器在目标矩阵中的读取地址，包括：In one embodiment, according to the clock sequence and the reading sequence of multiple shift registers, obtaining the read address of each shift register in the target matrix includes:

根据时钟顺序和多个移位寄存器的读取顺序，获取每个时钟对应的移位寄存器；每个时钟下执行一个移位寄存器的数据读取；According to the clock sequence and the reading sequence of multiple shift registers, obtain the shift register corresponding to each clock; perform data reading of one shift register at each clock;

将目标矩阵中数据的第一行地址作为第一个时钟对应的移位寄存器的读取地址、第二行地址作为第二个时钟对应的移位寄存器的读取地址，以此类推，得到每个时钟对应的移位寄存器的读取地址。The address of the first row of data in the target matrix is used as the read address of the shift register corresponding to the first clock, and the address of the second row is used as the read address of the shift register corresponding to the second clock, and so on, to obtain each The read address of the shift register corresponding to clocks.

在其中一个实施例中，根据每个移位寄存器在目标矩阵中的读取地址，控制各移位寄存器从目标矩阵中读取数据，包括：In one embodiment, controlling each shift register to read data from the target matrix according to the read address of each shift register in the target matrix includes:

对于任一移位寄存器，根据移位寄存器在目标矩阵中的读取地址确定移位寄存器对应的目标数据；For any shift register, determine the target data corresponding to the shift register according to the read address of the shift register in the target matrix;

控制移位寄存器读取目标数据。Control the shift register to read the target data.

在其中一个实施例中，控制各移位寄存器的最高位存储数据写入转置矩阵中，包括：In one embodiment, controlling the writing of the highest-order stored data of each shift register into the transposed matrix includes:

获取各移位寄存器的最高位存储数据，并将各移位寄存器的最高位存储数据进行组合，获取转置向量；Obtain the highest-order stored data of each shift register, and combine the highest-order stored data of each shift register to obtain the transposition vector;

控制转置向量写入转置矩阵中。Controls the writing of the transposed vector into the transposed matrix.

在其中一个实施例中，控制转置向量写入转置矩阵中，包括：In one embodiment, controlling the writing of the transposed vector into the transposed matrix includes:

获取转置向量在转置矩阵中的写入地址；Get the writing address of the transposed vector in the transposed matrix;

根据写入地址，控制转置向量写入转置矩阵中。According to the write address, the transposed vector is written into the transposed matrix.

在其中一个实施例中，获取转置向量在转置矩阵中的写入地址，包括：In one embodiment, obtaining the writing address of the transposed vector in the transposed matrix includes:

获取在对目标矩阵循环执行数据读写操作过程中，当前各移位寄存器均完成一次数据读取的次数；Obtain the number of times each shift register has completed a data read during the cyclic execution of data read and write operations on the target matrix;

根据次数确定转置向量在转置矩阵中的列地址；Determine the column address of the transposed vector in the transposed matrix according to the degree;

将转置向量在转置矩阵中的列地址作为转置向量在转置矩阵中的写入地址。Use the column address of the transposed vector in the transposed matrix as the write address of the transposed vector in the transposed matrix.

在其中一个实施例中，矩阵转置方法还包括：In one embodiment, the matrix transposition method further includes:

根据目标矩阵的尺寸和芯片中存储单元的带宽确定每个时钟的偏地址跳变步长；Determine the offset address transition step size of each clock according to the size of the target matrix and the bandwidth of the memory unit in the chip;

在对目标矩阵循环执行数据读写操作过程中，针对任一个时钟，按照时钟的偏地址跳变步长在行方向上进行偏地址跳变，并在偏地址跳变完毕的情况下，对时钟的基地址跳变一个地址。During the cyclic execution of data read and write operations on the target matrix, for any clock, a bias address jump is performed in the row direction according to the clock's bias address jump step, and when the bias address jump is completed, the clock's bias address transition is The base address jumps by one address.

第二方面，本申请还提供了一种矩阵转置装置，该装置包括：获取模块，获取模块包括读取单元和写入单元，其中：In a second aspect, this application also provides a matrix transposition device. The device includes: an acquisition module. The acquisition module includes a reading unit and a writing unit, wherein:

获取模块，用于对目标矩阵循环执行数据读写操作，直至目标矩阵中数据读取完毕，得到目标矩阵的转置矩阵；The acquisition module is used to perform data reading and writing operations on the target matrix cyclically until the data in the target matrix is read, and the transposed matrix of the target matrix is obtained;

读取单元，用于按照时钟顺序，依次控制芯片中的多个移位寄存器从目标矩阵中读取预设数量的数据；预设数量与多个移位寄存器的数量相同，且在读取过程中每个时钟下各移位寄存器中的已存储数据均向高位移动一位；The reading unit is used to sequentially control multiple shift registers in the chip to read a preset number of data from the target matrix in clock sequence; the preset number is the same as the number of multiple shift registers, and during the reading process The stored data in each shift register is shifted to the high bit by one bit at each clock;

写入单元，用于在各移位寄存器均完成一次数据读取的情况下，控制各移位寄存器的最高位存储数据写入转置矩阵中。The writing unit is used to control the highest bit stored data of each shift register to be written into the transposed matrix when each shift register has completed one data read.

第三方面，本申请还提供了一种计算机设备，该计算机设备包括存储器和处理器，该存储器存储有计算机程序，该处理器执行计算机程序时实现上述第一方面中任一项实施例中的方法的步骤。In a third aspect, this application also provides a computer device. The computer device includes a memory and a processor. The memory stores a computer program. When the processor executes the computer program, it implements any of the embodiments of the first aspect. Method steps.

第四方面，本申请还提供了一种计算机可读存储介质，该计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现上述第一方面中任一项实施例中的方法的步骤。In a fourth aspect, the present application also provides a computer-readable storage medium having a computer program stored thereon. When the computer program is executed by a processor, any one of the embodiments of the first aspect is implemented. The steps in the method.

第五方面，本申请还提供了一种计算机程序产品，该计算机程序产品，包括计算机程序，该计算机程序被处理器执行时实现上述第一方面中任一项实施例中的方法的步骤。In a fifth aspect, the present application also provides a computer program product, which includes a computer program that, when executed by a processor, implements the steps of the method in any one of the embodiments of the first aspect.

上述矩阵转置方法对目标矩阵循环执行数据读写操作，直至目标矩阵中数据读取完毕，得到目标矩阵的转置矩阵。其中，数据读写操作包括：按照时钟顺序，依次控制芯片中的多个移位寄存器从目标矩阵中读取预设数量的数据；预设数量与多个移位寄存器的数量相同，且在读取过程中每个时钟下各移位寄存器中的已存储数据均向高位移动一位；在各移位寄存器均完成一次数据读取的情况下，控制各移位寄存器的最高位存储数据写入转置矩阵中。由于本申请在每个时钟下控制移位寄存器从目标矩阵中读取预设数量的数据的同时，还控制各移位寄存器中的已存储数据均向高位移动一位，并且还在各移位寄存器均完成一次数据读取的情况下，控制各移位寄存器的最高位存储数据写入转置矩阵中。相当于在利用移位寄存器进行矩阵转置的过程中，各移位寄存器在同时进行读取、移位和写入，这种流水线的读取和写入模式，避免各移位寄存器出现空闲状态的同时，贴合芯片流水线运行的特点，节省矩阵转置运算的时间。并且，本申请提供的矩阵转置方法中，移位寄存器从目标矩阵中读取数据的数量与多个移位寄存器的数量相同，那么，在各移位寄存器均完成一次数据读取的情况下，每个时钟下对应的移位寄存器读取数据的带宽，与该时钟下各移位寄存器输出的最高位存储数据的带宽是一致的，这种输入输出带宽一致的数据读写方式，便于芯片在写入转置矩阵过程中有规律地进行地址跳变，从而提升矩阵转置的写入效率。另外，通常情况下，芯片都是搭载于计算机设备中运行的，所以可以理解为，通过本申请实施例提供的矩阵转置方法在芯片内进行矩阵转置处理时，提高了这些矩阵的转置效率就相当于提高了芯片运行速度，从而也相当于提高了芯片所搭载的计算机设备的运行速度。The above matrix transposition method performs cyclic data reading and writing operations on the target matrix until the data in the target matrix is read, and the transposed matrix of the target matrix is obtained. Among them, the data read and write operations include: sequentially controlling multiple shift registers in the chip to read a preset number of data from the target matrix in accordance with the clock sequence; the preset number is the same as the number of multiple shift registers, and when reading During the fetching process, the stored data in each shift register moves one bit to the high bit at each clock; when each shift register completes a data read, control the writing of the highest bit stored data in each shift register. in the transposed matrix. Since this application controls the shift register to read a preset amount of data from the target matrix at each clock, it also controls the stored data in each shift register to move one bit to the high bit, and also controls each shift register. When all registers have completed one data read, the highest-order stored data of each shift register is controlled to be written into the transposition matrix. It is equivalent to the process of using shift registers to perform matrix transposition. Each shift register is reading, shifting and writing at the same time. This pipeline read and write mode avoids the idle state of each shift register. At the same time, it fits the characteristics of the chip's pipeline operation and saves time in matrix transposition operations. Moreover, in the matrix transposition method provided by this application, the number of data read by the shift register from the target matrix is the same as the number of multiple shift registers. Then, when each shift register completes one data read , the bandwidth of the corresponding shift register reading data under each clock is consistent with the bandwidth of the highest bit stored data output by each shift register under the clock. This data reading and writing method with consistent input and output bandwidth is convenient for the chip During the writing process of the transposed matrix, address jumps are performed regularly, thereby improving the writing efficiency of the matrix transposed matrix. In addition, usually, chips are installed and run in computer equipment, so it can be understood that when performing matrix transposition processing in the chip through the matrix transposition method provided by the embodiment of the present application, the transposition of these matrices is improved. Efficiency is equivalent to increasing the running speed of the chip, which is also equivalent to increasing the running speed of the computer equipment equipped with the chip.

附图说明Description of drawings

图1为一个实施例中矩阵转置方法的应用环境图；Figure 1 is an application environment diagram of the matrix transposition method in one embodiment;

图2为一个实施例中矩阵转置方法的流程示意图；Figure 2 is a schematic flow chart of a matrix transposition method in one embodiment;

图3为一个实施例中数据读取步骤的流程示意图；Figure 3 is a schematic flowchart of the data reading steps in one embodiment;

图4为另一个实施例中数据读取步骤的流程示意图；Figure 4 is a schematic flow chart of the data reading steps in another embodiment;

图5为另一个实施例中数据读取步骤的流程示意图；Figure 5 is a schematic flowchart of the data reading steps in another embodiment;

图6为另一个实施例中数据读取步骤的流程示意图；Figure 6 is a schematic flow chart of the data reading steps in another embodiment;

图7为一个实施例中数据写入步骤的流程示意图；Figure 7 is a schematic flowchart of the data writing steps in one embodiment;

图8为另一个实施例中数据写入步骤的流程示意图；Figure 8 is a schematic flowchart of the data writing steps in another embodiment;

图9为另一个实施例中数据写入步骤的流程示意图；Figure 9 is a schematic flow chart of the data writing steps in another embodiment;

图10为另一个实施例中地址跳变的流程示意图；Figure 10 is a schematic flow chart of address hopping in another embodiment;

图11为一个实施例中目标矩阵的流程示意图；Figure 11 is a schematic flowchart of a target matrix in an embodiment;

图12为一个实施例中数据读写操作的流程示意图；Figure 12 is a schematic flow chart of data reading and writing operations in one embodiment;

图13为另一个实施例中地址跳变的流程示意图；Figure 13 is a schematic flow chart of address hopping in another embodiment;

图14为一个实施例中矩阵转置装置的结构框图。Figure 14 is a structural block diagram of a matrix transposition device in one embodiment.

具体实施方式Detailed ways

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本申请进行进一步详细说明。应当理解，此处描述的具体实施例仅仅用以解释本申请，并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clear, the present application will be further described in detail below with reference to the drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application and are not used to limit the present application.

本申请实施例提供的矩阵运算方法，可以应用于芯片上。该芯片可以是语音处理芯片、视频处理芯片、图像处理芯片等等人工智能芯片，其内部结构图可以如图1所示。该芯片包括处理器、存储器、输入/输出接口(Input/Output，简称I/O)和通信接口。其中，处理器、存储器和输入/输出接口通过系统总线连接，通信接口通过输入/输出接口连接到系统总线。其中，该芯片的处理器用于提供计算和控制能力。该芯片的存储器包括非易失性存储介质和移位存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该移位存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该芯片的数据库用于存储线程栈处理数据。该芯片的输入/输出接口用于处理器与外部设备之间交换信息。该芯片的通信接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种矩阵处理方法。本领域技术人员可以理解，图1中示出的结构，仅仅是与本申请方案相关的部分结构的框图，并不构成对本申请方案所应用于其上的芯片的限定，具体的芯片可以包括比图中所示更多或更少的部件，或者组合某些部件，或者具有不同的部件布置。The matrix operation method provided by the embodiment of the present application can be applied on the chip. The chip can be an artificial intelligence chip such as a voice processing chip, a video processing chip, an image processing chip, etc. Its internal structure diagram can be shown in Figure 1. The chip includes a processor, memory, input/output interface (Input/Output, referred to as I/O) and communication interface. Among them, the processor, memory and input/output interface are connected through the system bus, and the communication interface is connected to the system bus through the input/output interface. Among them, the chip's processor is used to provide computing and control capabilities. The chip's memory includes non-volatile storage media and shift memory. The non-volatile storage medium stores operating systems, computer programs and databases. The shift memory provides an environment for the execution of operating systems and computer programs in non-volatile storage media. The chip's database is used to store thread stack processing data. The chip's input/output interface is used to exchange information between the processor and external devices. The communication interface of the chip is used to communicate with external terminals through network connections. The computer program, when executed by the processor, implements a matrix processing method. Those skilled in the art can understand that the structure shown in Figure 1 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the chip on which the solution of the present application is applied. The specific chip may include more than The figures show more or fewer parts, or certain parts combined, or with different arrangements of parts.

随着人工智能技术的飞速发展，人工智能芯片被广泛应用于多个领域。由于人工智能芯片强大的数据计算能力，人工智能芯片执行的计算任务也越来越复杂。当人工智能芯片在运行神经网络模型时，需要把处理的数据矩阵化，然后在对矩阵进行运算，其中，矩阵运算中最基本的运算操作包括矩阵的转置。With the rapid development of artificial intelligence technology, artificial intelligence chips are widely used in many fields. Due to the powerful data computing capabilities of artificial intelligence chips, the computing tasks performed by artificial intelligence chips are becoming more and more complex. When the artificial intelligence chip runs the neural network model, it needs to matrix the processed data and then perform operations on the matrix. Among them, the most basic operation in matrix operations includes the transpose of the matrix.

在人工智能芯片中，通常采用连续模块地址的寻址方式，利用移位寄存器实现矩阵转置。相关技术中，采用阻塞式的矩阵转置方法，控制移位寄存器依据位宽大小，在当前时钟按照地址顺序执行数据读取操作，在当前时钟的下一时钟转换地址执行数据输出，循环往复上述过程，从而实现矩阵转置。In artificial intelligence chips, the addressing mode of continuous module addresses is usually used, and a shift register is used to implement matrix transposition. In the related technology, a blocking matrix transposition method is used to control the shift register according to the bit width, perform data reading operations in the address sequence at the current clock, perform data output at the next clock conversion address of the current clock, and repeat the above cycle. process to achieve matrix transposition.

以移位寄存器读取数据的带宽为8位，矩阵A的大小为8×16为例，对阻塞式的矩阵转置方式进行说明：首先控制移位寄存器逐行读取矩阵A的8位数据；然后对矩阵A对应的转置矩阵B(16×8)进行地址寻址，并将读取的数据按列写入转置矩阵B中，直至将读取数据写入结束；接着控制移位寄存器读取矩阵A的下一行的8位数据，重复上述过程，直至矩阵A的数据全部被写入至转置矩阵B中，完成矩阵A的转置运算。Taking the bandwidth of the shift register to read data as 8 bits and the size of matrix A as 8×16 as an example, the blocking matrix transposition method is explained: first, the shift register is controlled to read the 8-bit data of matrix A row by row. ; Then address the transposed matrix B (16×8) corresponding to matrix A, and write the read data into the transposed matrix B column by column until the read data is written; then control the shift The register reads the 8-bit data of the next row of matrix A, and repeats the above process until all the data of matrix A is written into the transposed matrix B, completing the transposition operation of matrix A.

然而，上述阻塞式的矩阵转置方式中移位寄存器的读写无法同时进行，即移位寄存器的读取数据过程和输出数据过程需要在两个时钟内进行，也就无法兼容芯片内部的流水线工作模式。再者，阻塞式的矩阵转置方式在输出数据时，需要对按列输出的数据进行多次地址寻址，这样繁琐的寻址流程会导致矩阵转置时间过长。However, in the above-mentioned blocking matrix transposition method, the reading and writing of the shift register cannot be performed at the same time. That is, the process of reading data and outputting data of the shift register need to be performed within two clocks, which is not compatible with the internal pipeline of the chip. Operating mode. Furthermore, when outputting data, the blocking matrix transposition method requires multiple addresses for the data output in columns. Such a cumbersome addressing process will cause the matrix transposition time to be too long.

基于此，本申请提供了一种非阻塞式的矩阵转置方法，通过设置各移位寄存器的长度以及数据读写方式，令芯片内各移位寄存器流水线读取数据和输出数据，快速实现矩阵转置，充分发挥芯片的计算效力，进而节省芯片的功耗。下面，通过一个实施例，对矩阵转置方法进行说明。Based on this, this application provides a non-blocking matrix transposition method. By setting the length of each shift register and the data reading and writing method, each shift register pipeline in the chip reads and outputs data, quickly realizing the matrix. Transpose, giving full play to the computing efficiency of the chip, thereby saving the power consumption of the chip. Next, the matrix transposition method is explained through an embodiment.

在一个实施例中，如图2所示，提供了一种矩阵转置方法，以该方法应用于图1中的芯片为例进行说明，包括以下步骤：In one embodiment, as shown in Figure 2, a matrix transposition method is provided. This method is explained by taking the method applied to the chip in Figure 1 as an example, and includes the following steps:

对目标矩阵循环执行数据读写操作，直至目标矩阵中数据读取完毕，得到目标矩阵的转置矩阵。Perform data reading and writing operations on the target matrix cyclically until the data in the target matrix is read, and obtain the transpose matrix of the target matrix.

其中，本申请实施例中的芯片可以根据实际需求，应用在多个场景中，比如语音处理、视频处理、图像处理等。那么，根据应用场景的不同，本申请实施例中的芯片可以是语音处理芯片、视频处理芯片、图像处理芯片等。对应的，若芯片是语音处理芯片，则目标矩阵为基于语音信息得到的语音信息矩阵；若芯片是视频处理芯片，则目标矩阵为基于视频信息得到的语音信息矩阵；若芯片是图像处理芯片，则目标矩阵为基于图像信息得到的图像信息矩阵。本申请实施例对芯片的类型不作限制。Among them, the chip in the embodiment of the present application can be applied in multiple scenarios according to actual needs, such as voice processing, video processing, image processing, etc. Then, depending on the application scenario, the chip in the embodiment of the present application may be a voice processing chip, a video processing chip, an image processing chip, etc. Correspondingly, if the chip is a voice processing chip, the target matrix is a voice information matrix obtained based on voice information; if the chip is a video processing chip, the target matrix is a voice information matrix obtained based on video information; if the chip is an image processing chip, Then the target matrix is the image information matrix obtained based on the image information. The embodiments of this application do not limit the type of chip.

需要说明的是，本申请实施例中的芯片在进行矩阵转置时，不同于计算机随机寻址的方式，而是采用连续模块地址，直接对矩阵中的数据进行读取。通过将相邻地址单元的数据放在不同存储器，比如移位寄存器，结合各移位寄存器并行工作的模式，进行矩阵转置。It should be noted that when the chip in the embodiment of the present application performs matrix transposition, it is different from the computer's random addressing method. Instead, it uses continuous module addresses to directly read the data in the matrix. By placing the data of adjacent address units in different memories, such as shift registers, and combining the parallel working mode of each shift register, the matrix transposition is performed.

在获取目标矩阵的情况下，对目标矩阵执行转置操作，通过对目标矩阵中的数据同时进行读取数据、写入数据，结合芯片连续模块寻址的方式，循环对目标矩阵的数据进行读写。这样一来，目标矩阵中数据读取完毕的情况下，也就获取得到了目标矩阵的转置矩阵。In the case of obtaining the target matrix, perform a transpose operation on the target matrix, read and write data to the data in the target matrix at the same time, and read the data of the target matrix cyclically in combination with the chip's continuous module addressing method. Write. In this way, when the data in the target matrix is read, the transposed matrix of the target matrix is obtained.

其中，数据读写操作包括：Among them, data reading and writing operations include:

S201，按照时钟顺序，依次控制芯片中的多个移位寄存器从目标矩阵中读取预设数量的数据；预设数量与多个移位寄存器的数量相同，且在读取过程中每个时钟下各移位寄存器中的已存储数据均向高位移动一位。S201, in accordance with the clock sequence, sequentially control multiple shift registers in the chip to read a preset number of data from the target matrix; the preset number is the same as the number of multiple shift registers, and each clock during the reading process The stored data in each shift register is moved to the high bit by one bit.

本申请实施例中，将芯片中的多个移位寄存器作为一组，执行目标矩阵的转置运算，并且每个时钟下允许一个移位寄存器从目标矩阵中读取数据。按照时钟顺序，依次控制芯片中的多个移位寄存器中的一个移位寄存器从目标矩阵中读取数据，并且将读取数据存储至对应移位寄存器的最低位。In the embodiment of the present application, multiple shift registers in the chip are used as a group to perform the transpose operation of the target matrix, and one shift register is allowed to read data from the target matrix at each clock. According to the clock sequence, one of the multiple shift registers in the chip is sequentially controlled to read data from the target matrix, and the read data is stored in the lowest bit of the corresponding shift register.

其中，不同的移位寄存器从目标矩阵中读取数据的数量是相同的，均为该组中多个移位寄存器的数量。比如芯片中的八个移位寄存器执行目标矩阵的矩阵转置操作，那么，每个时钟下对应的移位寄存器从目标矩阵中读取8个矩阵元素。Among them, the number of data read by different shift registers from the target matrix is the same, which is the number of multiple shift registers in the group. For example, the eight shift registers in the chip perform the matrix transpose operation of the target matrix. Then, the corresponding shift register reads 8 matrix elements from the target matrix at each clock.

同时，在移位寄存器读取数据的过程中，在当前时钟下的移位寄存器进行数据读取的同时，若存在其他移位寄存器在当前时钟前读取过数据，即包括已存储数据，则控制其他移位寄存器中的已存储数据向高位移动一位。At the same time, during the process of the shift register reading data, while the shift register under the current clock is reading data, if there are other shift registers that have read data before the current clock, that is, including stored data, then Controls the stored data in other shift registers to move one bit to the high bit.

仍以芯片中的八个移位寄存器执行目标矩阵的矩阵转置操作为例，若八个移位寄存器均未读取目标矩阵的数据，则在第一个时钟下对应的移位寄存器读取数据时，其他七个移位寄存器没有存储数据，也就不会存在数据向高位移动的情况；若八个移位寄存器均读取过目标矩阵的数据，则在当前时钟对应的移位寄存器读取数据时，其他七个移位寄存器均包括存储数据，那么在当前时钟下，各移位寄存器中的已存储数据也均向高位移动一位。Still taking the eight shift registers in the chip to perform the matrix transpose operation of the target matrix as an example, if none of the eight shift registers reads the data of the target matrix, the corresponding shift register reads data, the other seven shift registers do not store data, so there will be no data movement to high bits; if all eight shift registers have read the data of the target matrix, then the shift register corresponding to the current clock reads When fetching data, the other seven shift registers all include stored data, so under the current clock, the stored data in each shift register also moves one bit to the high bit.

以八个移位寄存器为一组，对移位寄存器读取数据的过程进行说明：第一个时钟下，第一寄存器读取目标矩阵的八位数据，并将读取数据存储至第一寄存器低位；第二个时钟，第二寄存器读取并存储目标矩阵的八位数据，并将读取数据存储至第二寄存器低位，同时第一寄存器的已存储数据向高位移动一位；第三个时钟，第三移位寄存器读取目标矩阵的八位数据，并将读取数据存储至第三寄存器低位，同时第一寄存器的已存储数据继续向高位移动一位(第一寄存器向高位累计移动两位)，第二寄存器的已存储数据向高位移动一位；第四个时钟，第四移位寄存器读取目标矩阵的八位数据，并将读取数据存储至第四寄存器低位，同时第一寄存器的已存储数据继续向高位移动一位(第一寄存器向高位累计移动三位)，第二寄存器的已存储数据继续向高位移动一位(第二寄存器向高位累计移动三位)；以此类推，第八个时钟，第八寄存器读取目标矩阵的八位数据，第一寄存器的已存储数据向高位累计移动七位，第二寄存器的已存储数据向高位累计移动六位，第三寄存器的已存储数据向高位累计移动五位，第四寄存器的已存储数据向高位累计移动四位，第五寄存器的已存储数据向高位累计移动三位，第六寄存器的已存储数据向高位累计移动两位，第七寄存器的已存储数据向高位累计移动一位。Taking eight shift registers as a group, the process of reading data by the shift register is explained: under the first clock, the first register reads the eight-bit data of the target matrix and stores the read data into the first register Low bit; the second clock, the second register reads and stores the eight-bit data of the target matrix, and stores the read data to the low bit of the second register, while the stored data of the first register moves one bit to the high bit; the third clock, the third shift register reads the eight-bit data of the target matrix and stores the read data to the low bit of the third register. At the same time, the stored data of the first register continues to move to the high bit by one bit (the first register moves cumulatively to the high bit). two bits), the stored data in the second register moves to the high bit by one bit; at the fourth clock, the fourth shift register reads the eight-bit data of the target matrix and stores the read data to the low bit of the fourth register. The stored data in one register continues to move one bit to the high bit (the first register moves three bits to the high bit cumulatively), and the stored data in the second register continues to move one bit to the high bit (the second register moves three bits to the high bit cumulatively); By analogy, at the eighth clock, the eighth register reads the eight-bit data of the target matrix. The stored data in the first register moves cumulatively to the high bit by seven bits. The stored data in the second register moves cumulatively by six bits to the high bit. The third register The stored data in the register moves cumulatively by five bits to the high bit, the stored data in the fourth register moves cumulatively by four bits to the high bit, the stored data in the fifth register moves cumulatively by three bits towards the high bit, and the stored data in the sixth register moves cumulatively towards the high bit. Moving two bits, the stored data in the seventh register moves cumulatively to the high bit by one bit.

S202，在各移位寄存器均完成一次数据读取的情况下，控制各移位寄存器的最高位存储数据写入转置矩阵中。S202, when each shift register has completed one data read, control the highest bit stored data of each shift register to be written into the transposed matrix.

在各移位寄存器均完成一次数据读取的情况下，也就是每个移位寄存器都存储有目标矩阵的数据，且在读取过程中每个时钟下各移位寄存器中的已存储数据均向高位移动一位的情况下，各移位寄存器中已存储数据均移动至最高位。此时，将各移位寄存器的最高位存储数据输出并写入转置矩阵中。When each shift register completes a data read, that is, each shift register stores the data of the target matrix, and during the reading process, the stored data in each shift register is equal to each clock. When moving one bit to the high bit, the data stored in each shift register is moved to the highest bit. At this time, the highest-order stored data of each shift register is output and written into the transposition matrix.

本申请实施例中，对目标矩阵循环执行数据读写操作，直至目标矩阵中数据读取完毕，得到目标矩阵的转置矩阵。其中，数据读写操作包括：按照时钟顺序，依次控制芯片中的多个移位寄存器从目标矩阵中读取预设数量的数据；预设数量与多个移位寄存器的数量相同，且在读取过程中每个时钟下各移位寄存器中的已存储数据均向高位移动一位；在各移位寄存器均完成一次数据读取的情况下，控制各移位寄存器的最高位存储数据写入转置矩阵中。由于本申请实施例在每个时钟下控制移位寄存器从目标矩阵中读取预设数量的数据的同时，还控制各移位寄存器中的已存储数据均向高位移动一位，并且还在各移位寄存器均完成一次数据读取的情况下，控制各移位寄存器的最高位存储数据写入转置矩阵中。相当于在利用移位寄存器进行矩阵转置的过程中，各移位寄存器在同时进行读取、移位和写入，这种流水线的读取和写入模式，避免各移位寄存器出现空闲状态的同时，贴合芯片流水线运行的特点，节省矩阵转置运算的时间。并且，本申请实施例提供的矩阵转置方法中，移位寄存器从目标矩阵中读取数据的数量与多个移位寄存器的数量相同，那么，在各移位寄存器均完成一次数据读取的情况下，每个时钟下对应的移位寄存器读取数据的带宽，与该时钟下各移位寄存器输出的最高位存储数据的带宽是一致的，这种输入输出带宽一致的数据读写方式，便于芯片在写入转置矩阵过程中有规律地进行地址跳变，从而提升矩阵转置的写入效率。通常情况下，以上述列举的语音处理芯片、视频处理芯片、图像处理芯片为例，任何一个场景中的芯片都是搭载于计算机设备中运行的，所以可以理解为，通过本申请实施例提供的矩阵转置方法在芯片内进行语音信息矩阵、语音信息矩阵或者图像信息矩阵中任何一个矩阵的转置处理时，提高了这些矩阵的转置效率就相当于提高了芯片运行速度，从而也相当于提高了芯片所搭载的计算机设备的运行速度。In the embodiment of the present application, data reading and writing operations are performed cyclically on the target matrix until the data in the target matrix is read, and the transposed matrix of the target matrix is obtained. Among them, the data read and write operations include: sequentially controlling multiple shift registers in the chip to read a preset number of data from the target matrix in accordance with the clock sequence; the preset number is the same as the number of multiple shift registers, and when reading During the fetching process, the stored data in each shift register moves one bit to the high bit at each clock; when each shift register completes a data read, control the writing of the highest bit stored data in each shift register. in the transposed matrix. Since the embodiment of the present application controls the shift register to read a preset amount of data from the target matrix at each clock, it also controls the stored data in each shift register to move one bit to the high bit, and also controls the stored data in each shift register to move to the high bit. When all shift registers have completed one data read, the highest bit stored data of each shift register is controlled to be written into the transposed matrix. It is equivalent to the process of using shift registers to perform matrix transposition. Each shift register is reading, shifting and writing at the same time. This pipeline read and write mode avoids the idle state of each shift register. At the same time, it fits the characteristics of the chip's pipeline operation and saves time in matrix transposition operations. Moreover, in the matrix transposition method provided by the embodiment of the present application, the number of data read by the shift register from the target matrix is the same as the number of multiple shift registers. Then, each shift register completes one data read. In this case, the bandwidth of the corresponding shift register reading data under each clock is consistent with the bandwidth of the highest bit stored data output by each shift register under that clock. This data reading and writing method with consistent input and output bandwidth, This facilitates the chip to perform regular address transitions during the process of writing the transposed matrix, thereby improving the writing efficiency of the matrix transposed matrix. Normally, taking the voice processing chip, video processing chip, and image processing chip listed above as examples, the chips in any scenario are mounted and run on computer equipment, so it can be understood that the chips provided by the embodiments of this application When the matrix transposition method transposes any matrix of the voice information matrix, speech information matrix or image information matrix in the chip, improving the transposition efficiency of these matrices is equivalent to increasing the running speed of the chip, which is also equivalent to Improves the running speed of the computer equipment equipped with the chip.

在芯片执行矩阵转置运算的过程中，需要在对目标矩阵进行数据读取的同时，进行转置矩阵的数据写入，从而提升矩阵转置运算的速度。基于此，下面通过一个实施例，对目标矩阵的数据读取步骤进行说明。When the chip performs the matrix transpose operation, it is necessary to read the data of the target matrix and write the data of the transposed matrix at the same time, thereby increasing the speed of the matrix transpose operation. Based on this, the following describes the data reading steps of the target matrix through an embodiment.

在一个实施例中，如图3所示，按照时钟顺序，依次控制多个移位寄存器从目标矩阵中读取数据，包括：In one embodiment, as shown in Figure 3, multiple shift registers are sequentially controlled to read data from the target matrix according to clock sequence, including:

S301，获取多个移位寄存器的读取顺序。S301, obtain the reading order of multiple shift registers.

对芯片中的多个移位寄存器进行组合，并按照一个时钟对应一个移位寄存器进行数据读取的规则，对多个移位寄存器进行排序，获取芯片中多个移位寄存器的读取顺序。Combine multiple shift registers in the chip and sort the multiple shift registers according to the rule that one clock corresponds to one shift register for data reading to obtain the reading order of multiple shift registers in the chip.

S302，按照时钟顺序和多个移位寄存器的读取顺序，获取每个移位寄存器在目标矩阵中的读取地址。S302: Obtain the reading address of each shift register in the target matrix according to the clock sequence and the reading sequence of the multiple shift registers.

其中，读取地址是各移位寄存器执行数据读取的依据，每个读取地址对应一组数据，移位寄存器根据读取地址，将该读取地址对应的数据读取并进行存储。Among them, the read address is the basis for each shift register to perform data reading. Each read address corresponds to a set of data. The shift register reads and stores the data corresponding to the read address according to the read address.

本申请实施例中，按照时钟顺序和多个移位寄存器的读取顺序，控制不同移位寄存器读取目标矩阵的不同行的数据，并且每个时钟下允许一个移位寄存器获取读取地址，以执行数据读取任务。在确定每个时钟下执行数据读取的移位寄存器的情况下，根据每个时钟下需要读取的目标矩阵的数据，确定每个时钟下移位寄存器在目标矩阵中的读取地址。In the embodiment of the present application, according to the clock sequence and the reading sequence of multiple shift registers, different shift registers are controlled to read data in different rows of the target matrix, and one shift register is allowed to obtain the read address at each clock. to perform data reading tasks. When determining the shift register that performs data reading at each clock, determine the read address of the shift register in the target matrix at each clock based on the data of the target matrix that needs to be read at each clock.

S303，根据每个移位寄存器在目标矩阵中的读取地址，控制各移位寄存器从目标矩阵中读取数据。S303: Control each shift register to read data from the target matrix according to the read address of each shift register in the target matrix.

在获取各移位寄存器在目标矩阵中的读取地址之后，控制各移位寄存器按照对应的读取地址，从目标矩阵中读取数据，并以先入先出的方式，将读取数据存储至移位寄存器的低位。After obtaining the read address of each shift register in the target matrix, control each shift register to read data from the target matrix according to the corresponding read address, and store the read data in a first-in, first-out manner. The lower bits of the shift register.

本申请实施例中，在每个时钟下，控制移位寄存器按照读取地址，从目标矩阵中读取数据，相当于在读取数据的过程中，充分发挥了芯片连续模块地址的寻址方式的特点，令移位寄存器能够持续进行数据读取，支持芯片流水线运行的模式。In the embodiment of this application, under each clock, the shift register is controlled to read data from the target matrix according to the read address, which is equivalent to giving full play to the addressing mode of the chip's continuous module address in the process of reading data. This feature enables the shift register to continuously read data and supports the chip pipeline operation mode.

在进行数据读取的过程中，涉及芯片中的多个移位寄存器。本申请实施例中，每个时钟下允许一个移位寄存器执行数据读取，也就需要对多个移位寄存器进行排序，确定不同时钟下执行数据读取的移位寄存器。基于此，下面，通过一个实施例，对多个移位寄存器的读取顺序的获取步骤进行说明。In the process of data reading, multiple shift registers in the chip are involved. In the embodiment of the present application, one shift register is allowed to perform data reading under each clock, which means that multiple shift registers need to be sorted to determine the shift register that performs data reading under different clocks. Based on this, the steps for obtaining the reading order of multiple shift registers will be described below through an embodiment.

在一个实施例中，如图4所示，获取多个移位寄存器的读取顺序包括：In one embodiment, as shown in Figure 4, obtaining the reading sequence of multiple shift registers includes:

S401，根据芯片中存储单元的带宽确定多个移位寄存器的长度；各移位寄存器的长度均不相同。其中，多个移位寄存器的数量与带宽相同，且各移位寄存器的长度依次递减；各移位寄存器中的最小长度等于带宽加一，最大长度等于带宽的两倍。S401: Determine the lengths of multiple shift registers according to the bandwidth of the memory unit in the chip; the lengths of each shift register are different. Among them, the number of multiple shift registers is the same as the bandwidth, and the length of each shift register decreases in sequence; the minimum length of each shift register is equal to the bandwidth plus one, and the maximum length is equal to twice the bandwidth.

根据芯片中存储单元的带宽，按照读写带宽相同的原则，设置多个移位寄存器的个数，并确定多个移位寄存器的长度。According to the bandwidth of the memory unit in the chip and the same principle of read and write bandwidth, set the number of multiple shift registers and determine the lengths of the multiple shift registers.

将芯片中存储单元的带宽与多个移位寄存器的个数设置相同，可以保证在同一时钟下，多个移位寄存器向存储单元写入数据与存储单元的带宽一致。Setting the bandwidth of the memory unit in the chip to be the same as the number of multiple shift registers can ensure that under the same clock, multiple shift registers write data to the memory unit consistent with the bandwidth of the memory unit.

根据芯片中存储单元的带宽确定多个不同长度的移位寄存器，且最小长度等于带宽加一，最大长度等于带宽的两倍，可以保证多个移位寄存器均包括已存储数据的情况下，各移位寄存器在最高位的存储数据，分别对应目标数据的一列数据。Multiple shift registers of different lengths are determined according to the bandwidth of the memory unit in the chip, and the minimum length is equal to the bandwidth plus one, and the maximum length is equal to twice the bandwidth. It can be ensured that when multiple shift registers include stored data, each The shift register stores data in the highest bit, corresponding to a column of target data.

可选的，若存储单元的带宽为8B，则设置8个移位寄存器，移位寄存器的长度从9到16依次不同。假设每个数据为int8的数据，即1B。通过这样的设置方式，在各移位寄存器的最高位存在数据的情况下，每个时钟下，一个寄存器读取数据，即读取数据的带宽为8B，且8个移位寄存器的最高位数据同时写回，也是8B。Optionally, if the bandwidth of the storage unit is 8B, eight shift registers are set, and the lengths of the shift registers vary from 9 to 16. Assume that each data is int8 data, that is, 1B. With this setting, when there is data in the highest bit of each shift register, one register reads data at each clock, that is, the bandwidth of reading data is 8B, and the highest bit data of 8 shift registers Write back at the same time, it is also 8B.

S402，将各移位寄存器的长度从大到小的顺序确定为多个移位寄存器的排序。S402: Determine the order of multiple shift registers by ordering the lengths of each shift register from large to small.

按照移位寄存器的长度从大到小的顺序，对芯片内的多个移位寄存器进行排序，获取多个移位寄存器的读取顺序。Sort multiple shift registers in the chip according to the length of the shift registers from largest to smallest to obtain the reading order of multiple shift registers.

可选的，若存储单元的带宽为8B，则设置8个移位寄存器，移位寄存器的长度从9到16依次不同。假设每个数据为int8的数据，即1B。则依次执行数据读取的寄存器的排序为：长度为16的移位寄存器、长度为15的移位寄存器、长度为14的移位寄存器、长度为13的移位寄存器、长度为12的移位寄存器、长度为11的移位寄存器、长度为10的移位寄存器、长度为9的移位寄存器。Optionally, if the bandwidth of the storage unit is 8B, eight shift registers are set, and the lengths of the shift registers vary from 9 to 16. Assume that each data is int8 data, that is, 1B. Then the order of the registers that perform data reading in sequence is: shift register with length 16, shift register with length 15, shift register with length 14, shift register with length 13, shift with length 12 Register, shift register of length 11, shift register of length 10, shift register of length 9.

本申请实施例中，根据存储单元的带宽，设置移位寄存器的个数的数量与带宽相同，相当于在数据写入的过程中，令多个移位寄存器向存储单元写入数据与存储单元的带宽相同。并且，将移位寄存器的最小长度设置为带宽加1，将移位寄存器的最大长度设置为带宽的两倍，相当于在数据读取的过程中，在一个移位寄存器读取目标矩阵的数据的同时，为其他移位寄存器进行移位，提供了预设的存储空间。因此，本申请实施例设置移位寄存器的个数以及移位寄存器长度的方式，保证数据读写过程中的读取数据的带宽与写入数据的带宽是相同的，能够充分发挥芯片的计算效力。In the embodiment of the present application, according to the bandwidth of the storage unit, the number of shift registers is set to be the same as the bandwidth, which is equivalent to having multiple shift registers write data to the storage unit and the storage unit during the data writing process. The bandwidth is the same. Moreover, set the minimum length of the shift register to the bandwidth plus 1, and set the maximum length of the shift register to twice the bandwidth, which is equivalent to reading the data of the target matrix in a shift register during the data reading process. At the same time, it provides a preset storage space for other shift registers to shift. Therefore, the embodiment of the present application sets the number of shift registers and the length of the shift register in a manner that ensures that the bandwidth of reading data and the bandwidth of writing data are the same during the data reading and writing process, and can fully utilize the computing efficiency of the chip. .

在获取芯片内多个移位寄存器的读取顺序之后，为每个移位寄存器分配读取地址，以进行数据读取。下面，通过一个实施例，对每个移位寄存器的读取地址的获取步骤进行说明：After obtaining the reading sequence of multiple shift registers in the chip, assign a read address to each shift register for data reading. Below, through an embodiment, the steps for obtaining the read address of each shift register are explained:

在一个实施例中，如图5所示，按照时钟顺序和多个移位寄存器的读取顺序，获取每个移位寄存器在目标矩阵中的读取地址，包括：In one embodiment, as shown in Figure 5, according to the clock sequence and the reading sequence of multiple shift registers, the read address of each shift register in the target matrix is obtained, including:

S501，根据时钟顺序和多个移位寄存器的读取顺序，获取每个时钟对应的移位寄存器；每个时钟下执行一个移位寄存器的数据读取。S501: Obtain the shift register corresponding to each clock according to the clock sequence and the reading sequence of multiple shift registers; perform data reading of one shift register at each clock.

按照每个时钟执行一个移位寄存器的数据读取的规则，将时钟顺序与多个移位寄存器的读取顺序进行匹配，获取每个时钟执行数据读取的移位寄存器。According to the rule that each clock performs data reading of one shift register, match the clock sequence with the reading sequence of multiple shift registers to obtain the shift register that performs data reading for each clock.

仍以存储单元带宽为8B为例，设置8个长度范围为8～16区间内的不同的移位寄存器，且多个移位寄存器的读取顺序为各移位寄存器的长度从大到小的顺序为例。那么，在第一个时钟对应的移位寄存器的长度为16，第二个时钟对应的移位寄存器的长度为15，第三个时钟对应的移位寄存器的长度为14，第四个时钟对应的移位寄存器的长度为13，第五个时钟对应的移位寄存器的长度为12，第六个时钟对应的移位寄存器的长度为11，第七个时钟对应的移位寄存器的长度为10，第八个时钟对应的移位寄存器的长度为9，第九个时钟对应的移位寄存器的长度为16。Still taking the storage unit bandwidth of 8B as an example, set up 8 different shift registers with a length ranging from 8 to 16, and the reading order of multiple shift registers is from large to small length of each shift register. Take the sequence as an example. Then, the length of the shift register corresponding to the first clock is 16, the length of the shift register corresponding to the second clock is 15, the length of the shift register corresponding to the third clock is 14, and the length of the shift register corresponding to the fourth clock The length of the shift register is 13, the length of the shift register corresponding to the fifth clock is 12, the length of the shift register corresponding to the sixth clock is 11, and the length of the shift register corresponding to the seventh clock is 10 , the length of the shift register corresponding to the eighth clock is 9, and the length of the shift register corresponding to the ninth clock is 16.

S502，将目标矩阵中数据的第一行地址作为第一个时钟对应的移位寄存器的读取地址、第二行地址作为第二个时钟对应的移位寄存器的读取地址，以此类推，得到每个时钟对应的移位寄存器的读取地址。S502, use the address of the first row of data in the target matrix as the read address of the shift register corresponding to the first clock, and the address of the second row as the read address of the shift register corresponding to the second clock, and so on. Get the read address of the shift register corresponding to each clock.

由于每个时钟对应一个移位寄存器进行数据读取，不同移位寄存器读取的数据为目标矩阵的不同行的数据，那么目标矩阵中数据的行地址就是移位寄存器的读取地址。将目标矩阵中的行顺序与时钟顺序匹配，对应的，将目标矩阵中的行地址作为对应时钟下移位寄存器的读取地址。Since each clock corresponds to a shift register for data reading, and the data read by different shift registers is the data of different rows of the target matrix, then the row address of the data in the target matrix is the read address of the shift register. Match the row order in the target matrix with the clock order. Correspondingly, use the row address in the target matrix as the read address of the shift register under the corresponding clock.

可选的，在第一个时钟，将目标矩阵中数据的第一行地址作为第一个时钟下移位寄存器的读取地址；在第二个时钟，将目标矩阵中数据的第二行地址作为第二个时钟下移位寄存器的读取地址；在第三个时钟，将目标矩阵中数据的第三行地址作为第三个时钟下移位寄存器的读取地址；以此类推，得到每个时钟对应的移位寄存器的读取地址。Optionally, at the first clock, use the first row address of the data in the target matrix as the read address of the shift register under the first clock; at the second clock, use the second row address of the data in the target matrix. As the read address of the shift register under the second clock; at the third clock, the third row address of the data in the target matrix is used as the read address of the shift register under the third clock; and so on, we get each The read address of the shift register corresponding to clocks.

本申请实施例中，在确定每个时钟对应的移位寄存器的基础上，根据目标矩阵的行顺序确定各时钟对应的移位寄存器的读取地址，由于行顺序是连续的，时钟顺序也是连续的，那么按照行顺序确定的各移位寄存器的读取地址也是连续。这样确定读取地址的方式，巧妙地利用了芯片连续模块地址的寻址方式，最大程度发挥了芯片的计算效力。In the embodiment of the present application, on the basis of determining the shift register corresponding to each clock, the read address of the shift register corresponding to each clock is determined according to the row order of the target matrix. Since the row order is continuous, the clock order is also continuous. , then the read addresses of each shift register determined in row order are also consecutive. This way of determining the read address cleverly makes use of the chip's addressing method of continuous module addresses, maximizing the chip's computing efficiency.

移位寄存器读取目标矩阵的一行数据，然而在移位寄存器的读取带宽小于目标矩阵中一行的元素的情况下，需要移位寄存器多次读取数据，以获取目标矩阵的一行完整的数据。基于此，下面通过一个实施例，对移位寄存器根据读取地址，读取目标矩阵中的数据的步骤进行说明。The shift register reads one row of data in the target matrix. However, when the read bandwidth of the shift register is smaller than the elements of one row in the target matrix, the shift register needs to read the data multiple times to obtain a complete row of data in the target matrix. . Based on this, the steps for the shift register to read the data in the target matrix according to the read address will be described below through an embodiment.

在一个实施例中，如图6所示，根据每个移位寄存器在目标矩阵中的读取地址，控制各移位寄存器从目标矩阵中读取数据，包括：In one embodiment, as shown in Figure 6, controlling each shift register to read data from the target matrix according to the read address of each shift register in the target matrix includes:

S601，对于任一移位寄存器，根据移位寄存器在目标矩阵中的读取地址确定移位寄存器对应的目标数据。S601. For any shift register, determine the target data corresponding to the shift register according to the read address of the shift register in the target matrix.

需要说明的是，一个读取地址对应的数据带宽与移位寄存器的读取带宽是一致的，也就是说一个读取地址对应的是目标矩阵中一行的多个数据。比如读取带宽为8B，目标矩阵中一个数据为1B，那么一个读取地址对应的数据就是目标矩阵中一行连续的8个数据。It should be noted that the data bandwidth corresponding to a read address is consistent with the read bandwidth of the shift register, which means that a read address corresponds to multiple data in one row of the target matrix. For example, if the read bandwidth is 8B and one data in the target matrix is 1B, then the data corresponding to a read address is 8 consecutive data in a row in the target matrix.

对于任一移位寄存器，根据移位寄存器在目标矩阵中的读取地址，将读取地址对应的目标矩阵中的一组数据，确定为移位寄存器读取的目标数据。For any shift register, according to the read address of the shift register in the target matrix, a set of data in the target matrix corresponding to the read address is determined as the target data read by the shift register.

S602，控制移位寄存器读取目标数据。S602, control the shift register to read target data.

在确定移位寄存器读取的目标数据的情况下，将目标矩阵中的读取地址作为目标数据的读取地址，控制移位寄存器按照目标矩阵的读取地址从目标矩阵中读取目标数据。When the target data read by the shift register is determined, the read address in the target matrix is used as the read address of the target data, and the shift register is controlled to read the target data from the target matrix according to the read address of the target matrix.

本申请实施例中，根据移位寄存器在目标矩阵中的读取地址确定移位寄存器对应的目标数据，控制移位寄存器按照读取地址，从目标矩阵中读取目标数据，设计逻辑清晰，且易于实现。In the embodiment of the present application, the target data corresponding to the shift register is determined according to the read address of the shift register in the target matrix, and the shift register is controlled to read the target data from the target matrix according to the read address. The design logic is clear, and Easy to implement.

在芯片执行矩阵转置运算的过程中，需要在对目标矩阵进行数据读取的同时，进行转置矩阵的数据写入。前述实施例对目标矩阵的数据读取步骤进行了说明，基于此，下面通过一个实施例，对转置矩阵的数据写入步骤进行说明。When the chip performs the matrix transpose operation, it is necessary to write data to the transposed matrix while reading data from the target matrix. The foregoing embodiment describes the data reading steps of the target matrix. Based on this, the following describes the data writing steps of the transposed matrix through an embodiment.

在一个实施例中，如图7所示，控制各移位寄存器的最高位存储数据写入转置矩阵中，包括：In one embodiment, as shown in Figure 7, controlling the writing of the highest-order stored data of each shift register into the transpose matrix includes:

S701，获取各移位寄存器的最高位存储数据，并将各移位寄存器的最高位存储数据进行组合，获取转置向量。S701: Obtain the highest-order stored data of each shift register, and combine the highest-order stored data of each shift register to obtain the transposition vector.

在各所述移位寄存器均完成一次数据读取的情况下，各移位寄存器的最高位均存储有目标矩阵的数据，且各移位寄存器的最高位数据分别对应目标矩阵同一列的数据，将各移位寄存器的最高位存储数据以行的形式进行组合，并将该组合作为当前时钟各移位寄存器输出的转置向量。When each shift register has completed one data read, the highest bit of each shift register stores the data of the target matrix, and the highest bit of data of each shift register respectively corresponds to the data of the same column of the target matrix, The highest-order stored data of each shift register is combined in the form of rows, and the combination is used as the transposition vector of the output of each shift register of the current clock.

S702，控制转置向量写入转置矩阵中。S702, control the transposition vector to be written into the transposition matrix.

由于转置向量是在目标矩阵中以列的形式进行存储的，那么对目标矩阵进行转置，需要将转置向量以行的形式写入转置矩阵中。Since the transposed vector is stored in the form of columns in the target matrix, to transpose the target matrix, the transposed vector needs to be written into the transposed matrix in the form of rows.

本申请实施例中，在获取各移位寄存器的最高位存储数据的基础上，将各移位寄存器的最高位存储数据组合生成转置向量，并控制转置向量以行的形式存储至转置矩阵中，实现了目标矩阵的转置运算。In the embodiment of the present application, on the basis of obtaining the highest-order stored data of each shift register, the highest-order stored data of each shift register is combined to generate a transposition vector, and the transposed vector is controlled to be stored in the form of rows in the transposed vector. In the matrix, the transpose operation of the target matrix is implemented.

与目标矩阵的数据读取过程，需要获取读取地址的方式类似，在转置矩阵的数据写入过程中，也需要以写入地址为依据，进行转置矩阵的数据写入。基于此，下面通过一个实施例，对转置矩阵的数据写入步骤进行说明。Similar to the data reading process of the target matrix, the read address needs to be obtained. In the data writing process of the transposed matrix, the data writing of the transposed matrix also needs to be based on the writing address. Based on this, the data writing steps of the transposed matrix will be described below through an embodiment.

在一个实施例中，如图8所示，控制转置向量写入转置矩阵中，包括：In one embodiment, as shown in Figure 8, controlling the transposition vector to be written into the transposition matrix includes:

S801，获取转置向量在转置矩阵中的写入地址。S801, obtain the writing address of the transposed vector in the transposed matrix.

其中，写入地址是各移位寄存器最高位存储数据写入转置矩阵的依据，每一个写入地址对应的是转置矩阵中一行连续的多个数据。可选的，根据转置向量在转置矩阵中的位置确定转置向量在转置矩阵中的写入地址。Among them, the write address is the basis for writing the data stored in the highest bit of each shift register into the transpose matrix. Each write address corresponds to multiple consecutive data in one row in the transpose matrix. Optionally, determine the writing address of the transposed vector in the transposed matrix according to the position of the transposed vector in the transposed matrix.

S802，根据写入地址，控制转置向量写入转置矩阵中。S802: Control the transposition vector to be written into the transposition matrix according to the writing address.

控制多个移位寄存器按照写入地址，将转置向量写入转置矩阵中。Control multiple shift registers to write the transposition vector into the transposition matrix according to the writing address.

本申请实施例中，根据转置向量在转置矩阵中的位置，获取转置向量在转置矩阵中的写入地址，并根据写入地址，将各移位寄存器中的最高位存储数据写入转置矩阵，实现对目标矩阵的转置。In the embodiment of the present application, according to the position of the transposed vector in the transposed matrix, the writing address of the transposed vector in the transposed matrix is obtained, and according to the written address, the highest bit stored data in each shift register is written. Enter the transpose matrix to realize the transposition of the target matrix.

下面，通过一个实施例，对前述实施例中“获取转置向量在转置矩阵中的写入地址”的一种可实现的方式进行说明，如图9所示，获取转置向量在转置矩阵中的写入地址，包括：Below, through an embodiment, an achievable way of "obtaining the writing address of the transposed vector in the transposed matrix" in the previous embodiment is explained. As shown in Figure 9, obtaining the transposed vector in the transposed matrix Write addresses in the matrix, including:

S901，获取在对目标矩阵循环执行数据读写操作过程中，当前各移位寄存器均完成一次数据读取的次数。S901: Obtain the number of times each shift register currently completes one data read during the cyclic execution of data read and write operations on the target matrix.

在对目标矩阵循环执行数据读写操作过程中，监测各移位寄存器均完成一次数据读取的次数，并根据次数确定转置向量在转置矩阵中的行位置，并将转置向量在转置矩阵中的行地址作为转置向量在转置矩阵中的写入地址。During the cyclic execution of data read and write operations on the target matrix, monitor the number of times each shift register has completed a data read, determine the row position of the transposed vector in the transposed matrix based on the number of times, and store the transposed vector in the transposed matrix. The row address in the transposed matrix is used as the writing address of the transposed vector in the transposed matrix.

S902，根据次数确定转置向量在转置矩阵中的行地址。S902: Determine the row address of the transposed vector in the transposed matrix according to the number of times.

示例性地，当前移位寄存器均完成一次数据读取，转置向量是目标矩阵的第一列数据，那么转置矩阵中的第一行地址就是转置向量在转置矩阵中的行地址。For example, the current shift register has completed one data read, and the transposed vector is the first column data of the target matrix, then the first row address in the transposed matrix is the row address of the transposed vector in the transposed matrix.

S903，将转置向量在转置矩阵中的行地址作为转置向量在转置矩阵中的写入地址。S903: Use the row address of the transposed vector in the transposed matrix as the writing address of the transposed vector in the transposed matrix.

本申请实施例中，根据移位寄存器均完成一次数据读取的次数，确定转置向量在转置矩阵中的行地址，进而确定转置向量在转置矩阵中的写入地址，由于芯片中，无论是目标矩阵还是转置矩阵，都是按行进行存储的，因此本申请提供的写入地址的确认方式，只需要进行一次寻址，即可将转置向量写入转置矩阵中，提升了芯片的运算效力。In the embodiment of the present application, the row address of the transposed vector in the transposed matrix is determined based on the number of times the shift register completes one data read, and then the writing address of the transposed vector in the transposed matrix is determined. Since the chip , both the target matrix and the transposed matrix are stored in rows. Therefore, the writing address confirmation method provided by this application only requires one addressing to write the transposed vector into the transposed matrix. Improved the computing efficiency of the chip.

在矩阵转置过程中，数据读写操作是以读取地址和写入地址为依据，读取目标矩阵的数据，并根据转置后的写入地址，将目标矩阵的数据写入转置矩阵中。本申请实施例中，芯片获取的读取地址和写入地址，采用基地址加偏地址的方式，通过规律性地地址跳变，对目标矩阵执行转置运算。基于此，下面通过一个实施例，对矩阵转置方法中的地址生成方式进行说明。During the matrix transposition process, the data read and write operations are based on the read address and write address, read the data of the target matrix, and write the data of the target matrix into the transposed matrix based on the transposed write address. middle. In the embodiment of the present application, the read address and write address obtained by the chip adopt the method of base address plus offset address, and perform transpose operation on the target matrix through regular address jumps. Based on this, the address generation method in the matrix transposition method is explained below through an embodiment.

在一个实施例中，如图10所示，矩阵转置方法还包括：In one embodiment, as shown in Figure 10, the matrix transposition method further includes:

S1001，根据目标矩阵的尺寸和芯片中存储单元的带宽确定每个时钟的偏地址跳变步长。S1001. Determine the offset address transition step size of each clock according to the size of the target matrix and the bandwidth of the memory unit in the chip.

在数据读取过程中，读取地址包括基地址和偏地址，获取目标矩阵的宽度和芯片中存储单元的带宽，并将目标矩阵的宽度与芯片中存储单元的带宽的比值，作为每个时钟读取地址中偏地址的跳变步长。During the data reading process, the read address includes the base address and the offset address, obtain the width of the target matrix and the bandwidth of the memory unit in the chip, and use the ratio of the width of the target matrix to the bandwidth of the memory unit in the chip as each clock Read the transition step size of the partial address in the read address.

在数据写入过程中，读取地址包括基地址和偏地址，获取目标矩阵的高度和芯片中存储单元的带宽，并将目标矩阵的高度与芯片中存储单元的带宽的比值，作为每个时钟读取地址中偏地址的跳变步长。During the data writing process, the read address includes the base address and the offset address, obtain the height of the target matrix and the bandwidth of the memory unit in the chip, and use the ratio of the height of the target matrix to the bandwidth of the memory unit in the chip as each clock Read the transition step size of the partial address in the read address.

示例性地，在目标矩阵的尺寸为H(高)×W(宽)，存储单元按行存储的情况下，若存储单元的带宽为8B，即存储单元每8B为一个连续的地址单元，那么每个时钟的读取地址的偏地址跳变步长为ceil(W/8)、每个时钟的写入地址的偏地址跳变步长为ceil(H/8)。其中，ceil()表示向上取整的操作。For example, when the size of the target matrix is H (height) × W (width) and the storage unit is stored in rows, if the bandwidth of the storage unit is 8B, that is, every 8B of the storage unit is a continuous address unit, then The offset address jump step size of the read address of each clock is ceil(W/8), and the offset address jump step size of the write address of each clock is ceil(H/8). Among them, ceil() represents the operation of rounding up.

S1002，在对目标矩阵循环执行数据读写操作过程中，针对任一个时钟，按照时钟的偏地址跳变步长在行方向上进行偏地址跳变，并在偏地址跳变完毕的情况下，对时钟的基地址跳变一个地址。S1002, during the cyclic execution of data read and write operations on the target matrix, for any clock, a bias address jump is performed in the row direction according to the bias address jump step of the clock, and when the bias address jump is completed, the The base address of the clock jumps one address.

在数据读取操作过程中，针对任一个时钟，根据读取地址中偏地址的跳变步长，读取地址的偏地址在目标矩阵的行方向上进行偏地址跳变，并在偏地址跳变完毕的情况下，读取地址的基地址跳变一个地址。During the data read operation, for any clock, according to the transition step size of the partial address in the read address, the bias address of the read address performs a bias address jump in the row direction of the target matrix, and the bias address jumps at the When completed, the base address of the read address jumps by one address.

在数据写入操作过程中，针对任一个时钟，根据写入地址中偏地址的跳变步长，写入地址的偏地址在转置矩阵的行方向上进行偏地址跳变，并在偏地址跳变完毕的情况下，写入地址的基地址跳变一个地址。During the data writing operation, for any clock, according to the transition step size of the offset address in the write address, the offset address of the write address performs offset address transitions in the row direction of the transposed matrix, and jumps at the offset address. When the change is completed, the base address of the write address jumps by one address.

本申请实施例中，在对目标矩阵循环执行数据读写操作过程中，读取地址和写入地址均是连续的地址，便于每个时钟按照地址存储规律实现地址跳变。In the embodiment of the present application, during the cyclic execution of data read and write operations on the target matrix, both the read address and the write address are continuous addresses, which facilitates each clock to implement address jumps according to the address storage rules.

在一个实施例中，以图11所示的目标矩阵为例，对矩阵转置方法进行说明，图11所示的目标矩阵尺寸(W×H)为64×16，目标矩阵的每一个数据大小为1B，若存储单元的带宽为8B，则设置8个移位寄存器，且8个移位寄存器的长度从16～9依次递减，如图12所示。图12为8个移位寄存器对目标矩阵进行转置运算的读写过程示意图。In one embodiment, the target matrix shown in Figure 11 is taken as an example to illustrate the matrix transposition method. The target matrix size (W×H) shown in Figure 11 is 64×16. Each data size of the target matrix is is 1B. If the bandwidth of the storage unit is 8B, 8 shift registers are set, and the lengths of the 8 shift registers decrease in sequence from 16 to 9, as shown in Figure 12. Figure 12 is a schematic diagram of the reading and writing process of eight shift registers transposing the target matrix.

由图12可知，在第一个时钟，由第一寄存器读取目标矩阵中的第一行的八个数据，并存储至第一寄存器的低八位；第二个时钟，由第二寄存器读取目标矩阵中的第二行的八个数据，并存储至第二寄存器的低八位，且第一寄存器的已存储数据向左移动一位；第三个时钟，由第三寄存器读取目标矩阵中的第三行的八个数据，并存储至第三寄存器的低八位，且第一寄存器的已存储数据继续向左移动一位(第一寄存器的已存储数据累计移动两位)，第二寄存器的已存储数据向左移动一位；依次类推，第八个时钟，由第八寄存器读取目标矩阵中的第八行的八个数据，并存储至第八寄存器的低八位，且第一寄存器的已存储数据累计向高位移动七位，第二寄存器的已存储数据累计向高位移动六位，第三寄存器的已存储数据累计向高位移动五位，第四寄存器的已存储数据累计向高位移动四位，第五寄存器的已存储数据累计向高位移动三位，第六寄存器的已存储数据累计向高位移动两位，第七寄存器的已存储数据累计向高位移动一位，由于第一寄存器的长度到第八寄存器的长度为16～9依次递减，那么在第九个时钟，由第一寄存器读取目标矩阵中第一行的第九列数据至第十五列数据的八个数据，且所有寄存器的已存储数据均移动至最高位，输出各移位寄存器的最高位数据，也就是目标矩阵第一列的八个数据，在第十个时钟，由第二寄存器读取目标矩阵中第二行的第九列数据至第十五列数据的八个数据，且所有寄存器的已存储数据均移动至最高位，输出各移位寄存器的最高位数据，也就是目标矩阵第二列的八个数据。也就是说，在经过八个时钟之后，目标矩阵的数据可以在读取的同时，进行数据写入，实现了流水式(非阻塞式)的数据读写操作。并且，每次读取数据的带宽与数据写入的带宽均为8B，实现了读写地址相同。It can be seen from Figure 12 that at the first clock, the eight data in the first row of the target matrix are read from the first register and stored in the lower eight bits of the first register; at the second clock, the eight data in the first row of the target matrix are read from the second register Take the eight data in the second row of the target matrix and store them in the lower eight bits of the second register, and move the stored data in the first register one bit to the left; at the third clock, read the target from the third register The eight data in the third row of the matrix are stored in the lower eight bits of the third register, and the stored data in the first register continues to move one bit to the left (the stored data in the first register moves cumulatively by two bits), The stored data in the second register is shifted one bit to the left; and by analogy, at the eighth clock, the eight data in the eighth row in the target matrix are read from the eighth register and stored in the lower eight bits of the eighth register. And the stored data in the first register moves to the high bit cumulatively by seven bits, the stored data in the second register moves to the high bit in total by six bits, the stored data in the third register moves to the high bit in total by five bits, the stored data in the fourth register moves to the high bit in total The accumulated data of the fifth register moves to the high order by four places, the stored data of the fifth register moves to the high order of three places, the stored data of the sixth register moves to the high order of two places, and the stored data of the seventh register moves to the high order of one place. The length of the first register to the length of the eighth register decreases in sequence from 16 to 9. Then at the ninth clock, the first register reads eight of the data from the ninth column of the first row to the fifteenth column of the target matrix. data, and the stored data in all registers are moved to the highest bit, and the highest bit data of each shift register is output, that is, the eight data in the first column of the target matrix. At the tenth clock, it is read by the second register Eight data from the ninth column to the fifteenth column of the second row in the target matrix, and the stored data of all registers are moved to the highest bit, and the highest bit data of each shift register is output, which is the target matrix. Eight data in two columns. In other words, after eight clocks, the data of the target matrix can be read and written at the same time, realizing pipelined (non-blocking) data read and write operations. Moreover, the bandwidth for each data read and the bandwidth for data write are both 8B, achieving the same read and write addresses.

与图12所示的读写操作相对应，本申请还对应设计了一套地址生成系统，采用基地址+偏地址的方式，为数据的读取提供依据。如图13所示，图13为第八个时钟之后，数据写入地址的跳变方式的示意图。在第九个时钟，将各移位寄存器的最高位存储数据，以行的形式写入转置矩阵中，当写入地址的偏地址在转置矩阵的行方向跳变完毕，写入地址的基地址跳变一个地址。Corresponding to the read and write operations shown in Figure 12, this application also designs an address generation system, using the base address + offset address method to provide a basis for data reading. As shown in Figure 13, Figure 13 is a schematic diagram of the transition mode of the data writing address after the eighth clock. At the ninth clock, the highest-order stored data of each shift register is written into the transpose matrix in the form of rows. When the offset address of the written address completes the transition in the row direction of the transposed matrix, the offset address of the written address is The base address jumps by one address.

本申请实施例中，结合图11～图13，对目标矩阵的转置运算进行说明，根据上述运算过程，本申请实施例提供的技术方案的优势包括以下三个方面：In the embodiment of the present application, the transposition operation of the target matrix is explained with reference to Figures 11 to 13. According to the above operation process, the advantages of the technical solution provided by the embodiment of the present application include the following three aspects:

(1)、在数据读取带宽不变的情况下，利用移位寄存器实现矩阵转置时，流水线读取和输出，贴合芯片流水线运行的本质，可以更快速地实现矩阵转置，充分发挥芯片的计算效率和效力，同时节省功耗。(1) When the data reading bandwidth remains unchanged, when using a shift register to realize matrix transposition, pipeline reading and output conform to the nature of the chip's pipeline operation, and the matrix transposition can be realized more quickly, giving full play to The chip’s computing efficiency and effectiveness while saving power.

(2)、在地址寻址方面，本申请实施例提供的矩阵转置方法输入输出带宽匹配，且输出数据地址相连，便于按照地址存储规律实现地址跳变。(2) In terms of address addressing, the input and output bandwidth of the matrix transposition method provided by the embodiment of the present application matches, and the output data addresses are connected, which facilitates the implementation of address hopping according to the address storage rules.

(3)、本申请实施例能够快速实现矩阵的转置运算，为后续芯片实现矩阵其他运算(比如矩阵乘法)提供了基础。(3) The embodiment of the present application can quickly realize the transposition operation of the matrix, providing a basis for subsequent chips to implement other matrix operations (such as matrix multiplication).

应该理解的是，虽然如上所述的各实施例所涉及的流程图中的各个步骤按照箭头的指示依次显示，但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明，这些步骤的执行并没有严格的顺序限制，这些步骤可以以其它的顺序执行。而且，如上所述的各实施例所涉及的流程图中的至少一部分步骤可以包括多个步骤或者多个阶段，这些步骤或者阶段并不必然是在同一时刻执行完成，而是可以在不同的时刻执行，这些步骤或者阶段的执行顺序也不必然是依次进行，而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the steps in the flowcharts involved in the above-mentioned embodiments are shown in sequence as indicated by the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated in this article, there is no strict order restriction on the execution of these steps, and these steps can be executed in other orders. Moreover, at least some of the steps in the flowcharts involved in the above embodiments may include multiple steps or stages. These steps or stages are not necessarily executed at the same time, but may be completed at different times. The execution order of these steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least part of the steps or stages in other steps.

基于同样的发明构思，本申请实施例还提供了一种用于实现上述所涉及的矩阵转置方法的矩阵转置装置。该装置所提供的解决问题的实现方案与上述方法中所记载的实现方案相似，故下面所提供的一个或多个矩阵转置装置实施例中的具体限定可以参见上文中对于矩阵转置方法的限定，在此不再赘述。Based on the same inventive concept, embodiments of the present application also provide a matrix transposition device for implementing the above-mentioned matrix transposition method. The solution to the problem provided by this device is similar to the solution described in the above method. Therefore, for the specific limitations in one or more embodiments of the matrix transposition device provided below, please refer to the above description of the matrix transposition method. Limitations will not be repeated here.

在一个实施例中，如图14所示，提供了一种矩阵转置装置1400，包括：获取模块1401，获取模块1401包括读取单元和写入单元，其中：In one embodiment, as shown in Figure 14, a matrix transposition device 1400 is provided, including: an acquisition module 1401. The acquisition module 1401 includes a reading unit and a writing unit, wherein:

获取模块1401，用于对目标矩阵循环执行数据读写操作，直至所述目标矩阵中数据读取完毕，得到所述目标矩阵的转置矩阵；The acquisition module 1401 is used to cyclically perform data reading and writing operations on the target matrix until the data in the target matrix is read, and obtain the transposed matrix of the target matrix;

读取单元，用于按照时钟顺序，依次控制芯片中的多个移位寄存器从所述目标矩阵中读取预设数量的数据；所述预设数量与所述多个移位寄存器的数量相同，且在读取过程中每个时钟下各移位寄存器中的已存储数据均向高位移动一位；A reading unit, configured to sequentially control multiple shift registers in the chip to read a preset number of data from the target matrix in a clock sequence; the preset number is the same as the number of the multiple shift registers. , and during the reading process, the stored data in each shift register moves to the high bit by one bit at each clock;

写入单元，用于在各所述移位寄存器均完成一次数据读取的情况下，控制各所述移位寄存器的最高位存储数据写入所述转置矩阵中。A writing unit is configured to control the highest bit stored data of each shift register to be written into the transposition matrix when each shift register has completed one data read.

在一个实施例中，读取单元包括：顺序读取子单元、地址获取子单元和数据读取子单元，其中：In one embodiment, the read unit includes: a sequential read sub-unit, an address acquisition sub-unit and a data read sub-unit, wherein:

顺序读取子单元，用于获取多个移位寄存器的读取顺序；Sequential reading subunit, used to obtain the reading sequence of multiple shift registers;

地址获取子单元，用于按照时钟顺序和多个移位寄存器的读取顺序，获取每个移位寄存器在目标矩阵中的读取地址；The address acquisition subunit is used to obtain the read address of each shift register in the target matrix according to the clock sequence and the reading sequence of multiple shift registers;

数据读取子单元，用于根据每个移位寄存器在目标矩阵中的读取地址，控制各移位寄存器从目标矩阵中读取数据。The data reading subunit is used to control each shift register to read data from the target matrix according to the read address of each shift register in the target matrix.

在一个实施例中，顺序读取子单元，还用于根据芯片中存储单元的带宽确定多个移位寄存器的长度；各移位寄存器的长度均不相同；将各移位寄存器的长度从大到小的顺序确定为多个移位寄存器的排序。In one embodiment, sequential reading of subunits is also used to determine the length of multiple shift registers based on the bandwidth of the memory unit in the chip; the lengths of each shift register are different; the length of each shift register is changed from the largest to the largest. Ordering of multiple shift registers is determined into small order.

在一个实施例中，多个移位寄存器的数量与带宽相同，且各移位寄存器的长度依次递减。其中，各移位寄存器中的最小长度等于带宽加1，最大长度等于带宽的两倍。In one embodiment, the number of shift registers is the same as the bandwidth, and the length of each shift register decreases in sequence. Among them, the minimum length in each shift register is equal to the bandwidth plus 1, and the maximum length is equal to twice the bandwidth.

在一个实施例中，地址获取子单元，还用于根据时钟顺序和多个移位寄存器的读取顺序，获取每个时钟对应的移位寄存器；每个时钟下执行一个移位寄存器的数据读取；将目标矩阵中数据的第一行地址作为第一个时钟对应的移位寄存器的读取地址、第二行地址作为第二个时钟对应的移位寄存器的读取地址，以此类推，得到每个时钟对应的移位寄存器的读取地址。In one embodiment, the address acquisition subunit is also used to acquire the shift register corresponding to each clock according to the clock sequence and the reading sequence of multiple shift registers; perform data reading of one shift register at each clock. Take; use the address of the first row of data in the target matrix as the read address of the shift register corresponding to the first clock, and the address of the second row as the read address of the shift register corresponding to the second clock, and so on, Get the read address of the shift register corresponding to each clock.

在一个实施例中，数据读取子单元，还用于对于任一移位寄存器，根据移位寄存器在目标矩阵中的读取地址确定移位寄存器对应的目标数据；控制移位寄存器读取目标数据。In one embodiment, the data reading subunit is also used to determine the target data corresponding to the shift register according to the read address of the shift register in the target matrix for any shift register; control the shift register to read the target data.

在一个实施例中，写入单元，包括：向量获取子单元和向量写入子单元，其中：In one embodiment, the writing unit includes: a vector acquisition subunit and a vector writing subunit, where:

向量获取子单元，用于获取各移位寄存器的最高位存储数据，并将各移位寄存器的最高位存储数据进行组合，获取转置向量；The vector acquisition subunit is used to obtain the highest-order stored data of each shift register, and combine the highest-order stored data of each shift register to obtain the transposed vector;

向量写入子单元，用于控制转置向量写入转置矩阵中。The vector writing subunit is used to control the writing of transposed vectors into the transposed matrix.

在一个实施例中，向量写入子单元，还用于获取转置向量在转置矩阵中的写入地址；根据写入地址，控制转置向量写入转置矩阵中。In one embodiment, the vector writing subunit is also used to obtain the writing address of the transposed vector in the transposed matrix; based on the writing address, the transposed vector is controlled to be written into the transposed matrix.

在一个实施例中，向量写入子单元，还用于获取在对目标矩阵循环执行数据读写操作过程中，当前各移位寄存器均完成一次数据读取的次数；根据次数确定转置向量在转置矩阵中的列地址；将转置向量在转置矩阵中的列地址作为转置向量在转置矩阵中的写入地址。In one embodiment, the vector writing subunit is also used to obtain the number of times each shift register currently completes a data read during the cyclic execution of data read and write operations on the target matrix; determine the position of the transposed vector based on the number of times. Transpose the column address in the matrix; use the column address of the transposed vector in the transposed matrix as the write address of the transposed vector in the transposed matrix.

在一个实施例中，矩阵转置装置1400，还包括确定模块和跳变模块，其中：In one embodiment, the matrix transposition device 1400 also includes a determination module and a hopping module, wherein:

确定模块，用于根据目标矩阵的尺寸和芯片中存储单元的带宽确定每个时钟的偏地址跳变步长；A determination module used to determine the offset address jump step size of each clock according to the size of the target matrix and the bandwidth of the memory unit in the chip;

跳变模块，用于在对目标矩阵循环执行数据读写操作过程中，针对任一个时钟，按照时钟的偏地址跳变步长在行方向上进行偏地址跳变，并在偏地址跳变完毕的情况下，对时钟的基地址跳变一个地址。The hopping module is used to perform partial address hopping in the row direction according to the bias address hopping step of the clock for any clock during the cyclic execution of data read and write operations on the target matrix, and after the partial address hopping is completed case, the base address of the clock jumps one address.

上述矩阵转置装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中，也可以以软件形式存储于计算机设备中的存储器中，以便于处理器调用执行以上各个模块对应的操作。Each module in the above-mentioned matrix transposition device can be implemented in whole or in part by software, hardware and combinations thereof. Each of the above modules may be embedded in or independent of the processor of the computer device in the form of hardware, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.

在一个实施例中，提供了一种计算机设备，包括存储器和处理器，存储器中存储有计算机程序，该处理器执行计算机程序时实现以下步骤：In one embodiment, a computer device is provided, including a memory and a processor. A computer program is stored in the memory. When the processor executes the computer program, it implements the following steps:

在一个实施例中，处理器执行计算机程序时还实现以下步骤：In one embodiment, the processor also implements the following steps when executing the computer program:

在一个实施例中，多个移位寄存器的数量与带宽相同，且各移位寄存器的长度依次递减；其中，各移位寄存器中的最小长度等于带宽加一，最大长度等于带宽的两倍。In one embodiment, the number of multiple shift registers is the same as the bandwidth, and the length of each shift register decreases in sequence; wherein, the minimum length of each shift register is equal to the bandwidth plus one, and the maximum length is equal to twice the bandwidth.

根据目标矩阵的宽度和芯片中存储单元的尺寸确定每个时钟的偏地址跳变步长；Determine the offset address transition step size of each clock according to the width of the target matrix and the size of the memory unit in the chip;

在一个实施例中，提供了一种计算机可读存储介质，其上存储有计算机程序，计算机程序被处理器执行时实现以下步骤：In one embodiment, a computer-readable storage medium is provided with a computer program stored thereon. When the computer program is executed by a processor, the following steps are implemented:

在一个实施例中，计算机程序被处理器执行时还实现以下步骤：In one embodiment, the computer program, when executed by the processor, also implements the following steps:

在一个实施例中，提供了一种计算机程序产品，包括计算机程序，该计算机程序被处理器执行时实现以下步骤：In one embodiment, a computer program product is provided, comprising a computer program that when executed by a processor implements the following steps:

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，所述的计算机程序可存储于一非易失性计算机可读取存储介质中，该计算机程序在执行时，可包括如上述各方法的实施例的流程。其中，本申请所提供的各实施例中所使用的对存储器、数据库或其它介质的任何引用，均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-OnlyMemory，ROM)、磁带、软盘、闪存、光存储器、高密度嵌入式非易失性存储器、阻变存储器(ReRAM)、磁变存储器(Magnetoresistive Random Access Memory，MRAM)、铁电存储器(Ferroelectric Random Access Memory，FRAM)、相变存储器(Phase Change Memory，PCM)、石墨烯存储器等。易失性存储器可包括随机存取存储器(Random Access Memory，RAM)或外部高速缓冲存储器等。作为说明而非局限，RAM可以是多种形式，比如静态随机存取存储器(Static Random Access Memory，SRAM)或动态随机存取存储器(Dynamic RandomAccess Memory，DRAM)等。本申请所提供的各实施例中所涉及的数据库可包括关系型数据库和非关系型数据库中至少一种。非关系型数据库可包括基于区块链的分布式数据库等，不限于此。本申请所提供的各实施例中所涉及的处理器可为通用处理器、中央处理器、图形处理器、数字信号处理器、可编程逻辑器、基于量子计算的数据处理逻辑器等，不限于此。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be completed by instructing relevant hardware through a computer program. The computer program can be stored in a non-volatile computer-readable storage. In the media, when executed, the computer program may include the processes of the above method embodiments. Any reference to memory, database or other media used in the embodiments provided in this application may include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive memory (ReRAM), magnetic variable memory (Magnetoresistive Random) Access Memory (MRAM), Ferroelectric Random Access Memory (FRAM), Phase Change Memory (PCM), graphene memory, etc. Volatile memory may include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration but not limitation, RAM can be in various forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM). The databases involved in the various embodiments provided in this application may include at least one of a relational database and a non-relational database. Non-relational databases may include blockchain-based distributed databases, etc., but are not limited thereto. The processors involved in the various embodiments provided in this application may be general-purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, etc., and are not limited to this.

以上实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined in any way. To simplify the description, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, all possible combinations should be used. It is considered to be within the scope of this manual.

以上所述实施例仅表达了本申请的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对本申请专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本申请构思的前提下，还可以做出若干变形和改进，这些都属于本申请的保护范围。因此，本申请的保护范围应以所附权利要求为准。The above-described embodiments only express several implementation modes of the present application, and their descriptions are relatively specific and detailed, but should not be construed as limiting the patent scope of the present application. It should be noted that, for those of ordinary skill in the art, several modifications and improvements can be made without departing from the concept of the present application, and these all fall within the protection scope of the present application. Therefore, the scope of protection of this application should be determined by the appended claims.

Claims

1. A matrix transposition method, the method comprising:

circularly executing data read-write operation on a target matrix until the data in the target matrix are read out, and obtaining a transposed matrix of the target matrix; the data read-write operation comprises the following steps:

sequentially controlling a plurality of shift registers in a chip to read preset quantity of data from the target matrix according to a clock sequence; the preset number is the same as the number of the plurality of shift registers, and stored data in each shift register moves one bit to the high position under each clock in the reading process;

And under the condition that each shift register finishes one data reading, controlling the most significant stored data of each shift register to be written into the transpose matrix.

2. The method of claim 1, wherein sequentially controlling the plurality of shift registers to read data from the target matrix in clock order comprises:

acquiring the reading sequence of the plurality of shift registers;

acquiring a reading address of each shift register in the target matrix according to the clock sequence and the reading sequence of the shift registers;

and controlling each shift register to read data from the target matrix according to the read address of each shift register in the target matrix.

3. The method of claim 2, wherein the obtaining the read order of the plurality of shift registers comprises:

determining the lengths of the plurality of shift registers according to the bandwidths of the storage units in the chip; the lengths of the shift registers are different;

the order of the lengths of the shift registers from large to small is determined as the order of the plurality of shift registers.

4. A method according to claim 3, wherein the number of shift registers is the same as the bandwidth, and the length of each shift register is successively decreased; wherein the minimum length in each shift register is equal to the bandwidth plus one, and the maximum length is equal to twice the bandwidth.

5. The method according to any one of claims 2-4, wherein said obtaining a read address of each shift register in the target matrix in the clock order and the read order of the plurality of shift registers comprises:

acquiring a shift register corresponding to each clock according to the clock sequence and the reading sequence of the shift registers; performing data reading of a shift register under each clock;

and taking the first row address of the data in the target matrix as the read address of the shift register corresponding to the first clock, taking the second row address as the read address of the shift register corresponding to the second clock, and the like to obtain the read address of the shift register corresponding to each clock.

6. The method of any one of claims 2-4, wherein controlling each shift register to read data from the target matrix according to the read address of each shift register in the target matrix comprises:

for any shift register, determining target data corresponding to the shift register according to a reading address of the shift register in the target matrix;

And controlling the shift register to read the target data.

7. The method of any of claims 1-4, wherein controlling the writing of the most significant stored data of each of the shift registers into the transpose matrix comprises:

acquiring the highest-order storage data of each shift register, and combining the highest-order storage data of each shift register to acquire a transpose vector;

and controlling the transposed vector to be written into the transposed matrix.

8. The method of claim 7, wherein the controlling the transpose vector to be written into the transpose matrix comprises:

acquiring a writing address of the transposed vector in the transposed matrix;

and controlling the transpose vector to be written into the transpose matrix according to the writing address.

9. The method of claim 8, wherein the obtaining the write address of the transpose vector in the transpose matrix comprises:

acquiring the number of times that each shift register finishes one-time data reading currently in the process of circularly executing data reading and writing operation on the target matrix;

determining a column address of the transpose vector in the transpose matrix according to the times;

And taking the column address of the transposed vector in the transposed matrix as the writing address of the transposed vector in the transposed matrix.

10. The method according to any one of claims 1-4, further comprising:

determining the offset address jump step length of each clock according to the size of the target matrix and the bandwidth of a storage unit in the chip;

and in the process of circularly executing data read-write operation on the target matrix, performing offset address hopping on any clock according to the offset address hopping step length of the clock in the row direction, and hopping the base address of the clock by one address under the condition that the offset address hopping is completed.