WO2025055512A1

WO2025055512A1 - Matrix transposition apparatus and method, ai processor, and computer device

Info

Publication number: WO2025055512A1
Application number: PCT/CN2024/104422
Authority: WO
Inventors: 黄彬
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-09-12
Filing date: 2024-07-09
Publication date: 2025-03-20
Anticipated expiration: 2026-03-12
Also published as: CN116910437A; CN116910437B

Abstract

The present application belongs to the technical field of artificial intelligence. Disclosed in the embodiments of the present application are a matrix transposition apparatus and method, an AI processor, and a computer device. The matrix transposition apparatus comprises a matrix reader, a shifter, a matrix buffer, a transposer, a matrix writer and a controller. The controller is used for: controlling the matrix reader to read from an upper-level memory matrix elements of a matrix to be transposed, the matrix reader reading, for each time, at least one row of matrix elements of the matrix to be transposed; controlling the shifter to execute data shift on the matrix elements read by the matrix reader, and controlling the shifter to write the shifted matrix elements into the matrix buffer; controlling the transposer to read sub-matrixes from the matrix buffer, and controlling the transposer to execute transposition processing on the read sub-matrixes to obtain transposed sub-matrixes, each sub-matrix comprising at least one column of matrix elements of the matrix to be transposed; and controlling the matrix writer to write into a lower-level memory the transposed sub-matrixes output by the transposer.

Description

Matrix transposition device, method, AI processor and computer equipment

本申请要求于2023年09月12日提交的申请号为202311171547.3、发明名称为“矩阵转置装置、方法、AI处理器及计算机设备”的中国专利申请的优先权，其全部内容通过引用结合在本申请中。This application claims priority to Chinese patent application No. 202311171547.3 filed on September 12, 2023, with invention name “Matrix transposition device, method, AI processor and computer device”, the entire contents of which are incorporated by reference into this application.

Technical Field

本申请实施例涉及人工智能技术领域，特别涉及一种矩阵转置装置、方法、AI处理器及计算机设备。The embodiments of the present application relate to the field of artificial intelligence technology, and in particular to a matrix transposition device, method, AI processor and computer equipment.

Background Art

随着人工智能技术的发展，越来越多的AI技术被用于生活和生产中。With the development of artificial intelligence technology, more and more AI technologies are being used in life and production.

矩阵运算作为人工智能技术的基础被大量运用，而矩阵转置则是一种常见的矩阵运算。比如，在进行图像处理时，可以通过矩阵转置实现对图像的翻转操作；又比如，在进行信号处理时，可以通过矩阵转置实现快速傅里叶变换。Matrix operations are widely used as the basis of artificial intelligence technology, and matrix transposition is a common matrix operation. For example, when performing image processing, matrix transposition can be used to flip the image; for another example, when performing signal processing, matrix transposition can be used to implement fast Fourier transform.

为了实现矩阵转置，需要在AI处理器中设置专门的矩阵转置装置，因此如何降低矩阵转置装置的设计复杂度，并减小片上电路面积成为提高AI处理器性能的关键。In order to realize matrix transposition, a special matrix transposition device needs to be set up in the AI processor. Therefore, how to reduce the design complexity of the matrix transposition device and reduce the on-chip circuit area becomes the key to improving the performance of the AI processor.

发明内容Summary of the invention

本申请实施例提供了一种矩阵转置装置、方法、AI处理器及计算机设备，能够降低矩阵转置运算的处理延迟。所述技术方案如下：The embodiments of the present application provide a matrix transposition device, method, AI processor and computer equipment, which can reduce the processing delay of matrix transposition operation. The technical solution is as follows:

一方面，本申请实施例提供了一种矩阵转置装置，所述装置包括：On the one hand, an embodiment of the present application provides a matrix transposition device, the device comprising:

矩阵读取器、移位器、矩阵缓存器、转置器、矩阵写入器以及控制器；Matrix readers, shifters, matrix buffers, transposers, matrix writers, and controllers;

所述控制器，用于控制所述矩阵读取器从上级存储器中读取待转置矩阵的矩阵元素，其中，所述矩阵读取器每次读取所述待转置矩阵中的至少一行矩阵元素；The controller is used to control the matrix reader to read the matrix elements of the matrix to be transposed from the upper-level memory, wherein the matrix reader reads at least one row of matrix elements in the matrix to be transposed each time;

所述控制器，用于控制所述移位器对所述矩阵读取器读取到的矩阵元素执行数据移位，以及控制所述移位器将移位后的矩阵元素写入所述矩阵缓存器；The controller is used to control the shifter to perform data shift on the matrix elements read by the matrix reader, and control the shifter to write the shifted matrix elements into the matrix buffer;

所述控制器，用于控制所述转置器从所述矩阵缓存器中读取子矩阵，以及控制所述转置器对读取到的子矩阵执行转置处理，得到转置子矩阵，其中，所述子矩阵包含所述待转置矩阵中的至少一列矩阵元素；The controller is used to control the transposer to read a submatrix from the matrix buffer, and control the transposer to perform a transposition process on the read submatrix to obtain a transposed submatrix, wherein the submatrix includes at least one column of matrix elements in the matrix to be transposed;

所述控制器，用于控制所述矩阵写入器将所述转置器输出的所述转置子矩阵写入下级存储器。The controller is used to control the matrix writer to write the transposed sub-matrix output by the transposer into a lower-level memory.

另一方面，本申请实施例提供了一种矩阵转置方法，所述方法用于如上述方面所述的矩阵转置装置，所述方法包括：On the other hand, an embodiment of the present application provides a matrix transposition method, which is used in the matrix transposition device as described in the above aspect, and the method includes:

控制器控制矩阵读取器从上级存储器中读取待转置矩阵的矩阵元素，其中，所述矩阵读取器每次读取所述待转置矩阵中的至少一行矩阵元素；The controller controls the matrix reader to read the matrix elements of the matrix to be transposed from the upper-level memory, wherein the matrix reader reads at least one row of matrix elements in the matrix to be transposed each time;

所述控制器控制移位器对所述矩阵读取器读取到的矩阵元素执行数据移位，以及控制所述移位器将移位后的矩阵元素写入矩阵缓存器；The controller controls the shifter to perform data shift on the matrix elements read by the matrix reader, and controls the shifter to write the shifted matrix elements into the matrix buffer;

所述控制器控制转置器从所述矩阵缓存器中读取子矩阵，以及控制所述转置器对读取到的子矩阵执行转置处理，得到转置子矩阵，其中，所述子矩阵包含所述待转置矩阵中的至少一列矩阵元素；The controller controls the transposer to read a submatrix from the matrix buffer, and controls the transposer to perform a transposition process on the read submatrix to obtain a transposed submatrix, wherein the submatrix includes at least one column of matrix elements in the matrix to be transposed;

所述控制器控制矩阵写入器将所述转置器输出的所述转置子矩阵写入下级存储器。The controller controls the matrix writer to write the transposed submatrix output by the transposer into a lower-level memory.

另一方面，本申请实施例提供了一种AI处理器，所述AI处理器包括如上述方面所述的矩阵转置装置。On the other hand, an embodiment of the present application provides an AI processor, which includes the matrix transposition device as described in the above aspects.

另一方面，本申请实施例提供了一种计算机设备，所述计算机设备包括如上述方面所述的AI处理器以及存储器，所述AI处理器与所述存储器通过总线相连。On the other hand, an embodiment of the present application provides a computer device, which includes an AI processor and a memory as described in the above aspects, and the AI processor is connected to the memory via a bus.

本申请实施例中，矩阵转置装置通过对矩阵读取器读取到的矩阵元素执行数据移位处理，对移位后存储在矩阵缓存器中的矩阵执行分块读取，得到包含待转置矩阵中至少一列矩阵元素的子矩阵，并对子矩阵执行转置，进而基于各个子矩阵对应的转置子矩阵得到待转置矩阵的转置结果，实现了沿矩阵列方向的矩阵转置，相较于沿矩阵对角线方向执行矩阵转置能够降低矩阵转置的处理延迟；此外，沿矩阵列方向执行矩阵转置时，由于每次从矩阵缓存器中提取出的矩阵元素的数量一致，因此同一矩阵只需要采用一种转置逻辑，有助于降低矩阵转置装置的设计复杂度，减小矩阵转置装置占用的片上面积。In the embodiment of the present application, the matrix transposition device performs data shift processing on the matrix elements read by the matrix reader. The method comprises the following steps: performing block reading on the matrix stored in the matrix buffer after the shift, obtaining a submatrix including at least one column of matrix elements in the matrix to be transposed, and performing transposition on the submatrix, and then obtaining the transposed result of the matrix to be transposed based on the transposed submatrix corresponding to each submatrix, thereby realizing the matrix transposition along the matrix column direction, which can reduce the processing delay of the matrix transposition compared with performing the matrix transposition along the matrix diagonal direction; in addition, when performing the matrix transposition along the matrix column direction, since the number of matrix elements extracted from the matrix buffer each time is the same, only one transposition logic is needed for the same matrix, which helps to reduce the design complexity of the matrix transposition device and reduce the on-chip area occupied by the matrix transposition device.

BRIEF DESCRIPTION OF THE DRAWINGS

图1是本申请一个示例性实施例示出的沿矩阵对角线方向执行矩阵转置过程的实施示意图；FIG1 is a schematic diagram of an exemplary embodiment of the present application showing a process of performing matrix transposition along a matrix diagonal direction;

图2示出了本申请一个示例性实施例提供的矩阵转置装置的结构示意图；FIG2 shows a schematic structural diagram of a matrix transposition device provided by an exemplary embodiment of the present application;

图3是本申请一个示例性实施例示出的矩阵填充以及填充去除过程的实施示意图；FIG3 is a schematic diagram of an implementation of a matrix filling and filling removal process shown in an exemplary embodiment of the present application;

图4是本申请一个示例性实施例示出的矩阵元素右移过程的实施示意图；FIG4 is a schematic diagram of an implementation of a matrix element right shift process shown in an exemplary embodiment of the present application;

图5是本申请一个示例性实施例示出的矩阵缓存器的结构示意图；FIG5 is a schematic diagram of the structure of a matrix buffer shown in an exemplary embodiment of the present application;

图6是本申请一个示例性实施例提供的矩阵转置过程的原理示意图；FIG6 is a schematic diagram showing a matrix transposition process according to an exemplary embodiment of the present application;

图7是本申请一个示例性实施例示出的矩阵转置过程的实施示意图；FIG7 is a schematic diagram of an implementation of a matrix transposition process shown in an exemplary embodiment of the present application;

图8是本申请另一个示例性实施例示出的矩阵转置过程的实施示意图；FIG8 is a schematic diagram of an implementation of a matrix transposition process shown in another exemplary embodiment of the present application;

图9是本申请再一个示例性实施例示出的矩阵转置过程的实施示意图；FIG9 is a schematic diagram of an implementation of a matrix transposition process shown in yet another exemplary embodiment of the present application;

图10示出了本申请一个示例性实施例提供的矩阵转置方法的流程图；FIG10 shows a flow chart of a matrix transposition method provided by an exemplary embodiment of the present application;

图11示出了本申请一个示例性实施例提供的计算机设备的结构框图。FIG. 11 shows a structural block diagram of a computer device provided by an exemplary embodiment of the present application.

DETAILED DESCRIPTION

人工智能(Artificial Intelligence，AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能，感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说，人工智能是计算机科学的一个综合技术，它企图了解智能的实质，并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法，使机器具有感知、推理与决策的功能。Artificial Intelligence (AI) is the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technology in computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines so that machines have the functions of perception, reasoning and decision-making.

人工智能技术是一门综合学科，涉及领域广泛，既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。Artificial intelligence technology is a comprehensive discipline that covers a wide range of fields, including both hardware-level and software-level technologies. The basic technologies of artificial intelligence generally include sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, mechatronics and other technologies. Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.

由于神经网络中存储和处理是一体化的，而经典计算机结构中存储和处理时分离的(分别由存储器和运算器实现)，因此采用经典计算机运行神经网络时，将不可避免的受到存储和处理分离式结构的制约，影响处理效率。为了提高神经网络的处理效率，神经网络处理器(Neural-Network Processing Units，NPU)(又被称为AI处理器)应运而生。其中，AI处理器通过电路来模拟神经元和突触结构，每个神经元被抽象为一个激励函数，该函数的输入由与其相连的神经元的输出以及连接神经元的突触共同决定。为了表达特定的知识，需要调整人工神经网络中突触的取值、网络的拓扑结构等，该过程称为“学习”，在学习之后，人工神经网络可通过习得的知识来解决特定的问题。Since storage and processing are integrated in neural networks, while storage and processing are separated in classical computer structures (realized by memory and arithmetic units, respectively), when using classical computers to run neural networks, it will inevitably be restricted by the separate storage and processing structure, affecting processing efficiency. In order to improve the processing efficiency of neural networks, neural network processors (Neural-Network Processing Units, NPU) (also known as AI processors) came into being. Among them, AI processors simulate neurons and synaptic structures through circuits. Each neuron is abstracted as an excitation function, and the input of the function is determined by the output of the neurons connected to it and the synapses connecting the neurons. In order to express specific knowledge, it is necessary to adjust the values of synapses in artificial neural networks, the topological structure of the network, etc. This process is called "learning". After learning, artificial neural networks can solve specific problems through the acquired knowledge.

AI处理器在运行中需要执行大量矩阵运算，包括矩阵乘法、矩阵求逆、矩阵转置(transposition)等等。为了使AI处理器能够实现矩阵转置，相关技术中，AI处理器中会专门设置矩阵转置装置。The AI processor needs to perform a large number of matrix operations during operation, including matrix multiplication, matrix inversion, matrix transposition, etc. In order to enable the AI processor to implement matrix transposition, in the related art, a matrix transposition device is specially provided in the AI processor.

在一种可能矩阵转置方式中，矩阵转置装置沿矩阵对角线方向执行矩阵转置。如图1所示，在执行矩阵转置时，矩阵转置装置沿待转置矩阵的对角线方向，每次提取一组与对角线垂直的向量(由至少一个矩阵元素构成)。示意性的，图1所示的(a)中，提取到的向量中包含矩阵元素(0,0)，图1所示的(b)中，提取到的向量中包含矩阵元素(0,1)和(1,0)，图1所示的(c)中，提取到的向量中包含矩阵元素(0,2)、(1,1)和(2,0)，图1 所示的(d)中，提取到的向量中包含矩阵元素(0,3)、(1,2)、(2,1)和(3,0)，以此类推。In a possible matrix transposition method, the matrix transposition device performs matrix transposition along the diagonal direction of the matrix. As shown in FIG1 , when performing matrix transposition, the matrix transposition device extracts a set of vectors (composed of at least one matrix element) perpendicular to the diagonal direction of the matrix to be transposed each time. Schematically, in (a) shown in FIG1 , the extracted vector includes the matrix element (0,0), in (b) shown in FIG1 , the extracted vector includes the matrix elements (0,1) and (1,0), in (c) shown in FIG1 , the extracted vector includes the matrix elements (0,2), (1,1) and (2,0), and in FIG1 In (d) shown, the extracted vector contains matrix elements (0,3), (1,2), (2,1) and (3,0), and so on.

进一步的，矩阵转置装置对读取到向量执行镜像反转，并将反转后的向量以读取时相同的方式写入存储器中。示意性的，图1所示的(a)中，矩阵转置装置将矩阵元素(0,0)写入存储器，图1所示的(b)中，矩阵转置装置将矩阵元素(1,0)和(0,1)写入存储器，图1所示的(c)中，矩阵转置装置将矩阵元素(2,0)、(1,1)和(0,2)写入存储器，图1所示的(d)中，矩阵转置装置将矩阵元素(3,0)、(2,1)(1,2)和(0,3)写入存储器，以此类推。Furthermore, the matrix transposition device performs a mirror inversion on the read vector, and writes the inverted vector into the memory in the same manner as when it was read. Schematically, in (a) shown in FIG1 , the matrix transposition device writes the matrix element (0,0) into the memory, in (b) shown in FIG1 , the matrix transposition device writes the matrix elements (1,0) and (0,1) into the memory, in (c) shown in FIG1 , the matrix transposition device writes the matrix elements (2,0), (1,1) and (0,2) into the memory, in (d) shown in FIG1 , the matrix transposition device writes the matrix elements (3,0), (2,1) (1,2) and (0,3) into the memory, and so on.

然而，采用上述方式执行矩阵转置时，矩阵处理的延迟大于矩阵写入的延迟，导致性能不佳。比如，对于图1所示的4×4矩阵，写入存储器需要4个时钟周期(逐行写入，共计写入四行)，而沿对角线读取向量执行翻转则需要7个时钟周期才能够完成矩阵处理。因此，在执行连续多矩阵转置运算的场景中，上述方案的启动延迟大于矩阵存取延迟；而在执行矩阵逐一转置运算的场景中，采用上述方案会引起矩阵存取断流，导致数据带宽降低。However, when performing matrix transposition in the above manner, the delay of matrix processing is greater than the delay of matrix writing, resulting in poor performance. For example, for the 4×4 matrix shown in Figure 1, writing to the memory requires 4 clock cycles (writing row by row, a total of four rows), while reading the vector along the diagonal to perform the flip requires 7 clock cycles to complete the matrix processing. Therefore, in the scenario of performing continuous multi-matrix transposition operations, the startup delay of the above scheme is greater than the matrix access delay; and in the scenario of performing matrix transposition operations one by one, the use of the above scheme will cause matrix access interruption, resulting in reduced data bandwidth.

此外，由于每次沿对角线取出的矩阵元素的数量不同，因此同一尺寸的矩阵转置需要配置多种向量反转逻辑，增加了反转电路的设计复杂度，进而导致矩阵转置装置占用的片上面积增加。In addition, since the number of matrix elements taken out along the diagonal each time is different, the matrix transposition of the same size requires the configuration of multiple vector inversion logics, which increases the design complexity of the inversion circuit and further increases the on-chip area occupied by the matrix transposition device.

不同于上述方案中沿矩阵对角线方向执行矩阵转置，本申请实施例提供了一种沿矩阵列方向执行转置的矩阵转置装置。通过对矩阵元素执行移位和分块处理，使矩阵转置装置每次对包含待转置矩阵中至少一列矩阵元素的子矩阵执行转置，进而基于各个子矩阵对应的转置子矩阵得到待转置矩阵的转置结果。相较于沿矩阵对角线方向执行矩阵转置，采用本申请实施例提供的矩阵转置装置能够达到与矩阵存取一致的延迟，即降低了矩阵转置的处理延迟；并且，由于每次沿矩阵列方向从矩阵缓存器中提取出的矩阵元素的数量一致，因此同一种长宽尺寸的矩阵在整个转置过程中只需要采用一种转置逻辑，有助于降低矩阵转置装置的设计复杂度，减小矩阵转置装置占用的片上面积。Different from the above scheme in which matrix transposition is performed along the matrix diagonal direction, the embodiment of the present application provides a matrix transposition device that performs transposition along the matrix column direction. By performing shifting and block processing on the matrix elements, the matrix transposition device performs transposition on the submatrix containing at least one column of matrix elements in the matrix to be transposed each time, and then obtains the transposition result of the matrix to be transposed based on the transposed submatrix corresponding to each submatrix. Compared with performing matrix transposition along the matrix diagonal direction, the matrix transposition device provided by the embodiment of the present application can achieve a delay consistent with matrix access, that is, reduce the processing delay of matrix transposition; and, since the number of matrix elements extracted from the matrix buffer along the matrix column direction each time is consistent, the matrix of the same length and width size only needs to use one transposition logic in the entire transposition process, which helps to reduce the design complexity of the matrix transposition device and reduce the on-chip area occupied by the matrix transposition device.

本申请实施例提供的矩阵转置装置可以应用于如下场景：The matrix transposition device provided in the embodiment of the present application can be applied to the following scenarios:

1、人工智能芯片中神经网络的训练和推理过程，包括训练神经网络时的反向传播梯度的计算，神经网络推理过程的特定层矩阵运算等等。1. The training and reasoning process of the neural network in the artificial intelligence chip, including the calculation of the back-propagation gradient when training the neural network, the matrix operations of specific layers in the neural network reasoning process, etc.

2、图像处理过程。在图像处理应用中执行矩阵转置计算，比如通过转置实现图像翻转或轮换操作时，被表示为矩阵的数字图像可以通过转置运算生成数字图像的轮换或镜像图像。2. Image processing process: In image processing applications, when performing matrix transposition calculations, such as image flipping or rotation operations, a digital image represented as a matrix can generate a rotated or mirrored image of the digital image through a transposition operation.

3、信号处理过程。使用矩阵转置实现快速傅里叶变换(Fast Fourier Transform，FFT)算法。3. Signal processing: Use matrix transposition to implement the Fast Fourier Transform (FFT) algorithm.

4、图网络中的关联性分析。社交网络或其它网络分析可以利用矩阵转置计算来确定网络中的节点之间的关系源，或者确定网络中的节点之间的关系模式。4. Relationship analysis in graph networks. Social network or other network analysis can use matrix transposition calculation to determine the source of relationships between nodes in the network, or determine the relationship pattern between nodes in the network.

请参考图2，其示出了本申请一个示例性实施例提供的矩阵转置装置的结构示意图。该矩阵转置装置包括：矩阵读取器210、移位器220、矩阵缓存器230、转置器240、矩阵写入器250以及控制器260。Please refer to FIG2 , which shows a schematic diagram of a matrix transposition device provided by an exemplary embodiment of the present application. The matrix transposition device includes: a matrix reader 210 , a shifter 220 , a matrix buffer 230 , a transposer 240 , a matrix writer 250 and a controller 260 .

其中，矩阵读取器210的输入端与上级存储器相连，矩阵读取器210的输出端与移位器220的输入端相连，移位器220的输出端与矩阵缓存器230的写入端相连，矩阵缓存器230的读取端与转置器240的输入端相连，转置器240的输出端与矩阵写入器250的输入端相连，矩阵写入器250的输出端与下级存储器相连。控制器260作为矩阵转置过程的控制端，分别与矩阵读取器210、移位器220、矩阵缓存器230、转置器240以及矩阵写入器250相连。The input end of the matrix reader 210 is connected to the upper memory, the output end of the matrix reader 210 is connected to the input end of the shifter 220, the output end of the shifter 220 is connected to the write end of the matrix buffer 230, the read end of the matrix buffer 230 is connected to the input end of the transposer 240, the output end of the transposer 240 is connected to the input end of the matrix writer 250, and the output end of the matrix writer 250 is connected to the lower memory. The controller 260, as the control end of the matrix transposition process, is respectively connected to the matrix reader 210, the shifter 220, the matrix buffer 230, the transposer 240 and the matrix writer 250.

控制器260用于控制矩阵读取器210从上级存储器中读取待转置矩阵的矩阵元素，其中，矩阵读取器每次读取待转置矩阵中的至少一行矩阵元素。The controller 260 is used to control the matrix reader 210 to read the matrix elements of the matrix to be transposed from the upper-level memory, wherein the matrix reader reads at least one row of matrix elements in the matrix to be transposed each time.

其中，不同应用场景下，待转置矩阵中矩阵元素的含义可能不同。比如，待转置矩阵中的矩阵元素可以是图像中像素点在某一颜色通道下的颜色值，或者，可以是图网络中某一图节点的特征值，本申请实施例并不对待转置矩阵中矩阵元素的具体含义进行限定。In different application scenarios, the meanings of matrix elements in the matrix to be transposed may be different. The matrix elements may be the color values of pixels in an image under a certain color channel, or may be the eigenvalues of a certain graph node in a graph network. The embodiments of the present application do not limit the specific meanings of the matrix elements in the transposed matrix.

在一些实施例中，矩阵读取器210基于上级存储器的读接口位宽，每次读取上级存储器的一行数据(即一个burst)。由于待转置矩阵中每一行矩阵元素的总位宽可能小于或等于读接口位宽，因此矩阵读取器210每次读取到的一行数据中，包含待转置矩阵中的至少一行矩阵元素。比如，当待转置矩阵中每一行矩阵元素的总位宽等于读接口位宽时，每次读取到待转置矩阵中的一行矩阵元素；当待转置矩阵中每一行矩阵元素的总位宽为读接口位宽的1/2时，每次读取到待转置矩阵中的2行矩阵元素。In some embodiments, the matrix reader 210 reads one row of data (i.e., one burst) from the upper memory each time based on the read interface bit width of the upper memory. Since the total bit width of each row of matrix elements in the matrix to be transposed may be less than or equal to the read interface bit width, each row of data read by the matrix reader 210 contains at least one row of matrix elements in the matrix to be transposed. For example, when the total bit width of each row of matrix elements in the matrix to be transposed is equal to the read interface bit width, one row of matrix elements in the matrix to be transposed is read each time; when the total bit width of each row of matrix elements in the matrix to be transposed is 1/2 of the read interface bit width, two rows of matrix elements in the matrix to be transposed are read each time.

为了避免后续读取矩阵元素时出现bank(内存块)读冲突(即避免多个线程尝试同时访问同一个bank中的不同位置)，在将读取到的矩阵元素写入矩阵缓存器230之前，控制器260用于控制移位器220对矩阵读取器210读取到的矩阵元素执行数据移位，并控制移位器220将移位后的矩阵元素写入矩阵缓存器230。In order to avoid bank (memory block) read conflicts when subsequently reading matrix elements (i.e., to avoid multiple threads attempting to access different locations in the same bank at the same time), before writing the read matrix elements into the matrix buffer 230, the controller 260 is used to control the shifter 220 to perform data shifting on the matrix elements read by the matrix reader 210, and control the shifter 220 to write the shifted matrix elements into the matrix buffer 230.

可选的，该移位操作可以是向右执行的循环移位操作。比如，当读取到的矩阵元素为00,01,02,03,04,05时，向右执行1位循环移位操作后，得到的矩阵元素序列为05,00,01,02,03,04。Optionally, the shift operation may be a circular shift operation performed to the right. For example, when the read matrix elements are 00, 01, 02, 03, 04, 05, after performing a 1-bit circular shift operation to the right, the obtained matrix element sequence is 05, 00, 01, 02, 03, 04.

将移位后的矩阵元素写入矩阵缓存器后，原先在待转置矩阵中处于同一列的矩阵元素被移动至不同列，也即原先在待转置矩阵中处于同一列的矩阵元素位于不同的bank中，因此后续读取待转置矩阵中一列或多列矩阵元素时，不会出现bank读冲突。After the shifted matrix elements are written into the matrix buffer, the matrix elements that were originally in the same column in the matrix to be transposed are moved to different columns, that is, the matrix elements that were originally in the same column in the matrix to be transposed are located in different banks, so there will be no bank read conflict when one or more columns of matrix elements in the matrix to be transposed are subsequently read.

在一些实施例中，每一行读取到的矩阵元素的数据移位长度不同，且相邻两行矩阵元素各自对应数据移位长度的差值相同。比如，4行矩阵元素各自的数据移位长度分别为0,1,2,3。In some embodiments, the data shift lengths of the matrix elements read in each row are different, and the difference between the data shift lengths of the matrix elements in two adjacent rows is the same. For example, the data shift lengths of the matrix elements in four rows are 0, 1, 2, and 3 respectively.

在一些实施例中，矩阵缓存器230可以实现为随机存取存储器(Static Random-Access Memory，SRAM)或者寄存器，本申请实施例对矩阵缓存器的具体类型并不进行限定。In some embodiments, the matrix buffer 230 can be implemented as a static random-access memory (SRAM) or a register. The embodiments of the present application do not limit the specific type of the matrix buffer.

当待转置矩阵完全写入矩阵缓存器230时，控制器260控制转置器240执行矩阵转置。When the matrix to be transposed is completely written into the matrix buffer 230 , the controller 260 controls the transposer 240 to perform matrix transposition.

控制器260，用于控制转置器240从矩阵缓存器230中读取子矩阵，并控制转置器240对读取到的子矩阵执行转置处理，得到转置子矩阵，其中，子矩阵包含待转置矩阵中的至少一列矩阵元素。The controller 260 is used to control the transposer 240 to read a submatrix from the matrix buffer 230, and control the transposer 240 to perform a transposition process on the read submatrix to obtain a transposed submatrix, wherein the submatrix includes at least one column of matrix elements in the matrix to be transposed.

在一些实施例中，待转置矩阵可以在列方向上被划分为若干尺寸一致的子矩阵(可能会涉及到矩阵填充，将在下文中以单独实施例介绍)，控制器260控制转置器240从矩阵缓存器230中读取子矩阵时，每次读取即读取到若干子矩阵中的一个子矩阵。In some embodiments, the matrix to be transposed can be divided into several sub-matrices of uniform size in the column direction (may involve matrix filling, which will be introduced in a separate embodiment below). When the controller 260 controls the transposer 240 to read a sub-matrix from the matrix buffer 230, one sub-matrix among the several sub-matrices is read each time.

由于每次读取到的子矩阵的尺寸一致，因此转置器240可以采用一致的转置逻辑对各个子矩阵执行转置处理，得到转置子矩阵。Since the size of the sub-matrices read each time is consistent, the transposer 240 can use consistent transposition logic to perform transposition processing on each sub-matrix to obtain a transposed sub-matrix.

在一个示意性的例子中，当待转置矩阵的尺寸为8×8时，待转置矩阵可以沿列方向被划分为8个8×1的子矩阵，控制器260每次控制转置器240读取一个子矩阵，共读取到8个子矩阵。转置器240对各个子矩阵执行转置处理，得到尺寸为1×8的转置子矩阵。In an illustrative example, when the size of the matrix to be transposed is 8×8, the matrix to be transposed can be divided into 8 8×1 sub-matrices along the column direction, and the controller 260 controls the transposer 240 to read one sub-matrix each time, and a total of 8 sub-matrices are read. The transposer 240 performs a transposition process on each sub-matrix to obtain a transposed sub-matrix with a size of 1×8.

进一步的，控制器260，用于控制矩阵写入器250将转置器240输出的转置子矩阵写入下级存储器。Furthermore, the controller 260 is used to control the matrix writer 250 to write the transposed sub-matrix output by the transposer 240 into the lower-level memory.

在一些实施例中，转置器240每完成一次转置处理后，矩阵写入器250即将转置得到的转置子矩阵的矩阵元素写入下级存储器。矩阵写入器250将所有转置子矩阵写入下级存储器后，下级存储器中即存储待转置矩阵对应的转置矩阵。In some embodiments, after the transposer 240 completes a transposition process, the matrix writer 250 writes the matrix elements of the transposed submatrix obtained by transposition into the lower-level memory. After the matrix writer 250 writes all the transposed submatrices into the lower-level memory, the lower-level memory stores the transposed matrix corresponding to the matrix to be transposed.

以待转置矩阵的尺寸为8×8，且存储器读写该矩阵需要8个时钟周期为例，在相关技术中沿对角线方向执行矩阵转置时，矩阵转置需要经过15个时钟周期；而采用本申请实施例提供的矩阵转置装置沿列方向执行矩阵转置时，矩阵转置仅需要经过8个时钟周期，矩阵转置的效率更高，且矩阵转置仅需要的时钟周期与存储器的读写延迟一致，有助于提高性能，提高传输带宽。 Taking the example that the size of the matrix to be transposed is 8×8 and the memory requires 8 clock cycles to read and write the matrix, in the related art, when the matrix transposition is performed along the diagonal direction, the matrix transposition requires 15 clock cycles; while when the matrix transposition device provided in the embodiment of the present application is used to perform the matrix transposition along the column direction, the matrix transposition only requires 8 clock cycles. The matrix transposition is more efficient, and the clock cycle required for the matrix transposition is consistent with the read and write delay of the memory, which helps to improve performance and increase transmission bandwidth.

此外，在相关技术中沿对角线方向执行矩阵转置时，由于每次读取到的矩阵元素数量为1至8个，因此需要设置8种反转逻辑；而采用本申请实施例提供的方案，由于每次读取到的子矩阵中矩阵元素的数量均为8个，因此只需要为转置器设置一种转置逻辑，无需处理不同数量的矩阵元素，降低了矩阵转置装置的设计复杂度，有助于降低矩阵转置装置占用的片上面积。In addition, in the related art, when performing matrix transposition along the diagonal direction, since the number of matrix elements read each time is 1 to 8, 8 types of inversion logic need to be set; while with the solution provided in the embodiment of the present application, since the number of matrix elements in the sub-matrix read each time is 8, only one transposition logic needs to be set for the transposer, and there is no need to process different numbers of matrix elements, which reduces the design complexity of the matrix transposition device and helps to reduce the on-chip area occupied by the matrix transposition device.

综上所述，本申请实施例中，矩阵转置装置通过对矩阵读取器读取到的矩阵元素执行数据移位处理，对移位后存储在矩阵缓存器中的矩阵执行分块读取，得到包含待转置矩阵中至少一列矩阵元素的子矩阵，并对子矩阵执行转置，进而基于各个子矩阵对应的转置子矩阵得到待转置矩阵的转置结果，实现了沿矩阵列方向的矩阵转置，相较于沿矩阵对角线方向执行矩阵转置能够降低矩阵转置的处理延迟；此外，沿矩阵列方向执行矩阵转置时，由于每次从矩阵缓存器中提取出的矩阵元素的数量一致，因此同一矩阵只需要采用一种转置逻辑，有助于降低矩阵转置装置的设计复杂度，减小矩阵转置装置占用的片上面积。To summarize, in the embodiment of the present application, the matrix transposition device performs data shift processing on the matrix elements read by the matrix reader, performs block reading on the matrix stored in the matrix buffer after the shift, obtains a sub-matrix containing at least one column of matrix elements in the matrix to be transposed, and performs transposition on the sub-matrix, and then obtains the transposition result of the matrix to be transposed based on the transposed sub-matrix corresponding to each sub-matrix, thereby realizing matrix transposition along the matrix column direction, which can reduce the processing delay of the matrix transposition compared to performing matrix transposition along the matrix diagonal direction; in addition, when performing matrix transposition along the matrix column direction, since the number of matrix elements extracted from the matrix buffer each time is the same, the same matrix only needs to use one transposition logic, which helps to reduce the design complexity of the matrix transposition device and reduce the on-chip area occupied by the matrix transposition device.

在一种将待转置矩阵划分为若干子矩阵的方式中，对于尺寸为N×M的待转置矩阵，该待转置矩阵被划为k个子矩阵，各个子矩阵中矩阵元素的总位宽大于等于读接口位宽，即单次转置处理的子矩阵中矩阵元素的总位宽大于等于读接口位宽，从而保证矩阵转置过程中无传输带宽损失。In a method of dividing a matrix to be transposed into several sub-matrices, for a matrix to be transposed of size N×M, the matrix to be transposed is divided into k sub-matrices, and the total bit width of matrix elements in each sub-matrix is greater than or equal to the read interface bit width, that is, the total bit width of matrix elements in the sub-matrix processed by a single transposition is greater than or equal to the read interface bit width, thereby ensuring that there is no transmission bandwidth loss during the matrix transposition process.

关于子矩阵列数的确定方式，在一种可能的实施方式中，控制器基于上级存储器的读接口位宽，以及待转置矩阵的矩阵参数，确定子矩阵的子矩阵列数。Regarding the manner of determining the number of submatrix columns, in a possible implementation, the controller determines the number of submatrix columns of the submatrix based on the read interface bit width of the upper-level memory and the matrix parameters of the matrix to be transposed.

其中，待转置矩阵的矩阵参数包括待转置矩阵的行数、列数以及各个矩阵元素的比特数。The matrix parameters of the matrix to be transposed include the number of rows and columns of the matrix to be transposed and the number of bits of each matrix element.

由于矩阵中的矩阵元素通常按照先行后列的方式存储在存储器中，因此当待转置矩阵的行矩阵元素位宽大于等于读接口位宽时，即可保证每次矩阵处理的矩阵元素的总位宽大于等于读接口位宽。Since the matrix elements in the matrix are usually stored in the memory in the order of rows first and columns later, when the bit width of the row matrix elements of the matrix to be transposed is greater than or equal to the read interface bit width, it can be ensured that the total bit width of the matrix elements in each matrix processing is greater than or equal to the read interface bit width.

在一种可能的实施方式中，在读接口位宽等于待转置矩阵的行矩阵元素位宽的情况下，将待转置矩阵的列行比向上取整结果确定为子矩阵列数。In a possible implementation, when the read interface bit width is equal to the row matrix element bit width of the matrix to be transposed, the column-row ratio of the matrix to be transposed is rounded up to an integer and determined as the number of submatrix columns.

其中，行矩阵元素位宽基于待转置矩阵的列数以及各个矩阵元素的比特数确定得到。示意性的，对于N×M的待转置矩阵，行矩阵元素位宽＝M(待转置矩阵的列数)×Element_size(矩阵元素的比特数)。The row matrix element bit width is determined based on the number of columns of the matrix to be transposed and the number of bits of each matrix element. Schematically, for an N×M matrix to be transposed, the row matrix element bit width=M(the number of columns of the matrix to be transposed)×Element_size(the number of bits of the matrix element).

比如，当待转置矩阵的尺寸为8×8，且矩阵元素的比特数为128bit时，该待转置矩阵的行矩阵元素位宽即为1024bit。For example, when the size of the matrix to be transposed is 8×8 and the number of bits of the matrix elements is 128 bits, the bit width of the row matrix elements of the matrix to be transposed is 1024 bits.

当行矩阵元素位宽等于读接口位宽时，控制器确定k＝ceiling[M/N]，其中，ceilling为向上取整操作，即将待转置矩阵划分为N个子矩阵，各个子矩阵的尺寸为N×k。When the row matrix element bit width is equal to the read interface bit width, the controller determines k=ceiling[M/N], where ceilling is a rounding-up operation, that is, the matrix to be transposed is divided into N sub-matrices, and the size of each sub-matrix is N×k.

相应的，划分得到的各个子矩阵中矩阵元素的总位宽即为N×k×Element_size，由于k≥M/N，且M×Element_size＝read_port_width(读接口位宽)，因此N×k×Element_size≥read_port_width。Correspondingly, the total bit width of the matrix elements in each sub-matrix obtained by division is N×k×Element_size. Since k≥M/N, and M×Element_size＝read_port_width (read interface bit width), N×k×Element_size≥read_port_width.

在一个示意性的例子中，当待转置矩阵的行矩阵元素位宽为2048bit，且读接口位宽为2048bit时，若待转置矩阵的尺寸为8×8，则确定子矩阵列数为1，即各个子矩阵的尺寸为8×1；若待转置矩阵的尺寸为4×8，则确定子矩阵列数为2，即各个子矩阵的尺寸为4×2；若待转置矩阵的尺寸为6×8，则确定子矩阵列数为2，即各个子矩阵的尺寸为6×2。In an illustrative example, when the row matrix element bit width of the matrix to be transposed is 2048 bits and the read interface bit width is 2048 bits, if the size of the matrix to be transposed is 8×8, the number of submatrix columns is determined to be 1, that is, the size of each submatrix is 8×1; if the size of the matrix to be transposed is 4×8, the number of submatrix columns is determined to be 2, that is, the size of each submatrix is 4×2; if the size of the matrix to be transposed is 6×8, the number of submatrix columns is determined to be 2, that is, the size of each submatrix is 6×2.

在另一种可能的实施方式中，在读接口位宽大于待转置矩阵的行矩阵元素位宽的情况下，将读接口位宽与待转置矩阵的列矩阵元素位宽的比值的向上取整结果确定为子矩阵列数，列矩阵元素位宽基于待转置矩阵的行数以及各个矩阵元素的比特数确定得到。In another possible implementation, when the read interface bit width is greater than the row matrix element bit width of the matrix to be transposed, the result of rounding up the ratio of the read interface bit width to the column matrix element bit width of the matrix to be transposed is determined as the number of submatrix columns, and the column matrix element bit width is determined based on the number of rows of the matrix to be transposed and the number of bits of each matrix element.

其中，列矩阵元素位宽基于待转置矩阵的行数以及各个矩阵元素的比特数确定得到。示意性的，对于N×M的待转置矩阵，列矩阵元素位宽＝N(待转置矩阵的列数)×Element_size(矩阵元素的比特数)。 The column matrix element bit width is determined based on the number of rows of the matrix to be transposed and the number of bits of each matrix element. Schematically, for an N×M matrix to be transposed, the column matrix element bit width=N(the number of columns of the matrix to be transposed)×Element_size(the number of bits of the matrix element).

比如，当待转置矩阵的尺寸为8×8，且矩阵元素的比特数为128bit时，该待转置矩阵的列矩阵元素位宽即为1024bit。For example, when the size of the matrix to be transposed is 8×8 and the number of bits of the matrix elements is 128 bits, the bit width of the column matrix elements of the matrix to be transposed is 1024 bits.

当行矩阵元素位宽等于读接口位宽时，控制器确定k＝ceilling[read_port_width/(N*Element_size)]，即将待转置矩阵划分为N个子矩阵，各个子矩阵的尺寸为N×k。When the row matrix element width is equal to the read interface width, the controller determines k=ceilling[read_port_width/(N*Element_size)], that is, the matrix to be transposed is divided into N sub-matrices, and the size of each sub-matrix is N×k.

相应的，划分得到的各个子矩阵中矩阵元素的总位宽即为N×k×Element_size，由于k≥read_port_width/N*Element_size，因此N×k×Element_size≥read_port_width。Correspondingly, the total bit width of the matrix elements in each sub-matrix obtained by division is N×k×Element_size. Since k≥read_port_width/N*Element_size, N×k×Element_size≥read_port_width.

在一个示意性的例子中，当待转置矩阵的行矩阵元素位宽为1024bit，且读接口位宽为2048bit时，若待转置矩阵的尺寸为8×8(矩阵元素为128bit)，则确定子矩阵列数为2，即各个子矩阵的尺寸为8×2。In an illustrative example, when the row matrix element bit width of the matrix to be transposed is 1024 bits and the read interface bit width is 2048 bits, if the size of the matrix to be transposed is 8×8 (the matrix element is 128 bits), the number of submatrix columns is determined to be 2, that is, the size of each submatrix is 8×2.

为了保证各个子矩阵的尺寸一致，当N无法被M整除时，需要对待转置矩阵执行填充(padding)处理，相应的，后续向下级存储器中写入转置子矩阵时，需要执行填充去除。In order to ensure that the sizes of the sub-matrices are consistent, when N cannot be divided by M, padding processing needs to be performed on the transposed matrix. Accordingly, when the transposed sub-matrix is subsequently written into the lower-level memory, padding removal needs to be performed.

示意性的，如图3所示，对于N×M的待转置矩阵，由于M无法整除N，因此矩阵读取器在读取到的矩阵元素后填充M’-M个矩阵元素，其中，M’＝k*N。由于经过转置后得到的尺寸为M’×N，因此需要将转置后矩阵的最后M’-M行矩阵元素去除，还原得到M×N的矩阵。Schematically, as shown in FIG3, for an N×M matrix to be transposed, since M cannot divide N evenly, the matrix reader fills M'-M matrix elements after the read matrix elements, where M'=k*N. Since the size obtained after transposition is M'×N, it is necessary to remove the last M'-M rows of matrix elements of the transposed matrix to restore the M×N matrix.

在一种可能的实施方式中，在读接口位宽等于待转置矩阵的行矩阵元素位宽，且列行比为大于1的非整数的情况下，控制器控制矩阵读取器在读取到的矩阵元素后执行填充，以实现填充后的待转置矩阵的列行比为整数。In a possible implementation, when the read interface bit width is equal to the row matrix element bit width of the matrix to be transposed, and the column-to-row ratio is a non-integer greater than 1, the controller controls the matrix reader to perform padding after the read matrix elements, so that the column-to-row ratio of the matrix to be transposed after padding is an integer.

可选的，矩阵读取器在读取到的矩阵元素后填充0或者其他预设数据。Optionally, the matrix reader fills the read matrix elements with 0 or other preset data.

在一种可能的实施方式中，控制器基于待转置矩阵的行数和列数确定填充列数，从而基于填充列数控制矩阵读取器在读取到的矩阵元素后执行填充。In a possible implementation, the controller determines the number of padding columns based on the number of rows and columns of the matrix to be transposed, and controls the matrix reader to perform padding after the read matrix elements based on the number of padding columns.

其中，填充列数为k×N-M。比如，当待转置矩阵的尺寸为6×8时，矩阵读取器每次读取到矩阵元素后，需要填充4列矩阵元素。The number of filled columns is k × N - M. For example, when the size of the matrix to be transposed is 6 × 8, the matrix reader needs to fill 4 columns of matrix elements each time it reads a matrix element.

需要说明的是，矩阵读取器在读取矩阵元素的过程中随路填充，无需额外占用时长。It should be noted that the matrix reader fills in the matrix elements as it reads them, without taking up any additional time.

相应的，控制器控制矩阵写入器对转置器输出的至少一个转置子矩阵执行填充去除，并将至少一个填充去除后的转置子矩阵中的矩阵元素写入下级存储器。需要说明的是，执行填充去除的至少一个转置子矩阵，位于末位位置的。也即是说，上述至少一个转置子矩阵中的数据位置位于填充后的待转置矩阵的转置结果的末尾一行或多行。需要说明的是，在一些示例中，对转置器输出的转置子矩阵执行填充去除可以实现为去除转置子矩阵中的部分或全部数据。Correspondingly, the controller controls the matrix writer to perform padding removal on at least one transposed submatrix output by the transposer, and writes the matrix elements in at least one transposed submatrix after padding removal into the lower-level memory. It should be noted that at least one transposed submatrix after padding removal is located at the last position. In other words, the data position in the at least one transposed submatrix is located at the end of one or more rows of the transposed result of the matrix to be transposed after padding. It should be noted that, in some examples, performing padding removal on the transposed submatrix output by the transposer can be implemented by removing part or all of the data in the transposed submatrix.

由于子矩阵经过转置后，填充的列变为了行，因此矩阵写入器需要对至少一个转置子矩阵执行去除填充，以去除填充的列经过转置后变成的行。需要说明的是，此处的去除并不是指特定去除操作，而是指矩阵写入器不将转置子矩阵中的部分或全部行写入下级存储器。Since the filled columns become rows after the submatrix is transposed, the matrix writer needs to perform a de-filling operation on at least one transposed submatrix to remove the rows that the filled columns become after the transposition. It should be noted that the de-filling here does not refer to a specific de-filling operation, but means that the matrix writer does not write some or all of the rows in the transposed submatrix to the lower-level memory.

在一种可能的实施方式中，控制器基于填充列数控制矩阵写入器对转置器输出的至少一个转置子矩阵执行填充去除，其中，从至少一个转置子矩阵中需要去除的矩阵元素的行数总和与填充列数相同。In one possible implementation, the controller controls the matrix writer to perform padding removal on at least one transposed submatrix output by the transposer based on the number of padding columns, wherein the sum of the number of rows of matrix elements to be removed from at least one transposed submatrix is the same as the number of padding columns.

去除的矩阵元素的行数与填充列数一致，即k×N-M。比如，当待转置矩阵的尺寸为6×8时，为了保证填充后的待转置矩阵的列行比为整数，需要填充4列矩阵元素，填充后的待转置矩阵的尺寸为6×12。如上介绍，每一个子矩阵的尺寸为6×2，对应的转置子矩阵的尺寸为2×6。需要去除最后4行矩阵元素，以去除填充的4列矩阵元素经过转置后变成的行。每个转置子矩阵的尺寸为2×6，即包括两行六列，需要对处于末位位置的两个转置子矩阵执行填充去除，去除转置子矩阵中的全部数据，以实现去除最后4行矩阵元素。The number of rows of matrix elements to be removed is consistent with the number of padded columns, that is, k×N-M. For example, when the size of the matrix to be transposed is 6×8, in order to ensure that the column-row ratio of the padded matrix to be transposed is an integer, 4 columns of matrix elements need to be padded, and the size of the padded matrix to be transposed is 6×12. As introduced above, the size of each submatrix is 6×2, and the size of the corresponding transposed submatrix is 2×6. The last 4 rows of matrix elements need to be removed to remove the rows that the padded 4 columns of matrix elements become after transposition. The size of each transposed submatrix is 2×6, that is, it includes two rows and six columns. It is necessary to perform padding removal on the two transposed submatrices in the last position to remove all data in the transposed submatrix to achieve the removal of the last 4 rows of matrix elements.

为了避免从矩阵缓存器中读取子矩阵时出现bank读冲突，在一种可能的实施方式中，控制器基于子矩阵的子矩阵列数，确定矩阵读取器各次读取到的矩阵元素对应的移位步长，从而基于移位步长，控制移位器对矩阵读取器读取到的矩阵元素执行数据右移。 In order to avoid bank read conflicts when reading a sub-matrix from a matrix buffer, in one possible implementation, the controller determines a shift step corresponding to each matrix element read by the matrix reader based on the number of sub-matrix columns of the sub-matrix, and thereby controls the shifter to perform data right shift on the matrix elements read by the matrix reader based on the shift step.

在一种可能的实施方式中，对于矩阵读取器第n次读取到的矩阵元素，控制器确定该矩阵元素的移位步长为(n-1)×k，即第一次读取到的矩阵元素的移位步长为0，第二次读取到的矩阵元素的移位步长为k，第三次读取到的矩阵元素的移位步长为2k，以此类推。In one possible implementation, for the matrix element read by the matrix reader for the nth time, the controller determines that the shift step length of the matrix element is (n-1)×k, that is, the shift step length of the matrix element read for the first time is 0, the shift step length of the matrix element read for the second time is k, the shift step length of the matrix element read for the third time is 2k, and so on.

通过上述移位处理后，原先位于待转置矩阵中同一列的矩阵元素被移动至不同列，以此避免从矩阵缓存器中读取子矩阵时出现bank读冲突。After the above shifting process, the matrix elements originally located in the same column of the matrix to be transposed are moved to different columns, thereby avoiding bank read conflicts when reading sub-matrices from the matrix buffer.

示意性的，如图4所示，当待转置矩阵的行数为5，且子矩阵的子矩阵列数为k时，第一行矩阵元素右移0步长后写入了矩阵缓存器，第二行矩阵元素右移k步长后写入了矩阵缓存器，第三行矩阵元素右移2k步长后写入了矩阵缓存器，第四行矩阵元素右移3k步长后写入了矩阵缓存器，第五行矩阵元素右移4k步长后写入了矩阵缓存器。Schematically, as shown in Figure 4, when the number of rows of the matrix to be transposed is 5 and the number of submatrix columns of the submatrix is k, the matrix elements in the first row are shifted right by 0 steps and written into the matrix buffer, the matrix elements in the second row are shifted right by k steps and written into the matrix buffer, the matrix elements in the third row are shifted right by 2k steps and written into the matrix buffer, the matrix elements in the fourth row are shifted right by 3k steps and written into the matrix buffer, and the matrix elements in the fifth row are shifted right by 4k steps and written into the matrix buffer.

在一种可能的设计中，矩阵缓存器有若干内存bank构成。In one possible design, the matrix buffer is composed of several memory banks.

在一个示意性的例子中，如图5所示，矩阵缓存器中设置有2n个内存bank，且各个内存bank分别与写总线与写数据接口相连，通过读总线与读数据接口相连，通过地址总线与地址接口相连。因此，移位器可以通过写数据接口向不同内存bank中并行写入数据，转置器可以通过读数据接口从不同内存bank中并行读取数据。In an illustrative example, as shown in FIG5 , 2n memory banks are provided in the matrix buffer, and each memory bank is respectively connected to the write bus and the write data interface, connected to the read data interface through the read bus, and connected to the address interface through the address bus. Therefore, the shifter can write data to different memory banks in parallel through the write data interface, and the transposer can read data from different memory banks in parallel through the read data interface.

关于将移位后矩阵元素写入矩阵缓存器的方式，在一种可能的实施方式中，在执行矩阵元素写入时，控制器用于控制移位器将移位后的矩阵元素写入矩阵缓存器中的多个内存bank；而在执行矩阵元素读取时，控制器用于基于子矩阵列数，控制转置器从矩阵缓存器的多个内存bank中并行读取子矩阵的矩阵元素。Regarding the manner of writing shifted matrix elements into a matrix buffer, in one possible implementation, when executing matrix element writing, the controller is used to control the shifter to write the shifted matrix elements into a plurality of memory banks in the matrix buffer; and when executing matrix element reading, the controller is used to control the transposer to read the matrix elements of the sub-matrix in parallel from the plurality of memory banks in the matrix buffer based on the number of sub-matrix columns.

在一些实施例中，控制器控制移位器将移位后不同的矩阵元素写入矩阵缓存器中不同内存bank。比如，当读取到一行8个矩阵元素时，移位器将8个矩阵元素分别写入8个内存bank中。In some embodiments, the controller controls the shifter to write different matrix elements after shifting into different memory banks in the matrix buffer. For example, when a row of 8 matrix elements is read, the shifter writes the 8 matrix elements into 8 memory banks respectively.

相应的，后续转置器从矩阵缓存器中读取一列矩阵元素时，转置器从8个内存bank中分别读取8个矩阵元素。Correspondingly, when the subsequent transposer reads a column of matrix elements from the matrix buffer, the transposer reads 8 matrix elements from 8 memory banks respectively.

在一种可能的实施方式中，从不同内存bank中读取同一子矩阵中的矩阵元素时，控制器基于子矩阵列数确定各个内存bank的读取地址，其中，该内存bank的读取地址递增，且地址递增的递增量为k个矩阵元素的位宽。In one possible implementation, when reading matrix elements in the same submatrix from different memory banks, the controller determines a read address of each memory bank based on the number of columns of the submatrix, wherein the read address of the memory bank is incremented, and the increment of the address increment is the bit width of k matrix elements.

示意性的，在读取第1个子矩阵时，转置器基于起始地址0从第1个内存bank读取矩阵元素，基于起始地址k×Element_size从第2个内存bank读取矩阵元素，基于起始地址2k×Element_size从第3个内存bank读取矩阵元素，以此类推。在读取第i个子矩阵时，转置器基于起始地址0从第i个内存bank读取矩阵元素，基于起始地址k×Element_size从第i+1个内存bank读取矩阵元素，基于起始地址2k×Element_size从第i+2个内存bank读取矩阵元素，以此类推。需要说明的是，当读取到存储矩阵元素的最后一个内存bank后，将会从第一个内存bank开始读取矩阵元素。Schematically, when reading the first sub-matrix, the transposer reads the matrix elements from the first memory bank based on the starting address 0, reads the matrix elements from the second memory bank based on the starting address k×Element_size, reads the matrix elements from the third memory bank based on the starting address 2k×Element_size, and so on. When reading the i-th sub-matrix, the transposer reads the matrix elements from the i-th memory bank based on the starting address 0, reads the matrix elements from the i+1-th memory bank based on the starting address k×Element_size, reads the matrix elements from the i+2-th memory bank based on the starting address 2k×Element_size, and so on. It should be noted that after reading the last memory bank storing the matrix elements, the matrix elements will be read starting from the first memory bank.

由于在向矩阵缓存器中写入矩阵元素时执行了数据移位，因此为了保证子矩阵中矩阵元素的顺序正确性，在一种可能的实施方式中，控制器还用于基于子矩阵列数，控制转置器对子矩阵的矩阵元素执行数据左移处理。Since data shifting is performed when writing matrix elements into the matrix buffer, in order to ensure the correctness of the order of matrix elements in the sub-matrix, in a possible implementation, the controller is also used to control the transposer to perform data left shifting processing on the matrix elements of the sub-matrix based on the number of columns of the sub-matrix.

可选的，该矩阵元素左移的移位步长为(n-1)×k，即第一次读取到的矩阵元素左移0，第二次读取到的矩阵元素左移k，第三次读取到的矩阵元素的左移2k，以此类推。Optionally, the shift step length of the matrix element left shift is (n-1)×k, that is, the matrix element read for the first time is shifted left by 0, the matrix element read for the second time is shifted left by k, the matrix element read for the third time is shifted left by 2k, and so on.

示意性的，如图6所示，待转置矩阵被划分为成5×k子矩阵，将右移处理后的待转置矩阵完整写入矩阵缓存器后，第一步，转置器从矩阵缓存器中读取第一子矩阵中的矩阵元素，并对矩阵元素左移0步长，进而对第一子矩阵执行转置，得到由待转置矩阵中第1至第k列矩阵元素转置得到的k×5矩阵，并将第一个k×5矩阵的矩阵元素写入下级存储器。Schematically, as shown in FIG6 , the matrix to be transposed is divided into 5×k sub-matrices. After the matrix to be transposed after the right shift processing is completely written into the matrix buffer, in the first step, the transposer reads the matrix elements in the first sub-matrix from the matrix buffer, and shifts the matrix elements left by 0 steps, and then performs transposition on the first sub-matrix to obtain a k×5 matrix obtained by transposing the matrix elements from the 1st to the kth columns in the matrix to be transposed, and writes the matrix elements of the first k×5 matrix into the lower-level memory.

第二步，转置器从矩阵缓存器中读取第二子矩阵中的矩阵元素，并对矩阵元素左移k步长，进而对第二子矩阵执行转置，得到由待转置矩阵中第k+1至第2k列矩阵元素转置得到的k×5矩阵，并将第二个k×5矩阵的矩阵元素写入下级存储器。 In the second step, the transposer reads the matrix elements in the second submatrix from the matrix buffer, shifts the matrix elements left by k steps, and then performs transposition on the second submatrix to obtain a k×5 matrix obtained by transposing the matrix elements in the k+1th to 2kth columns in the matrix to be transposed, and writes the matrix elements of the second k×5 matrix into the lower-level memory.

第三和第四步的过程与第二步类似(左移步长分别为2k和3k)，本实施例不再赘述。The processes of the third and fourth steps are similar to the second step (the left shift step lengths are 2k and 3k respectively), and will not be repeated in this embodiment.

第五步，转置器从矩阵缓存器中读取第五子矩阵中的矩阵元素，并对矩阵元素左移4k步长，进而对第五子矩阵执行转置，得到由待转置矩阵中第4k+1至第5k列矩阵元素转置得到的k×5矩阵。由于k×5矩阵中存在填充数据，因此向下级存储器中写入前，需要基于读取时的填充列数，将第五个k×5矩阵的最后若干列去除，并将剩余矩阵元素写入下级存储器。In the fifth step, the transposer reads the matrix elements in the fifth submatrix from the matrix buffer, shifts the matrix elements to the left by 4k steps, and then transposes the fifth submatrix to obtain a k×5 matrix obtained by transposing the matrix elements in the 4k+1th to 5kth columns in the matrix to be transposed. Since there is padding data in the k×5 matrix, before writing to the lower-level memory, it is necessary to remove the last several columns of the fifth k×5 matrix based on the number of padding columns when reading, and write the remaining matrix elements into the lower-level memory.

由于每次从矩阵缓存器中读取的矩阵元素的数量一致，因此同一待转置矩阵可以采用统一转置逻辑的转置器。在一种可能的设计中，转置器可以由多个多路选择器(multiplexer)构成。在对矩阵元素执行转置处理时，各个多路选择器只需要根据配置的数据输出端执行选通输出即可。Since the number of matrix elements read from the matrix buffer each time is the same, the same matrix to be transposed can use a transposer with unified transposition logic. In a possible design, the transposer can be composed of multiple multiplexers. When performing transposition processing on the matrix elements, each multiplexer only needs to perform a gated output according to the configured data output terminal.

在一种可能的实施方式中，控制器用于基于待转置矩阵的矩阵尺寸，配置转置器中的多路选择器的数据输出端，而转置器则用于通过经过配置的多路选择器对子矩阵的矩阵元素执行选通输出，得到转置子矩阵。In one possible implementation, the controller is used to configure the data output end of the multiplexer in the transposer based on the matrix size of the matrix to be transposed, and the transposer is used to perform a gated output on the matrix elements of the submatrix through the configured multiplexer to obtain a transposed submatrix.

其中，控制器为不同矩阵尺寸的待转置矩阵所配置的多路选择器选通逻辑不同。The controller configures different multiplexer gating logics for matrices to be transposed of different matrix sizes.

比如，对于采用A类矩阵尺寸的待转置矩阵，控制器为其配置的多路选择器选通逻辑为数据输入端1选通数据输出端1，数据输入端2选通数据输出端2，数据输入端3选通数据输出端3；对于采用B类矩阵尺寸的待转置矩阵，控制器为其配置的多路选择器选通逻辑为数据输入端1选通数据输出端1，数据输入端2选通数据输出端3，数据输入端3选通数据输出端5，数据输入端4选通数据输出端7，数据输入端5选通数据输出端2，数据输入端6选通数据输出端4，数据输入端7选通数据输出端6，数据输入端8选通数据输出端8。For example, for a matrix to be transposed of type A matrix size, the controller configures the multiplexer selection logic for it so that data input terminal 1 selects data output terminal 1, data input terminal 2 selects data output terminal 2, and data input terminal 3 selects data output terminal 3; for a matrix to be transposed of type B matrix size, the controller configures the multiplexer selection logic for it so that data input terminal 1 selects data output terminal 1, data input terminal 2 selects data output terminal 3, data input terminal 3 selects data output terminal 5, data input terminal 4 selects data output terminal 7, data input terminal 5 selects data output terminal 2, data input terminal 6 selects data output terminal 4, data input terminal 7 selects data output terminal 6, and data input terminal 8 selects data output terminal 8.

示例性的，由于每次执行转置子矩阵中矩阵元素的数量与上级存储器的读接口位宽相关，因此转置器中多路选择器的数量与上级存储器的读接口位宽相匹配。Exemplarily, since the number of matrix elements in each transposition sub-matrix is related to the read interface bit width of the upper-level memory, the number of multiplexers in the transposer matches the read interface bit width of the upper-level memory.

进一步的，在一些实施例中，多路选择器的数量与读接口位宽呈正相关关系。比如，当矩阵元素的比特数为32bit时，若读接口位宽为1024bit，则转置器中配置32个多路选择器，若读接口位宽为2048bit，则转置器中配置64个多路选择器。Furthermore, in some embodiments, the number of multiplexers is positively correlated with the read interface bit width. For example, when the number of bits of the matrix element is 32 bits, if the read interface bit width is 1024 bits, 32 multiplexers are configured in the transposer, and if the read interface bit width is 2048 bits, 64 multiplexers are configured in the transposer.

结合上述实施例，下面采用示例性的例子对几种典型矩阵转置的过程执行说明。In combination with the above embodiments, several typical matrix transposition processes are described below using illustrative examples.

1、待转置矩阵的行列数相等，且行矩阵元素位宽与上级存储器的读接口位宽相同。1. The number of rows and columns of the matrix to be transposed is equal, and the bit width of the row matrix element is the same as the read interface bit width of the upper-level memory.

以8×8矩阵为例，如图7所示，矩阵读取器每次读取待转置矩阵中的一行矩阵元素，以k＝1为单位步长，对读取到的矩阵元素执行右移，即第一行矩阵元素右移0，第二行矩阵元素右移1，第三行矩阵元素右移2，以此类推，从而将数据移位后的待转置矩阵写入矩阵缓存器。Taking an 8×8 matrix as an example, as shown in FIG7 , the matrix reader reads a row of matrix elements in the matrix to be transposed each time, and right-shifts the read matrix elements with a unit step size of k=1, that is, the matrix elements in the first row are right-shifted by 0, the matrix elements in the second row are right-shifted by 1, the matrix elements in the third row are right-shifted by 2, and so on, thereby writing the matrix to be transposed after the data shift is written into the matrix buffer.

转置器从矩阵缓存器中读取第一个子矩阵中的矩阵元素，并对矩阵元素执行左移0处理，进而通过配置的转置逻辑，对矩阵元素执行转置处理，最终将转置得到的转置子矩阵的矩阵元素写入下级存储器。The transposer reads the matrix elements in the first submatrix from the matrix buffer, performs a left shift of 0 on the matrix elements, and then performs a transposition process on the matrix elements through the configured transposition logic, and finally writes the matrix elements of the transposed submatrix obtained by transposition into the lower-level memory.

类似的，转置器从矩阵缓存器中读取第二个子矩阵中的矩阵元素，并对矩阵元素执行左移1处理，进而通过配置的转置逻辑，对矩阵元素执行转置处理，最终将转置得到的转置子矩阵的矩阵元素写入下级存储器。Similarly, the transposer reads the matrix elements in the second submatrix from the matrix buffer, performs a left shift of 1 on the matrix elements, and then performs a transposition process on the matrix elements through the configured transposition logic, and finally writes the matrix elements of the transposed submatrix obtained by transposition into the lower-level memory.

以此类推，转置器完成剩余6个子矩阵的转置，最终将转置后的8×8矩阵写入下级缓存器中。Similarly, the transposer completes the transposition of the remaining 6 sub-matrices and finally writes the transposed 8×8 matrix into the lower-level cache.

2、待转置矩阵的行列数相等，但行矩阵元素位宽小于上级存储器的读接口位宽。2. The number of rows and columns of the matrix to be transposed is equal, but the bit width of the row matrix element is smaller than the read interface bit width of the upper-level memory.

以8×8矩阵，且行矩阵元素位宽为1024bit，读接口位宽为2048bit为例，如图8所示，矩阵读取器每次读取待转置矩阵中的2行矩阵元素，以k＝2为单位步长，对读取到的矩阵元素执行右移，即第一行矩阵元素右移0，第二行矩阵元素右移2，第三行矩阵元素右移4，以此类推，从而将数据移位后的待转置矩阵写入矩阵缓存器。Taking an 8×8 matrix with a row matrix element width of 1024 bits and a read interface width of 2048 bits as an example, as shown in FIG8 , the matrix reader reads 2 rows of matrix elements in the matrix to be transposed each time, with a unit step of k=2, and performs a right shift on the read matrix elements, that is, the first row of matrix elements is right shifted by 0, the second row of matrix elements is right shifted by 2, the third row of matrix elements is right shifted by 4, and so on, thereby writing the matrix to be transposed after the data shift is written into the matrix buffer.

转置器从矩阵缓存器中读取第一个子矩阵中的矩阵元素，并对矩阵元素执行左移0处理，进而通过配置的转置逻辑，对矩阵元素执行转置处理，最终将转置得到的转置子矩阵的矩阵元素写入下级存储器。The transposer reads the matrix elements of the first submatrix from the matrix buffer and performs a left shift of 0 on the matrix elements. The matrix elements are transposed by the configured transposition logic, and finally the matrix elements of the transposed submatrix obtained by transposition are written into the lower-level memory.

类似的，转置器从矩阵缓存器中读取第二个子矩阵中的矩阵元素，并对矩阵元素执行左移2处理，进而通过配置的转置逻辑，对矩阵元素执行转置处理，最终将转置得到的转置子矩阵的矩阵元素写入下级存储器。Similarly, the transposer reads the matrix elements in the second sub-matrix from the matrix buffer, performs a left shift of 2 on the matrix elements, and then performs a transposition process on the matrix elements through the configured transposition logic, and finally writes the matrix elements of the transposed sub-matrix obtained by transposition into the lower-level memory.

以此类推，转置器完成剩余2个子矩阵的转置，最终将转置后的8×8矩阵写入下级缓存器中。Similarly, the transposer completes the transposition of the remaining two sub-matrices and finally writes the transposed 8×8 matrix into the lower-level cache.

3、待转置矩阵的行列数不相等(M为N整数倍)，且行矩阵元素位宽等于上级存储器的读接口位宽。3. The number of rows and columns of the matrix to be transposed is not equal (M is an integer multiple of N), and the bit width of the row matrix element is equal to the read interface bit width of the upper-level memory.

以4×8矩阵为例，如图9所示，矩阵读取器每次读取待转置矩阵中的一行矩阵元素，以k＝2为单位步长，对读取到的矩阵元素执行右移，即第一行矩阵元素右移0，第二行矩阵元素右移2，第三行矩阵元素右移4，以此类推，从而将数据移位后的待转置矩阵写入矩阵缓存器。Taking a 4×8 matrix as an example, as shown in FIG9 , the matrix reader reads a row of matrix elements in the matrix to be transposed each time, and right-shifts the read matrix elements with a unit step size of k=2, that is, the matrix elements in the first row are right-shifted by 0, the matrix elements in the second row are right-shifted by 2, the matrix elements in the third row are right-shifted by 4, and so on, thereby writing the matrix to be transposed after the data shift is written into the matrix buffer.

以此类推，转置器完成剩余2个子矩阵的转置，最终将转置后的8×4矩阵写入下级缓存器中。Similarly, the transposer completes the transposition of the remaining two sub-matrices and finally writes the transposed 8×4 matrix into the lower-level cache.

需要说明的是，当M无法整除N时，在执行数据移位前需要执行额外的填充操作，且在执行最后一次转置子矩阵写入前，需要执行填充去除。比如，以4×7矩阵为例，执行数据移位前需要额外填充一列矩阵元素，在执行最后一次转置子矩阵写入前，需要将转置子矩阵的最后一行。It should be noted that when M cannot divide N, additional padding operations need to be performed before data shifting, and padding removal needs to be performed before the last transposed submatrix is written. For example, taking a 4×7 matrix as an example, an additional column of matrix elements needs to be filled before data shifting, and the last row of the transposed submatrix needs to be removed before the last transposed submatrix is written.

在本申请的一种应用场景中，矩阵转置装置用于处理神经网络的训练和推理过程，矩阵转置装置称为神经网络的参数矩阵转置装置；该装置包括：矩阵读取器、移位器、矩阵缓存器、转置器、矩阵写入器以及控制器；In an application scenario of the present application, a matrix transposition device is used to process the training and reasoning process of a neural network, and the matrix transposition device is called a parameter matrix transposition device of a neural network; the device includes: a matrix reader, a shifter, a matrix buffer, a transposition device, a matrix writer, and a controller;

控制器用于控制矩阵读取器从上级存储器中读取待转置的神经网络参数矩阵的参数元素，其中，矩阵读取器每次读取神经网络参数矩阵中的至少一行参数元素；The controller is used to control the matrix reader to read parameter elements of the neural network parameter matrix to be transposed from the upper-level memory, wherein the matrix reader reads at least one row of parameter elements in the neural network parameter matrix each time;

控制器用于控制移位器对矩阵读取器读取到的参数元素执行数据移位，以及控制移位器将移位后的参数元素写入矩阵缓存器；The controller is used to control the shifter to perform data shift on the parameter elements read by the matrix reader, and to control the shifter to write the shifted parameter elements into the matrix buffer;

可选的，控制器还用于基于上级存储器的读接口位宽，以及神经网络参数矩阵的矩阵参数，确定子矩阵的子矩阵列数；Optionally, the controller is further used to determine the number of submatrix columns of the submatrix based on the read interface bit width of the upper memory and the matrix parameters of the neural network parameter matrix;

基于子矩阵列数，确定矩阵读取器各次读取到的参数元素对应的移位步长；Based on the number of submatrix columns, determining the shift step corresponding to the parameter elements read by the matrix reader each time;

基于移位步长，控制移位器对矩阵读取器读取到的参数元素执行数据右移。Based on the shift step size, the control shifter performs data right shift on the parameter elements read by the matrix reader.

可选的，矩阵缓存器由若干内存bank构成；Optionally, the matrix buffer is composed of a number of memory banks;

控制器用于控制移位器将移位后的参数元素写入矩阵缓存器中的多个内存bank；The controller is used for controlling the shifter to write the shifted parameter elements into a plurality of memory banks in the matrix buffer;

控制器还用于基于子矩阵列数，控制转置器从矩阵缓存器的多个内存bank中并行读取子矩阵的参数元素。The controller is further used to control the transposer to read the parameter elements of the sub-matrix in parallel from the plurality of memory banks of the matrix buffer based on the number of columns of the sub-matrix.

控制器用于控制转置器从矩阵缓存器中读取子矩阵，以及控制转置器对读取到的子矩阵执行转置处理，得到转置子矩阵，其中，子矩阵包含神经网络参数矩阵中的至少一列参数元素；The controller is used to control the transposer to read a submatrix from the matrix buffer, and to control the transposer to perform a transposition process on the read submatrix to obtain a transposed submatrix, wherein the submatrix includes at least one column of parameter elements in the neural network parameter matrix;

可选的，转置器由多个多路选择器构成；Optionally, the transposer is composed of a plurality of multiplexers;

控制器还用于基于神经网络参数矩阵的矩阵尺寸，配置转置器中的多路选择器的数据输出端；The controller is also used to configure the data input of the multiplexer in the transposer based on the matrix size of the neural network parameter matrix. outgoing end;

转置器用于通过经过配置的多路选择器对子矩阵的矩阵元素执行选通输出，得到转置子矩阵。The transposer is used for performing gated output on matrix elements of the sub-matrix through the configured multiplexer to obtain a transposed sub-matrix.

控制器用于控制矩阵写入器将转置器输出的转置子矩阵写入下级存储器。The controller is used for controlling the matrix writer to write the transposed sub-matrix output by the transposer into the lower-level memory.

其中，下级存储器用于存储神经网络参数矩阵的转置结果，和/或，向执行神经网络的训练和推理过程的处理器(或处理装置)提供神经网络参数矩阵的转置结果。The lower-level memory is used to store the transposed result of the neural network parameter matrix, and/or to provide the transposed result of the neural network parameter matrix to a processor (or processing device) that performs the training and reasoning process of the neural network.

在本申请的另一种应用场景中，矩阵转置装置用于执行图像处理，比如通过矩阵转置实现图像翻转或轮换，矩阵转置装置称为图像转置装置；该装置包括：矩阵读取器、移位器、矩阵缓存器、转置器、矩阵写入器以及控制器；In another application scenario of the present application, a matrix transposition device is used to perform image processing, such as realizing image flipping or rotation through matrix transposition, and the matrix transposition device is called an image transposition device; the device includes: a matrix reader, a shifter, a matrix buffer, a transposer, a matrix writer, and a controller;

控制器用于控制矩阵读取器从上级存储器中读取待转置的图像像素矩阵的像素元素，其中，矩阵读取器每次读取图像像素矩阵中的至少一行像素元素；The controller is used to control the matrix reader to read the pixel elements of the image pixel matrix to be transposed from the upper memory, wherein the matrix reader reads at least one row of pixel elements in the image pixel matrix each time;

控制器用于控制移位器对矩阵读取器读取到的像素元素执行数据移位，以及控制移位器将移位后的像素元素写入矩阵缓存器；The controller is used to control the shifter to perform data shift on the pixel elements read by the matrix reader, and to control the shifter to write the shifted pixel elements into the matrix buffer;

可选的，控制器还用于基于上级存储器的读接口位宽，以及图像像素矩阵的矩阵参数，确定子矩阵的子矩阵列数；Optionally, the controller is further used to determine the number of submatrix columns of the submatrix based on the read interface bit width of the upper memory and the matrix parameters of the image pixel matrix;

基于子矩阵列数，确定矩阵读取器各次读取到的像素元素对应的移位步长；Based on the number of submatrix columns, determining the shift step length corresponding to the pixel elements read by the matrix reader each time;

基于移位步长，控制移位器对矩阵读取器读取到的像素元素执行数据右移。Based on the shift step size, the shifter is controlled to perform data right shift on the pixel elements read by the matrix reader.

控制器用于控制移位器将移位后的像素元素写入矩阵缓存器中的多个内存bank；The controller is used for controlling the shifter to write the shifted pixel elements into a plurality of memory banks in the matrix buffer;

控制器还用于基于子矩阵列数，控制转置器从矩阵缓存器的多个内存bank中并行读取子矩阵的像素元素。The controller is further used for controlling the transposer to read pixel elements of the submatrix in parallel from a plurality of memory banks of the matrix buffer based on the number of columns of the submatrix.

控制器用于控制转置器从矩阵缓存器中读取子矩阵，以及控制转置器对读取到的子矩阵执行转置处理，得到转置子矩阵，其中，子矩阵包含图像像素矩阵中的至少一列像素元素；The controller is used to control the transposer to read a submatrix from the matrix buffer, and to control the transposer to perform a transposition process on the read submatrix to obtain a transposed submatrix, wherein the submatrix includes at least one column of pixel elements in the image pixel matrix;

控制器还用于基于图像像素矩阵的矩阵尺寸，配置转置器中的多路选择器的数据输出端；The controller is also used to configure the data output terminal of the multiplexer in the transposer based on the matrix size of the image pixel matrix;

转置器用于通过经过配置的多路选择器对子矩阵的像素元素执行选通输出，得到转置子矩阵。The transposer is used for performing gate output on the pixel elements of the sub-matrix through the configured multiplexer to obtain a transposed sub-matrix.

其中，下级存储器用于存储图像像素矩阵的转置结果，和/或，向执行神经网络的训练和推理过程的处理器(或处理装置)提供图像像素矩阵的转置结果。The lower-level memory is used to store the transposed result of the image pixel matrix, and/or to provide the transposed result of the image pixel matrix to a processor (or processing device) that performs the training and reasoning process of the neural network.

在本申请的又一种应用场景中，矩阵转置装置用于执行信号处理，比如通过矩阵转置实现信号的傅里叶变换(FFT)或傅里叶变换的一个子过程；矩阵转置装置称为信号转置装置；该装置包括：矩阵读取器、移位器、矩阵缓存器、转置器、矩阵写入器以及控制器；In another application scenario of the present application, a matrix transposition device is used to perform signal processing, such as realizing a Fourier transform (FFT) of a signal or a sub-process of the Fourier transform through matrix transposition; the matrix transposition device is called a signal transposition device; the device includes: a matrix reader, a shifter, a matrix buffer, a transposer, a matrix writer, and a controller;

控制器用于控制矩阵读取器从上级存储器中读取待转置的信号矩阵的信号元素，其中，矩阵读取器每次读取信号矩阵中的至少一行信号元素；The controller is used to control the matrix reader to read the signal elements of the signal matrix to be transposed from the upper-level memory, wherein the matrix reader reads at least one row of signal elements in the signal matrix each time;

控制器用于控制移位器对矩阵读取器读取到的信号元素执行数据移位，以及控制移位器将移位后的信号元素写入矩阵缓存器；The controller is used to control the shifter to perform data shift on the signal elements read by the matrix reader, and to control the shifter to write the shifted signal elements into the matrix buffer;

可选的，控制器还用于基于上级存储器的读接口位宽，以及信号矩阵的矩阵参数，确定子矩阵的子矩阵列数；Optionally, the controller is further used to determine the number of sub-matrix columns of the sub-matrix based on the read interface bit width of the upper-level memory and the matrix parameters of the signal matrix;

基于子矩阵列数，确定矩阵读取器各次读取到的信号元素对应的移位步长；Based on the number of submatrix columns, determining the shift step length corresponding to the signal element read by the matrix reader each time;

基于移位步长，控制移位器对矩阵读取器读取到的信号元素执行数据右移。Based on the shift step size, the shifter is controlled to perform data right shift on the signal elements read by the matrix reader.

控制器用于控制移位器将移位后的信号元素写入矩阵缓存器中的多个内存bank；The controller is used for controlling the shifter to write the shifted signal elements into a plurality of memory banks in the matrix buffer;

控制器还用于基于子矩阵列数，控制转置器从矩阵缓存器的多个内存bank中并行读取子矩阵的信号元素。The controller is also used to control the transposer to read in parallel from multiple memory banks of the matrix buffer based on the number of sub-matrix columns. The signal element of the submatrix.

控制器用于控制转置器从矩阵缓存器中读取子矩阵，以及控制转置器对读取到的子矩阵执行转置处理，得到转置子矩阵，其中，子矩阵包含信号矩阵中的至少一列信号元素；The controller is used to control the transposer to read a sub-matrix from the matrix buffer, and to control the transposer to perform a transposition process on the read sub-matrix to obtain a transposed sub-matrix, wherein the sub-matrix includes at least one column of signal elements in the signal matrix;

控制器还用于基于信号矩阵的矩阵尺寸，配置转置器中的多路选择器的数据输出端；The controller is further used to configure the data output terminals of the multiplexer in the transposer based on the matrix size of the signal matrix;

转置器用于通过经过配置的多路选择器对子矩阵的信号元素执行选通输出，得到转置子矩阵。The transposer is used to perform gating output on the signal elements of the sub-matrix through the configured multiplexer to obtain a transposed sub-matrix.

其中，下级存储器用于存储信号矩阵的转置结果，和/或，向执行神经网络的训练和推理过程的处理器(或处理装置)提供信号矩阵的转置结果。The lower-level memory is used to store the transposed result of the signal matrix, and/or to provide the transposed result of the signal matrix to a processor (or processing device) that performs the training and reasoning process of the neural network.

在本申请的再一种应用场景中，矩阵转置装置用于执行图网络的关联性分析，比如通过矩阵转置实现计算网络中的节点之间的关系源，或者确定网络中的节点之间的关系模式；矩阵转置装置称为信息转置装置；该装置包括：矩阵读取器、移位器、矩阵缓存器、转置器、矩阵写入器以及控制器；In another application scenario of the present application, a matrix transposition device is used to perform correlation analysis of a graph network, such as realizing the relationship source between nodes in a calculation network or determining the relationship pattern between nodes in a network through matrix transposition; the matrix transposition device is called an information transposition device; the device includes: a matrix reader, a shifter, a matrix buffer, a transposer, a matrix writer, and a controller;

控制器用于控制矩阵读取器从上级存储器中读取待转置的图网络信息矩阵的信息元素，其中，矩阵读取器每次读取图网络信息矩阵中的至少一行信息元素；The controller is used to control the matrix reader to read the information elements of the graph network information matrix to be transposed from the upper-level memory, wherein the matrix reader reads at least one row of information elements in the graph network information matrix each time;

控制器用于控制移位器对矩阵读取器读取到的信息元素执行数据移位，以及控制移位器将移位后的信息元素写入矩阵缓存器；The controller is used to control the shifter to perform data shift on the information elements read by the matrix reader, and to control the shifter to write the shifted information elements into the matrix buffer;

可选的，控制器还用于基于上级存储器的读接口位宽，以及图网络信息矩阵的矩阵参数，确定子矩阵的子矩阵列数；Optionally, the controller is further used to determine the number of sub-matrix columns of the sub-matrix based on the read interface bit width of the upper-level memory and the matrix parameters of the graph network information matrix;

基于子矩阵列数，确定矩阵读取器各次读取到的信息元素对应的移位步长；Based on the number of submatrix columns, determining the shift step length corresponding to the information elements read by the matrix reader each time;

基于移位步长，控制移位器对矩阵读取器读取到的信息元素执行数据右移。Based on the shift step length, the shifter is controlled to perform data right shift on the information elements read by the matrix reader.

控制器用于控制移位器将移位后的信息元素写入矩阵缓存器中的多个内存bank；The controller is used for controlling the shifter to write the shifted information elements into a plurality of memory banks in the matrix buffer;

控制器还用于基于子矩阵列数，控制转置器从矩阵缓存器的多个内存bank中并行读取子矩阵的信息元素。The controller is further used to control the transposer to read information elements of the submatrix in parallel from a plurality of memory banks of the matrix buffer based on the number of columns of the submatrix.

控制器用于控制转置器从矩阵缓存器中读取子矩阵，以及控制转置器对读取到的子矩阵执行转置处理，得到转置子矩阵，其中，子矩阵包含图网络信息矩阵中的至少一列信息元素；The controller is used to control the transposer to read a submatrix from the matrix buffer, and to control the transposer to perform a transposition process on the read submatrix to obtain a transposed submatrix, wherein the submatrix includes at least one column of information elements in the graph network information matrix;

控制器还用于基于图网络信息矩阵的矩阵尺寸，配置转置器中的多路选择器的数据输出端；The controller is also used to configure the data output terminal of the multiplexer in the transposer based on the matrix size of the graph network information matrix;

转置器用于通过经过配置的多路选择器对子矩阵的信息元素执行选通输出，得到转置子矩阵。The transposer is used to perform gating output on the information elements of the sub-matrix through the configured multiplexer to obtain a transposed sub-matrix.

其中，下级存储器用于存储图网络信息矩阵的转置结果，和/或，向执行神经网络的训练和推理过程的处理器(或处理装置)提供图网络信息矩阵的转置结果。Among them, the lower-level memory is used to store the transposed result of the graph network information matrix, and/or provide the transposed result of the graph network information matrix to the processor (or processing device) that executes the training and reasoning process of the neural network.

请参考图10，其示出了本申请一个示例性实施例提供的矩阵转置方法的流程图，该方法应用于矩阵转置装置，矩阵转置装置包括矩阵读取器、移位器、矩阵缓存器、转置器、矩阵写入器以及控制器，该方法包括：Please refer to FIG. 10 , which shows a flow chart of a matrix transposition method provided by an exemplary embodiment of the present application. The method is applied to a matrix transposition device, and the matrix transposition device includes a matrix reader, a shifter, a matrix buffer, a transposer, a matrix writer, and a controller. The method includes:

步骤1001，控制器控制矩阵读取器从上级存储器中读取待转置矩阵的矩阵元素，其中，矩阵读取器每次读取待转置矩阵中的至少一行矩阵元素。Step 1001: A controller controls a matrix reader to read matrix elements of a matrix to be transposed from a superior memory, wherein the matrix reader reads at least one row of matrix elements in the matrix to be transposed each time.

步骤1002，控制器控制移位器对矩阵读取器读取到的矩阵元素执行数据移位，以及控制移位器将移位后的矩阵元素写入矩阵缓存器。In step 1002 , the controller controls the shifter to perform data shift on the matrix elements read by the matrix reader, and controls the shifter to write the shifted matrix elements into the matrix buffer.

步骤1003，控制器控制转置器从矩阵缓存器中读取子矩阵，以及控制转置器对读取到的子矩阵执行转置处理，得到转置子矩阵，其中，子矩阵包含待转置矩阵中的至少一列矩阵元素。Step 1003: The controller controls the transposer to read the submatrix from the matrix buffer, and controls the transposer to read the submatrix from the matrix buffer. The submatrix of the matrix to be transposed is transposed to obtain a transposed submatrix, wherein the submatrix contains at least one column of matrix elements in the matrix to be transposed.

步骤1004，控制器控制矩阵写入器将转置器输出的转置子矩阵写入下级存储器。Step 1004: the controller controls the matrix writer to write the transposed sub-matrix output by the transposer into the lower-level memory.

在一些实施例中，控制器控制移位器对矩阵读取器读取到的矩阵元素执行数据移位，包括：In some embodiments, the controller controls the shifter to perform data shift on the matrix elements read by the matrix reader, including:

基于上级存储器的读接口位宽，以及待转置矩阵的矩阵参数，确定子矩阵的子矩阵列数；Determine the number of submatrix columns of the submatrix based on the read interface bit width of the upper memory and the matrix parameters of the matrix to be transposed;

基于子矩阵列数，确定矩阵读取器各次读取到的矩阵元素对应的移位步长；Based on the number of submatrix columns, determining the shift step length corresponding to the matrix elements read by the matrix reader each time;

基于移位步长，控制移位器对矩阵读取器读取到的矩阵元素执行数据右移。Based on the shift step size, the shifter is controlled to perform data right shift on the matrix elements read by the matrix reader.

在一些实施例中，基于所述上级存储器的读接口位宽，以及待转置矩阵的矩阵参数，确定子矩阵的子矩阵列数，包括：In some embodiments, determining the number of submatrix columns of the submatrix based on the read interface bit width of the upper-level memory and the matrix parameters of the matrix to be transposed includes:

在读接口位宽等于待转置矩阵的行矩阵元素位宽的情况下，将待转置矩阵的列行比向上取整结果确定为子矩阵列数，行矩阵元素位宽基于待转置矩阵的列数以及各个矩阵元素的比特数确定得到；When the read interface bit width is equal to the row matrix element bit width of the matrix to be transposed, the column-row ratio of the matrix to be transposed is rounded up to determine the number of submatrix columns, and the row matrix element bit width is determined based on the number of columns of the matrix to be transposed and the number of bits of each matrix element;

或，在读接口位宽大于待转置矩阵的行矩阵元素位宽的情况下，将读接口位宽与待转置矩阵的列矩阵元素位宽的比值的向上取整结果确定为所述子矩阵列数，列矩阵元素位宽基于待转置矩阵的行数以及各个矩阵元素的比特数确定得到。Or, when the read interface bit width is greater than the row matrix element bit width of the matrix to be transposed, the result of rounding up the ratio of the read interface bit width to the column matrix element bit width of the matrix to be transposed is determined as the number of columns of the submatrix, and the column matrix element bit width is determined based on the number of rows of the matrix to be transposed and the number of bits of each matrix element.

在一些实施例中，该方法还包括：In some embodiments, the method further comprises:

在读接口位宽等于所述待转置矩阵的行矩阵元素位宽，且列行比为大于1的非整数的情况下，控制器控制矩阵读取器在读取到的矩阵元素后执行填充；When the read interface bit width is equal to the row matrix element bit width of the matrix to be transposed, and the column-row ratio is a non-integer greater than 1, the controller controls the matrix reader to perform padding after the read matrix element;

所述控制器控制所述矩阵写入器对所述转置器输出的至少一个转置子矩阵执行填充去除，并将至少一个填充去除后的转置子矩阵中的矩阵元素写入所述下级存储器。The controller controls the matrix writer to perform padding removal on at least one transposed submatrix output by the transposer, and writes matrix elements in at least one transposed submatrix after padding removal into the lower-level memory.

在一些实施例中，控制器控制矩阵读取器在读取到的矩阵元素后执行填充，包括：In some embodiments, the controller controls the matrix reader to perform padding after the read matrix element, including:

基于待转置矩阵的行数和列数确定填充列数；Determine the number of padding columns based on the number of rows and columns of the matrix to be transposed;

基于填充列数控制矩阵读取器在读取到的矩阵元素后执行填充；Control the matrix reader to perform padding after the read matrix elements based on the number of padding columns;

所述控制器控制所述矩阵写入器对所述转置器输出的至少一个转置子矩阵执行填充去除，包括：The controller controls the matrix writer to perform padding removal on at least one transposed sub-matrix output by the transposer, including:

所述控制器基于所述填充列数控制所述矩阵写入器对所述转置器输出的所述至少一个转置子矩阵执行填充去除，其中，从所述至少一个转置子矩阵中需要去除的矩阵元素总和的行数与所述填充列数相同。The controller controls the matrix writer to perform padding removal on the at least one transposed submatrix output by the transposer based on the number of padding columns, wherein the number of rows of the sum of matrix elements to be removed from the at least one transposed submatrix is the same as the number of padding columns.

在一些实施例中，矩阵缓存器由若干内存bank构成；In some embodiments, the matrix buffer is composed of a number of memory banks;

控制器控制转置器从矩阵缓存器中读取子矩阵，包括：The controller controls the transposer to read the sub-matrix from the matrix buffer, including:

控制移位器将移位后的矩阵元素写入矩阵缓存器中的多个内存bank；Control the shifter to write the shifted matrix elements into multiple memory banks in the matrix buffer;

基于子矩阵列数，控制转置器从矩阵缓存器的多个内存bank中并行读取子矩阵的矩阵元素。Based on the number of sub-matrix columns, the transposer is controlled to read the matrix elements of the sub-matrix in parallel from multiple memory banks of the matrix buffer.

在一些实施例中，控制转置器从矩阵缓存器的多个内存bank中并行读取子矩阵的矩阵元素，包括：In some embodiments, controlling the transposer to read matrix elements of a sub-matrix in parallel from a plurality of memory banks of a matrix buffer comprises:

基于子矩阵列数确定各个内存bank的读取地址，内存bank的读取地址递增；Determine the read address of each memory bank based on the number of sub-matrix columns, and the read address of the memory bank is incremented;

基于读取地址控制转置器从矩阵缓存器的多个内存bank中并行读取子矩阵的矩阵元素。The transposer is controlled to read the matrix elements of the sub-matrix in parallel from the plurality of memory banks of the matrix buffer based on the read address.

在一些实施例中，控制器控制转置器从矩阵缓存器中读取子矩阵之后，该方法还包括：In some embodiments, after the controller controls the transposer to read the sub-matrix from the matrix buffer, the method further includes:

基于子矩阵列数，控制转置器对子矩阵的矩阵元素执行数据左移。Based on the number of columns of the submatrix, the control transposer performs a data left shift on the matrix elements of the submatrix.

在一些实施例中，转置器由多个多路选择器构成；In some embodiments, the transposer is comprised of a plurality of multiplexers;

该方法还包括：The method further includes:

控制器基于待转置矩阵的矩阵尺寸，配置转置器中的多路选择器的数据输出端； The controller configures the data output terminal of the multiplexer in the transposer based on the matrix size of the matrix to be transposed;

转置器通过经过配置的多路选择器对子矩阵的矩阵元素执行选通输出，得到转置子矩阵。The transposer performs gated output on matrix elements of the sub-matrix through a configured multiplexer to obtain a transposed sub-matrix.

在一些实施例中，转置器中多路选择器的数量与上级存储器的读接口位宽相匹配。In some embodiments, the number of multiplexers in the transposer matches the read interface bit width of the upper level memory.

矩阵转置装置执行矩阵转置的详细过程可以参考上述装置实施例，本实施例在此不作赘述。The detailed process of the matrix transposition device performing matrix transposition can refer to the above-mentioned device embodiment, and this embodiment will not be described in detail here.

在一些实施例中，本申请实施例中的矩阵转置装置可以集成在AI处理器中。In some embodiments, the matrix transposition device in the embodiments of the present application can be integrated into an AI processor.

请参考图11，其示出了本申请一个示例性实施例提供的计算机设备1100的结构框图。其中，该计算机设备1100可以是便携式移动终端，比如：智能手机、平板电脑、动态影像专家压缩标准音频层面3(Moving Picture Experts Group Audio Layer III，MP3)播放器、动态影像专家压缩标准音频层面4(Moving Picture Experts Group Audio Layer IV，MP4)播放器。计算机设备1100还可能被称为用户设备、便携式终端、工作站、服务器等其他名称。Please refer to FIG. 11 , which shows a block diagram of a computer device 1100 provided by an exemplary embodiment of the present application. The computer device 1100 may be a portable mobile terminal, such as a smart phone, a tablet computer, a Moving Picture Experts Group Audio Layer III (MP3) player, or a Moving Picture Experts Group Audio Layer IV (MP4) player. The computer device 1100 may also be called a user device, a portable terminal, a workstation, a server, or other names.

通常，计算机设备1100包括有：处理器1101和存储器1102。Typically, the computer device 1100 includes a processor 1101 and a memory 1102 .

处理器1101可以包括一个或多个处理核心，比如4核心处理器、8核心处理器等。处理器1101可以采用数字信号处理(Digital Signal Processing，DSP)、现场可编程门阵列(Field－Programmable Gate Array，FPGA)、可编程逻辑阵列(Programmable Logic Array，PLA)中的至少一种硬件形式来实现。处理器1101也可以包括主处理器和协处理器，主处理器是用于对在唤醒状态下的数据执行处理的处理器，也称中央处理器(Central Processing Unit，CPU)；协处理器是用于对在待机状态下的数据执行处理的低功耗处理器。在一些实施例中，处理器1101可以在集成有图像处理器(Graphics Processing Unit，GPU)，GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中，处理器1101还可以包括人工智能(Artificial Intelligence，AI)处理器，该AI处理器用于处理有关机器学习的计算操作。The processor 1101 may include one or more processing cores, such as a 4-core processor, an 8-core processor, etc. The processor 1101 may be implemented in at least one hardware form of digital signal processing (DSP), field-programmable gate array (FPGA), and programmable logic array (PLA). The processor 1101 may also include a main processor and a coprocessor. The main processor is a processor for performing processing on data in the awake state, also known as a central processing unit (CPU); the coprocessor is a low-power processor for performing processing on data in the standby state. In some embodiments, the processor 1101 may be integrated with a graphics processing unit (GPU), and the GPU is responsible for rendering and drawing the content to be displayed on the display screen. In some embodiments, the processor 1101 may also include an artificial intelligence (AI) processor, which is used to process computing operations related to machine learning.

在一些实施例中，处理器1101可以集成有上述实施例提供的矩阵转置装置。当处理器1101存在矩阵转置需求时，即可通过该矩阵转置装置执行转置运算。In some embodiments, the processor 1101 may be integrated with the matrix transposition device provided in the above embodiments. When the processor 1101 has a matrix transposition requirement, the matrix transposition device may be used to perform the transposition operation.

存储器1102可以包括一个或多个计算机可读存储介质，该计算机可读存储介质可以是有形的和非暂态的。存储器1102还可包括高速随机存取存储器，以及非易失性存储器，比如一个或多个磁盘存储设备、闪存存储设备。The memory 1102 may include one or more computer-readable storage media, which may be tangible and non-transitory. The memory 1102 may also include high-speed random access memory and non-volatile memory, such as one or more disk storage devices, flash memory storage devices.

在一些实施例中，计算机设备1100还可选包括有外围设备接口1103和至少一个外围设备。In some embodiments, the computer device 1100 may also optionally include a peripheral device interface 1103 and at least one peripheral device.

本领域技术人员可以理解，图11中示出的结构并不构成对计算机设备1100的限定，可以包括比图示更多或更少的组件，或者组合某些组件，或者采用不同的组件布置。Those skilled in the art will appreciate that the structure shown in FIG. 11 does not limit the computer device 1100 , and may include more or fewer components than shown in the figure, or combine certain components, or adopt a different component arrangement.

本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成，也可以通过程序来指令相关的硬件完成，所述的程序可以存储于一种计算机可读存储介质中，上述提到的存储介质可以是只读存储器，磁盘或光盘等。A person skilled in the art will understand that all or part of the steps to implement the above embodiments may be accomplished by hardware or by instructing related hardware through a program, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a disk or an optical disk, etc.

以上所述仅为本申请的可选的实施例，并不用以限制本申请，凡在本申请的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本申请的保护范围之内。 The above description is only an optional embodiment of the present application and is not intended to limit the present application. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present application shall be included in the protection scope of the present application.

Claims

A matrix transposition device, characterized in that the device comprises:

Matrix readers, shifters, matrix buffers, transposers, matrix writers, and controllers;

The controller is used to control the matrix reader to read the matrix elements of the matrix to be transposed from the upper-level memory, wherein the matrix reader reads at least one row of matrix elements in the matrix to be transposed each time;

The controller is used to control the shifter to perform data shift on the matrix elements read by the matrix reader, and control the shifter to write the shifted matrix elements into the matrix buffer;

The controller is used to control the transposer to read a submatrix from the matrix buffer, and control the transposer to perform a transposition process on the read submatrix to obtain a transposed submatrix, wherein the submatrix includes at least one column of matrix elements in the matrix to be transposed;

The controller is used to control the matrix writer to write the transposed sub-matrix output by the transposer into a lower-level memory.

The device according to claim 1, characterized in that the controller is used to:

Determine the number of submatrix columns of the submatrix based on the read interface bit width of the upper-level memory and the matrix parameters of the matrix to be transposed;

Determining the shift step length corresponding to the matrix elements read by the matrix reader each time based on the number of columns of the submatrix;

Based on the shift step length, the shifter is controlled to perform data right shift on the matrix elements read by the matrix reader.

The device according to claim 2, characterized in that the controller is used to:

In a case where the read interface bit width is equal to the row matrix element bit width of the matrix to be transposed, a result of rounding up the column-row ratio of the matrix to be transposed is determined as the number of columns of the submatrix, and the row matrix element bit width is determined based on the number of columns of the matrix to be transposed and the number of bits of each matrix element;

Or, when the read interface bit width is greater than the row matrix element bit width of the matrix to be transposed, the result of rounding up the ratio of the read interface bit width to the column matrix element bit width of the matrix to be transposed is determined as the number of columns of the submatrix, and the column matrix element bit width is determined based on the number of rows of the matrix to be transposed and the number of bits of each matrix element.

The device according to claim 3, characterized in that the controller is further used for:

When the read interface bit width is equal to the row matrix element bit width of the matrix to be transposed, and the column-to-row ratio is a non-integer greater than 1, controlling the matrix reader to perform padding after the read matrix element;

The matrix writer is controlled to perform padding removal on at least one transposed submatrix output by the transposer, and matrix elements in at least one transposed submatrix after padding removal are written into the lower-level memory.

The device according to claim 4, characterized in that the controller is used to:

Determine the number of filled columns based on the number of rows and columns of the matrix to be transposed;

Controlling the matrix reader to perform padding after the read matrix elements based on the number of padding columns;

The controller is further used for:

The matrix writer is controlled to perform padding removal on the at least one transposed submatrix output by the transposer based on the padding column number, wherein the sum of the number of rows of matrix elements to be removed from the at least one transposed submatrix is the same as the padding column number.

The device according to any one of claims 2 to 5, characterized in that the matrix buffer is composed of a plurality of memory banks;

The controller is used to control the shifter to write the shifted matrix elements into a plurality of memory banks in the matrix buffer;

The controller is further configured to control the transposer to read matrix elements of the submatrix in parallel from a plurality of memory banks of the matrix buffer based on the number of columns of the submatrix.

The device according to claim 6, characterized in that the controller is used to:

Determining a read address of each memory bank based on the number of columns of the sub-matrix, the read address of the memory bank being incremented;

The transposer is controlled based on the read address to read matrix elements of the sub-matrix in parallel from a plurality of memory banks of the matrix buffer.

The device according to any one of claims 2 to 7, characterized in that the controller is further used for:

Based on the number of columns of the sub-matrix, the transposer is controlled to perform data left shift on the matrix elements of the sub-matrix.

The device according to any one of claims 1 to 8, characterized in that the transposer is composed of a plurality of multiplexers;

The controller is further used to configure the data output end of the multiplexer in the transposer based on the matrix size of the matrix to be transposed;

The transposer is used to perform gating output on the matrix elements of the sub-matrix through the configured multiplexer to obtain the transposed sub-matrix.

The device according to claim 9 is characterized in that the number of multiplexers in the transposer matches the read interface bit width of the upper-level memory.

A matrix transposition method, characterized in that the method is applied to a matrix transposition device, the matrix transposition device includes a matrix reader, a shifter, a matrix buffer, a transposer, a matrix writer and a controller, and the method includes:

The controller controls the matrix reader to read matrix elements of the matrix to be transposed from the upper-level memory, wherein the matrix reader reads at least one row of matrix elements in the matrix to be transposed each time;

The controller controls the shifter to perform data shift on the matrix elements read by the matrix reader, and controls the shifter to write the shifted matrix elements into the matrix buffer;

The controller controls the transposer to read a submatrix from the matrix buffer, and controls the transposer to perform a transposition process on the read submatrix to obtain a transposed submatrix, wherein the submatrix includes at least one column of matrix elements in the matrix to be transposed;

The controller controls the matrix writer to write the transposed submatrix output by the transposer into a lower-level memory.

The method according to claim 11, characterized in that the controller controls the shifter to perform data shift on the matrix elements read by the matrix reader, comprising:

The controller determines the number of submatrix columns of the submatrix based on the read interface bit width of the upper-level memory and the matrix parameters of the matrix to be transposed;

The controller determines the shift step length corresponding to the matrix elements read by the matrix reader each time based on the number of columns of the submatrix;

The controller controls the shifter to perform data right shift on the matrix elements read by the matrix reader based on the shift step size.

The method according to claim 12, characterized in that the controller determines the number of submatrix columns of the submatrix based on the read interface bit width of the upper-level memory and the matrix parameters of the matrix to be transposed, comprising:

In a case where the read interface bit width is equal to the row matrix element bit width of the matrix to be transposed, the controller determines a result of rounding up the column-row ratio of the matrix to be transposed as the number of columns of the submatrix, and the row matrix element bit width is determined based on the number of columns of the matrix to be transposed and the number of bits of each matrix element;

Or, when the read interface bit width is greater than the row matrix element bit width of the matrix to be transposed, the controller determines the number of columns of the submatrix by rounding up the ratio of the read interface bit width to the column matrix element bit width of the matrix to be transposed, and the column matrix element bit width is determined based on the number of rows of the matrix to be transposed and the number of bits of each matrix element.

The method according to claim 13, characterized in that the method further comprises:

In the case where the read interface bit width is equal to the row matrix element bit width of the matrix to be transposed, and the column-to-row ratio is a non-integer greater than 1, the controller controls the matrix reader to perform padding after the read matrix element;

The controller controls the matrix writer to perform padding removal on at least one transposed submatrix output by the transposer, and writes matrix elements in at least one transposed submatrix after padding removal into the lower-level memory.

The method according to claim 14, characterized in that the controller controls the matrix reader to perform padding after the read matrix elements, comprising:

The controller determines the number of filling columns based on the number of rows and the number of columns of the matrix to be transposed;

The controller controls the matrix reader to perform padding after the read matrix element based on the number of padding columns;

The controller controls the matrix writer to perform padding removal on at least one transposed sub-matrix output by the transposer, including:

The controller controls the matrix writer to perform padding removal on the at least one transposed submatrix output by the transposer based on the number of padding columns, wherein the number of rows of the sum of matrix elements to be removed from the at least one transposed submatrix is the same as the number of padding columns.

The method according to any one of claims 12 to 15, characterized in that the method further comprises:

Based on the number of columns of the sub-matrix, the controller controls the transposer to perform data left shift on matrix elements of the sub-matrix.

The method according to any one of claims 11 to 16, characterized in that the transposer is composed of a plurality of multiplexers;

The method further comprises:

The controller configures the data output terminal of the multiplexer in the transposer based on the matrix size of the matrix to be transposed;

The transposer performs gating output on the matrix elements of the sub-matrix through the configured multiplexer to obtain the transposed sub-matrix.

An AI processor, characterized in that the processor comprises the matrix transposition device according to any one of claims 1 to 10.

A computer device, characterized in that the computer device comprises the AI processor as claimed in claim 18 and a memory, and the AI processor is connected to the memory via a bus.