CN117574036A

CN117574036A - Computing device, operating method and machine-readable storage medium

Info

Publication number: CN117574036A
Application number: CN202410056538.8A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Bi Ren Technology Co ltd; Beijing Bilin Technology Development Co ltd
Current assignee: Shanghai Bi Ren Technology Co ltd; Beijing Bilin Technology Development Co ltd
Priority date: 2024-01-16
Filing date: 2024-01-16
Publication date: 2024-02-20
Anticipated expiration: 2044-01-16
Also published as: CN117574036B

Abstract

The invention provides an arithmetic device, an operation method and a machine-readable storage medium. The arithmetic device performs multiplication of the input matrix and the weight matrix. The arithmetic device comprises an input buffer, a ring buffer, a control circuit and an arithmetic circuit. All elements of the input matrix are deposited at a plurality of consecutive addresses of the input buffer. The control circuit moves a plurality of elements of the input matrix from a segment of consecutive addresses of the input buffer to the circular buffer. The control circuit defines boundary positions in the circular buffer based on the dimensions of the input matrix and defines a fill range at the boundary positions based on the current weight elements of the weight matrix. The control circuit changes the element values in said filling range of the circular buffer to effect a filling operation of the input matrix. The arithmetic circuitry then uses the elements in the circular buffer to perform the multiplication.

Description

Computing device, operating method and machine-readable storage medium

技术领域Technical field

本发明涉及一种运算装置、操作方法和机器可读存储介质。The invention relates to a computing device, an operating method and a machine-readable storage medium.

背景技术Background technique

运算装置可以进行一个矩阵A（被乘数，multiplicand）和另一个矩阵B（乘数，multiplier）的乘算。在许多矩阵乘算的应用中为了调整乘算结果矩阵C（积，product）的维度，在进行矩阵乘算之前，矩阵A会先进行填充（padding）操作。填充操作意味着，在矩阵A的左右两侧填充一或多列（column）的填充元素（padding element，冗余元素），以及/或是在矩阵A的上下两侧填充一或多行（row）的填充元素。填充元素全部为一个相同填充值，而这个填充值无关于原始矩阵A。可想而知，经填充操作后的矩阵A具有许多行/列的填充元素，而这些大量填充元素会占用运算装置的输入缓冲器的空间。The arithmetic device can perform multiplication of one matrix A (multiplicand) and another matrix B (multiplier). In many matrix multiplication applications, in order to adjust the dimension of the multiplication result matrix C (product), the matrix A will first undergo a padding operation before performing the matrix multiplication. The padding operation means padding one or more columns of padding elements (redundant elements) on the left and right sides of the matrix A, and/or padding one or more rows (row) on the top and bottom sides of the matrix A. ) of the padding element. The padding elements all have the same padding value, and this padding value has nothing to do with the original matrix A. It is conceivable that the matrix A after the padding operation has many rows/columns of padding elements, and these large number of padding elements will occupy the space of the input buffer of the computing device.

发明内容Contents of the invention

本发明提供一种运算装置、操作方法和机器可读存储介质，以在实现对输入矩阵进行填充操作的应用场景中减少填充元素对输入缓冲器的空间占用。The present invention provides a computing device, an operating method and a machine-readable storage medium to reduce the space occupied by filling elements in an input buffer in an application scenario of filling an input matrix.

在根据本发明的实施例中，所述运算装置用以执行输入矩阵和权重矩阵的乘算。所述运算装置包括输入缓冲器（input buffer）、环形缓冲器（ring buffer）、控制电路以及运算电路。输入缓冲器用以暂存输入矩阵，其中输入矩阵的所有元素（element）被存放在输入缓冲器的多个连续地址处。控制电路耦接至输入缓冲器和环形缓冲器。控制电路从输入缓冲器的一段连续地址处搬移输入矩阵的多个元素至环形缓冲器。控制电路基于输入矩阵的维度定义在环形缓冲器中的至少一个边界位置。控制电路基于权重矩阵的目前权重元素在所述至少一个边界位置定义至少一个填充范围。控制电路改变在环形缓冲器的所述至少一个填充范围中的所述多个元素的元素值，以实现对输入矩阵的填充操作。运算电路耦接至环形缓冲器，以读取经填充操作后的所述多个元素。运算电路使用经填充操作后的所述多个元素去进行所述乘算。In an embodiment according to the present invention, the computing device is used to perform multiplication of the input matrix and the weight matrix. The computing device includes an input buffer, a ring buffer, a control circuit and a computing circuit. The input buffer is used to temporarily store the input matrix, where all elements of the input matrix are stored at multiple consecutive addresses in the input buffer. The control circuit is coupled to the input buffer and the ring buffer. The control circuit moves a plurality of elements of the input matrix from a contiguous address in the input buffer to the ring buffer. The control circuit defines at least one boundary position in the ring buffer based on the dimensions of the input matrix. The control circuit defines at least one filling range at the at least one boundary position based on the current weight element of the weight matrix. The control circuit changes element values of the plurality of elements in the at least one filling range of the ring buffer to implement a filling operation on the input matrix. The operation circuit is coupled to the ring buffer to read the plurality of elements after the filling operation. The operation circuit uses the plurality of elements after the padding operation to perform the multiplication.

在根据本发明的实施例中，所述运算装置的操作方法包括：将输入矩阵存放在运算装置的输入缓冲器的多个连续地址处；由运算装置的控制电路从输入缓冲器的一段连续地址处搬移输入矩阵的多个元素至运算装置的环形缓冲器；由控制电路基于输入矩阵的维度定义在环形缓冲器中的至少一个边界位置；由控制电路基于权重矩阵的目前权重元素在所述至少一个边界位置定义至少一个填充范围；由控制电路改变在环形缓冲器的所述至少一个填充范围中的所述多个元素的元素值，以实现对输入矩阵的填充操作；由运算装置的运算电路从环形缓冲器读取经填充操作后的所述多个元素；以及由运算电路使用经填充操作后的所述多个元素去进行所述乘算。In an embodiment according to the present invention, the operating method of the arithmetic device includes: storing the input matrix at multiple consecutive addresses of the input buffer of the arithmetic device; and using a control circuit of the arithmetic device to obtain a sequence of consecutive addresses in the input buffer. Move multiple elements of the input matrix to the ring buffer of the computing device; define at least one boundary position in the ring buffer based on the dimensions of the input matrix by the control circuit; and define at least one boundary position in the ring buffer based on the current weight element of the weight matrix by the control circuit. A boundary position defines at least one filling range; the control circuit changes the element values of the plurality of elements in the at least one filling range of the ring buffer to realize the filling operation of the input matrix; and the arithmetic circuit of the arithmetic device The plurality of elements after the filling operation are read from the ring buffer; and the operation circuit uses the plurality of elements after the filling operation to perform the multiplication.

在根据本发明的实施例中，所述机器可读存储介质用于存储非暂时性机器可读指令。当所述非暂时性机器可读指令由计算机执行时，可以实现所述操作方法。In an embodiment according to the present invention, the machine-readable storage medium is used to store non-transitory machine-readable instructions. The operating method can be implemented when the non-transitory machine-readable instructions are executed by a computer.

基于上述，本发明诸实施例所述运算装置可以将原始输入矩阵（不存在填充元素的矩阵）的所有元素存放在运算装置的输入缓冲器的多个连续地址处。在矩阵乘算中的每一个迭代中，输入矩阵的多个对应元素（输入矩阵的一部分）被从输入缓冲器的一段连续地址处搬移至环形缓冲器，然后控制电路依据目前迭代选择性地改变在环形缓冲器中元素的元素值（被改变的元素可以被视为填充元素，相当于对输入矩阵进行填充操作）。因此，所述运算装置可以在实现对输入矩阵进行填充操作的应用场景中减少填充元素对输入缓冲器的空间占用。Based on the above, the computing device according to the embodiments of the present invention can store all elements of the original input matrix (matrix without padding elements) at multiple consecutive addresses in the input buffer of the computing device. In each iteration of the matrix multiplication, multiple corresponding elements of the input matrix (parts of the input matrix) are moved from a consecutive range of addresses in the input buffer to the ring buffer, and then the control circuit is selectively changed according to the current iteration The element value of the element in the ring buffer (the changed element can be regarded as a filling element, which is equivalent to filling the input matrix). Therefore, the computing device can reduce the space occupation of the input buffer by the filling elements in the application scenario of filling the input matrix.

附图说明Description of the drawings

图1是依照本发明的一实施例的一种运算设备的电路方块（circuit block）示意图。FIG. 1 is a circuit block schematic diagram of a computing device according to an embodiment of the present invention.

图2是依照一实施例所绘示，在填充输入矩阵后才进行完整矩阵乘算的场景示意图。FIG. 2 is a schematic diagram of a scenario in which a complete matrix multiplication is performed after filling the input matrix according to an embodiment.

图3是依照一实施例所绘示，在矩阵乘算中的不同迭代将经填充操作后的输入矩阵的元素搬至环形缓冲器的场景示意图。FIG. 3 is a schematic diagram illustrating a scenario in which elements of the input matrix after padding operations are moved to a ring buffer in different iterations of matrix multiplication according to an embodiment.

图4是依照本发明的一实施例的一种运算装置的操作方法的流程示意图。FIG. 4 is a schematic flowchart of an operating method of a computing device according to an embodiment of the present invention.

图5是依照本发明的一实施例所绘示，在矩阵乘算的不同迭代中将矩阵元素从输入缓冲器搬至环形缓冲器然后选择性改变环形缓冲器中的部分元素的操作过程示意图。5 is a schematic diagram illustrating the operation process of moving matrix elements from the input buffer to the ring buffer and then selectively changing some elements in the ring buffer in different iterations of matrix multiplication according to an embodiment of the present invention.

附图标记说明Explanation of reference signs

100：运算设备100: Computing equipment

110：存储单元110: Storage unit

120：运算装置120: computing device

130：下一级电路130: Next level circuit

C1：乘算结果矩阵C1: Multiplication result matrix

CC1：运算电路CC1: Arithmetic circuit

CONT1：控制电路CONT1: control circuit

IB1：输入缓冲器IB1: input buffer

IN1：输入矩阵IN1: input matrix

IN1’：经填充操作后的输入矩阵IN1’: input matrix after filling operation

RB1：环形缓冲器RB1: Ring buffer

t31、t32、t33、t34、t35、t50、t51、t52、t53、t54、t55、t56、t57、t58、t59：时间t31, t32, t33, t34, t35, t50, t51, t52, t53, t54, t55, t56, t57, t58, t59: time

W：权重矩阵W: weight matrix

S410、S420、S430、S440、S450、S460、S470：步骤S410, S420, S430, S440, S450, S460, S470: Steps

具体实施方式Detailed ways

现将详细地参考本发明的示范性实施例，示范性实施例的实例说明于附图中。只要有可能，相同元件符号在图式和描述中用来表示相同或相似部分。Reference will now be made in detail to the exemplary embodiments of the present invention, examples of which are illustrated in the accompanying drawings. Whenever possible, the same reference numbers are used in the drawings and descriptions to refer to the same or similar parts.

在本案说明书全文（包括权利要求）中所使用的“耦接（或连接）”一词可指任何直接或间接的连接手段。举例而言，若文中描述第一装置耦接（或连接）于第二装置，则应该被解释成该第一装置可以直接连接于该第二装置，或者该第一装置可以透过其他装置或某种连接手段而间接地连接至该第二装置。本案说明书全文（包括权利要求）中提及的“第一”、“第二”等用语是用以命名组件的名称，或区别不同实施例或范围，而并非用来限制组件数量的上限或下限，亦非用来限制组件的次序。另外，凡可能之处，在附图及实施方式中使用相同标号的组件/构件/步骤代表相同或类似部分。不同实施例中使用相同标号或使用相同用语的组件/构件/步骤可以相互参照相关说明。The word "coupling (or connection)" used throughout the specification (including claims) of this case may refer to any direct or indirect means of connection. For example, if a first device is coupled (or connected) to a second device, it should be understood that the first device can be directly connected to the second device, or the first device can be connected through other devices or indirectly connected to the second device through some means of connection. Terms such as "first" and "second" mentioned in the full text of the specification (including the claims) of this case are used to name components or to distinguish different embodiments or scopes, and are not used to limit the upper or lower limit of the number of components. , nor is it used to limit the order of components. In addition, wherever possible, components/members/steps with the same reference numbers are used in the drawings and embodiments to represent the same or similar parts. Components/components/steps using the same numbers or using the same terms in different embodiments can refer to the relevant descriptions of each other.

图1是依照本发明的一实施例的一种运算设备100的电路方块（circuit block）示意图。图1所示运算设备100包括存储单元110、运算装置120以及下一级电路130。基于实际设计与应用，在一些实施例中，存储单元110可以包括任何类型的内存，例如高带宽内存（High Bandwidth Memory，HBM）或是其他动态随机存取内存（Dynamic random-accessmemory，DRAM）。在另一些实施例中，存储单元110可以包括任何存储装置，例如固态硬盘（Solid-state drive，SSD）或是其他存储装置。运算装置120可以从外部的存储单元110取得输入矩阵IN1的所有元素（element）。输入矩阵IN1是未进行填充操作的原始输入矩阵（不存在填充元素的矩阵）。运算装置120可以执行输入矩阵IN1和权重矩阵W的乘算。运算装置120可以在矩阵乘算中的每一个迭代中实现对输入矩阵的局部元素（当下迭代的多个对应元素）进行填充操作。因此，运算装置120可以在实现对输入矩阵IN1进行填充操作的应用场景中减少填充元素对运算装置120的输入缓冲器IB1的空间占用。在完成对输入矩阵IN1和权重矩阵W的乘算后，运算装置120可以将矩阵乘算的运算结果提供给下一级电路130。基于实际设计与应用，下一级电路130例如是内存或是其他运算装置。在一些应用例中，运算装置120可以是张量核（tensor core），而所述其他运算装置可以是矢量核（vector core）或是其他运算电路。FIG. 1 is a circuit block schematic diagram of a computing device 100 according to an embodiment of the present invention. The computing device 100 shown in FIG. 1 includes a storage unit 110, a computing device 120 and a next-level circuit 130. Based on actual design and application, in some embodiments, the storage unit 110 may include any type of memory, such as high bandwidth memory (High Bandwidth Memory, HBM) or other dynamic random access memory (Dynamic random-access memory, DRAM). In other embodiments, the storage unit 110 may include any storage device, such as a solid-state drive (SSD) or other storage device. The arithmetic device 120 can obtain all elements of the input matrix IN1 from the external storage unit 110 . The input matrix IN1 is the original input matrix without padding (a matrix with no padding elements). The arithmetic device 120 may perform multiplication of the input matrix IN1 and the weight matrix W. The computing device 120 can perform a filling operation on the local elements of the input matrix (multiple corresponding elements of the current iteration) in each iteration of the matrix multiplication. Therefore, the arithmetic device 120 can reduce the space occupied by the filling elements in the input buffer IB1 of the arithmetic device 120 in an application scenario in which the input matrix IN1 is filled. After completing the multiplication of the input matrix IN1 and the weight matrix W, the operation device 120 may provide the operation result of the matrix multiplication to the next-stage circuit 130 . Based on actual design and application, the next-level circuit 130 is, for example, a memory or other computing device. In some application examples, the computing device 120 may be a tensor core, and the other computing devices may be vector cores or other computing circuits.

在图1所示实施例中，运算装置120包括输入缓冲器（input buffer）IB1、环形缓冲器（ring buffer）RB1、控制电路CONT1以及运算电路CC1。基于实际设计，运算电路CC1例如包括通用矩阵乘法（GEneral Matrix Multiply，GEMM）电路或是其他运算电路。输入缓冲器IB1例如包括通用矩阵乘法输入缓冲器（GEMM input buffer，GIB）或是其他缓冲器。输入缓冲器IB1用以暂存输入矩阵IN1。控制电路CONT1耦接至输入缓冲器IB1、环形缓冲器RB1和运算电路CC1。运算电路CC1耦接至环形缓冲器RB1。依照不同的设计，在一些实施例中，上述运算装置120、运算电路CC1和/或控制电路CONT1的实现方式可以是硬件（hardware）电路。在另一些实施例中，运算装置120、运算电路CC1和/或控制电路CONT1的实现方式可以是固件（firmware）、软件（software，即程序）或是前述二者的组合形式。在又一些实施例中，运算装置120、运算电路CC1和/或控制电路CONT1的实现方式可以是硬件、固件、软件的组合形式。In the embodiment shown in FIG. 1 , the computing device 120 includes an input buffer IB1 , a ring buffer RB1 , a control circuit CONT1 and a computing circuit CC1 . Based on the actual design, the operation circuit CC1 includes, for example, a general matrix multiplication (GEneral Matrix Multiply, GEMM) circuit or other operation circuits. The input buffer IB1 includes, for example, a general matrix multiplication input buffer (GEMM input buffer, GIB) or other buffers. The input buffer IB1 is used to temporarily store the input matrix IN1. The control circuit CONT1 is coupled to the input buffer IB1, the ring buffer RB1 and the operation circuit CC1. The operation circuit CC1 is coupled to the ring buffer RB1. According to different designs, in some embodiments, the implementation of the above-mentioned computing device 120, the computing circuit CC1 and/or the control circuit CONT1 may be a hardware circuit. In other embodiments, the computing device 120 , the computing circuit CC1 and/or the control circuit CONT1 may be implemented in the form of firmware (firmware), software (software, that is, a program), or a combination of the foregoing. In some embodiments, the computing device 120, the computing circuit CC1, and/or the control circuit CONT1 may be implemented in a combination of hardware, firmware, and software.

以硬件形式而言，上述运算装置120、运算电路CC1和/或控制电路CONT1可以实现于集成电路（integrated circuit）上的逻辑电路。举例来说，运算装置120、运算电路CC1和/或控制电路CONT1的相关功能可以被实现于一或多个硬件控制器（hardwarecontroller）、微控制器（Microcontroller）、硬件处理器（hardware processor）、微处理器（Microprocessor）、特殊应用集成电路（Application-specific integrated circuit，ASIC）、数字信号处理器（digital signal processor，DSP）、场可程序逻辑门阵列（FieldProgrammable Gate Array，FPGA）、中央处理器（Central Processing Unit，CPU）及/或其他处理单元中的各种逻辑区块、模块和电路。运算装置120、运算电路CC1和/或控制电路CONT1的相关功能可以利用硬件描述语言（hardware description languages，例如Verilog HDL或VHDL）或其他合适的编程语言来实现为硬件电路，例如集成电路中的各种逻辑区块、模块和电路。In terms of hardware, the above-mentioned computing device 120, computing circuit CC1 and/or control circuit CONT1 can be implemented as a logic circuit on an integrated circuit. For example, the relevant functions of the computing device 120, the computing circuit CC1 and/or the control circuit CONT1 can be implemented in one or more hardware controllers (hardwarecontroller), microcontrollers (Microcontroller), hardware processors (hardware processor), Microprocessor (Microprocessor), Application-specific integrated circuit (ASIC), Digital Signal Processor (DSP), Field Programmable Gate Array (FPGA), Central Processing Unit (Central Processing Unit, CPU) and/or various logical blocks, modules and circuits in other processing units. The relevant functions of the computing device 120, the computing circuit CC1 and/or the control circuit CONT1 can be implemented as hardware circuits using hardware description languages (such as Verilog HDL or VHDL) or other suitable programming languages, such as various components in an integrated circuit. logic blocks, modules, and circuits.

以软件形式及/或固件形式而言，上述运算装置120、运算电路CC1和/或控制电路CONT1的相关功能可以被实现为编程码（programming codes）。例如，利用一般的编程语言（programming languages，例如C、C++或汇编语言）或其他合适的编程语言来实现运算装置120、运算电路CC1和/或控制电路CONT1。所述编程码可以被记录/存放在“非临时的机器可读存储介质（non-transitory machine-readable storage medium）”中。在一些实施例中，所述非临时的机器可读存储介质例如包括半导体内存以及（或是）存储装置。所述半导体内存包括记忆卡、只读存储器（Read Only Memory，ROM）、闪存（FLASH memory）、可程序设计的逻辑电路或是其他半导体内存。所述存储装置包括带（tape）、盘（disk）、硬盘（hard diskdrive，HDD）、固态硬盘（SSD）或是其他存储装置。电子设备（例如计算机、CPU、硬件控制器、微控制器、硬件处理器或微处理器）可以从所述非临时的机器可读存储介质中读取并执行所述编程码，从而实现运算装置120、运算电路CC1和/或控制电路CONT1的相关功能。In terms of software form and/or firmware form, the relevant functions of the above-mentioned computing device 120, computing circuit CC1 and/or control circuit CONT1 can be implemented as programming codes (programming codes). For example, general programming languages (such as C, C++ or assembly language) or other suitable programming languages are used to implement the computing device 120, the computing circuit CC1 and/or the control circuit CONT1. The programming code may be recorded/stored in a "non-transitory machine-readable storage medium". In some embodiments, the non-transitory machine-readable storage medium includes, for example, a semiconductor memory and/or a storage device. The semiconductor memory includes memory cards, read only memory (Read Only Memory, ROM), flash memory (FLASH memory), programmable logic circuits or other semiconductor memories. The storage device includes a tape, a disk, a hard disk drive (HDD), a solid state drive (SSD), or other storage devices. An electronic device (such as a computer, CPU, hardware controller, microcontroller, hardware processor or microprocessor) can read and execute the programming code from the non-transitory machine-readable storage medium, thereby implementing the computing device 120. Related functions of the operation circuit CC1 and/or the control circuit CONT1.

图2是依照一实施例所绘示，在填充输入矩阵IN1后才进行完整矩阵乘算的场景示意图。请参照图1与图2，运算装置120可以从存储单元110读取输入矩阵IN1。输入矩阵IN1的实际维度可以依照实际设计来决定。在图2所示场景中，输入矩阵IN1被假设为4*10矩阵，其中A00、A01、…、A39表示在输入矩阵IN1不同位置的元素（element）。运算装置120可以对输入矩阵IN1进行填充操作，然后将经填充操作后的输入矩阵IN1’存放在输入缓冲器IB1。填充操作意味着，在输入矩阵IN1的左右两侧填充一或多列（column）的填充元素（paddingelement，冗余元素），以及/或是在输入矩阵IN1的上下两侧填充一或多行（row）的填充元素。FIG. 2 is a schematic diagram of a scenario in which a complete matrix multiplication is performed after filling the input matrix IN1 according to an embodiment. Referring to FIG. 1 and FIG. 2 , the computing device 120 can read the input matrix IN1 from the storage unit 110 . The actual dimensions of the input matrix IN1 can be determined according to the actual design. In the scenario shown in Figure 2, the input matrix IN1 is assumed to be a 4*10 matrix, where A00, A01, ..., A39 represent elements at different positions of the input matrix IN1. The computing device 120 may perform a filling operation on the input matrix IN1, and then store the filled input matrix IN1' in the input buffer IB1. The padding operation means padding one or more columns of padding elements (redundant elements) on the left and right sides of the input matrix IN1, and/or padding one or more rows (redundant elements) on the top and bottom sides of the input matrix IN1 ( row) padding element.

运算装置120可以先进行填充操作，然后执行输入矩阵IN1和权重矩阵W的乘算。在图2所示场景中，权重矩阵W被假设为1*5矩阵，其中W00、W01、…、W04表示在权重矩阵W不同位置的元素。基于权重矩阵W的维度，为了使乘算结果矩阵C1的维度相同于输入矩阵IN1的维度，输入矩阵IN1的左右侧边应填充的列数量为(w_column_number - 1)/2 = (5 - 1)/2= 2，而输入矩阵IN1的上下侧边应填充的行数量为(w_row_number - 1)/2 = (1 - 1)/2 =0，其中w_column_number表示权重矩阵W的列数量，而w_row_number表示权重矩阵W的行数量。因此，运算装置120在输入矩阵IN1的左右两侧各自填充两列的填充元素（具有相同填充值P0），而形成在输入缓冲器IB1中经填充操作后的输入矩阵IN1’。基于实际设计，所述相同填充值P0可以是0或是其他数值。在完成对输入矩阵IN1的填充后，运算装置120才执行输入矩阵IN1’和权重矩阵W的乘算，以获得乘算结果矩阵C1（其中C00、C01、…、C39表示在乘算结果矩阵C1不同位置的元素）。在完成矩阵乘算后，运算装置120可以将乘算结果矩阵C1提供给下一级电路130。The computing device 120 may first perform a filling operation, and then perform a multiplication of the input matrix IN1 and the weight matrix W. In the scenario shown in Figure 2, the weight matrix W is assumed to be a 1*5 matrix, where W00, W01,..., W04 represent elements at different positions of the weight matrix W. Based on the dimension of the weight matrix W, in order to make the dimension of the multiplication result matrix C1 the same as the dimension of the input matrix IN1, the number of columns that should be filled on the left and right sides of the input matrix IN1 is (w_column_number - 1)/2 = (5 - 1) /2= 2, and the number of rows that should be filled in the upper and lower sides of the input matrix IN1 is (w_row_number - 1)/2 = (1 - 1)/2 =0, where w_column_number represents the number of columns of the weight matrix W, and w_row_number represents The number of rows of the weight matrix W. Therefore, the arithmetic device 120 fills two columns of padding elements (with the same padding value P0) on the left and right sides of the input matrix IN1 to form the input matrix IN1' after the padding operation in the input buffer IB1. Based on actual design, the same filling value P0 may be 0 or other values. After completing the filling of the input matrix IN1, the arithmetic device 120 performs the multiplication of the input matrix IN1' and the weight matrix W to obtain the multiplication result matrix C1 (where C00, C01, ..., C39 are represented in the multiplication result matrix C1 elements in different positions). After completing the matrix multiplication, the computing device 120 may provide the multiplication result matrix C1 to the next-stage circuit 130 .

图3是依照一实施例所绘示，在矩阵乘算中的不同迭代将经填充操作后的输入矩阵IN1’的元素搬至环形缓冲器RB1的场景示意图。请参照图1与图3，基于权重矩阵W的维度，矩阵乘算具有5个迭代。在时间t31中（第一个迭代），输入矩阵IN1’的第1列至第10列被从输入缓冲器IB1搬至环形缓冲器RB1，此时运算电路CC1可以使用环形缓冲器RB1的诸元素与权重矩阵W中的元素W00进行第一个迭代运算。在时间t32中（第二个迭代），输入矩阵IN1’的第2列至第11列被从输入缓冲器IB1搬至环形缓冲器RB1，此时运算电路CC1可以使用环形缓冲器RB1的诸元素与权重矩阵W中的元素W01进行第二个迭代运算。在时间t33中（第三个迭代），输入矩阵IN1’的第3列至第12列被从输入缓冲器IB1搬至环形缓冲器RB1，此时运算电路CC1可以使用环形缓冲器RB1的诸元素与权重矩阵W中的元素W02进行第三个迭代运算。在时间t34中（第四个迭代），输入矩阵IN1’的第4列至第13列被从输入缓冲器IB1搬至环形缓冲器RB1，此时运算电路CC1可以使用环形缓冲器RB1的诸元素与权重矩阵W中的元素W03进行第四个迭代运算。在时间t35中（第五个迭代），输入矩阵IN1’的第5列至第14列被从输入缓冲器IB1搬至环形缓冲器RB1，此时运算电路CC1可以使用环形缓冲器RB1的诸元素与权重矩阵W中的元素W04进行第五个迭代运算。FIG. 3 is a schematic diagram illustrating a scenario in which elements of the input matrix IN1' after padding are moved to the ring buffer RB1 in different iterations of matrix multiplication according to an embodiment. Please refer to Figure 1 and Figure 3. Based on the dimensions of the weight matrix W, the matrix multiplication has 5 iterations. At time t31 (the first iteration), the 1st to 10th columns of the input matrix IN1' are moved from the input buffer IB1 to the ring buffer RB1. At this time, the operation circuit CC1 can use the elements of the ring buffer RB1. Perform the first iteration operation with element W00 in weight matrix W. At time t32 (the second iteration), the 2nd to 11th columns of the input matrix IN1' are moved from the input buffer IB1 to the ring buffer RB1. At this time, the operation circuit CC1 can use the elements of the ring buffer RB1 Perform the second iteration operation with element W01 in the weight matrix W. At time t33 (the third iteration), the 3rd to 12th columns of the input matrix IN1' are moved from the input buffer IB1 to the ring buffer RB1. At this time, the operation circuit CC1 can use the elements of the ring buffer RB1. Perform the third iteration operation with element W02 in weight matrix W. At time t34 (the fourth iteration), the 4th to 13th columns of the input matrix IN1' are moved from the input buffer IB1 to the ring buffer RB1. At this time, the operation circuit CC1 can use the elements of the ring buffer RB1. Perform the fourth iteration operation with element W03 in weight matrix W. At time t35 (the fifth iteration), the 5th to 14th columns of the input matrix IN1' are moved from the input buffer IB1 to the ring buffer RB1. At this time, the operation circuit CC1 can use the elements of the ring buffer RB1 Perform the fifth iteration operation with element W04 in weight matrix W.

运算电路CC1可以进行具有多个迭代的矩阵乘算。在完成输入矩阵IN1’和权重矩阵W的乘算后，运算电路CC1可以将乘算结果矩阵C1提供给下一级电路130。本实施例并不限制运算电路CC1所进行矩阵乘算的具体算法。依照实际设计，运算电路CC1可以进行众所周知的矩阵乘算的算法或是其他算法。从图2与图3可以看出，输入矩阵IN1’的填充元素全部为一个相同填充值P0，而这个填充值P0无关于原始输入矩阵IN1。可想而知，经填充操作后的输入矩阵IN1’具有许多列的填充元素，而这些大量填充元素会占用输入缓冲器IB1的空间。The arithmetic circuit CC1 can perform matrix multiplication with multiple iterations. After completing the multiplication of the input matrix IN1' and the weight matrix W, the operation circuit CC1 can provide the multiplication result matrix C1 to the next-stage circuit 130. This embodiment does not limit the specific algorithm of matrix multiplication performed by the operation circuit CC1. According to the actual design, the arithmetic circuit CC1 can perform well-known matrix multiplication algorithms or other algorithms. It can be seen from Figure 2 and Figure 3 that the filling elements of the input matrix IN1' are all the same filling value P0, and this filling value P0 has nothing to do with the original input matrix IN1. It is conceivable that the input matrix IN1' after the filling operation has many columns of filling elements, and these large number of filling elements will occupy the space of the input buffer IB1.

图4是依照本发明的一实施例的一种运算装置120的操作方法的流程示意图。在一些实施例中，图4所示操作方法可以实现于固件或软件（即程序）。例如，图4所示操作方法的相关操作可以被实现为非暂时性机器可读指令（编程码或程序），而所述非暂时性机器可读指令可以被存储在机器可读存储介质。当非暂时性机器可读指令由计算机执行时可以实现图4所示操作方法。在另一些实施例中，图4所示操作方法可以实现于硬件，例如实现于图1所示运算装置120。FIG. 4 is a schematic flowchart of an operating method of the computing device 120 according to an embodiment of the present invention. In some embodiments, the operating method shown in Figure 4 can be implemented in firmware or software (ie, a program). For example, the relevant operations of the operating method shown in FIG. 4 can be implemented as non-transitory machine-readable instructions (programming code or program), and the non-transitory machine-readable instructions can be stored in a machine-readable storage medium. The operation method shown in Figure 4 can be implemented when the non-transitory machine readable instructions are executed by the computer. In other embodiments, the operation method shown in FIG. 4 may be implemented in hardware, such as the computing device 120 shown in FIG. 1 .

图5是依照本发明的一实施例所绘示，在矩阵乘算的不同迭代中将矩阵元素从输入缓冲器IB1搬至环形缓冲器RB1然后选择性改变环形缓冲器RB1中的部分元素的操作过程示意图。请参照图1、图4与图5，输入缓冲器IB1可以从运算装置120外部的存储单元110取得输入矩阵IN1的所有元素。在步骤S410中，输入缓冲器IB1将尚未被填充的原始输入矩阵IN1的所有元素存放在输入缓冲器IB1的多个连续地址处。在图5所示实施例中，输入矩阵IN1的第一个元素A00的前面被填充了两个填充元素（具有相同填充值P0）。被存放在输入缓冲器IB1的所述多个连续地址处的输入矩阵IN1的所有元素A00至A39中不存在填充操作的填充元素。相较于图2与图3所示实施例，图4与图5所示实施例可以有效减少在输入缓冲器IB1中的填充元素（填充值P0）的数量。在大量填充元素不再占用输入缓冲器IB1空间的情况下，输入缓冲器IB1的空间利用效率可以被有效提高。FIG. 5 illustrates an operation of moving matrix elements from input buffer IB1 to ring buffer RB1 and then selectively changing some elements in ring buffer RB1 in different iterations of matrix multiplication according to an embodiment of the present invention. Process diagram. Referring to FIG. 1 , FIG. 4 and FIG. 5 , the input buffer IB1 can obtain all elements of the input matrix IN1 from the storage unit 110 outside the computing device 120 . In step S410, the input buffer IB1 stores all elements of the original input matrix IN1 that have not been filled yet at multiple consecutive addresses of the input buffer IB1. In the embodiment shown in FIG. 5 , the front of the first element A00 of the input matrix IN1 is filled with two padding elements (with the same padding value P0). There are no filling elements of the filling operation in all elements A00 to A39 of the input matrix IN1 stored at the plurality of consecutive addresses of the input buffer IB1. Compared with the embodiment shown in FIGS. 2 and 3 , the embodiment shown in FIGS. 4 and 5 can effectively reduce the number of padding elements (filling value P0) in the input buffer IB1. In the case where a large number of padding elements no longer occupy the space of the input buffer IB1, the space utilization efficiency of the input buffer IB1 can be effectively improved.

在步骤S420中，控制电路CONT1从输入缓冲器IB1的一段连续地址处搬移输入矩阵IN1的多个元素至环形缓冲器RB1。举例来说，图5中部绘示了环形缓冲器RB1在不同时间（不同迭代）的存储内容。如图5所示，矩阵乘算具有5个迭代（基于权重矩阵W的维度）。在时间t50中（权重矩阵W中元素W00所对应的第一个迭代），输入缓冲器IB1的第1至第40个连续元素“P0、P0、A00、A01、…、A37”被从输入缓冲器IB1搬至环形缓冲器RB1。在时间t52中（权重矩阵W中元素W01所对应的第二个迭代），输入缓冲器IB1的第2至第41个连续元素“P0、A00、A01、…、A37、A38”被从输入缓冲器IB1搬至环形缓冲器RB1。在时间t54中（权重矩阵W中元素W02所对应的第三个迭代），输入缓冲器IB1的第3至第42个连续元素“A00、A01、…、A37、A38、A39”被从输入缓冲器IB1搬至环形缓冲器RB1。在时间t56中（权重矩阵W中元素W03所对应的第四个迭代），输入缓冲器IB1的第4至第42个连续元素“A01、…、A37、A38、A39”被从输入缓冲器IB1搬至环形缓冲器RB1。在时间t58中（权重矩阵W中元素W04所对应的第五个迭代），输入缓冲器IB1的第5至第42个连续元素“A01、…、A37、A38、A39”被从输入缓冲器IB1搬至环形缓冲器RB1。In step S420, the control circuit CONT1 moves multiple elements of the input matrix IN1 from a continuous address of the input buffer IB1 to the ring buffer RB1. For example, the middle part of Figure 5 shows the storage contents of the ring buffer RB1 at different times (different iterations). As shown in Figure 5, the matrix multiplication has 5 iterations (based on the dimensions of the weight matrix W). At time t50 (the first iteration corresponding to element W00 in weight matrix W), the 1st to 40th consecutive elements "P0, P0, A00, A01, ..., A37" of input buffer IB1 are removed from the input buffer IB1 is moved to ring buffer RB1. At time t52 (the second iteration corresponding to element W01 in weight matrix W), the 2nd to 41st consecutive elements "P0, A00, A01, ..., A37, A38" of input buffer IB1 are removed from the input buffer IB1 is moved to ring buffer RB1. At time t54 (the third iteration corresponding to element W02 in weight matrix W), the 3rd to 42nd consecutive elements "A00, A01, ..., A37, A38, A39" of input buffer IB1 are removed from the input buffer IB1 is moved to ring buffer RB1. At time t56 (the fourth iteration corresponding to element W03 in weight matrix W), the 4th to 42nd consecutive elements "A01, ..., A37, A38, A39" of input buffer IB1 are removed from input buffer IB1 Moved to ring buffer RB1. At time t58 (the fifth iteration corresponding to element W04 in weight matrix W), the 5th to 42nd consecutive elements "A01,...,A37,A38,A39" of input buffer IB1 are removed from input buffer IB1 Moved to ring buffer RB1.

在步骤S430中，控制电路CONT1基于输入矩阵IN1的维度定义在环形缓冲器RB1中的至少一个边界位置。在一些实施例中，控制电路CONT1可以基于输入矩阵IN1的列数量去定义所述边界位置。举例来说（但不限于此），在相邻两个边界位置之间的距离为输入矩阵IN1的列数量。以图2所示输入矩阵IN1为例，输入矩阵IN1的所述列数量为10，所以相邻两个边界位置之间的距离为10个元素，因此控制电路CONT1可以定义在环形缓冲器RB1中的至少一个边界位置为图5所示虚直线处。In step S430, the control circuit CONT1 defines at least one boundary position in the ring buffer RB1 based on the dimensions of the input matrix IN1. In some embodiments, the control circuit CONT1 may define the boundary location based on the number of columns of the input matrix IN1. For example (but not limited to this), the distance between two adjacent boundary positions is the number of columns of the input matrix IN1. Taking the input matrix IN1 shown in Figure 2 as an example, the number of columns of the input matrix IN1 is 10, so the distance between two adjacent boundary positions is 10 elements, so the control circuit CONT1 can be defined in the ring buffer RB1 At least one boundary position of is the dotted straight line shown in Figure 5.

在步骤S440中，控制电路CONT1基于权重矩阵W的目前权重元素在所述边界位置定义填充范围。在一些实施例中，控制电路CONT1可以基于所述目前权重元素在权重矩阵W中的列位置去计算出所述填充范围的填充长度，其中所述填充范围为从所述边界位置起始至所述填充长度的范围。举例来说（但不限于此），控制电路CONT1可以计算下述等式A，以获得填充长度。其中，mask_offset表示所述填充长度，w_id表示目前权重元素在权重矩阵W中的列位置，以及pad_number表示填充操作在输入矩阵IN1的一侧应填充的列数量。控制电路CONT1可以计算下述等式B，以获得输入矩阵IN1的所述侧应填充的列数量pad_number。其中，w_column_number表示所述权重矩阵的列数量。In step S440, the control circuit CONT1 defines a filling range at the boundary position based on the current weight element of the weight matrix W. In some embodiments, the control circuit CONT1 can calculate the filling length of the filling range based on the column position of the current weight element in the weight matrix W, where the filling range is from the boundary position to the The range of padding lengths stated. For example (but not limited to this), the control circuit CONT1 can calculate the following equation A to obtain the fill length. Among them, mask_offset represents the filling length, w_id represents the column position of the current weight element in the weight matrix W, and pad_number represents the number of columns that the filling operation should fill on one side of the input matrix IN1. The control circuit CONT1 can calculate the following equation B to obtain the number of columns pad_number that should be filled on said side of the input matrix IN1. Among them, w_column_number represents the number of columns of the weight matrix.

mask_offset = w_id – pad_number 等式Amask_offset = w_id – pad_number Equation A

pad_number = (w_column_number - 1)/2 等式Bpad_number = (w_column_number - 1)/2 Equation B

在步骤S450中，控制电路CONT1改变在环形缓冲器RB1的所述填充范围中元素的元素值，以实现对输入矩阵IN1的填充操作。在步骤S460中，运算电路CC1可以从环形缓冲器RB1读取经填充操作后的多个元素。在步骤S470中，运算电路CC1可以使用经填充操作后的这些元素去进行矩阵乘算。以下将以图5作为说明范例。In step S450, the control circuit CONT1 changes the element value of the element in the filling range of the ring buffer RB1 to implement the filling operation of the input matrix IN1. In step S460, the operation circuit CC1 may read a plurality of elements after the filling operation from the ring buffer RB1. In step S470, the operation circuit CC1 may use these elements after the filling operation to perform matrix multiplication. Figure 5 will be used as an example below.

如图5所示，矩阵乘算具有5个迭代（基于权重矩阵W的维度）。图5中部绘示了环形缓冲器RB1在不同时间（不同迭代）的存储内容。在时间t50（权重矩阵W中元素W00所对应的第一个迭代），输入缓冲器IB1的第1至第40个连续元素“P0、P0、A00、A01、…、A37”被从输入缓冲器IB1搬至环形缓冲器RB1。目前权重元素W00在权重矩阵W中的列位置为“第1个”（列位置w_id为0）。输入矩阵IN1的一侧应填充的列数量pad_number为(w_column_number - 1)/2= (5 - 1)/2 = 2。因此，控制电路CONT1在步骤S440中可以计算出填充范围的填充长度mask_offset = w_id – pad_number = 0 - 2 = -2。基于填充长度mask_offset = -2，控制电路CONT1在步骤S440中可以将“从环形缓冲器RB1中的边界位置（图5所示虚直线处）开始往右两个元素的范围”定义为所述填充范围。如图5所示，在时间t50中环形缓冲器RB1的元素A08至A09被定义为一个填充范围，环形缓冲器RB1的元素A18至A19被定义为另一个填充范围，环形缓冲器RB1的元素A28至A29被定义为又一个填充范围。As shown in Figure 5, the matrix multiplication has 5 iterations (based on the dimensions of the weight matrix W). The middle part of Figure 5 shows the storage contents of ring buffer RB1 at different times (different iterations). At time t50 (the first iteration corresponding to element W00 in weight matrix W), the 1st to 40th consecutive elements "P0, P0, A00, A01, ..., A37" of input buffer IB1 are removed from the input buffer IB1 is moved to ring buffer RB1. The current column position of weight element W00 in weight matrix W is "1st" (column position w_id is 0). The number of columns pad_number that should be filled on one side of the input matrix IN1 is (w_column_number - 1)/2= (5 - 1)/2 = 2. Therefore, the control circuit CONT1 can calculate the padding length of the padding range mask_offset = w_id – pad_number = 0 - 2 = -2 in step S440 . Based on the padding length mask_offset = -2, the control circuit CONT1 can define "the range of two elements to the right starting from the boundary position in the ring buffer RB1 (the dotted straight line shown in Figure 5)" as the padding in step S440 scope. As shown in Figure 5, at time t50, elements A08 to A09 of ring buffer RB1 are defined as one filling range, elements A18 to A19 of ring buffer RB1 are defined as another filling range, and element A28 of ring buffer RB1 Up to A29 is defined as yet another padding range.

图5上部绘示了环形缓冲器RB1在不同时间（不同迭代）被改变后的存储内容。控制电路CONT1在步骤S450中可以改变在环形缓冲器RB1的所述填充范围中元素的元素值。举例来说，在时间t50后的时间t51（权重矩阵W中元素W00所对应的第一个迭代），控制电路CONT1可以将在环形缓冲器RB1的多个填充范围（元素A08至A09、元素A18至A19、元素A28至A29）中的元素全部重设为相同填充值P0，以实现对输入矩阵IN1的填充操作。基于实际设计，所述相同填充值P0可以是0或是其他数值。在将填充范围中的元素全部重设为相同填充值P0后，运算电路CC1可以使用环形缓冲器RB1的诸元素与权重矩阵W中的元素W00进行第一个迭代运算。The upper part of Figure 5 shows the storage contents of the ring buffer RB1 after being changed at different times (different iterations). The control circuit CONT1 may change the element value of the element in the filling range of the ring buffer RB1 in step S450. For example, at time t51 after time t50 (the first iteration corresponding to element W00 in weight matrix W), the control circuit CONT1 can fill multiple filling ranges (elements A08 to A09, element A18) in the ring buffer RB1 to A19, elements A28 to A29) are all reset to the same filling value P0 to realize the filling operation of the input matrix IN1. Based on actual design, the same filling value P0 may be 0 or other values. After resetting all the elements in the filling range to the same filling value P0, the operation circuit CC1 can use the elements of the ring buffer RB1 and the element W00 in the weight matrix W to perform the first iteration operation.

在时间t51后的时间t52（权重矩阵W中元素W01所对应的第二个迭代），输入缓冲器IB1的第2至第41个连续元素“P0、A00、A01、…、A37、A38”被从输入缓冲器IB1搬至环形缓冲器RB1。目前权重元素W01在权重矩阵W中的列位置为“第2个”（列位置w_id为1），因此控制电路CONT1可以计算出填充范围的填充长度mask_offset = 1 - 2 = -1。基于填充长度mask_offset = -1，控制电路CONT1在步骤S440中可以将“从环形缓冲器RB1中的边界位置（图5所示虚直线处）开始往右一个元素的范围”定义为所述填充范围。如图5所示，在时间t52中环形缓冲器RB1的元素A09被定义为一个填充范围，环形缓冲器RB1的元素A19被定义为另一个填充范围，环形缓冲器RB1的元素A29被定义为又一个填充范围。在时间t52后的时间t53（权重矩阵W中元素W01所对应的第二个迭代），控制电路CONT1可以将在环形缓冲器RB1的多个填充范围（元素A09、A19、A29）中的元素全部重设为相同填充值P0，以实现对输入矩阵IN1的填充操作。在将填充范围中的元素全部重设为相同填充值P0后，运算电路CC1可以使用环形缓冲器RB1的诸元素与权重矩阵W中的元素W01进行第二个迭代运算。At time t52 after time t51 (the second iteration corresponding to element W01 in weight matrix W), the 2nd to 41st consecutive elements "P0, A00, A01, ..., A37, A38" of the input buffer IB1 are Move from input buffer IB1 to ring buffer RB1. The current column position of weight element W01 in weight matrix W is "2nd" (column position w_id is 1), so the control circuit CONT1 can calculate the filling length of the filling range mask_offset = 1 - 2 = -1. Based on the filling length mask_offset = -1, the control circuit CONT1 can define "the range of one element to the right starting from the boundary position in the ring buffer RB1 (the dotted line shown in Figure 5)" as the filling range in step S440 . As shown in Figure 5, at time t52, element A09 of ring buffer RB1 is defined as a filling range, element A19 of ring buffer RB1 is defined as another filling range, and element A29 of ring buffer RB1 is defined as another filling range. A padding range. At time t53 after time t52 (the second iteration corresponding to element W01 in weight matrix W), control circuit CONT1 can add all elements in the multiple filling ranges (elements A09, A19, A29) of ring buffer RB1 Reset to the same filling value P0 to implement the filling operation of the input matrix IN1. After resetting all the elements in the filling range to the same filling value P0, the operation circuit CC1 can use the elements of the ring buffer RB1 and the element W01 in the weight matrix W to perform a second iterative operation.

在时间t53后的时间t54中（权重矩阵W中元素W02所对应的第三个迭代），输入缓冲器IB1的第3至第42个连续元素“A00、A01、…、A37、A38、A39”被从输入缓冲器IB1搬至环形缓冲器RB1。目前权重元素W02在权重矩阵W中的列位置为“第3个”（列位置w_id为2），因此控制电路CONT1可以计算出填充范围的填充长度mask_offset = 2 - 2 = 0。基于填充长度mask_offset = 0，控制电路CONT1知道目前填充范围的长度为0，亦即不需要将元素设为填充值P0。在时间t54后的时间t55（权重矩阵W中元素W02所对应的第三个迭代），运算电路CC1可以使用环形缓冲器RB1的诸元素与权重矩阵W中的元素W02进行第三个迭代运算。At time t54 after time t53 (the third iteration corresponding to element W02 in weight matrix W), the 3rd to 42nd consecutive elements "A00, A01, ..., A37, A38, A39" of the input buffer IB1 Moved from input buffer IB1 to ring buffer RB1. The current column position of weight element W02 in weight matrix W is "3rd" (column position w_id is 2), so the control circuit CONT1 can calculate the filling length of the filling range mask_offset = 2 - 2 = 0. Based on the filling length mask_offset = 0, the control circuit CONT1 knows that the current length of the filling range is 0, that is, there is no need to set the element to the filling value P0. At time t55 after time t54 (the third iteration corresponding to the element W02 in the weight matrix W), the operation circuit CC1 can use the elements of the ring buffer RB1 and the element W02 in the weight matrix W to perform the third iteration operation.

在时间t55后的时间t56中（权重矩阵W中元素W03所对应的第四个迭代），输入缓冲器IB1的第4至第42个连续元素“A01、…、A37、A38、A39”被从输入缓冲器IB1搬至环形缓冲器RB1。目前权重元素W03在权重矩阵W中的列位置为“第4个”（列位置w_id为3），因此控制电路CONT1可以计算出填充范围的填充长度mask_offset = 3 - 2 = 1。基于填充长度mask_offset = 1，控制电路CONT1在步骤S440中可以将“从环形缓冲器RB1中的边界位置（图5所示虚直线处）开始往左一个元素的范围”定义为所述填充范围。如图5所示，在时间t56中环形缓冲器RB1的元素A10被定义为一个填充范围，环形缓冲器RB1的元素A20被定义为另一个填充范围，环形缓冲器RB1的元素A30被定义为又一个填充范围。在时间t56后的时间t57（权重矩阵W中元素W03所对应的第四个迭代），控制电路CONT1可以将在环形缓冲器RB1的多个填充范围（元素A10、A20、A30）中的元素全部重设为相同填充值P0，以实现对输入矩阵IN1的填充操作。在将填充范围中的元素全部重设为相同填充值P0后，运算电路CC1可以使用环形缓冲器RB1的诸元素与权重矩阵W中的元素W03进行第四个迭代运算。In time t56 after time t55 (the fourth iteration corresponding to element W03 in weight matrix W), the 4th to 42nd consecutive elements "A01,...,A37,A38,A39" of input buffer IB1 are removed from Input buffer IB1 is moved to ring buffer RB1. The current column position of weight element W03 in weight matrix W is "4th" (column position w_id is 3), so the control circuit CONT1 can calculate the filling length of the filling range mask_offset = 3 - 2 = 1. Based on the filling length mask_offset = 1, the control circuit CONT1 may define "the range of one element to the left starting from the boundary position in the ring buffer RB1 (the dotted straight line shown in Figure 5)" as the filling range in step S440. As shown in Figure 5, at time t56, element A10 of ring buffer RB1 is defined as a filling range, element A20 of ring buffer RB1 is defined as another filling range, and element A30 of ring buffer RB1 is defined as another filling range. A padding range. At time t57 after time t56 (the fourth iteration corresponding to element W03 in weight matrix W), control circuit CONT1 can add all elements in the multiple filling ranges (elements A10, A20, A30) of ring buffer RB1 Reset to the same filling value P0 to implement the filling operation of the input matrix IN1. After resetting all the elements in the filling range to the same filling value P0, the operation circuit CC1 can use the elements of the ring buffer RB1 and the element W03 in the weight matrix W to perform the fourth iterative operation.

在时间t57后的时间t58中（权重矩阵W中元素W04所对应的第五个迭代），输入缓冲器IB1的第5至第42个连续元素“A01、…、A37、A38、A39”被从输入缓冲器IB1搬至环形缓冲器RB1。目前权重元素W04在权重矩阵W中的列位置为“第5个”（列位置w_id为4），因此控制电路CONT1可以计算出填充范围的填充长度mask_offset = 4 - 2 = 2。基于填充长度mask_offset = 2，控制电路CONT1在步骤S440中可以将“从环形缓冲器RB1中的边界位置（图5所示虚直线处）开始往左两个元素的范围”定义为所述填充范围。如图5所示，在时间t58中环形缓冲器RB1的元素A10至A11被定义为一个填充范围，环形缓冲器RB1的元素A20至A21被定义为另一个填充范围，环形缓冲器RB1的元素A30至A31被定义为又一个填充范围。在时间t58后的时间t59（权重矩阵W中元素W04所对应的第五个迭代），控制电路CONT1可以将在环形缓冲器RB1的多个填充范围（元素A10至A11、元素A20至A21、元素A30至A31）中的元素全部重设为相同填充值P0，以实现对输入矩阵IN1的填充操作。在将填充范围中的元素全部重设为相同填充值P0后，运算电路CC1可以使用环形缓冲器RB1的诸元素与权重矩阵W中的元素W04进行第五个迭代运算。In time t58 after time t57 (the fifth iteration corresponding to element W04 in weight matrix W), the 5th to 42nd consecutive elements "A01,...,A37,A38,A39" of input buffer IB1 are removed from Input buffer IB1 is moved to ring buffer RB1. The current column position of weight element W04 in weight matrix W is "5th" (column position w_id is 4), so the control circuit CONT1 can calculate the filling length of the filling range mask_offset = 4 - 2 = 2. Based on the filling length mask_offset = 2, the control circuit CONT1 can define "the range of two elements to the left starting from the boundary position in the ring buffer RB1 (the dotted straight line shown in Figure 5)" as the filling range in step S440 . As shown in Figure 5, at time t58, the elements A10 to A11 of the ring buffer RB1 are defined as one filling range, the elements A20 to A21 of the ring buffer RB1 are defined as another filling range, and the element A30 of the ring buffer RB1 Up to A31 is defined as yet another padding range. At time t59 after time t58 (the fifth iteration corresponding to element W04 in weight matrix W), the control circuit CONT1 can fill multiple filling ranges (elements A10 to A11, elements A20 to A21, elements All elements in A30 to A31) are reset to the same filling value P0 to realize the filling operation of the input matrix IN1. After resetting all the elements in the filling range to the same filling value P0, the operation circuit CC1 can use the elements of the ring buffer RB1 and the element W04 in the weight matrix W to perform the fifth iteration operation.

如上所述，运算电路CC1可以进行具有多个迭代的矩阵乘算。在完成矩阵乘算后，运算电路CC1可以将乘算结果矩阵C1提供给下一级电路130。As described above, the arithmetic circuit CC1 can perform matrix multiplication with a plurality of iterations. After completing the matrix multiplication, the operation circuit CC1 may provide the multiplication result matrix C1 to the next-stage circuit 130 .

综上所述，所述运算装置120可以将原始输入矩阵IN1（不存在填充元素的矩阵）的所有元素存放在输入缓冲器IB1的多个连续地址处。在矩阵乘算中的每一个迭代中，输入矩阵IN1的多个对应元素（输入矩阵IN1的一部分）被从输入缓冲器IB1的一段连续地址处搬移至环形缓冲器RB1，然后控制电路CONT1依据目前迭代选择性地改变在环形缓冲器RB1的元素的元素值（被改变的元素可以被视为填充元素，相当于对输入矩阵IN1进行填充操作）。因此，运算装置120可以在实现对输入矩阵IN1进行填充操作的应用场景中减少填充元素对输入缓冲器IB1的空间占用。To sum up, the computing device 120 can store all elements of the original input matrix IN1 (a matrix without padding elements) in multiple consecutive addresses of the input buffer IB1. In each iteration of the matrix multiplication, multiple corresponding elements of the input matrix IN1 (part of the input matrix IN1) are moved from a continuous address of the input buffer IB1 to the ring buffer RB1, and then the control circuit CONT1 is based on the current Iteratively and selectively change the element value of the element in the ring buffer RB1 (the changed element can be regarded as a filling element, which is equivalent to performing a filling operation on the input matrix IN1). Therefore, the computing device 120 can reduce the space occupation of the input buffer IB1 by the filling elements in the application scenario of implementing the filling operation of the input matrix IN1.

最后应说明的是：以上各实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述各实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分或者全部技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention, but not to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions described in the foregoing embodiments can still be modified, or some or all of the technical features can be equivalently replaced; and these modifications or substitutions do not deviate from the essence of the corresponding technical solutions from the technical solutions of the embodiments of the present invention. scope.

Claims

1. An arithmetic device for performing multiplication of an input matrix and a weight matrix, the arithmetic device comprising:

an input buffer for buffering the input matrix, wherein all elements of the input matrix are stored at a plurality of consecutive addresses of the input buffer;

a ring buffer;

a control circuit coupled to the input buffer and the ring buffer, wherein the control circuit moves a plurality of elements of the input matrix to the ring buffer from a segment of consecutive addresses of the input buffer, the control circuit defines at least one boundary position in the ring buffer based on a dimension of the input matrix, the control circuit defines at least one fill range at the at least one boundary position based on a current weight element of the weight matrix, and the control circuit changes element values of the plurality of elements in the at least one fill range of the ring buffer to achieve a fill operation on the input matrix; and

and an arithmetic circuit coupled to the ring buffer to read the plurality of elements after the fill operation, wherein the arithmetic circuit uses the plurality of elements after the fill operation to perform the multiplication.

2. The computing device of claim 1, wherein the input buffer retrieves all elements of the input matrix from a storage unit external to the computing device.

3. The computing device of claim 2, wherein the input buffer comprises a universal matrix multiplication input buffer.

4. The computing device of claim 2, wherein the storage unit comprises a high bandwidth memory.

5. The computing device of claim 1, wherein no fill elements of the fill operation are present in all elements of the input matrix that are deposited at the plurality of consecutive addresses of the input buffer.

6. The computing device of claim 1, wherein the computing circuit provides the multiplied result to a memory or other computing device.

7. The computing device of claim 6, wherein the computing device is a tensor kernel and the computing circuit comprises a general-purpose matrix multiplication circuit.

8. The computing device of claim 6, wherein the other computing device comprises a vector core.

9. The computing device of claim 1, wherein the control circuit defines the at least one boundary position based on a number of columns of the input matrix.

10. The computing device of claim 9, wherein a distance between two adjacent boundary positions among the at least one boundary position is the number of columns of the input matrix.

11. The computing device of claim 1, wherein the control circuit calculates a fill length of the at least one fill range based on a column position of the current weight element in the weight matrix, and the at least one fill range is a range starting from the at least one boundary position to the fill length.

12. The computing device of claim 11, wherein the control circuit calculates mask_offset = w_id-pad_number to obtain the fill length, wherein mask_offset represents the fill length, w_id represents the column position of the current weight element in the weight matrix, and pad_number represents a number of columns that the fill operation should fill on one side of the input matrix.

13. The computing device of claim 12, wherein the number of columns pad_number to be filled on the side of the input matrix is pad_number= (w_column_number-1)/2, where w_column_number represents the number of columns of the weight matrix.

14. The computing device of claim 1, wherein the control circuit resets all of the plurality of elements in the at least one fill range of the circular buffer to one and the same fill value to effect a fill operation on the input matrix.

15. The computing device of claim 14, wherein the same fill value is 0.

16. A method of operation of an arithmetic device to perform multiplication of an input matrix and a weight matrix, the method comprising:

storing the input matrix at a plurality of consecutive addresses of an input buffer of the computing device;

shifting a plurality of elements of the input matrix from a segment of consecutive addresses of the input buffer to a circular buffer of the computing device;

defining at least one boundary position in the circular buffer based on a dimension of the input matrix;

defining at least one filling range at the at least one boundary position based on a current weight element of the weight matrix;

changing element values of the plurality of elements in the at least one fill range of the ring buffer to effect a fill operation on the input matrix;

reading, by an arithmetic circuit of the arithmetic device, the plurality of elements from the ring buffer after the fill operation, wherein the arithmetic circuit is coupled to the ring buffer; and

the multiplication is performed by the arithmetic circuit using the plurality of elements after the padding operation.

17. The method of operation of claim 16, further comprising:

all elements of the input matrix are retrieved by the input buffer from a storage unit external to the computing device.

18. The method of operation of claim 17 wherein the input buffer comprises a universal matrix multiplication input buffer.

19. The method of claim 17, wherein the storage unit comprises a high bandwidth memory.

20. The method of operation of claim 16, wherein no fill elements of the fill operation are present in all elements of the input matrix deposited at the plurality of consecutive addresses of the input buffer.

21. The method of operation of claim 16, further comprising:

the operation result of the multiplication is provided to a memory or other operation devices by the operation circuit.

22. The method of operation of claim 21 wherein the computing device is a tensor kernel and the computing circuit comprises a general-purpose matrix multiplication circuit.

23. The method of operation of claim 21, wherein the other computing device comprises a vector core.

24. The method of operation of claim 16, further comprising:

the at least one boundary position is defined based on a number of columns of the input matrix.

25. The method of operation of claim 24, wherein a distance between two adjacent boundary positions among the at least one boundary position is the number of columns of the input matrix.

26. The method of operation of claim 16, further comprising:

a fill length of the at least one fill range is calculated based on a column position of the current weight element in the weight matrix, wherein the at least one fill range is a range starting from the at least one boundary position to the fill length.

27. The method of operation of claim 26, further comprising:

calculating mask_offset=w_id-pad_number to obtain the padding length, wherein mask_offset represents the padding length, w_id represents the column position of the current weight element in the weight matrix, and pad_number represents the number of columns that the padding operation should pad on one side of the input matrix.

28. The method of operation of claim 27, wherein the number of columns pad_number that should be filled on the side of the input matrix is pad_number= (w_column_number-1)/2, where w_column_number represents the number of columns of the weight matrix.

29. The method of operation of claim 16, further comprising:

the plurality of elements in the at least one fill range of the ring buffer are all reset to one and the same fill value to effect a fill operation on the input matrix.

30. The method of operation of claim 29, wherein the same fill value is 0.

31. A machine-readable storage medium storing non-transitory machine-readable instructions which, when executed by a computer, implement the method of operation of any one of claims 16-30.