CN114692810A

CN114692810A - Device and board card for calculating Winograd convolution

Info

Publication number: CN114692810A
Application number: CN202011579087.4A
Authority: CN
Inventors: 不公告发明人
Original assignee: Anhui Cambricon Information Technology Co Ltd
Current assignee: Anhui Cambricon Information Technology Co Ltd
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2022-07-01

Abstract

The present invention relates to a device and a board for calculating Winograd convolution, wherein the device is connected to an off-chip memory, and the off-chip memory stores neuron data and a plurality of instructions. The apparatus includes a neuron buffer and a forward transformation unit. The neuron buffer is used to start loading the neuron data from the off-chip memory; and the forward transformation unit is used to start reading the neuron data from the neuron buffer before the neuron data is loaded. The metadata is forward transformed to generate forward transformed data. The invention has the technical effects of ensuring network precision, performance acceleration, area reduction and power consumption reduction.

Description

Devices and boards for computing Winograd convolutions

技术领域technical field

本发明一般地涉及神经网络领域。更具体地，本发明涉及计算Winograd卷积的装置与板卡。The present invention generally relates to the field of neural networks. More specifically, the present invention relates to devices and boards for computing Winograd convolutions.

背景技术Background technique

随着信息化时代的高速发展，人工智能与机器学习领域的研究炙手可热，相关产业蓬勃发展。卷积神经网络在计算机视觉、自动驾驶、机器翻译、语音识别、智能家居等各方面都有着广泛的作用。With the rapid development of the information age, research in the field of artificial intelligence and machine learning is hot, and related industries are booming. Convolutional neural networks have a wide range of roles in computer vision, autonomous driving, machine translation, speech recognition, smart homes, and more.

卷积神经网络的参数量大，运算量大，使得卷积神经网络模型在便携移动终端有限面积和算力下被严重的限制其执行性能，同时非专用性设计的处理器在进行卷积运算时也会造成功耗的巨大开销。The convolutional neural network has a large amount of parameters and a large amount of calculation, which makes the convolutional neural network model seriously limit its execution performance under the limited area and computing power of the portable mobile terminal. It will also cause a huge cost of power consumption.

Winograd卷积是一种基于多项式插值算法的卷积加速实现方式。它通过对卷积操作的两个输入：神经元及权值进行一定规模切分后，分别做线性变换，也就是Winograd正变换，再将变换后的神经元和权值进行对位乘法，把对位乘法结果再次进行线性变换，即Winograd逆变换，最后得到与原卷积操作等价的卷积结果。Winograd convolution is a convolution acceleration implementation based on polynomial interpolation algorithm. It divides the two inputs of the convolution operation: neurons and weights to a certain scale, and then performs linear transformation, that is, Winograd forward transformation, and then multiplies the transformed neurons and weights. Perform a linear transformation on the bit multiplication result again, that is, Winograd inverse transformation, and finally obtain a convolution result equivalent to the original convolution operation.

由于在Winograd卷积操作的过程中，神经元和权值的正逆变换矩阵都由简单的固定数值构成，故而可以仅利用加法来实现Winograd神经元和权值的正逆变换过程。而Winograd算法中所需的乘法操作仅出现在对位乘过程中，此过程的乘法复杂度较原始卷积算法有相当程度缩减。由于硬件实现乘法运算的开销(时序、功耗、面积)比实现同位宽的加法要高很多，因此以Winograd卷积替代原始卷积操作能够带来硬件能效比和运算时间上的明显收益。Since in the process of Winograd convolution operation, the forward and inverse transformation matrices of neurons and weights are composed of simple fixed values, so the forward and inverse transformation process of Winograd neurons and weights can be realized only by addition. However, the multiplication operation required in the Winograd algorithm only occurs in the process of bitwise multiplication, and the multiplication complexity of this process is considerably reduced compared to the original convolution algorithm. Since the overhead (timing, power consumption, area) of the multiplication operation in hardware is much higher than the addition of the same bit width, replacing the original convolution operation with Winograd convolution can bring obvious benefits in hardware energy efficiency ratio and operation time.

然而，目前没有一种硬件针对Winograd卷积加速算法来设计，使得现有人工智能芯片无法完全展现Winograd卷积运算的优势。因此，一种能够高效运行Winograd卷积算法的硬件设备是迫切需要的。However, there is currently no hardware designed for the Winograd convolution acceleration algorithm, so that the existing artificial intelligence chips cannot fully display the advantages of the Winograd convolution operation. Therefore, a hardware device that can efficiently run the Winograd convolution algorithm is urgently needed.

发明内容SUMMARY OF THE INVENTION

为了至少部分地解决背景技术中提到的技术问题，本发明的方案提供了一种计算Winograd卷积的装置与板卡。In order to at least partially solve the technical problems mentioned in the background art, the solution of the present invention provides a device and a board for calculating Winograd convolution.

在一个方面中，本发明揭露一种计算Winograd卷积的装置，连接至片外内存，所述片外内存存储有神经元数据及多个指令。所述装置包括：神经元缓存，用以启动自所述片外内存载入所述神经元数据；以及正变换单元，用以在所述神经元数据完成载入前，启动自所述神经元缓存读取所述神经元数据进行正变换，以产生正变换数据。In one aspect, the present invention discloses an apparatus for computing Winograd convolutions, connected to off-chip memory, where neuron data and a plurality of instructions are stored in the off-chip memory. The device includes: a neuron cache for enabling loading of the neuron data from the off-chip memory; and a forward transformation unit for enabling the neuron data to be loaded from the neuron before the neuron data is loaded The buffer reads the neuron data and performs forward transformation to generate forward transformation data.

在另一个方面，本发明揭露一种集成电路装置，包括前述的装置，还揭露一种板卡，包括根据前述的集成电路装置。In another aspect, the present invention discloses an integrated circuit device including the aforementioned device, and a board including the aforementioned integrated circuit device.

本发明提出的硬件结构能够匹配Winograd卷积加速算法，具有保证网络精度、性能加速、面积缩减以及功耗降低的技术效果。The hardware structure proposed by the invention can match the Winograd convolution acceleration algorithm, and has the technical effects of ensuring network accuracy, performance acceleration, area reduction and power consumption reduction.

附图说明Description of drawings

通过参考附图阅读下文的详细描述，本发明示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中，以示例性而非限制性的方式示出了本发明的若干实施方式，并且相同或对应的标号表示相同或对应的部分其中：The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily understood by reading the following detailed description with reference to the accompanying drawings. In the accompanying drawings, several embodiments of the present invention are shown by way of example and not limitation, and like or corresponding reference numerals refer to like or corresponding parts wherein:

图1是示出F(2×2,3×3)的原始卷积转换成Winograd卷积的示意图；FIG. 1 is a schematic diagram illustrating the conversion of the original convolution of F (2×2, 3×3) into a Winograd convolution;

图2是示出本发明实施例的板卡的结构图；2 is a structural diagram illustrating a board according to an embodiment of the present invention;

图3是示出本发明实施例的集成电路装置的结构图；3 is a structural diagram illustrating an integrated circuit device according to an embodiment of the present invention;

图4是示出本发明实施例的计算装置的内部结构示意图；4 is a schematic diagram illustrating an internal structure of a computing device according to an embodiment of the present invention;

图5是示出本发明实施例的正变换数据缓存的示意图；5 is a schematic diagram illustrating a forward transformation data cache according to an embodiment of the present invention;

图6是示出本发明实施例的对位乘累加运算器的示意图；6 is a schematic diagram illustrating a bitwise multiply-accumulate operator according to an embodiment of the present invention;

图7是示出本发明实施例的流水线的示意图。FIG. 7 is a schematic diagram illustrating a pipeline of an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative efforts shall fall within the protection scope of the present invention.

应当理解，本发明的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象，而不是用于描述特定顺序。本发明的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在，但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that the terms "first", "second", "third" and "fourth" in the claims, description and drawings of the present invention are used to distinguish different objects, rather than to describe a specific order . The terms "comprising" and "comprising" used in the description and claims of the present invention indicate the presence of the described features, integers, steps, operations, elements and/or components, but do not exclude one or more other features, integers , step, operation, element, component and/or the presence or addition of a collection thereof.

还应当理解，在此本发明说明书中所使用的术语仅仅是出于描述特定实施例的目的，而并不意在限定本发明。如在本发明说明书和权利要求书中所使用的那样，除非上下文清楚地指明其它情况，否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解，在本发明说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合，并且包括这些组合。It should also be understood that the terminology used in this specification of the present invention is for the purpose of describing particular embodiments only, and is not intended to limit the present invention. As used in the present specification and claims, the singular forms "a," "an," and "the" are intended to include the plural unless the context clearly dictates otherwise. It will be further understood that, as used in the present specification and claims, the term "and/or" refers to and including any and all possible combinations of one or more of the associated listed items.

如在本说明书和权利要求书中所使用的那样，术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。As used in this specification and in the claims, the term "if" may be contextually interpreted as "when" or "once" or "in response to determining" or "in response to detecting".

下面结合附图来详细描述本发明的具体实施方式。The specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

Winograd卷积加速算法(以下简称Winograd算法或Winograd卷积)是利用对卷积运算中的操作数进行线性变换，进而发现需要乘法数最小的变换方法，再通过增加部分加法操作代替所需要的乘法操作。从硬件层面来说，乘法器相比于加法器所需的结构更加复杂，面积功耗更大，综合处理性能更差，故以加法取代乘法的Winograd算法在处理二维卷积运算时具有极大优势。Winograd convolution acceleration algorithm (hereinafter referred to as Winograd algorithm or Winograd convolution) is to use the linear transformation of the operands in the convolution operation, and then find the transformation method that requires the smallest number of multiplications, and then replace the required multiplication by adding some addition operations. operate. From the perspective of hardware, multipliers have more complex structure, larger area power consumption, and worse comprehensive processing performance than adders. Therefore, the Winograd algorithm, which replaces multiplication with addition, has great advantages in processing two-dimensional convolution operations. big advantage.

对于二维卷积来说，卷积结果可以表示为F(m×n,r×s)，即输出形状为m×n，权值形状为r×s。Winograd算法的矩阵表示式为：For two-dimensional convolution, the convolution result can be expressed as F(m×n, r×s), that is, the output shape is m×n, and the weight shape is r×s. The matrix representation of the Winograd algorithm is:

Y＝A^T[(GgG^T)⊙(B^TdB)]AY=A ^T [(GgG ^T )⊙(B ^T dB)]A

其中，Y表示卷积操作的输出矩阵，A^T为逆变换左乘常量矩阵，G为权值变换左乘常量矩阵，g为原始卷积的权值，G^T为权值变换右乘常量矩阵，⊙表示按位乘法，B^T为神经元变换左乘常量矩阵，d为神经元数据，B为神经元变换右乘常量矩阵，A为逆变换右乘常量矩阵。各变换的左乘矩阵和右乘矩阵仅为转置关系。Among them, Y represents the output matrix of the convolution operation, A ^T is the inverse transformation left multiplication constant matrix, G is the weight transformation left multiplication constant matrix, g is the weight of the original convolution, G ^T is the weight transformation right multiplication constant matrix , ⊙ means bitwise multiplication, B ^T is the neuron transformation left multiplication constant matrix, d is the neuron data, B is the neuron transformation right multiplication constant matrix, and A is the inverse transformation right multiplication constant matrix. The left-multiplied and right-multiplied matrices of each transformation are only transposed.

以F(2×2,3×3)为例，前述各常数矩阵如下所示：Taking F(2×2,3×3) as an example, the aforementioned constant matrices are as follows:

图1示出F(2×2,3×3)的原始卷积转换成Winograd卷积的示意图。如图所示，神经元数据101与卷积核102进行卷积运算。计算时，将神经元数据101根据滑动窗口103中的元素按一行排列，滑动窗口103滑动4次形成4×9矩阵104，再将卷积核102的元素按一列排列，形成9×1矩阵105，4×9矩阵104与9×1矩阵105进行卷积运算，得到4×1卷积结果106。Figure 1 shows a schematic diagram of the conversion of the original convolution of F (2×2, 3×3) into a Winograd convolution. As shown in the figure, the neuron data 101 and the convolution kernel 102 are subjected to a convolution operation. During calculation, the neuron data 101 is arranged in a row according to the elements in the sliding window 103, the sliding window 103 is slid 4 times to form a 4×9 matrix 104, and then the elements of the convolution kernel 102 are arranged in a column to form a 9×1 matrix 105 , a 4×9 matrix 104 and a 9×1 matrix 105 perform a convolution operation to obtain a 4×1 convolution result 106 .

再根据图中虚线进行切分，4×9矩阵104转变成2×3矩阵107，9×1矩阵105转变成3×1矩阵108，4×1卷积结果106转变成2×1卷积结果109。在线性变换后，2×1卷积结果109的第一个元素R₀＝M₀+M₁+M₂，且R₁＝M₁-M₂-M₃。而M₀、M₁、M₂、M₃可以用以下式子表示：Then according to the dotted line in the figure, the 4×9 matrix 104 is transformed into a 2×3 matrix 107, the 9×1 matrix 105 is transformed into a 3×1 matrix 108, and the 4×1 convolution result 106 is transformed into a 2×1 convolution result. 109. After the linear transformation, the first element of the 2x1 convolution result 109 is R ₀ =M ₀ +M ₁ +M ₂ , and R ₁ =M ₁ -M ₂ -M ₃ . And M ₀ , M ₁ , M ₂ , and M ₃ can be represented by the following formulas:

通过前述的切分与线性变换，原本卷积运算需要执行36次乘法，而Winograd算法仅需要执行16次乘法，降低了2.25倍的乘法计算复杂度。Through the aforementioned segmentation and linear transformation, the original convolution operation needs to perform 36 multiplications, while the Winograd algorithm only needs to perform 16 multiplications, reducing the multiplication complexity by 2.25 times.

由上述二维卷积的Winograd算法的转换可以看出，Winograd算法主要分为以下几个步骤。首先，对权值进行权值常数矩阵的左乘和右乘，即GgG^T，得到Winograd线性变换之后的权值；同时对神经元数据进行神经元常数矩阵的左乘和右乘，即B^TdB，得到Winograd线性变换之后的神经元。再者，将Winograd变换后的神经元和权值矩阵进行对位相乘操作，即(GgG^T)⊙(B^TdB)，得到对位乘法结果。最后，将对位乘法结果进行Winograd逆变换常数矩阵的左乘和右乘操作，即A^T[(GgG^T)⊙(B^TdB)]A，最终得到与原始卷积等价的卷积结果。It can be seen from the conversion of the above two-dimensional convolution Winograd algorithm that the Winograd algorithm is mainly divided into the following steps. First, perform the left and right multiplication of the weight constant matrix on the weights, namely GgG ^T , to obtain the weights after the linear transformation of Winograd; at the same time, perform the left and right multiplication of the neuron constant matrix on the neuron data, namely B ^T dB, get the neuron after the Winograd linear transformation. Furthermore, perform the multiplication operation of the neuron and the weight matrix after the Winograd transformation, namely (GgG ^T )⊙(B ^T dB), to obtain the result of the multiplication of the opposite positions. Finally, the left multiplication and right multiplication of the Winograd inverse transformation constant matrix will be performed on the bit multiplication result, that is, A ^T [(GgG ^T )⊙(B ^T dB)]A, and finally the convolution result equivalent to the original convolution is obtained. .

从硬件设计的角度来说，本发明针对上述三个过程之间的依赖性以及运算区别特征，将这三个大的变换步骤进行流水化设计，以实现更高效的加速性能。From the perspective of hardware design, the present invention implements a pipeline design for these three large transformation steps to achieve more efficient acceleration performance in view of the dependencies among the above three processes and the distinguishing features of operations.

图2示出本发明实施例的一种板卡20的结构示意图。如图2所示，板卡20包括芯片201，其是一种系统级芯片(System on Chip，SoC)，或称片上系统，集成有一个或多个组合处理装置，组合处理装置是一种人工智能运算单元，用以支持各类深度学习和机器学习算法，满足计算机视觉、语音、自然语言处理、数据挖掘等领域复杂场景下的智能处理需求。特别是深度学习技术大量应用在云端智能领域，云端智能应用的一个显著特点是输入数据量大，对平台的存储能力和计算能力有很高的要求，此实施例的板卡20适用在云端智能应用，具有庞大的片外存储、片上存储和大量的计算能力。FIG. 2 shows a schematic structural diagram of a board 20 according to an embodiment of the present invention. As shown in FIG. 2 , the board 20 includes a chip 201, which is a system-on-chip (SoC), or a system-on-a-chip, and integrates one or more combined processing devices. The combined processing device is an artificial The intelligent computing unit is used to support various deep learning and machine learning algorithms to meet the intelligent processing requirements in complex scenarios in the fields of computer vision, speech, natural language processing, and data mining. In particular, deep learning technology is widely used in the field of cloud intelligence. A notable feature of cloud intelligence applications is the large amount of input data, which has high requirements on the storage capacity and computing power of the platform. The board 20 in this embodiment is suitable for cloud intelligence. applications, with huge off-chip storage, on-chip storage and massive computing power.

芯片201通过对外接口装置202与外部设备203相连接。外部设备203例如是服务器、计算机、摄像头、显示器、鼠标、键盘、网卡或wifi接口等。待处理的数据可以由外部设备203通过对外接口装置202传递至芯片201。芯片201的计算结果可以经由对外接口装置202传送回外部设备203。根据不同的应用场景，对外接口装置202可以具有不同的接口形式，例如PCIe接口等。The chip 201 is connected to an external device 203 through an external interface device 202 . The external device 203 is, for example, a server, a computer, a camera, a monitor, a mouse, a keyboard, a network card or a wifi interface, and the like. The data to be processed can be transmitted to the chip 201 by the external device 203 through the external interface device 202 . The calculation result of the chip 201 can be transmitted back to the external device 203 via the external interface device 202 . According to different application scenarios, the external interface device 202 may have different interface forms, such as a PCIe interface and the like.

板卡20还包括用于存储数据的存储器件204，其包括一个或多个存储单元205。存储器件204通过总线与控制器件206和芯片201进行连接和数据传输。板卡20中的控制器件206配置用于对芯片201的状态进行调控。为此，在一个应用场景中，控制器件206可以包括单片机(Micro Controller Unit，MCU)。The board 20 also includes a storage device 204 for storing data, which includes one or more storage units 205 . The storage device 204 is connected to the control device 206 and the chip 201 through a bus and performs data transmission. The control device 206 in the board 20 is configured to control the state of the chip 201 . To this end, in an application scenario, the control device 206 may include a microcontroller (Micro Controller Unit, MCU).

图3是示出此实施例的芯片201中的组合处理装置的结构图。如图3中所示，组合处理装置30包括计算装置301、接口装置302、处理装置303和DRAM 304。FIG. 3 is a block diagram showing the combined processing device in the chip 201 of this embodiment. As shown in FIG. 3 , the combined processing device 30 includes a computing device 301 , an interface device 302 , a processing device 303 and a DRAM 304 .

计算装置301配置成执行用户指定的操作，主要实现为单核智能处理器或者多核智能处理器，用以执行深度学习或机器学习的计算，特别是Winograd卷积运算，其可以通过接口装置302与处理装置303进行交互，以共同完成用户指定的操作。The computing device 301 is configured to perform operations specified by the user, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor to perform deep learning or machine learning calculations, especially Winograd convolution operations, which can be communicated with the interface device 302. The processing device 303 interacts to jointly complete the operation specified by the user.

接口装置302用于在计算装置301与处理装置303间传输数据和控制指令。例如，计算装置301可以经由接口装置302从处理装置303中获取输入数据，写入计算装置301片上的存储装置。进一步，计算装置301可以经由接口装置302从处理装置303中获取控制指令，写入计算装置301片上的控制缓存中。替代地或可选地，接口装置302也可以读取计算装置301的存储装置中的数据并传输给处理装置303。The interface device 302 is used to transmit data and control instructions between the computing device 301 and the processing device 303 . For example, the computing device 301 may obtain input data from the processing device 303 via the interface device 302 and write the input data into the storage device on-chip of the computing device 301 . Further, the computing device 301 can obtain the control instruction from the processing device 303 via the interface device 302 and write it into the control cache on the computing device 301 . Alternatively or alternatively, the interface device 302 may also read the data in the storage device of the computing device 301 and transmit it to the processing device 303 .

处理装置303作为通用的处理装置，执行包括但不限于数据搬运、对计算装置301的开启和/或停止等基本控制。根据实现方式的不同，处理装置303可以是中央处理器(central processing unit，CPU)、图形处理器(graphics processing unit，GPU)或其他通用和/或专用处理器中的一种或多种类型的处理器，这些处理器包括但不限于数字信号处理器(digital signal processor，DSP)、专用集成电路(application specificintegrated circuit，ASIC)、现场可编程门阵列(field-programmable gate array，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等，并且其数目可以根据实际需要来确定。如前所述，仅就本发明的计算装置301而言，其可以视为具有单核结构或者同构多核结构。然而，当将计算装置301和处理装置303整合共同考虑时，二者视为形成异构多核结构。The processing device 303, as a general processing device, performs basic control including but not limited to data transfer, starting and/or stopping the computing device 301, and the like. Depending on the implementation, the processing device 303 may be one or more types of central processing unit (CPU), graphics processing unit (GPU), or other general-purpose and/or special-purpose processors Processors, these processors include but are not limited to digital signal processors (DSPs), application specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs) or other Programming logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs. As mentioned above, only for the computing device 301 of the present invention, it can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when the computing device 301 and the processing device 303 are considered together, the two are considered to form a heterogeneous multi-core structure.

DRAM 304用以存储待处理的数据，为片外内存，大小通常为16G或更大，用于保存计算装置301和/或处理装置303的数据，尤其是存储欲进行Winograd卷积运算的神经元数据及权值。在此实施例中，处理装置303已经事先将原始卷积的权值线性变换成Winograd权值GgG^T，存储在DRAM 304中。The DRAM 304 is used to store the data to be processed, which is an off-chip memory, and the size is usually 16G or larger. It is used to save the data of the computing device 301 and/or the processing device 303, especially the neurons that are to be subjected to the Winograd convolution operation. data and weights. In this embodiment, the processing device 303 has previously linearly transformed the weight of the original convolution into a Winograd weight GgG ^T and stored in the DRAM 304 .

图4示出计算装置301的结构图。计算装置301包括总线401、直接存储器访问(DMA)模块402、指令缓存(Iram)407、译码单元(IDU)408、神经元缓存(Nram)409、正变换单元(NTU，neuron transformation unit)410、正变换数据缓存(WNram)411、权值缓存(Wram)412、对位乘累加运算器(MAC)413、对位乘数据缓存(WRram)414、逆变换单元(ITU)415、结果缓存(Rram)416及逻辑运算模块(ALU，arithmetic logic unit)417。FIG. 4 shows a block diagram of the computing device 301 . The computing device 301 includes a bus 401 , a direct memory access (DMA) module 402 , an instruction buffer (Iram) 407 , a decoding unit (IDU) 408 , a neuron buffer (Nram) 409 , and a forward transformation unit (NTU, neuron transformation unit) 410 , forward transform data buffer (WNram) 411, weight buffer (Wram) 412, bitwise multiply-accumulate operator (MAC) 413, bitwise multiply data buffer (WRram) 414, inverse transform unit (ITU) 415, result buffer ( Rram) 416 and a logic operation module (ALU, arithmetic logic unit) 417.

总线401是各装置之间传送信息的公共通信干线，由导线组成的传输线束，按照组合处理装置30所传输的信息种类，总线401为数据总线、地址总线和控制总线的统称，用来传输数据、数据地址和指令。总线401作为DRAM 304与计算装置301的通讯渠道，在此实施例中具体为PCIe。The bus 401 is a public communication trunk line for transmitting information between various devices, and is a transmission harness composed of wires. According to the type of information transmitted by the combined processing device 30, the bus 401 is a general term for a data bus, an address bus and a control bus, used to transmit data. , data addresses and instructions. The bus 401 serves as a communication channel between the DRAM 304 and the computing device 301 , which is specifically PCIe in this embodiment.

DMA模块402用以将数据从一个地址空间复制到另外一个地址空间，通常是将数据在外部内存(如DRAM 304)与计算装置301内部缓存间进行搬运。在实现DMA传输时，处理装置201把总线控制权交给DMA模块402，DMA模块402控制总线401进行数据搬运，结束DMA传输后，DMA模块402把总线控制权交回给处理装置201。The DMA module 402 is used for copying data from one address space to another address space, usually for transferring data between an external memory (eg, the DRAM 304 ) and the internal buffer of the computing device 301 . When implementing DMA transmission, the processing device 201 transfers the bus control right to the DMA module 402 , and the DMA module 402 controls the bus 401 to carry out data transfer.

DMA模块402包括神经元直接存储器访问(NDMA)403、权值直接存储器访问(WDMA)404、指令直接存储器访问(IDMA)405及结果直接存储器访问(RDMA)406。NDMA 403用以自DRAM 304输入神经元数据，WDMA 404用以自DRAM 304输入Winograd权值，IDMA 405用以自DRAM 304输入指令，RDMA 406用以将计算结果输出至DRAM 304。在其他实施例中，NDMA403、WDMA 404、IDMA 405及RDMA 406可以由同一个直接存储器访问来实现。The DMA module 402 includes neuron direct memory access (NDMA) 403 , weight direct memory access (WDMA) 404 , instruction direct memory access (IDMA) 405 and result direct memory access (RDMA) 406 . NDMA 403 is used to input neuron data from DRAM 304 , WDMA 404 is used to input Winograd weights from DRAM 304 , IDMA 405 is used to input commands from DRAM 304 , and RDMA 406 is used to output calculation results to DRAM 304 . In other embodiments, NDMA 403, WDMA 404, IDMA 405, and RDMA 406 may be implemented by the same direct memory access.

Iram 407用以暂存IDMA 405输入的指令，IDU 408自Iram 407取出指令进行译码，并根据译码后的指令以控制其他单元运作。IDU 408是整个计算装置301的译码调度单元，负责译码从DRAM 304获取的控制指令，转换为控制信号协调片上各个模块/单元的运作，同时还要负责指令的保序、解依赖，进行分支预测、异常处理、中断处理等诸多任务。图中细线箭头为控制流，粗线箭头为数据流。The Iram 407 is used to temporarily store the instructions input by the IDMA 405 , and the IDU 408 fetches the instructions from the Iram 407 for decoding, and controls other units to operate according to the decoded instructions. The IDU 408 is the decoding and scheduling unit of the entire computing device 301, and is responsible for decoding the control instructions obtained from the DRAM 304, converting them into control signals to coordinate the operation of each module/unit on the chip, and also responsible for order preservation, de-dependency, and execution of the instructions. Branch prediction, exception handling, interrupt handling and many other tasks. The thin line arrows in the figure are control flow, and the thick line arrows are data flow.

Nram 409根据译码后的指令，用以暂存NDMA 403发送的神经元数据，NTU 410根据译码后的指令，自Nram 409读取神经元数据进行正变换，也就是进行B^TdB的运算，以产生正变换数据，所产生的正变换数据暂存在WNram 411中。The Nram 409 is used to temporarily store the neuron data sent by the NDMA 403 according to the decoded instruction, and the NTU 410 reads the neuron data from the Nram 409 according to the decoded instruction to perform forward transformation, that is, the operation of B ^T dB , to generate the forward transformation data, and the generated forward transformation data is temporarily stored in the WNram 411 .

NTU 410包括输入缓存、寄存器堆、加法器组及输出缓存。NTU 410 includes input buffers, register files, adder banks, and output buffers.

当NTU 410收到指令欲从Nram 409载入神经元数据时，输入缓存作为先入先出队列缓存，用以暂存神经元数据。载入神经元数据的阶段将会持续到所有数据接收完成，不同规模的卷积滤波器会配置固定且独立的缓存资源划分和输入计数，整体过程由IDU 408发送指令控制。When the NTU 410 receives an instruction to load neuron data from the Nram 409, the input buffer is used as a FIFO queue buffer for temporarily storing the neuron data. The stage of loading neuron data will continue until all data reception is completed. Convolution filters of different scales will be configured with fixed and independent buffer resource partitioning and input counts. The overall process is controlled by instructions sent by the IDU 408 .

寄存器堆根据译码后的指令，按照规划好的运算顺序，自输入缓存取出暂存的神经元数据，存储至寄存器堆的特定地址，这些存储在寄存器堆的特定地址的神经元数据成为加法操作数。在此实施例中，由于输入、运算、输出三个阶段的流水时间长度相等，故而会出现缓存硬件资源依赖的现象，为了解决资源依赖的问题，寄存器堆切分为相同大小的乒存储单元与乓存储单元，第i个加法操作数及计算完所产生的正变换数据暂存在乒存储单元中，第i+1个加法操作数及第i+1个正变换数据则暂存在乓存储单元，第i+2个加法操作数及第i+2个正变换数据又暂存在乒存储单元，覆盖第i个加法操作数及第i+2个正变换数据，寄存器堆依此规则进行存储。According to the decoded instructions and the planned operation sequence, the register file retrieves the temporarily stored neuron data from the input cache and stores it at the specific address of the register file. The neuron data stored in the specific address of the register file becomes an addition operation. number. In this embodiment, since the pipeline time lengths of the three stages of input, operation, and output are equal, the phenomenon of cache hardware resource dependence occurs. In order to solve the problem of resource dependence, the register file is divided into ping storage units of the same size and Pong storage unit, the i-th addition operand and the positive transformation data generated after the calculation are temporarily stored in the ping storage unit, and the i+1-th addition operand and the i+1-th positive transformation data are temporarily stored in the pong storage unit, The i+2th addition operand and the i+2th positive transformation data are temporarily stored in the ping storage unit, covering the ith addition operand and the i+2th positive transformation data, and the register file is stored according to this rule.

加法器组根据译码后指令，自寄存器堆的特定地址依序读取加法操作数进行加法运算。在此实施例中，加法器组的数量为2组以对应加法运算调度方向，每组包括16个加法器以对应向量化方向，每个加法器为FB32加法器，在神经元数据的通道方向按照特定顺序执行Winograd卷积的正变换中的加法运算，此特定顺序为先计算Winograd卷积的左乘矩阵B^T的加法，再计算Winograd卷积的右乘矩阵B的加法，最后产生正变换数据，再将正变换数据存回寄存器堆中。运算顺序以及寄存器分配、运算时间均与卷积滤波器规模相关，由IDU408发送指令控制。此运算阶段与前述载入神经元数据的阶段产生数据依赖，以流水方式执行，由硬件通过计数实现。The adder group sequentially reads the addition operands from the specific address of the register file according to the decoded instruction to perform the addition operation. In this embodiment, the number of adder groups is 2 groups to correspond to the addition operation scheduling direction, each group includes 16 adders to correspond to the vectorization direction, each adder is an FB32 adder, in the channel direction of the neuron data The addition operation in the positive transformation of the Winograd convolution is performed in a specific order. The specific order is to first calculate the addition of the left multiplication matrix B ^T of the Winograd convolution, and then calculate the addition of the right multiplication matrix B of the Winograd convolution, and finally generate the positive transformation. data, and then store the positive transformation data back into the register file. The operation order, register allocation, and operation time are related to the scale of the convolution filter, and are controlled by the IDU408 sending instructions. This operation stage generates data dependencies with the aforementioned stage of loading neuron data, and is executed in a pipelined manner, which is implemented by hardware through counting.

输出缓存亦为先入先出队列缓存，用以暂存依序来自乒存储单元及乓存储单元的正变换数据。此输出阶段需要依赖于运算阶段的整体完成，才能进行相应缓存的输出。The output buffer is also a first-in-first-out queue buffer for temporarily storing forward-transformed data sequentially from the ping storage unit and the pong storage unit. This output stage needs to rely on the overall completion of the operation stage before the corresponding cached output can be performed.

WNram 411包括多个缓存单元，图5示出一种示例性的WNram 411的示意图，如图所示，WNram 411包括4个缓存单元：第一缓存单元501、第二缓存单元502、第三缓存单元503、第四缓存单元504。来自NTU 410的正变换数据是通过路由分发方式发送至这些缓存单元的一个或多个。The WNram 411 includes a plurality of cache units, and FIG. 5 shows a schematic diagram of an exemplary WNram 411. As shown in the figure, the WNram 411 includes four cache units: a first cache unit 501, a second cache unit 502, and a third cache unit 503 and fourth cache unit 504 . Forward transformation data from NTU 410 is routed to one or more of these cache units.

回到图4，Wram 412根据译码后的指令，暂存WDMA 404发送来的Winograd权值，MAC413根据译码后的指令，自Wram 412读取Winograd权值，并自WNram 411读取正变换数据，对正变换数据与Winograd权值进行对位乘累加运算，也就是进行[(GgG^T)⊙(B^TdB)]的运算，以产生对位乘数据，并将对位乘数据暂存至WRram 414。Returning to FIG. 4 , the Wram 412 temporarily stores the Winograd weights sent by the WDMA 404 according to the decoded instructions, and the MAC 413 reads the Winograd weights from the Wram 412 according to the decoded instructions, and reads the positive transformation from the WNram 411 data, perform the multiplication and accumulation operation on the positive transformation data and the Winograd weight, that is, perform the operation of [(GgG ^T )⊙(B ^T dB)] to generate the multiplication data, and temporarily store the multiplication data. to WRram 414.

在此实施例中，MAC 413包括64个MAC运算器，均分为4组分别进行4个不同批量的运算，每组的16个MAC运算器为独立分布。WNram411的正变换数据需要同时发送给这64个MAC运算器，使其与不同的Winograd权值进行对位乘累加运算，因此WNram 411以广播或者分发路由的方式发送正变换数据。由于输出负载大，为保证驱动能力和时序，WNram 411的正变换数据通过N1、N2两级广播或者分发路由，首先发送给4个N1节点，每个N1节点广播或者分发路由给4个N2节点，每个N2节点再广播或者分发路由给4个MAC运算器。In this embodiment, the MAC 413 includes 64 MAC operators, which are equally divided into 4 groups to perform operations in 4 different batches, and the 16 MAC operators in each group are independently distributed. The positive transformation data of WNram411 needs to be sent to these 64 MAC operators at the same time, so that it can perform the multiplication and accumulation operation with different Winograd weights. Therefore, the WNram 411 sends the forward transformation data by broadcasting or routing. Due to the large output load, in order to ensure the driving ability and timing, the forward transformation data of WNram 411 is broadcast or distributed through the N1 and N2 levels, and is first sent to 4 N1 nodes, and each N1 node broadcasts or distributes the route to 4 N2 nodes. , each N2 node rebroadcasts or distributes routes to 4 MAC operators.

图6示出一组MAC运算器601的示意图。MAC运算器601首先进行对位相乘，然后将得到的结果向量依次累加，逻辑功能相当于求向量内积或者进行矩阵乘法中元素值的运算。FIG. 6 shows a schematic diagram of a set of MAC operators 601 . The MAC operator 601 first multiplies the bits, and then accumulates the resultant vectors in sequence. The logical function is equivalent to calculating the inner product of vectors or performing the operation of element values in matrix multiplication.

ITU 415根据译码后的指令，自WRram 414读取对位乘数据，逆变换对位乘数据，也就是进行A^T[(GgG^T)⊙(B^TdB)]A的运算，以获得卷积结果，卷积结果暂存在Rram 416中。ITU 415 reads the bitwise multiplication data from the WRram 414 according to the decoded instruction, and inversely transforms the bitwise multiplication data, that is, performs the operation of A ^T [(GgG ^T )⊙(B ^T dB)]A to obtain the volume Product results, convolution results are temporarily stored in Rram 416.

ITU 415包括输入缓存、寄存器堆、加法器组及输出缓存。ITU 415 includes input buffers, register files, adder banks, and output buffers.

当ITU 415收到指令欲从WRram 414载入对位乘数据时，输入缓存作为先入先出队列缓存，用以暂存对位乘数据。载入对位乘数据的阶段将会持续到所有数据接收完成，不同规模的卷积滤波器会配置固定且独立的缓存资源划分和输入计数，整体过程由IDU 408发送指令控制。When the ITU 415 receives an instruction to load the bitwise multiplication data from the WRram 414, the input buffer is used as a FIFO queue buffer for temporarily storing the bitwise multiplication data. The stage of loading the bitwise multiplication data will continue until all data reception is completed, and the convolution filters of different scales will be configured with fixed and independent buffer resource partitioning and input counting. The overall process is controlled by the IDU 408 sending instructions.

寄存器堆根据译码后的指令，按照固定的运算顺序，自输入缓存取出暂存的对位乘数据，存储至寄存器堆的特定地址，这些存储在寄存器堆的特定地址的对位乘数据成为加法操作数。同样地，为了解决资源依赖的问题，寄存器堆具有相同大小的乒存储单元与乓存储单元，第i个加法操作数及计算完所产生的卷积结果暂存在乒存储单元中，第i+1个加法操作数及第i+1个卷积结果则暂存在乓存储单元，第i+2个加法操作数及第i+2个卷积结果又暂存在乒存储单元，覆盖第i个加法操作数及第i个卷积结果，寄存器堆依此规则进行存储。According to the decoded instruction, according to the fixed operation sequence, the register file fetches the temporarily stored pairwise multiplication data from the input cache and stores it to a specific address of the register file. operand. Similarly, in order to solve the problem of resource dependence, the register file has a ping storage unit and a pong storage unit of the same size. The i-th addition operand and the convolution result generated after the calculation are temporarily stored in the ping storage unit. The addition operand and the i+1th convolution result are temporarily stored in the pong storage unit, and the i+2th addition operand and the i+2th convolution result are temporarily stored in the ping storage unit, covering the ith addition operation. number and the i-th convolution result, and the register file is stored according to this rule.

加法器组根据译码后指令，自寄存器堆的特定地址依序读取加法操作数进行加法运算。与加法器组相同，加法器组的数量为2组以对应加法运算调度方向，每组包括16个加法器以对应向量化方向，每个加法器为FB32加法器，在对位乘数据的通道方向按照特定顺序执行Winograd卷积的逆变换中的加法运算，此特定顺序为先计算Winograd卷积的左乘矩阵A^T的加法，再计算Winograd卷积的右乘矩阵A的加法，最后产生卷积结果，再将卷积结果存回寄存器堆中。运算顺序以及寄存器分配、运算时间均与卷积滤波器规模相关，由IDU408发送指令控制。此运算阶段与前述载入对位乘数据的阶段产生数据依赖，以流水方式执行，由硬件通过计数实现。The adder group sequentially reads the addition operands from the specific address of the register file according to the decoded instruction to perform the addition operation. Same as the adder group, the number of adder groups is 2 groups to correspond to the addition operation scheduling direction. Each group includes 16 adders to correspond to the vectorization direction. The direction performs the addition operation in the inverse transformation of the Winograd convolution in a specific order. This specific order is to first calculate the addition of the left multiplication matrix A ^T of the Winograd convolution, and then calculate the addition of the right multiplication matrix A of the Winograd convolution, and finally generate the volume The result is accumulated, and the convolution result is stored back in the register file. The operation order, register allocation, and operation time are related to the scale of the convolution filter, and are controlled by the IDU408 sending instructions. This operation stage and the aforementioned stage of loading the bit-multiplied data generate data dependencies, which are executed in a pipelined manner and implemented by hardware through counting.

输出缓存亦为先入先出队列缓存，用以暂存依序来自乒存储单元及乓存储单元的卷积结果。此输出阶段需要依赖于运算阶段的整体完成，才能进行相应缓存的输出。The output buffer is also a first-in-first-out queue buffer for temporarily storing the convolution results from the ping memory unit and the pong memory unit in sequence. This output stage needs to rely on the overall completion of the operation stage before the corresponding cached output can be performed.

除了Winograd卷积外，计算装置301还能够执行所有神经网络相关的运算，ALU417用以根据译码后的指令，执行两大任务：第一个任务是卷积融合操作的运算，即运算可以和卷积层在片上一次性完成不需要依赖更多数据的运算，这些运算包括激活、加偏置、方向部分和累加等运算过程；第二个任务是非卷积运算。ALU 417产生的运算结果亦暂存于Rram416中。ALU 417的存在可以保证卷积神经网络中的各种运算在计算装置301中都可以完整实现，使得计算装置301具有神经网络的通用性和完整性。In addition to the Winograd convolution, the computing device 301 can also perform all operations related to the neural network. The ALU 417 is used to perform two tasks according to the decoded instructions: the first task is the operation of the convolution fusion operation, that is, the operation can be combined with The convolutional layer performs operations on the chip that do not require more data at one time. These operations include activation, biasing, direction part, and accumulation operations; the second task is non-convolutional operations. The operation result generated by ALU 417 is also temporarily stored in Rram416. The existence of the ALU 417 can ensure that various operations in the convolutional neural network can be completely implemented in the computing device 301, so that the computing device 301 has the versatility and integrity of a neural network.

RDMA 406根据译码后的指令，将卷积结果自Rram 416中取出并输出至DRAM 304，至此完成整个卷积运算。同样地，RDMA 406亦可根据译码后的指令，将ALU 417生成的其他运算结果自Rram 416中取出并输出至DRAM 304。According to the decoded instruction, the RDMA 406 fetches the convolution result from the Rram 416 and outputs it to the DRAM 304, thus completing the entire convolution operation. Similarly, the RDMA 406 can also fetch other operation results generated by the ALU 417 from the Rram 416 and output them to the DRAM 304 according to the decoded instruction.

由于卷积运算的数据规模庞大，为了降低指令本身的启动开销，此实施例进一步利用指令控制使得相关模块/单元得以执行流水，提高硬件的使用率。Due to the huge data scale of the convolution operation, in order to reduce the start-up overhead of the instruction itself, this embodiment further utilizes instruction control to enable related modules/units to execute pipelines, thereby improving the utilization rate of hardware.

从上述可知，神经元数据的输入时机与数据规模会影响Winograd卷积指令的神经元正变换过程，权值数据输入时机与数据规模也会影响Winograd卷积指令的对位乘累加运算过程，而Winograd卷积逆变换的完成时机会影响卷积结果输出指令的执行。因此，从控制的角度来说，指令的先后顺序及执行的时间点是关键的，不仅如此，此实施例还需要在存在依赖关系的指令之间插入同步指令，以解决输入输出程序与Winograd卷积程序的数据依赖问题。It can be seen from the above that the input timing and data scale of neuron data will affect the positive transformation process of neurons in Winograd convolution instructions, and the input timing and data scale of weight data will also affect the registration process of Winograd convolution instructions. The completion timing of the Winograd convolution inverse transform will affect the execution of the convolution result output instruction. Therefore, from the point of view of control, the order of instructions and the time point of execution are critical. Not only that, but in this embodiment, synchronization instructions need to be inserted between the instructions with dependencies to solve the problem between the input and output programs and the Winograd volume. The data dependence problem of product programs.

图7示出此实施例流水线的示意图，主要通过IDU 408来控制Nram409、NTU 410、Wram 412、MAC 413、ITU 415及Rram 416间的流水操作。FIG. 7 shows a schematic diagram of the pipeline of this embodiment. The IDU 408 mainly controls the pipeline operations among Nram 409 , NTU 410 , Wram 412 , MAC 413 , ITU 415 and Rram 416 .

在进行第i次的卷积运算时，IDU 408发送指令控制Nram 409在时间点T₁启动自DRAM 304载入神经元数据i，而载入神经元数据i会在时间点T₂完成。在神经元数据i完成载入前，于时间点T₃，IDU 408根据同步指令控制NTU 410启动自Nram 409读取神经元数据i进行正变换，以产生正变换数据i。自时间点T₃起，在Nram 409载入神经元数据i的同时，NTU410亦自Nram 409读取神经元数据i进行正变换，正变换数据i会在时间点T₄完成。When performing the i-th convolution operation, the IDU 408 sends an instruction to control the _Nram 409 to start loading the neuron data i from the DRAM 304 at the time point T1, and the loading of the neuron data i will be completed at the time point _T2 . Before the neuron data i is loaded, at the time point T ₃ , the IDU 408 controls the NTU 410 to start reading the neuron data i from the Nram 409 to perform forward transformation according to the synchronization command, so as to generate the forward transformation data i. From the time point T3, while the _Nram 409 loads the neuron data i, the NTU 410 also reads the neuron data i from the Nram 409 to perform forward transformation, and the forward transformation of the data i will be completed at the time point T4 _.

神经元数据i的卷积运算需要搭配Winograd权值i，基于此实施例的硬件结构，神经元数据i的输入由NDMA 403负责，Winograd权值i的输入由WDMA 404负责，可以并行，但是考虑到计算装置301的输入输出带宽是固定的，且神经元数据i需要先经过NTU 410做正变换，才会与Winograd权值i由MAC 413进行对位乘累加，故此实施例设计让神经元数据i在时间点T₁先被载入并在时间点T₃做正变换，Winograd权值i稍后才输入至Wram 412，使得Nram409、NTU 410、Wram 412与MAC 413彼此间得以良好的搭配，尽量避免造成某一个模块/单元闲置或堵塞。为此，IDU 408在正变换数据i完全产生前，根据同步指令控制Wram 412启动自DRAM 304载入Winograd权值i。启动载入Winograd权值i的时间点可以根据计算装置301的输入输出带宽而定，较佳地，可以选择在时间点T₃，即启动正变换与启动载入Winograd权值i同时被执行。假设Winograd权值i也在时间点T₄完成下载。The convolution operation of neuron data i needs to be matched with Winograd weight i. Based on the hardware structure of this embodiment, NDMA 403 is responsible for the input of neuron data i, and WDMA 404 is responsible for the input of Winograd weight i, which can be parallelized, but consider The input and output bandwidth to the computing device 301 is fixed, and the neuron data i needs to be forward transformed by the NTU 410 before it is multiplied and accumulated by the MAC 413 with the Winograd weight i. Therefore, in this embodiment, the neuron data is designed so that the i is _first loaded at time point T1 and forward transformed at time point T3, and Winograd weight i is input to _Wram 412 later, so that Nram409, NTU 410, Wram 412 and MAC 413 can be well matched with each other, Try to avoid leaving a module/unit idle or blocked. To this end, the IDU 408 controls the Wram 412 to start loading the Winograd weight i from the DRAM 304 according to the synchronization command before the transforming data i is completely generated. The time point for starting the loading of the Winograd weights i can be determined according to the input and output bandwidth of the computing device 301 . Preferably, it can be selected at the time point T ₃ , that is, starting the forward transformation and starting the loading of the Winograd weights i at the same time. Suppose that the Winograd weight i is also downloaded at time point T4 _.

在Winograd权值i完成载入前，于时间点T₅，IDU 408根据同步指令控制MAC 413启动对正变换数据i与Winograd权值i进行对位乘累加运算，以产生对位乘数据i。自时间点T₅起，在Wram 412载入Winograd权值i的同时，MAC 413亦进行对位乘累加运算，对位乘数据i会在时间点T₆完成。Before the Winograd weight i is loaded, at the time point T ₅ , the IDU 408 controls the MAC 413 according to the synchronization command to start the alignment multiply-accumulate operation on the positive transformation data i and the Winograd weight i to generate the alignment multiplication data i. Since the time point T5, when the _Wram 412 loads the Winograd weight i, the MAC 413 also performs the multiplication-accumulation operation, and the multiplication by the data i will be completed at the time point _T6 .

在对位乘数据i完成产生前，于时间点T₇，IDU 408根据指令控制ITU415启动自WRram 414读取对位乘数据i进行逆变换，以产生卷积结果i。自时间点T₇起，在MAC 413进行对位乘累加运算的同时，ITU 415也在进行逆变换运算，卷积结果i会在时间点T₈完成。Before the generation of the bit-multiplied data i is completed, at the time point T ₇ , the IDU 408 controls the ITU 415 to start reading the bit-multiplied data i from the WRram 414 according to the instruction to perform inverse transformation to generate the convolution result i. Since the time point _T7 , while the MAC 413 is performing the bitwise multiply-accumulate operation, the ITU 415 is also performing the inverse transform operation, and the convolution result i will be completed at the time point _T8 .

Rram 416启动暂存卷积结果i的时间点可以有两个，一个是在卷积结果i完全产生前，也就是介于时间点T₇与T₈间，另一个是在卷积结果i完全产生后。图7是以在卷积结果i完全产生后启动暂存卷积结果i为例，于时间点T₈，IDU 408根据同步指令控制Rram 416启动暂存卷积结果i，并在时间点T₉完成暂存。There can be two time points when Rram 416 starts to temporarily store the convolution result i, one is before the convolution result i is completely generated, that is, between the time points T ₇ and T ₈ , and the other is when the convolution result i is completely generated. after generation. FIG. 7 is an example of starting the temporary storage of the convolution result i after the convolution result i is completely generated. At the time point T ₈ , the IDU 408 controls the Rram 416 to start the temporary storage of the convolution result i according to the synchronization command, and at the time point T ₉ Complete staging.

在神经元数据i输出完毕后，便可开始进行第i+1次的卷积运算，IDU408发送指令控制Nram 409在时间点T₂启动自DRAM 304载入神经元数据i+1，而载入神经元数据i+1会在时间点T₁₀完成，换言之，在正变换数据i完全产生前，Nram 409已启动载入神经元数据i+1。在神经元数据i+1完成载入前，于时间点T₄，IDU 408根据同步指令控制NTU 410启动自Nram409读取神经元数据i+1进行正变换，以产生正变换数据i+1。自时间点T₄起，在Nram 409载入神经元数据i+1的同时，NTU 410也自Nram409读取神经元数据i+1进行正变换，正变换数据i+1会在时间点T₁₁完成。After the neuron data i is output, the i+1th convolution operation can be started. The IDU 408 sends an instruction to control the Nram 409 to start loading the neuron data i+ ₁ from the DRAM 304 at the time point T2. The neuron data i+ ₁ will be completed at the time point T10, in other words, before the positive transformation data i is completely generated, the Nram 409 has started to load the neuron data i+1. Before the neuron data i+1 is loaded, at time point T ₄ , the IDU 408 controls the NTU 410 to start reading the neuron data i+1 from the Nram 409 to perform forward transformation according to the synchronization command to generate the forward transformation data i+1. Since the time point T4, while the _Nram 409 loads the neuron data i+1, the NTU 410 also reads the neuron data i+1 from the Nram409 and performs positive transformation, and the positive transformation data i+1 will be at the time point _T11 . Finish.

IDU 408在正变换数据i+1完全产生前，且在MAC 413完全产生对位乘数据i前，根据同步指令控制Wram 412启动自DRAM 304载入Winograd权值i+1。启动载入Wi+1nograd权值i+1的时间点可以根据计算装置301的输入输出带宽而定，较佳地，可以选择在时间点T₄，即启动正变换与启动载入Winograd权值i+1同时被执行。假设Winograd权值i+1也在时间点T₁₁完成下载。The IDU 408 controls the Wram 412 to start loading the Winograd weight i+1 from the DRAM 304 according to the synchronization instruction before the positive transformation data i+1 is completely generated, and before the MAC 413 completely generates the bitwise multiplication data i. The time point for starting the loading of the Wi+1 nograd weight i+1 can be determined according to the input and output bandwidth of the computing device 301. Preferably, it can be selected at the time point T ₄ , that is, starting the forward transformation and starting the loading of the Winograd weight i +1 for both being executed. Assume that the Winograd weight i+1 is also downloaded at time point _T11 .

在Winograd权值i+1完成载入前，于时间点T₆，在ITU 415完全产生卷积结果i前，IDU 408根据同步指令控制MAC 413启动对正变换数据i+1与Winograd权值i+1进行对位乘累加运算，以产生对位乘数据i+1。自时间点T₆起，在Wram 412载入Winograd权值i+1的同时，MAC 413亦进行对位乘累加运算，对位乘数据i+1会在时间点T₁₂完成。Before the Winograd weight i+1 is loaded, at the time point T ₆ , before the ITU 415 completely generates the convolution result i, the IDU 408 controls the MAC 413 according to the synchronization command to start aligning the data i+1 and the Winograd weight i +1 Performs the bitwise multiply-accumulate operation to generate the bitwise multiplication data i+1. Since the time point _T6 , when the Wram 412 loads the Winograd weight value i+1, the MAC 413 also performs the registration multiplication and accumulation operation, and the registration multiplication data i+1 will be completed at the time point _T12 .

在对位乘数据i+1完成产生前，于时间点T₈，IDU 408根据指令控制ITU 415启动自WRram 414读取对位乘数据i+1进行逆变换，以产生卷积结果i+1。自时间点T₈起，在MAC 413进行对位乘累加运算的同时，ITU415也在进行逆变换运算，卷积结果i+1会在时间点T₁₃完成。Before the generation of the bit-multiplied data i+1 is completed, at the time point T ₈ , the IDU 408 controls the ITU 415 to start reading the bit-multiplied data i+1 from the WRram 414 according to the instruction to perform inverse transformation to generate the convolution result i+1 . Since the time point T8, while the MAC 413 is performing the bitwise multiply-accumulate operation, the _ITU415 is also performing the inverse transform operation, and the convolution result i+1 will be completed at the time point _T13 .

同样地，Rram 416启动暂存卷积结果i+1的时间点可以有两个，一个是在卷积结果i+1完全产生前，也就是介于时间点T₉与T₁₃间，另一个是在卷积结果i+1完全产生后，以在卷积结果i+1完全产生后启动暂存卷积结果i+1为例，于时间点T₁₃，IDU 408根据同步指令控制Rram 416启动暂存卷积结果i+1，并在时间点T₁₄完成暂存。Similarly, there can be two time points when Rram 416 starts to temporarily store the convolution result i+1, one is before the convolution result i+1 is completely generated, that is, between the time points T ₉ and T ₁₃ , the other is It is after the convolution result i+1 is completely generated, taking the example of starting the temporary storage of the convolution result i+1 after the convolution result i+1 is completely generated, at the time point T ₁₃ , the IDU 408 controls the Rram 416 to start up according to the synchronization command The convolution result i+1 is temporarily stored, and the temporary storage is completed at time point _T14 .

基于图4所示的计算装置301的结构，此实施例根据前述的流水来执行Winograd卷积运算，能够充分的善用硬件优势，提升输入输出与运算效率。Based on the structure of the computing device 301 shown in FIG. 4 , in this embodiment, the Winograd convolution operation is performed according to the aforementioned pipeline, which can fully utilize the advantages of hardware and improve the input-output and operation efficiency.

本发明基于Winograd算法的特性进行硬件设计，以实现加速通用性，并提出加速Winograd卷积运算速度的流水级操作方式，还在硬件实现过程中通过时分复用、广播路由等方法充分利用可复用资源。本发明提出的硬件结构能够匹配Winograd卷积算法，具有保证网络精度、性能加速、面积缩减以及功耗降低的技术效果。The invention carries out hardware design based on the characteristics of the Winograd algorithm to realize the universality of acceleration, and proposes a pipeline-level operation mode to accelerate the operation speed of the Winograd convolution. In the process of hardware implementation, time division multiplexing, broadcast routing and other methods are used to make full use of the reproducible Use resources. The hardware structure proposed by the invention can match the Winograd convolution algorithm, and has the technical effects of ensuring network accuracy, performance acceleration, area reduction and power consumption reduction.

根据不同的应用场景，本发明的电子设备或装置可以包括服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆；所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机；所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本发明的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步，本发明的电子设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中，根据本发明方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器)，而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中，云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容，从而可以根据终端设备和/或边缘端设备的硬件信息，从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源，以便完成端云一体或云边端一体的统一管理、调度和协同工作。According to different application scenarios, the electronic device or device of the present invention may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, home appliances, and/or medical equipment. The vehicles include airplanes, ships and/or vehicles; the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods; the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph. The electronic device or device of the present invention can also be applied to the Internet, Internet of Things, data center, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical care and other fields. Further, the electronic device or device of the present invention can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge terminal, and terminal. In one or more embodiments, the electronic device or device with high computing power according to the solution of the present invention can be applied to a cloud device (such as a cloud server), while the electronic device or device with low power consumption can be applied to a terminal device and/or Edge devices (such as smartphones or cameras). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be retrieved from the hardware information of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device. Match the appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-device integration.

需要说明的是，为了简明的目的，本发明将一些方法及其实施例表述为一系列的动作及其组合，但是本领域技术人员可以理解本发明的方案并不受所描述的动作的顺序限制。因此，依据本发明的公开或教导，本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步，本领域技术人员可以理解本发明所描述的实施例可以视为可选实施例，即其中所涉及的动作或模块对于本发明某个或某些方案的实现并不一定是必需的。另外，根据方案的不同，本发明对一些实施例的描述也各有侧重。鉴于此，本领域技术人员可以理解本发明某个实施例中没有详述的部分，也可以参见其他实施例的相关描述。It should be noted that, for the purpose of simplicity, the present invention expresses some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solution of the present invention is not limited by the sequence of the described actions . Accordingly, based on the disclosure or teachings of the present invention, those skilled in the art will understand that some of the steps may be performed in other orders or simultaneously. Further, those skilled in the art can understand that the embodiments described in the present invention may be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily necessary for the realization of one or some solutions of the present invention. In addition, according to different solutions, the present invention also has different emphases in the description of some embodiments. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present invention, and can also refer to the related descriptions of other embodiments.

在具体实现方面，基于本发明的公开和教导，本领域技术人员可以理解本发明所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如，就前文所述的电子设备或装置实施例中的各个单元来说，本文在考虑了逻辑功能的基础上对其进行拆分，而实际实现时也可以有另外的拆分方式。又例如，可以将多个单元或组件结合或者集成到另一个系统，或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言，前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中，前述的直接或间接耦合涉及利用接口的通信连接，其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。In terms of specific implementation, based on the disclosure and teaching of the present invention, those skilled in the art can understand that the several embodiments disclosed in the present invention can also be implemented in other ways not disclosed herein. For example, as for each unit in the foregoing electronic device or apparatus embodiment, it is divided on the basis of considering the logical function, and there may also be other division methods in actual implementation. As another example, multiple units or components may be combined or integrated into another system, or some features or functions of a unit or component may be selectively disabled. As far as the connection relationship between different units or components is concerned, the connections discussed above in conjunction with the accompanying drawings may be direct or indirect couplings between units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

在本发明中，作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外，根据实际的需要，可以选择其中的部分或者全部单元来实现本发明实施例所述方案的目的。另外，在一些场景中，本发明实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。In the present invention, units described as separate components may or may not be physically separated, and components shown as units may or may not be physical units. The aforementioned components or elements may be co-located or distributed over multiple network elements. In addition, according to actual needs, some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present invention. In addition, in some scenarios, multiple units in this embodiment of the present invention may be integrated into one unit or each unit physically exists independently.

在另外一些实现场景中，上述集成的单元也可以采用硬件的形式实现，即为具体的硬件电路，其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件，而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此，本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现，例如中央处理器、GPU、FPGA、DSP和ASIC等。进一步，前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等)，其例如可以是可变电阻式存储器(Resistive Random Access Memory，RRAM)、动态随机存取存储器(Dynamic RandomAccess Memory，DRAM)、静态随机存取存储器(Static Random Access Memory，SRAM)、增强动态随机存取存储器(Enhanced Dynamic Random Access Memory，EDRAM)、高带宽存储器(High Bandwidth Memory，HBM)、混合存储器立方体(Hybrid Memory Cube，HMC)、ROM和RAM等。In other implementation scenarios, the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits, and the like. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but are not limited to, devices such as transistors or memristors. In view of this, various types of devices described herein (eg, computing devices or other processing devices) may be implemented by suitable hardware processors, such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (Enhanced Dynamic Random Access Memory, EDRAM), High Bandwidth Memory (High Bandwidth Memory, HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.

依据以下条款可更好地理解前述内容：The foregoing can be better understood in accordance with the following terms:

条款A1、一种计算Winograd卷积的装置，连接至片外内存，所述片外内存存储有神经元数据及多个指令，所述装置包括：神经元缓存，用以启动自所述片外内存载入所述神经元数据；以及正变换单元，用以在所述神经元数据完成载入前，启动自所述神经元缓存读取所述神经元数据进行正变换，以产生正变换数据。Clause A1. An apparatus for computing Winograd convolutions, connected to off-chip memory, the off-chip memory storing neuron data and a plurality of instructions, the apparatus comprising: a neuron cache for enabling activation from the off-chip memory Loading the neuron data into the memory; and a forward transformation unit, used to start reading the neuron data from the neuron cache to perform forward transformation before the neuron data is loaded, so as to generate forward transformation data .

条款A2、根据条款A1所述的装置，其中所述片外内存还存储有Winograd权值，所述装置还包括权值缓存，用以在所述正变换数据完全产生前，启动自所述片外内存载入所述Winograd权值。Clause A2. The apparatus of Clause A1, wherein the off-chip memory further stores Winograd weights, the apparatus further comprises a weight cache for starting from the slice before the forward transformation data is completely generated Load the Winograd weights into external memory.

条款A3、根据条款A2所述的装置，其中启动正变换与启动载入所述Winograd权值同时被执行。Clause A3. The apparatus of clause A2, wherein initiating a positive transformation is performed concurrently with initiating loading of the Winograd weights.

条款A4、根据条款A2所述的装置，还包括对位乘累加运算器，用以在所述Winograd权值完成载入前，启动对所述正变换数据与所述Winograd权值进行对位乘累加运算，以产生对位乘数据。Clause A4. The apparatus according to Clause A2, further comprising a parametric multiply-accumulate operator for starting parametric multiplication of the positive transformed data and the Winograd weight before the Winograd weight is loaded. Accumulate operation to produce bitwise multiply data.

条款A5、根据条款A4所述的装置，还包括逆变换单元用以在所述对位乘数据完全产生前，启动逆变换所述对位乘数据进行，以产生卷积结果。Clause A5. The apparatus according to Clause A4, further comprising an inverse transform unit for inversely transforming the pairwise multiplication data to generate a convolution result before the pairwise multiplication data is completely generated.

条款A6、根据条款A5所述的装置，还包括卷积结果缓存，用以在所述卷积结果完全产生前，启动暂存所述卷积结果。Item A6. The apparatus according to Item A5, further comprising a convolution result buffer, configured to start temporarily storing the convolution result before the convolution result is completely generated.

条款A7、根据条款A5所述的装置，还包括卷积结果缓存，用以在所述卷积结果完全产生后，启动暂存所述卷积结果。Clause A7. The apparatus according to Clause A5, further comprising a convolution result buffer, for starting to temporarily store the convolution result after the convolution result is completely generated.

条款A8、根据条款A5所述的装置，其中在所述逆变换单元完全产生所述卷积结果前，所述对位乘累加运算器启动对下一个正变换数据与下一个Winograd权值进行对位乘累加运算，以产生下一个对位乘数据。Clause A8. The apparatus of Clause A5, wherein before the inverse transform unit fully generates the convolution result, the bitwise multiply-accumulate operator initiates an alignment of the next forward transformed data with the next Winograd weight. A bitwise multiply-accumulate operation to generate the next bitwise multiply data.

条款A9、根据条款A4所述的装置，其中在所述对位乘累加运算器完全产生所述对位乘数据前，所述权值缓存启动自所述片外内存载入下一个Winograd权值。Clause A9. The apparatus of Clause A4, wherein the weight cache is enabled to load the next Winograd weights from the off-chip memory before the pairwise multiply-accumulate operator fully generates the pairwise multiply data .

条款A10、根据条款A1所述的装置，其中在所述正变换单元完全产生所述正变换数据前，所述神经元缓存启动自所述片外内存载入下一个神经元数据。Clause A10. The apparatus of Clause A1, wherein the neuron cache starts to load the next neuron data from the off-chip memory before the forward transform unit fully generates the forward transform data.

条款A11、一种集成电路装置，包括根据条款A1至10任一项所述的装置。Clause A11. An integrated circuit device comprising the device of any one of clauses A1 to 10.

条款A12、一种板卡，包括根据条款A11所述的集成电路装置。Clause A12. A board comprising the integrated circuit device of clause A11.

以上对本发明实施例进行了详细介绍，本文中应用了具体个例对本发明的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本发明的方法及其核心思想；同时，对于本领域的一般技术人员，依据本发明的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本发明的限制。The embodiments of the present invention have been introduced in detail above, and specific examples are used to illustrate the principles and implementations of the present invention. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present invention; at the same time, for Persons of ordinary skill in the art, according to the idea of the present invention, will have changes in the specific embodiments and application scope. To sum up, the contents of this specification should not be construed as limiting the present invention.

Claims

1. a device for calculating Winograd convolution is connected to off-chip memory, and the off-chip memory stores neuron data and a plurality of instructions, and the device comprises:

a neuron cache to enable loading of the neuron data from the off-chip memory; and

The forward transformation unit is used to start reading the neuron data from the neuron buffer to perform forward transformation before the neuron data is loaded, so as to generate forward transformation data.

2 . The apparatus according to claim 1 , wherein the off-chip memory further stores Winograd weights, the apparatus further comprises a weight cache, which is used to start from the slice before the positive transformation data is completely generated. 3 . Load the Winograd weights into external memory.

3. The apparatus of claim 2, wherein initiating forward transformation is performed simultaneously with initiating loading of the Winograd weights.

4. The apparatus according to claim 2, further comprising a parametric multiply-accumulate operator, used to start the parametric multiplication of the positive transformed data and the Winograd weight before the Winograd weight is loaded. Accumulate operation to generate bitwise multiply data.

5 . The apparatus according to claim 4 , further comprising an inverse transformation unit for inversely transforming the bitwise multiplication data to generate a convolution result before the bitwise multiplication data is completely generated. 6 .

6. The apparatus according to claim 5, further comprising a convolution result buffer, for starting to temporarily store the convolution result before the convolution result is completely generated.

7. The apparatus according to claim 5, further comprising a convolution result buffer for starting to temporarily store the convolution result after the convolution result is completely generated.

8. The apparatus according to claim 5, wherein before the inverse transform unit completely generates the convolution result, the bitwise multiply-accumulate operator starts to compare the next positive transform data with the next Winograd weight value A bitwise multiply-accumulate operation to generate the next bitwise multiply data.

9. The apparatus of claim 4, wherein the weight cache is enabled to load the next Winograd weight from the off-chip memory before the para-multiply-accumulate operator fully generates the para-multiply data .

10. The apparatus of claim 1, wherein the neuron cache starts to load next neuron data from the off-chip memory before the forward transform unit fully generates the forward transform data.

11. An integrated circuit device comprising the device of any one of claims 1 to 10.

12. A board comprising the integrated circuit device of claim 11.