CN108875925A

CN108875925A - A kind of control method and device for convolutional neural networks processor

Info

Publication number: CN108875925A
Application number: CN201810685989.2A
Authority: CN
Inventors: 韩银和; 许浩博; 王颖
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2018-06-28
Filing date: 2018-06-28
Publication date: 2018-11-23

Abstract

The present invention provides a control method, including: 1) determining the size n*n of the convolution operation that needs to be performed; 2) selecting m ² 7*7 convolutions according to the size n*n of the convolution operation that needs to be performed Load the value of the convolution kernel corresponding to the size in the product calculation unit, and fill the remaining values with 0, 7m≥n; 3) According to the size of the convolution operation that needs to be performed, the input that needs to perform convolution The size of the feature map determines the number of cycles required for the convolution calculation process; 4) in each cycle of the convolution calculation process, load the value of the corresponding input feature map into the m ² 7*7 convolutions In the calculation unit, the distribution of the value of the input feature map in the m ² 7*7 convolution calculation units and the value of the convolution kernel in the m ² 7*7 convolution calculation units The distribution in is kept consistent; the m ² 7*7 convolution calculation units loaded with the convolution kernel and the value of the input feature map are controlled to perform convolution calculations corresponding to the number of cycles.

Description

A control method and device for a convolutional neural network processor

技术领域technical field

本发明涉及一种卷积神经网络处理器，尤其涉及针对卷积神经网络处理器的硬件加速方面的改进。The invention relates to a convolutional neural network processor, in particular to improvements in hardware acceleration for the convolutional neural network processor.

背景技术Background technique

人工智能技术在近些年来得到了迅猛的发展，在全世界范围内得到了广泛的关注，无论是工业界还是学术界都开展了人工智能技术的研究工作，将人工智能技术渗透至视觉感知、语音识别、辅助驾驶、智能家居、交通调度等各个领域。深度学习技术是人工智能技术发展的助推器。深度学习采用深度神经网络的拓扑结构进行训练、优化及推理等，深度神经网络白块卷积神经网络、深度置信网络、循环神经网络等，通过反复迭代、训练。以图像识别应用为例，深度学习算法通过深度神经网络可以自动地得到隐藏的图像的特征数据，并且产生优于传统的基于模式识别分析方法的效果。Artificial intelligence technology has developed rapidly in recent years and has attracted widespread attention all over the world. Both industry and academia have carried out research work on artificial intelligence technology, infiltrating artificial intelligence technology into visual perception, Speech recognition, assisted driving, smart home, traffic dispatching and other fields. Deep learning technology is a booster for the development of artificial intelligence technology. Deep learning uses the topological structure of the deep neural network for training, optimization, and reasoning. The deep neural network white block convolutional neural network, deep belief network, and cyclic neural network, etc., undergo repeated iterations and training. Taking the application of image recognition as an example, the deep learning algorithm can automatically obtain the characteristic data of the hidden image through the deep neural network, and produce better results than the traditional analysis methods based on pattern recognition.

然而，现有的深度学习技术的实现依赖于极大的计算量。在训练阶段，需要在海量数据中通过反复迭代计算得到神经网络中的权重数据；在推理阶段，同样需要采用神经网络在极短响应时间(通常为毫秒级)内完成对输入数据的运算处理，这需要所部署的神经网络运算电路(包括CPU、GPU、FPGA和ASIC等)达到每秒千亿次甚至万亿次的计算能力。因而，对用于实现深度学习技术的硬件加速，例如对卷积神经网络处理器的硬件加速是非常有必要的。However, the implementation of existing deep learning techniques relies on a huge amount of computation. In the training phase, it is necessary to obtain the weight data in the neural network through repeated iterative calculations in massive data; in the reasoning phase, it is also necessary to use the neural network to complete the calculation and processing of the input data within a very short response time (usually milliseconds). This requires the deployed neural network computing circuits (including CPUs, GPUs, FPGAs, and ASICs, etc.) to achieve hundreds of billions or even trillions of calculations per second. Therefore, hardware acceleration for deep learning techniques, such as convolutional neural network processors, is very necessary.

通常认为实现硬件加速的方式可被大致分为两种，一种是采用更大规模的硬件并行地进行计算处理，另一种则是通过设计专用硬件电路来提高处理速度或效率。It is generally believed that there are two ways to achieve hardware acceleration, one is to use larger-scale hardware to perform calculations in parallel, and the other is to increase processing speed or efficiency by designing dedicated hardware circuits.

针对上述第二种方式，一些现有技术直接将神经网络映射为硬件电路，针对各个网络层分别采用不同的计算单元，使得针对各个网络层的计算以流水线的方式进行。例如，除第一个计算单元之外的各个计算单元以前一个计算单元的输出作为其输入，并且每个计算单元仅用于执行针对与其对应的网络层的计算，在流水线的不同的单位时间内，所述计算单元对所述网络层的不同的输入进行计算。这样的现有技术，通常针对的是需要连续处理不同的输入的场景，例如对包含多帧图像的视频文件进行处理。并且，这样的现有技术通常针对的是具有较少网络层的神经网络。这是由于，深度神经网络的网络层数和规模较大，直接将神经网络映射为硬件电路，其电路面积的代价非常之大，而功耗也会随着电路面积的增大而增加。此外，考虑到各个网络层彼此的运算时间也存在较大差异，为了实现流水线的功能，提供给各个流水线层级的运行时间需要被强制设置为彼此相等，即等于处理速度最慢的流水线层级的运算时间。对于具有大量网络层的深度神经网络而言，设计流水线需要考虑非常多的因素，以减少流水计算过程中处理速度相对较快的流水线层级所需等待的时间。For the above-mentioned second method, some existing technologies directly map the neural network into a hardware circuit, and use different computing units for each network layer, so that the calculation for each network layer is performed in a pipelined manner. For example, each computing unit except the first computing unit takes the output of the previous computing unit as its input, and each computing unit is only used to perform calculations for its corresponding network layer, in different unit times of the pipeline , the calculation unit performs calculations on different inputs of the network layer. Such existing technologies are generally aimed at scenarios that require continuous processing of different inputs, such as processing a video file containing multiple frames of images. Also, such prior art usually targets neural networks with fewer network layers. This is because the number of network layers and the scale of the deep neural network are large, and the neural network is directly mapped to a hardware circuit. The cost of the circuit area is very large, and the power consumption will increase with the increase of the circuit area. In addition, considering that there is a large difference in the calculation time of each network layer, in order to realize the function of the pipeline, the running time provided to each pipeline level needs to be forced to be equal to each other, that is, equal to the operation of the slowest pipeline level time. For a deep neural network with a large number of network layers, the design pipeline needs to consider a lot of factors to reduce the waiting time required for the relatively fast processing pipeline level in the pipeline calculation process.

还有一些现有技术在参考了神经网络进行计算的规律的情况下，提出可以针对神经网络处理器中的计算单元进行“时分复用”以提高计算单元的复用率，其区别于上述流水线的方式，采用相同的计算单元依次对神经网络中的各个网络层进行计算。例如对输入层、第一隐藏层、第二隐藏层、…输出层逐一地进行计算，并在下一次迭代计算中重复上述过程。这样的现有技术可以针对具有较少网络层的神经网络，也可以针对深度神经网络，并且其尤其适合于硬件资源受限的应用场景，。对于这样的应用场景，神经网络处理器在针对一个输入进行了网络层A的计算之后，可能很长时间都不需要再进行针对该网络层A的计算，若是每个网络层分别采用不同的硬件作为其计算单元则会导致对硬件的限制，使得硬件的复用率不高。绝大多数现有技术均是基于这样的考虑，采用不同的针对计算单元的“时分复用”的方式而对神经网络处理器的硬件进行了相应的改进。There are also some existing technologies that, in the case of referring to the calculation rules of the neural network, propose that "time division multiplexing" can be performed on the calculation units in the neural network processor to improve the multiplexing rate of the calculation units, which is different from the above-mentioned pipeline In this way, the same computing unit is used to sequentially calculate each network layer in the neural network. For example, calculations are performed on the input layer, the first hidden layer, the second hidden layer, ... the output layer one by one, and the above process is repeated in the next iterative calculation. Such prior art can be aimed at neural networks with fewer network layers, and can also be aimed at deep neural networks, and it is especially suitable for application scenarios with limited hardware resources. For such an application scenario, after the neural network processor calculates the network layer A for an input, it may not need to perform calculations for the network layer A for a long time. If each network layer uses different hardware As its calculation unit, it will lead to restrictions on hardware, so that the reuse rate of hardware is not high. Most of the existing technologies are based on such considerations, adopting different "time division multiplexing" methods for computing units to improve the hardware of the neural network processor accordingly.

然而，无论采用上述哪种现有技术来设计卷积神经网络处理器，仍然存在硬件利用率有待改进之处。However, no matter which of the above-mentioned existing techniques is used to design a convolutional neural network processor, there are still areas where hardware utilization needs to be improved.

发明内容Contents of the invention

因此，本发明的目的在于克服上述现有技术的缺陷，提供一种用于卷积神经网络处理器的控制方法，所述卷积神经网络处理器具有7*7的卷积计算单元，所述控制方法包括：Therefore, the object of the present invention is to overcome the defects of the above-mentioned prior art, and provide a control method for a convolutional neural network processor, the convolutional neural network processor has 7*7 convolution calculation units, the Control methods include:

1)确定需要执行的卷积运算的卷积核尺寸n*n；1) Determine the convolution kernel size n*n of the convolution operation to be performed;

2)根据需要执行的卷积运算的卷积核尺寸n*n，选择在m²个7*7的卷积计算单元中载入与所述尺寸对应的卷积核的数值，并将其余的各个数值填充为0，7m≥n；2) According to the convolution kernel size n*n of the convolution operation that needs to be performed, choose to load the value of the convolution kernel corresponding to the size in m ² 7*7 convolution calculation units, and use the rest Each value is filled with 0, 7m≥n;

3)根据需要执行的卷积运算的尺寸、以及需要执行卷积的输入特征图的尺寸，确定卷积计算过程所需的周期数；并且3) Determine the number of cycles required for the convolution calculation process according to the size of the convolution operation that needs to be performed and the size of the input feature map that needs to be performed; and

4)根据所述周期数，在卷积计算过程中的各个周期，将相应的输入特征图的数值载入到所述m²个7*7的卷积计算单元中，所述输入特征图的数值在所述m²个7*7的卷积计算单元中的分布与所述卷积核的数值在所述m²个7*7的卷积计算单元中的分布保持一致；4) According to the number of cycles, in each cycle of the convolution calculation process, the value of the corresponding input feature map is loaded into the m2 ⁷ *7 convolution calculation units, and the input feature map The distribution of the value in the m ² 7*7 convolution computing units is consistent with the distribution of the value of the convolution kernel in the m ² 7*7 convolution computing units;

控制载入了卷积核以及输入特征图的数值的所述m²个7*7的卷积计算单元分别执行与所述周期数对应的卷积计算；Controlling the m ² 7*7 convolution calculation units loaded with the convolution kernel and the value of the input feature map respectively performs convolution calculations corresponding to the number of cycles;

5)对所述m²个7*7的卷积计算单元的卷积计算结果中对应的元素进行累加，以获得最终的卷积运算的输出特征图。5) Accumulate the corresponding elements in the convolution calculation results of the m ² 7*7 convolution calculation units to obtain the final output feature map of the convolution operation.

优选地，根据所述方法，其中步骤2)包括：Preferably, according to the method, wherein step 2) comprises:

若是需要执行的卷积运算的尺寸小于7*7，则在同一个7*7的卷积计算单元中载入与所述尺寸对应的卷积核的数值并将其余的各个数值填充为0；If the size of the convolution operation to be performed is smaller than 7*7, load the value of the convolution kernel corresponding to the size in the same 7*7 convolution calculation unit and fill the remaining values with 0;

若是需要执行的卷积运算的尺寸大于7*7，则在相应数量的7*7的卷积计算单元中载入与所述尺寸对应的卷积核的数值并将其余的各个数值填充为0。If the size of the convolution operation to be performed is greater than 7*7, load the value of the convolution kernel corresponding to the size in the corresponding number of 7*7 convolution calculation units and fill the remaining values with 0 .

优选地，根据所述方法，其中步骤4)包括：Preferably, according to the method, wherein step 4) comprises:

在卷积计算过程中的各个周期中，若是需要载入的输入特征图的数值中包含所述输入特征图中左侧第一列的元素，则一次性将所述输入特征图中与需要执行的卷积运算的尺寸相匹配的多个元素载入到所述卷积计算单元的相应位置处并将其余的各个位置的数值填充为0，否则则将与前一周期中相同的元素作为一个整体向左移动一个单元，并且将输入特征图中与前一周期中不同的、且需要更新的多个元素载入到通过所述移动而空出的位置处。In each cycle of the convolution calculation process, if the value of the input feature map to be loaded contains the elements in the first column on the left of the input feature map, then the input feature map and the required execution Multiple elements that match the size of the convolution operation are loaded into the corresponding positions of the convolution calculation unit and the values of the remaining positions are filled with 0, otherwise, the same elements as in the previous cycle are used as a Move one unit to the left as a whole, and load multiple elements in the input feature map that are different from those in the previous cycle and need to be updated to the positions vacated by the shift.

在卷积计算过程中的各个周期中，控制所述m²个7*7的卷积计算单元分别对其所载入的针对输入特征图以及针对卷积核的对应位置的元素执行乘法、并对乘法的结果进行累加，以获得输出特征图中相应位置的元素。In each cycle of the convolution calculation process, control the m ² 7*7 convolution calculation units to perform multiplication for the elements loaded on the input feature map and the corresponding positions of the convolution kernel, and The results of the multiplications are accumulated to obtain the elements at the corresponding positions in the output feature map.

若是需要执行的卷积运算的尺寸为5*5，则在同一个7*7的卷积计算单元中载入5*5的卷积核的数值并将其余的各个数值填充为0；If the size of the convolution operation to be performed is 5*5, load the value of the 5*5 convolution kernel in the same 7*7 convolution calculation unit and fill the remaining values with 0;

并且，步骤4)包括：And, step 4) includes:

在执行卷积计算的全部周期的每一个周期，将相应的输入特征图的数值载入到所述7*7的卷积计算单元中，所述输入特征图的数值在所述7*7的卷积计算单元中的分布与所述5*5的卷积核的数值在所述7*7的卷积计算单元中的分布保持一致；In each cycle of performing all cycles of convolution calculation, the value of the corresponding input feature map is loaded into the 7*7 convolution calculation unit, and the value of the input feature map is in the 7*7 The distribution in the convolution computing unit is consistent with the distribution of the value of the 5*5 convolution kernel in the 7*7 convolution computing unit;

其中，在卷积计算过程中的各个周期中，若是需要载入的输入特征图的数值中包含所述输入特征图中左侧第一列的元素，则一次性将所述输入特征图中尺寸为5*5的25个元素载入到所述卷积计算单元的相应位置处并将其余的各个位置的数值填充为0，否则则将与前一周期中相同的元素作为一个整体向左移动一个单元，并且将输入特征图中与前一周期中不同的、且需要更新的5个元素载入到通过所述移动而空出的位置处。Among them, in each cycle of the convolution calculation process, if the value of the input feature map to be loaded contains the elements in the first column on the left side of the input feature map, the size of the input feature map 25 elements of 5*5 are loaded into the corresponding positions of the convolution calculation unit and the values of the remaining positions are filled with 0, otherwise, the same elements as in the previous cycle are moved to the left as a whole One unit, and the 5 elements in the input feature map that are different from the previous cycle and need to be updated are loaded into the positions vacated by the move.

若是需要执行的卷积运算的尺寸为3*3，则在同一个7*7的卷积计算单元中载入至多针对4个通道的、3*3的卷积核的数值并将其余的各个数值填充为0；If the size of the convolution operation that needs to be performed is 3*3, load the value of the 3*3 convolution kernel for at most 4 channels in the same 7*7 convolution calculation unit and use the remaining The value is filled with 0;

并且，步骤4)包括：And, step 4) includes:

在执行卷积计算的全部周期的每一个周期，将相应的输入特征图的数值载入到所述7*7的卷积计算单元中，所述输入特征图的数值以与所述3*3的卷积核的数量相等的一个或多个副本的形式被载入到所述7*7的卷积计算单元中，并且所述输入特征图在所述7*7的卷积计算单元中的分布与所述3*3的卷积核的数值在所述7*7的卷积计算单元中的分布相对应；In each cycle of all cycles of performing convolution calculation, the value of the corresponding input feature map is loaded into the 7*7 convolution calculation unit, and the value of the input feature map is the same as the 3*3 The form of one or more copies of the same number of convolution kernels is loaded into the 7*7 convolution computing unit, and the input feature map is in the 7*7 convolution computing unit The distribution corresponds to the distribution of the value of the 3*3 convolution kernel in the 7*7 convolution calculation unit;

其中，在卷积计算过程中的各个周期中，若是需要载入的输入特征图的数值中包含所述输入特征图中左侧第一列的元素，则一次性将所述输入特征图中尺寸为3*3的9个元素载入到所述卷积计算单元的相应位置处并将其余的各个位置的数值填充为0，否则则将与前一周期中相同的元素作为一个整体向左移动一个单元，并且将输入特征图中与前一周期中不同的、且需要更新的3个元素载入到通过所述移动而空出的相应位置处。Among them, in each cycle of the convolution calculation process, if the value of the input feature map to be loaded contains the elements in the first column on the left side of the input feature map, the size of the input feature map The 9 elements of 3*3 are loaded into the corresponding positions of the convolution calculation unit and the values of the remaining positions are filled with 0, otherwise, the same elements as in the previous cycle are moved to the left as a whole One unit, and load the three elements in the input feature map that are different from those in the previous cycle and need to be updated to the corresponding positions vacated by the move.

优选地，根据所述方法，其中步骤4)还包括：Preferably, according to the method, wherein step 4) also includes:

若是在同一个7*7的卷积计算单元中载入针对2个或4个通道的、3*3的卷积核的数值，并且需要载入的输入特征图的数值中不包含所述输入特征图中左侧第一列的元素，则将所述7*7的卷积计算单元中处于相同列数且处于不同行数的、关于所述输入特征图的2个副本中与前一周期中相同的元素作为一个整体向左移动一个单元，并且将输入特征图中与前一周期中不同的、且需要更新的3个元素载入到通过所述移动而空出的相应位置处。If the value of the 3*3 convolution kernel for 2 or 4 channels is loaded in the same 7*7 convolution calculation unit, and the value of the input feature map to be loaded does not include the input For the elements in the first column on the left in the feature map, the two copies of the input feature map in the 7*7 convolution calculation unit that are in the same number of columns and in different rows are compared with the previous cycle The same elements in are moved one unit to the left as a whole, and the three elements in the input feature map that are different from those in the previous cycle and need to be updated are loaded into the corresponding positions vacated by the movement.

若是需要执行的卷积运算的尺寸为11*11，则控制由四个7*7的卷积计算单元共同载入11*11的卷积核的数值并将其余的各个数值填充为0；If the size of the convolution operation to be performed is 11*11, control the four 7*7 convolution calculation units to jointly load the value of the 11*11 convolution kernel and fill the remaining values with 0;

并且，步骤4)包括：And, step 4) includes:

在执行卷积计算的全部周期的每一个周期，将相应的输入特征图的数值载入到所述四个7*7的卷积计算单元中，所述输入特征图的数值在所述四个7*7的卷积计算单元中的分布与所述11*11的卷积核的数值在所述四个7*7的卷积计算单元中的分布保持一致；In each cycle of all cycles of performing convolution calculation, the values of the corresponding input feature maps are loaded into the four 7*7 convolution calculation units, and the values of the input feature maps are in the four The distribution in the 7*7 convolution computing unit is consistent with the distribution of the value of the 11*11 convolution kernel in the four 7*7 convolution computing units;

其中，在卷积计算过程中的各个周期中，若是需要载入的输入特征图的数值中包含所述输入特征图中左侧第一列的元素，则一次性将所述输入特征图中尺寸为11*11的121个元素载入到所述卷积计算单元的相应位置处并将其余的各个位置的数值填充为0，否则则将与前一周期中相同的元素作为一个整体向左移动一个单元，并且将输入特征图中与前一周期中不同的、且需要更新的11个元素载入到通过所述移动而空出的相应位置处。Among them, in each cycle of the convolution calculation process, if the value of the input feature map to be loaded contains the elements in the first column on the left side of the input feature map, the size of the input feature map The 121 elements of 11*11 are loaded into the corresponding positions of the convolution calculation unit and the values of the remaining positions are filled with 0, otherwise, the same elements as in the previous cycle are moved to the left as a whole One unit, and the 11 elements in the input feature map that are different from those in the previous cycle and need to be updated are loaded into the corresponding positions vacated by the movement.

在卷积计算过程中的各个周期中，控制所述四个7*7的卷积计算单元分别对其所载入的针对输入特征图以及针对卷积核的对应位置的元素执行乘法、并对乘法的结果进行累加在卷积计算过程中的各个周期中，控制所述四个7*7的卷积计算单元分别对其所载入的针对输入特征图以及针对卷积核的对应位置的元素执行乘法、并对乘法的结果进行累加；In each cycle of the convolution calculation process, the four 7*7 convolution calculation units are controlled to perform multiplication for the elements loaded on the input feature map and the corresponding positions of the convolution kernel, and to The results of the multiplication are accumulated in each cycle of the convolution calculation process, and the four 7*7 convolution calculation units are controlled respectively to the elements loaded for the input feature map and the corresponding positions of the convolution kernel Perform multiplication and accumulate the results of the multiplication;

并且步骤5)包括：对由全部所述四个7*7的卷积计算单元的计算结果进行累加，以获得输出特征图中相应位置的元素。And step 5) includes: accumulating the calculation results of all the four 7*7 convolution calculation units to obtain elements at corresponding positions in the output feature map.

以及，一种控制单元，其用于实现上述任意一项所述的控制方法。And, a control unit, which is used to implement the control method described in any one of the above.

以及，一种卷积神经网络处理器，包括：7*7的卷积计算单元、以及控制单元，所述控制单元用于实现上述任意一项所述方法。And, a convolutional neural network processor, comprising: a 7*7 convolutional calculation unit, and a control unit, the control unit is used to implement any one of the methods described above.

与现有技术相比，本发明的优点在于：Compared with the prior art, the present invention has the advantages of:

改善了用于执行卷积的计算单元的复用率，达到减少必须设置在卷积神经网络处理器中的硬件计算单元的效果。卷积神经网络处理器不必为了针对需要采用不同尺寸的卷积核的不同的卷积层而设置大量的具有不同尺寸的硬件计算单元。在执行针对一个卷积层的计算时，可以采用与该卷基层的卷积核的尺寸不匹配的其他计算单元来进行计算，由此提高了卷积神经网络处理器中硬件计算单元的利用率。The multiplexing rate of computing units used to perform convolution is improved, achieving the effect of reducing the hardware computing units that must be set in the convolutional neural network processor. The convolutional neural network processor does not need to set a large number of hardware computing units with different sizes for different convolutional layers that require convolution kernels of different sizes. When performing calculations for a convolutional layer, other computing units that do not match the size of the convolution kernel of the convolutional layer can be used for calculations, thereby improving the utilization of hardware computing units in convolutional neural network processors .

附图说明Description of drawings

以下参照附图对本发明实施例作进一步说明，其中：Embodiments of the present invention will be further described below with reference to the accompanying drawings, wherein:

图1是现有技术中采用M种卷积核对输入图层进行卷积计算以获得输出图层的示意图，其中每个卷积核具有N个通道；Fig. 1 is a schematic diagram of using M kinds of convolution kernels to perform convolution calculation on an input layer to obtain an output layer in the prior art, wherein each convolution kernel has N channels;

图2是现有技术利用一个7*7的计算单元实现7*7的卷积运算的示意图；FIG. 2 is a schematic diagram of a 7*7 convolution operation realized by a 7*7 computing unit in the prior art;

图3是根据本发明的一个实施例利用一个7*7的计算单元实现5*5的卷积运算的示意图；FIG. 3 is a schematic diagram of implementing a 5*5 convolution operation using a 7*7 computing unit according to an embodiment of the present invention;

图4是根据本发明的一个实施例利用一个7*7的计算单元一次性实现对4个通道的3*3的卷积运算的示意图；Fig. 4 is a schematic diagram of realizing a 3*3 convolution operation on 4 channels at one time by using a 7*7 computing unit according to an embodiment of the present invention;

图5是根据本发明的一个实施例利用四个7*7的计算单元实现11*11的卷积运算的示意图。FIG. 5 is a schematic diagram of implementing 11*11 convolution operations by using four 7*7 computing units according to an embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图和具体实施方式对本发明作详细说明。The present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.

发明人在研究现有技术的过程中发现，现有的各种经典神经网络，例如Alexnet、GoogleNet、VGG、Resnet等，这些神经网络均包含有不同数量的卷积层，而不同的卷积层所采用的卷积核大小也有所差别。例如Alexnet，该网络的第一层为卷积核大小为11*11的卷积层，该网络的第二层为卷积核大小为5*5的卷积层，该网络第三层为卷积核大小为3*3的卷积层等等。The inventor found in the process of researching the existing technology that various existing classical neural networks, such as Alexnet, GoogleNet, VGG, Resnet, etc., all contain different numbers of convolutional layers, and different convolutional layers The size of the convolution kernel used is also different. For example Alexnet, the first layer of the network is a convolutional layer with a convolution kernel size of 11*11, the second layer of the network is a convolutional layer with a convolution kernel size of 5*5, and the third layer of the network is a convolutional layer A convolutional layer with a kernel size of 3*3 and so on.

然而，在现有的各种神经网络处理器中，均是针对不同大小的卷积核设置不同的计算单元。这就造成了，当执行某一个卷积层的计算时，与该卷基层的卷积核的尺寸不匹配的其他计算单元均处于闲置状态。However, in various existing neural network processors, different computing units are set for convolution kernels of different sizes. As a result, when the calculation of a certain convolution layer is performed, other calculation units that do not match the size of the convolution kernel of the convolution layer are idle.

例如，如图1所示出地，神经网络处理器可以提供M种不同的卷积核，记作卷积核0至卷积核M-1，每个卷积核具有N个通道，分别用于针对输入图层的N个通道进行卷积计算，每一个卷积核与一个输入图层进行卷积运算后可以得到一个输出图层。针对一个输入图层，利用全部M种卷积核可以计算得到M-1个输出图层。若是某一个输入图层需要执行采用卷积核1的卷积运算，此时除与卷积核1对应的计算单元以外的其他计算单元均处于闲置状态。For example, as shown in Figure 1, the neural network processor can provide M different convolution kernels, which are denoted as convolution kernel 0 to convolution kernel M-1, and each convolution kernel has N channels, respectively used Convolution calculation is performed on N channels of the input layer, and an output layer can be obtained after each convolution kernel is convolved with an input layer. For an input layer, M-1 output layers can be calculated by using all M kinds of convolution kernels. If a certain input layer needs to perform a convolution operation using convolution kernel 1, at this time, other computing units except the computing unit corresponding to convolution kernel 1 are in an idle state.

对此，本发明提出了一种对计算单元的复用方案，其通过控制来调整计算单元实际载入的数据(对于同一个计算单元而言，其既需要载入卷积核的数值，也需要载入输入特征图中的数值)，从而实现以7*7规模的计算单元实现针对多种尺寸的卷积运算，以减少进行卷积运算所必须采用的硬件计算单元的规模。In this regard, the present invention proposes a multiplexing scheme for the computing unit, which adjusts the data actually loaded by the computing unit through control (for the same computing unit, it needs to load the value of the convolution kernel, or Need to load the value in the input feature map), so as to realize the convolution operation for multiple sizes with a computing unit of 7*7 scale, so as to reduce the scale of the hardware computing unit that must be used for convolution operation.

本发明所采用的神经网络处理器系统架构可以包括以下五个部分，输入数据存储单元、控制单元、输出数据存储单元、权重存储单元、计算单元。The system architecture of the neural network processor used in the present invention may include the following five parts, an input data storage unit, a control unit, an output data storage unit, a weight storage unit, and a calculation unit.

输入数据存储单元用于存储参与计算的数据；输出数据存储单元存储计算得到的神经元响应值；权重存储单元用于存储已经训练好的神经网络权重；The input data storage unit is used to store the data involved in the calculation; the output data storage unit stores the calculated neuron response value; the weight storage unit is used to store the trained neural network weights;

控制单元分别与输出数据存储单元、权重存储单元、计算单元相连，控制单元可根据解析得到的控制信号控制计算单元进行神经网络计算。The control unit is respectively connected with the output data storage unit, the weight storage unit and the calculation unit, and the control unit can control the calculation unit to perform neural network calculation according to the control signal obtained through analysis.

计算单元用于根据控制单元产生的控制信号来执行相应的神经网络计算。计算单元完成神经网络算法中的大部分运算，即向量乘加操作等。The calculation unit is used for performing corresponding neural network calculation according to the control signal generated by the control unit. The calculation unit completes most of the operations in the neural network algorithm, that is, vector multiplication and addition operations, etc.

根据本发明的对计算单元的复用可以由上述控制单元对计算单元进行控制而实现，下面将通过几个实施例来进行具体的介绍。The multiplexing of the calculation unit according to the present invention can be realized by controlling the calculation unit by the above-mentioned control unit, which will be specifically introduced through several embodiments below.

下面，首先介绍一下传统的现有技术是如何利用7*7的计算单元来实现7*7的卷积运算的。参考图2中所给出了一个实例，对于现有技术而言，7*7规模的计算单元以如下方式来实现卷积运算：In the following, firstly, it will be introduced how the traditional prior art utilizes 7*7 computing units to implement 7*7 convolution operations. Referring to an example given in FIG. 2, for the prior art, a computing unit with a scale of 7*7 implements convolution operation in the following manner:

在第一周期，将输入特征图中第1-7行、1-7列中的每个元素(这里将其称作为是针对输入特征图的滑动窗口)与卷积核中对应位置的每个元素分别相乘所得的结果的累加之和作为输出特征图中的第1行第1列的元素，即2×(-4)+(3×2)+(-2×(-4))+(2×(-8))+(-7×3)＝-31。In the first cycle, each element in rows 1-7 and columns 1-7 in the input feature map (here referred to as a sliding window for the input feature map) is combined with each element in the corresponding position in the convolution kernel The cumulative sum of the results obtained by multiplying the elements separately is used as the element in the first row and the first column in the output feature map, that is, 2×(-4)+(3×2)+(-2×(-4))+ (2×(-8))+(-7×3)=-31.

在第二周期，将输入特征图中第1-7行、2-8列中的每个元素(即针对当前周期滑动窗口中的数值)与卷积核中对应位置的每个元素分别相乘所得的结果的累加之和作为输出特征图中的第1行第2列元素(未在图2中示出)。In the second cycle, each element in rows 1-7 and columns 2-8 in the input feature map (that is, the value in the sliding window for the current cycle) is multiplied by each element in the corresponding position in the convolution kernel The cumulative sum of the obtained results is used as the element in row 1 and column 2 in the output feature map (not shown in FIG. 2 ).

以此类推，通过向右、或向下移动尺寸为7*7的滑动窗口共15次，以获得尺寸为4*4的输出特征图。By analogy, by moving the sliding window of size 7*7 to the right or down for a total of 15 times, an output feature map of size 4*4 is obtained.

本发明不排斥采用上述方式来利用7*7的计算单元实现7*7的卷积运算。并且，进一步地，在本发明中，还可以通过控制使得7*7规模的计算单元实现针对除7*7以外的其他的尺寸的卷积核的运算，例如5*5、3*3、11*11的卷积运算。The present invention does not exclude the use of the above method to realize the 7*7 convolution operation by using the 7*7 computing units. And, further, in the present invention, it is also possible to control the calculation unit of 7*7 scale to realize the operation of convolution kernels of other sizes than 7*7, such as 5*5, 3*3, 11 *11 convolution operation.

如前文中所介绍地，在传统的现有技术中，输出特征图的尺寸取决于滑动窗口的移动次数和卷积核的大小，例如，针对10*10的输入特征图进行7*7的卷积运算，滑动窗口的横向移动范围以及纵向移动范围均为4个单元，通过多个周期的计算可以获得4*4的输出特征图，这使得采用7*7的计算单元来实现其他尺寸的卷积运算是非常困难的。可以理解，在沿用现有技术的情况下，采用7*7的计算单元针对10*10的输入特征图进行卷积计算只能得到4*4的输出特征图(例如在图2中所示出地)，计算单元以及处理器并不知道如何移动滑动窗口才能够利用7*7的计算单元以获得诸如5*5的卷积运算。As mentioned above, in the traditional prior art, the size of the output feature map depends on the number of shifts of the sliding window and the size of the convolution kernel, for example, a 7*7 convolution for a 10*10 input feature map The product operation, the horizontal movement range and the vertical movement range of the sliding window are 4 units, and the output feature map of 4*4 can be obtained through the calculation of multiple cycles, which makes it possible to use 7*7 calculation units to realize volumes of other sizes. Multiplication is very difficult. It can be understood that, in the case of continuing to use the existing technology, using a 7*7 computing unit to perform convolution calculations on an input feature map of 10*10 can only obtain an output feature map of 4*4 (for example, as shown in Figure 2 Ground), the calculation unit and the processor do not know how to move the sliding window to be able to use the 7*7 calculation unit to obtain such as 5*5 convolution operation.

对此，本发明提出了一种相应的控制方法，通过调度载入到计算单元中的输入特征图、卷积核，并控制执行乘法、加法运算，实现以7*7的计算单元执行5*5的卷积运算。In this regard, the present invention proposes a corresponding control method, by scheduling the input feature maps and convolution kernels loaded into the computing unit, and controlling the execution of multiplication and addition operations, the 7*7 computing unit can perform 5* 5 convolution operations.

根据本发明的一个实施例，参考图3，具体的控制方法如下：According to an embodiment of the present invention, referring to Fig. 3, the specific control method is as follows:

在计算单元进行卷积计算时，控制在每次滑动窗口时分别将相应的卷积核的值以及相应的输入特征图的值载入到7*7的计算单元中。When the calculation unit performs convolution calculation, the control loads the value of the corresponding convolution kernel and the value of the corresponding input feature map into the 7*7 calculation unit each time the window is slid.

如图3所示，输入特征图的尺寸为10*10，需要执行的卷积运算的尺寸为5*5，因此可以确定卷积计算需要执行总共6×6＝36个周期。As shown in Figure 3, the size of the input feature map is 10*10, and the size of the convolution operation to be performed is 5*5, so it can be determined that the convolution calculation needs to be executed for a total of 6×6=36 cycles.

在第一周期，将输入特征图中第1-5行、第1-5列的元素载入到7*7的计算单元中以作为第1-5行、第1-5列的元素，并且将剩余的第6、7行、第6、7列的元素填充为“0”；将5*5的卷积核载入到7*7的计算单元中以作为第1-5行、第1-5列的元素，并且将剩余的第6、7行、第6、7列的元素填充为“0”，由此在7*7的计算单元中载入了输入特征图和卷积核的值。控制该7*7的计算单元对输入特征图以及卷积核中对应位置的元素执行乘法、累加，以获得输出特征图中第1行第1列的元素，即(2×(-4))+(3×2)+(-2×(-4))+(2×(-8))＝-10。由于计算单元中除去原本5*5的卷积核的数值以外的各个元素均为0，因此计算的结果与实际采用5*5的计算单元进行卷积计算的结果完全一致。In the first cycle, the elements of rows 1-5 and columns 1-5 of the input feature map are loaded into a 7*7 computing unit as elements of rows 1-5 and columns 1-5, and Fill the remaining elements in rows 6, 7, and columns 6 and 7 with "0"; load the 5*5 convolution kernel into the 7*7 computing unit as rows 1-5, row 1 -5 column elements, and fill the remaining 6th, 7th row, 6th, 7th column elements with "0", thus loading the input feature map and convolution kernel in the 7*7 computing unit value. Control the 7*7 calculation unit to perform multiplication and accumulation on the input feature map and the corresponding position elements in the convolution kernel to obtain the elements of the first row and the first column in the output feature map, that is, (2×(-4)) +(3×2)+(-2×(-4))+(2×(-8))=-10. Since all the elements in the calculation unit except the value of the original 5*5 convolution kernel are 0, the calculation result is exactly the same as the result of the actual convolution calculation using the 5*5 calculation unit.

在第二周期，将输入特征图中第1-5行、第2-6列的全部元素(即“0,0,2,0,-3；0,3,-2,5,0；0,0,0,2,0；0,0,0,3,0；0,0,0,0,0”)载入到该计算单元中以作为第1-5行、第1-5列的新的元素。控制计算单元针对其所载入的元素执行乘法、和累加运算，以获得输出特征图中第1行第2列的元素。In the second cycle, all elements in rows 1-5 and columns 2-6 of the feature map will be input (ie "0,0,2,0,-3;0,3,-2,5,0;0 ,0,0,2,0; 0,0,0,3,0; 0,0,0,0,0") are loaded into the computational unit as rows 1-5, columns 1-5 of new elements. The control calculation unit performs multiplication and accumulation operations on the loaded elements to obtain the elements in row 1 and column 2 of the output feature map.

根据本发明的一个优选的实施例，还可以在第二周期中对上述在7*7的计算单元中载入输入特征图的数据的方式进行改进，以提高载入效率。即，参考图3，将7*7的计算单元中第1-5行、第2-5列的全部元素(即“0,0,2,0；0,3,-2,5；0,0,0,2；0,0,0,3；0,0,0,0”)整体向左移动1个单位以作为第1-5行、第1-4列的新的元素，并且将输入特征图中第1-5行、第6列的元素(即“-3；0；0；0；0”)载入到该计算单元中以作为第1-5行、第5列的新的元素，由此对该7*7的计算单元中所载入的输入特征图的值进行了更新，达到了与传统方案中采用滑动窗口类似的效果。并且类似地控制计算单元针对其所载入的元素执行乘法、和累加运算，以获得输出特征图中第1行第2列的元素。According to a preferred embodiment of the present invention, the above-mentioned manner of loading the data of the input feature map in the 7*7 computing units can also be improved in the second cycle, so as to improve the loading efficiency. That is, referring to FIG. 3 , all the elements in rows 1-5 and columns 2-5 in the 7*7 calculation unit (that is, "0,0,2,0; 0,3,-2,5; 0, 0,0,2; 0,0,0,3; 0,0,0,0") overall move to the left by 1 unit as a new element in rows 1-5, columns 1-4, and will The elements of rows 1-5 and column 6 in the input feature map (ie "-3; 0; 0; 0; 0") are loaded into the calculation unit as new elements of rows 1-5 and column 5 elements, thus updating the value of the input feature map loaded in the 7*7 computing unit, achieving an effect similar to the sliding window used in the traditional solution. And similarly control the calculation unit to perform multiplication and accumulation operations on the loaded elements to obtain the elements in the first row and the second column in the output feature map.

以此类推，完成第三到第六周期。By analogy, the third to sixth cycles are completed.

在第七周期，将输入特征图中第2-6行、第1-5列的元素载入到7*7的计算单元中以作为第1-5行、第1-5列的元素，并且控制计算单元针对其所载入的元素执行乘法、和累加运算，以获得输出特征图中第2行第1列的元素。并且在随后的第八到第十二周期，采用与前述第二到第六周期类似的方式在该计算单元中载入相应的输入特征图的元素。以此类推，直到完成了全部三十六个周期，获得了6*6的输出特征图。In the seventh cycle, the elements of rows 2-6 and columns 1-5 of the input feature map are loaded into the computing unit of 7*7 as elements of rows 1-5 and columns 1-5, and The control calculation unit performs multiplication and accumulation operations on the loaded elements to obtain the elements in the second row and the first column in the output feature map. And in the subsequent eighth to twelfth cycles, the corresponding elements of the input feature map are loaded into the calculation unit in a manner similar to the aforementioned second to sixth cycles. By analogy, until all thirty-six cycles are completed, a 6*6 output feature map is obtained.

可以看出，通过上述控制方法，在第一周期时向计算单元中一次性载入了输入特征图中5*5的25个数值。类似地，第七、十三、十九、二十五、三十一周期也一次性载入了输入特征图中的25个数值。而相应地，在第二至六周期中，每次仅需载入输入特征图的5个数值，并且将在前一周期中也使用的20个数值向左移动，对于计算单元中已载入的针对卷积核的数值则不做修改。类似地，第八至十二、第十四至十八、第二十六至三十、第三十二至三十六也采用了与第二至六周期相似的方式来载入输入特征图中的元素。It can be seen that through the above control method, 25 values of 5*5 in the input feature map are loaded into the calculation unit at one time in the first cycle. Similarly, the seventh, thirteenth, nineteenth, twenty-fifth, and thirty-first cycles also loaded 25 values in the input feature map at one time. Correspondingly, in the second to sixth cycles, only 5 values of the input feature map need to be loaded each time, and the 20 values also used in the previous cycle are moved to the left. The value for the convolution kernel is not modified. Similarly, the eighth-twelfth, fourteenth-eighteenth, twenty-sixth-thirtieth, thirty-second-thirty-sixth cycles also use a similar method to the second to sixth cycle to load the input feature map elements in .

由此，可以确保在处于各个周期时，计算单元中，输入特征图的每个元素的位置与与其进行乘法运算的卷积核中的相应元素的位置是一一对应的。并且，对于除去用于实现本发明的控制方法的单元之外的其他单元，例如计算单元本身、或者处理器而言，它们并不会意识到7*7的卷积单元实际上实施的是5*5的卷积运算。此外，通过上述控制方法，使得在各个周期内计算单元所载入的输入特征图的数值并不直接取决于滑动窗口。一方面体现在载入到计算单元中的输入特征图的数值的排列并不取决于尺寸为5*5的滑动窗口中各个数值实际的排列方式，另一方面还体现在计算的周期数也不取决于尺寸为7*7的滑动窗口的移动次数(即4*4)，输出结果的数量和尺寸可以通过本发明的控制方法来控制，由此可以利用7*7的计算单元针对10*10的输入特征图进行5*5的卷积运算并从而得到6*6的输出结果。Thus, it can be ensured that in each cycle, the position of each element of the input feature map in the calculation unit is in one-to-one correspondence with the position of the corresponding element in the convolution kernel that is multiplied therewith. Moreover, for other units except the unit used to implement the control method of the present invention, such as the computing unit itself or the processor, they will not realize that the 7*7 convolution unit actually implements 5 *5 convolution operation. In addition, through the above control method, the value of the input feature map loaded by the calculation unit in each cycle does not directly depend on the sliding window. On the one hand, it is reflected in the fact that the numerical arrangement of the input feature map loaded into the computing unit does not depend on the actual arrangement of each numerical value in the sliding window with a size of 5*5; Depending on the number of shifts of the sliding window with a size of 7*7 (i.e. 4*4), the number and size of the output results can be controlled by the control method of the present invention, thus the 7*7 calculation unit can be used for 10*10 The input feature map is subjected to a 5*5 convolution operation and thus an output result of 6*6 is obtained.

类似地，还可以采用与上述图3中的实例相似的方式，在同一个7*7的计算单元中执行尺寸小于7*7的卷积运算，例如执行3*3的卷积运算。即，将3*3的卷积核载入到7*7的计算单元中，并将其余数值填充为“0”。根据输入特征图的尺寸以及需要执行的卷积运算的尺寸3*3，确定卷积运算所需的周期数。在每一个周期，将输入特征图中相应的数值载入到该7*7的计算单元中，以执行卷积运算。Similarly, in a manner similar to the above-mentioned example in FIG. 3 , a convolution operation with a size smaller than 7*7, for example, a 3*3 convolution operation, can be performed in the same 7*7 computing unit. That is, load the 3*3 convolution kernel into the 7*7 computing unit, and fill the remaining values with "0". According to the size of the input feature map and the size 3*3 of the convolution operation to be performed, determine the number of cycles required for the convolution operation. In each cycle, the corresponding value in the input feature map is loaded into the 7*7 computing unit to perform the convolution operation.

可以理解，对于采用7*7的计算单元而言，一次性仅执行一次诸如3*3的卷积运算，其硬件的利用率并不高。对此，本发明还提出了一种方案以控制在同一个7*7的计算单元中一次性针对同一个输入特征图执行四个通道的3*3的卷积运算。It can be understood that, for a calculation unit using 7*7, the convolution operation such as 3*3 is only performed once at a time, and the utilization rate of its hardware is not high. In this regard, the present invention also proposes a solution to control the 3*3 convolution operations of four channels to be performed on the same input feature map at one time in the same 7*7 computing unit.

根据本发明的一个实施例，还提供了一种控制方法，实现以7*7的计算单元执行3*3的卷积运算，参考图4，具体的控制方法如下：According to an embodiment of the present invention, a control method is also provided to implement a 3*3 convolution operation with a 7*7 calculation unit. Referring to FIG. 4, the specific control method is as follows:

输入特征图的尺寸为10*10，需要执行的卷积运算的尺寸为3*3，因此可以确定卷积计算需要执行总共8×8＝64个周期。The size of the input feature map is 10*10, and the size of the convolution operation to be performed is 3*3, so it can be determined that the convolution calculation needs to be performed for a total of 8×8=64 cycles.

在第一周期，将输入特征图中第1-3行、第1-3列的元素复制成4份，分别载入到7*7的计算单元的第1-3行1-3列、第1-3行4-6列、第4-6行1-3列、第4-6行4-6列，并且将剩余的第7行、第7列的元素填充为“0”；将针对四个通道的四个3*3的卷积核分别载入到7*7的计算单元的第1-3行1-3列、第1-3行4-6列、第4-6行1-3列、第4-6行4-6列，并且将剩余的第7行、第7列的元素填充为“0”。在图3所示出的实施例中，每个卷积核用于一个通道，而在计算针对多通道的卷积运算的结果时，可以将每个通道的同一位置的卷积结果累加在一起以作为该位置的输出结果。在本发明中，可以在载入了上述输入特征图、以及卷积核的元素之后，控制该7*7的计算单元对输入特征图以及卷积核中对应位置的元素执行乘法、累加，其所获得的结果与针对四个通道的同一位置的卷积结果累加的结果是一致的，由此可以获得输出特征图中第1行第1列的元素。In the first cycle, copy the elements in rows 1-3 and columns 1-3 of the input feature map into 4 copies, and load them into rows 1-3 and columns 1-3 and columns 1-3 of the 7*7 computing unit respectively. Line 1-3, column 4-6, line 4-6, column 1-3, line 4-6, column 4-6, and fill the remaining elements in line 7 and column 7 with "0"; The four 3*3 convolution kernels of the four channels are respectively loaded into the 1-3 rows 1-3 columns, 1-3 rows 4-6 columns, 4-6 rows 1 of the 7*7 computing unit -3 columns, 4-6 rows and 4-6 columns, and fill the remaining elements of the 7th row and 7th column with "0". In the embodiment shown in Figure 3, each convolution kernel is used for one channel, and when calculating the result of the convolution operation for multiple channels, the convolution results of the same position of each channel can be accumulated together as the output of this location. In the present invention, after the above-mentioned input feature map and elements of the convolution kernel are loaded, the 7*7 calculation unit can be controlled to perform multiplication and accumulation on the input feature map and the corresponding elements in the convolution kernel, which The obtained result is consistent with the accumulation result of the convolution results at the same position of the four channels, so that the elements in the first row and the first column of the output feature map can be obtained.

在第二周期，将7*7的计算单元中第1-3行2-3列的全部元素(即“0,0；0,3；0,0”)整体向左移动1个单位以作为第1-3行1-2列的新的元素，并且将输入特征图中第1-3行、第4列的元素(即“2；-2；0”)载入到该计算单元中以作为第1-3行、第3列的新的元素。类似地，针对该计算单元中原本的第1-3行4-6列、第4-6行1-3列、第4-6行4-6列也执行上述移动、和载入新元素的操作。由此，对该7*7的计算单元中所载入的输入特征图的值进行了更新。进一步地，在本发明中，还可以将该计算单元中原本的第1-6行2-3列、以及第1-6行5-6列分别看作一个整体进行移动，和/或将需要载入的新的元素复制为两份并看作一个整体一并载入计算单元，由此以减少需要控制和操作的步骤。类似地控制计算单元针对其所载入的元素执行乘法、和累加运算，以获得输出特征图中第1行第2列的元素。In the second cycle, all the elements in rows 1-3, columns 2-3 (ie "0,0; 0,3; 0,0") in the 7*7 calculation unit are moved to the left by 1 unit as a whole as New elements in rows 1-3 and columns 1-2, and the elements in rows 1-3 and columns 4 of the input feature map (ie "2;-2;0") are loaded into the calculation unit to As a new element for rows 1-3, column 3. Similarly, for the original 1-3 rows and 4-6 columns, 4-6 rows and 1-3 columns, and 4-6 rows and 4-6 columns in the calculation unit, the above-mentioned movement and loading of new elements are also performed operate. Thus, the value of the input feature map loaded in the 7*7 calculation unit is updated. Further, in the present invention, the original 1-6th rows, 2-3 columns, and 1-6th rows, 5-6 columns in the calculation unit can also be regarded as a whole to move, and/or will need The loaded new element is copied into two copies and loaded into the calculation unit as a whole, thereby reducing the steps required for control and operation. Similarly, the calculation unit is controlled to perform multiplication and accumulation operations on the loaded elements to obtain the elements in row 1 and column 2 of the output feature map.

以此类推，直到完成了全部周期，获得了8*8的输出特征图。By analogy, until all cycles are completed, an 8*8 output feature map is obtained.

可以看出，通过上述控制方法，可以针对输入特征图一次性进行四个通道的3*3的卷积运算，其尤其适合于通道数量较多的情况。对于通道数量不等于四个的情况，还可以选择将7*7的计算单元中用于载入卷积核的至少一个尺寸为3*3的区域填充为“0”，例如将第4-6行4-6列全部填充为“0”。对于通道数量大于四个的情况，例如具有七个通道的情况，可以通过控制进行两次计算，在第一次计算时载入4个卷积核、在第二次计算时载入3个卷积核，并控制对两次计算的结果中相同位置的元素进行累加以获得针对该位置的输出结果。It can be seen that through the above control method, the 3*3 convolution operation of four channels can be performed on the input feature map at one time, which is especially suitable for the case of a large number of channels. For the case where the number of channels is not equal to four, you can also choose to fill at least one area with a size of 3*3 in the 7*7 computing unit used to load the convolution kernel with "0", for example, fill the 4-6 Rows 4-6 are all filled with "0". For the case where the number of channels is greater than four, such as the case with seven channels, two calculations can be performed by controlling, 4 convolution kernels are loaded in the first calculation, and 3 volumes are loaded in the second calculation Accumulate kernel, and control the accumulation of the elements at the same position in the results of the two calculations to obtain the output result for this position.

本发明通过上述实施例介绍了如何控制7*7的计算单元实现5*5、3*3的卷积计算，下面将介绍如何控制7*7的计算单元来实现尺寸超过7*7的卷积计算。The present invention introduces how to control the calculation unit of 7*7 to realize the convolution calculation of 5*5 and 3*3 through the above-mentioned embodiment. The following will introduce how to control the calculation unit of 7*7 to realize the convolution with a size exceeding 7*7 calculate.

根据本发明的一个实施例，提供了一种控制方法，实现以7*7的计算单元执行11*11的卷积运算，参考图5，具体的控制方法如下：According to an embodiment of the present invention, a control method is provided to implement a 11*11 convolution operation with a 7*7 computing unit. Referring to FIG. 5, the specific control method is as follows:

首先，判断出11>7，因而需要由不止一个7*7的计算单元来共同完成尺寸为11*11的卷积运算。这里可以选择恰好可以用于载入尺寸为11*11的数据的k个计算单元。这里针对k的选择为：k＝m²，7m可以选择大于或等于n的最小正整数。当然，也可以选择比上述数量更多的7*7的计算单元来执行11*11的卷积运算。对于图5所示出的实例，这里选择k＝4个计算单元。First of all, it is judged that 11>7, so more than one 7*7 computing unit is required to jointly complete the convolution operation with a size of 11*11. Here, k computing units that can just be used to load data with a size of 11*11 can be selected. Here, the choice of k is: k=m ² , and 7m can choose the smallest positive integer greater than or equal to n. Of course, more 7*7 computing units than the above number can also be selected to perform 11*11 convolution operations. For the example shown in Fig. 5, k=4 calculation units are selected here.

控制将所使用的卷积核的数值分为四部分分别载入到四个7*7的计算单元中，将其余部分填充为“0”；并且，在各个周期，控制将输入特征图中相应的数据分为四部分分别载入到所述四个7*7的计算单元中，将其余部分填充为“0”。这里在该四个7*7的计算单元中，卷积核的数值的分布方式与输入特征图的数值的分布方式保持一致。The control divides the value of the used convolution kernel into four parts and loads them into four 7*7 computing units respectively, and fills the rest with "0"; and, in each cycle, the control will input the corresponding The data is divided into four parts and loaded into the four 7*7 computing units respectively, and the remaining part is filled with "0". Here, in the four 7*7 computing units, the distribution of the values of the convolution kernel is consistent with the distribution of the values of the input feature map.

并且，控制每个计算单元针对其所载入的元素执行乘法、和累加运算，通过将全部四个计算单元的计算结果中对应的计算结果进行累加，从而获得相应的输出特征图的数值。In addition, each calculation unit is controlled to perform multiplication and accumulation operations on the loaded elements, and the corresponding calculation results of all four calculation units are accumulated to obtain the value of the corresponding output feature map.

在该实施例中，还可以进一步地，采用与前述实施例类似的方式以在各个周期通过移动和载入相应的输入特征图的数值的方式来更新每个计算单元中的输入特征图的数值。例如，在第二个周期，将第一个7*7的计算单元(其处于图5中左上角)的第2-7列的数值向左移动1个单元，并在第7列中载入新的数值；将第二个7*7的计算单元(其处于图5中右上角)的第2-4列的数值向左移动1个单元，并在第4列中载入新的数值；将第三个7*7的计算单元(其处于图5中左下角)的第2-7列1-4行的数值向左移动1个单元，并在第7列1-4行中载入新的数值；将第四个7*7的计算单元(其处于图5中右下角)的第2-4列1-4行的数值向左移动1个单元，并在第4列1-4行中载入新的数值。In this embodiment, the value of the input feature map in each calculation unit can be updated by moving and loading the value of the corresponding input feature map in each cycle in a manner similar to the previous embodiment . For example, in the second cycle, move the values in columns 2-7 of the first 7*7 calculation unit (which is in the upper left corner of Figure 5) to the left by 1 unit, and load it in column 7 New value; move the value of the 2nd-4 column of the second 7*7 calculation unit (it is in the upper right corner in Figure 5) to the left by 1 unit, and load the new value in the 4th column; Move the value of the third 7*7 calculation unit (it is in the lower left corner of Figure 5) in the 2-7 column 1-4 row to the left by 1 unit, and load it in the 7th column 1-4 row New value; move the value of the fourth 7*7 calculation unit (it is in the lower right corner of Figure 5) in the 2-4 column 1-4 row to the left by 1 unit, and place it in the 4th column 1-4 Load the new value into the row.

在本发明中，可以针对上述控制方法设置相应的控制单元，这样的控制单元可以适应于一个现有的卷积神经网络处理器，通过实施上述控制方法的方式来对用于卷积的计算单元进行复用，也可以基于这样的控制单元所需要的硬件资源来设计配套的卷积神经网络处理器，例如在满足上述复用方案的情况下采用最少数量的硬件资源。In the present invention, a corresponding control unit can be set for the above-mentioned control method. Such a control unit can be adapted to an existing convolutional neural network processor, and the calculation unit used for convolution can be implemented by implementing the above-mentioned control method. For multiplexing, a supporting convolutional neural network processor can also be designed based on the hardware resources required by such a control unit, for example, using the minimum number of hardware resources when the above multiplexing scheme is satisfied.

本发明所提供的方案涉及改善用于执行卷积的计算单元的复用率，以减少必须设置在卷积神经网络处理器中的硬件计算单元，卷积神经网络处理器不必为了针对需要采用不同尺寸的卷积核的不同的卷积层而设置大量的具有不同尺寸的硬件计算单元。在执行针对一个卷积层的计算时，可以采用同一尺寸的计算单元实现针对不同卷积层的卷积计算，由此提高了卷积神经网络处理器中硬件计算单元的利用率。The solution provided by the present invention involves improving the multiplexing rate of the computing units used to perform convolution, so as to reduce the hardware computing units that must be set in the convolutional neural network processor, and the convolutional neural network processor does not have to adopt different A large number of hardware computing units with different sizes are set for different convolution layers of convolution kernels of different sizes. When performing calculations for one convolutional layer, computing units of the same size can be used to implement convolutional calculations for different convolutional layers, thereby improving the utilization of hardware computing units in the convolutional neural network processor.

可以理解，本发明并不排斥如背景技术中所介绍的采用更大规模的硬件并行地进行计算处理、以及通过“时分复用”的方式来提高计算单元的复用率。It can be understood that the present invention does not exclude the use of larger-scale hardware to perform calculations in parallel and increase the multiplexing rate of calculation units through "time division multiplexing" as introduced in the background art.

并且，需要说明的是，上述实施例中介绍的各个步骤并非都是必须的，本领域技术人员可以根据实际需要进行适当的取舍、替换、修改等。Moreover, it should be noted that not all the steps described in the foregoing embodiments are necessary, and those skilled in the art may make appropriate trade-offs, replacements, modifications, etc. according to actual needs.

最后所应说明的是，以上实施例仅用以说明本发明的技术方案而非限制。尽管上文参照实施例对本发明进行了详细说明，本领域的普通技术人员应当理解，对本发明的技术方案进行修改或者等同替换，都不脱离本发明技术方案的精神和范围，其均应涵盖在本发明的权利要求范围当中。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention rather than limit them. Although the present invention has been described in detail above with reference to the embodiments, those skilled in the art should understand that modifications or equivalent replacements to the technical solutions of the present invention do not depart from the spirit and scope of the technical solutions of the present invention, and all of them should be covered in Within the scope of the claims of the present invention.

Claims

1. a kind of control method for convolutional neural networks processor, the convolutional neural networks processor has the volume of 7*7 Product computing unit, the control method include：

1) the convolution kernel size n*n of the convolution algorithm needed to be implemented is determined；

2) the convolution kernel size n*n of the convolution algorithm executed as needed is selected in m²It is loaded into the convolutional calculation unit of a 7*7 The numerical value of convolution kernel corresponding with the size, and remaining each numerical value is filled with 0,7m >=n；

3) size of the convolution algorithm executed as needed and need to be implemented convolution input feature vector figure size, determine volume Periodicity needed for product calculating process；And

4) according to the periodicity, the numerical value of corresponding input feature vector figure is loaded by each period during convolutional calculation To the m²In the convolutional calculation unit of a 7*7, the numerical value of the input feature vector figure is in the m²The convolutional calculation unit of a 7*7 In distribution with the numerical value of the convolution kernel in the m²Distribution in the convolutional calculation unit of a 7*7 is consistent；

Control is loaded with the m of the numerical value of convolution kernel and input feature vector figure²The convolutional calculation unit of a 7*7 execute respectively with The corresponding convolutional calculation of the periodicity；

5) to the m²Corresponding element adds up in the convolutional calculation result of the convolutional calculation unit of a 7*7, final to obtain Convolution algorithm output characteristic pattern.

2. according to the method described in claim 1, wherein step 2) includes：

If the size of the convolution algorithm needed to be implemented be less than 7*7, then in the convolutional calculation unit of the same 7*7 be loaded into and institute It states the numerical value of the corresponding convolution kernel of size and remaining each numerical value is filled with 0；

If the size of the convolution algorithm needed to be implemented is greater than 7*7, then it is loaded into the convolutional calculation unit of the 7*7 of respective numbers Remaining each numerical value is simultaneously filled with 0 by the numerical value of convolution kernel corresponding with the size.

3. according to the method described in claim 1, wherein step 4) includes：

In each period during convolutional calculation, if comprising the input in the numerical value for the input feature vector figure for needing to be loaded into The element of left side first row in characteristic pattern, then disposably by the size in the input feature vector figure with the convolution algorithm needed to be implemented The multiple elements to match are loaded into the corresponding position of the convolutional calculation unit and fill out the numerical value of remaining each position Filling is 0, a unit otherwise will be then moved to the left as a whole with element identical in previous cycle, and will input spy Multiple elements that and needs different from previous cycle update in sign figure are loaded into the position vacated by the movement Place.

4. according to the method described in claim 1, wherein step 4) includes：

In each period during convolutional calculation, the m is controlled²The convolutional calculation unit of a 7*7 is respectively to loaded by it For input feature vector figure and multiplication is executed for the element of the corresponding position of convolution kernel and is added up to the result of multiplication, To obtain the element of corresponding position in output characteristic pattern.

5. method described in any one of -4 according to claim 1, wherein step 2) includes：

If the size of the convolution algorithm needed to be implemented is 5*5, then it is loaded into 5*5's in the convolutional calculation unit of the same 7*7 Remaining each numerical value is simultaneously filled with 0 by the numerical value of convolution kernel；

Also, step 4) includes：

In each period in the whole periods for executing convolutional calculation, the numerical value of corresponding input feature vector figure is loaded into the 7* In 7 convolutional calculation unit, distribution of the numerical value of the input feature vector figure in the convolutional calculation unit of the 7*7 and the 5* Distribution of the numerical value of 5 convolution kernel in the convolutional calculation unit of the 7*7 is consistent；

Wherein, in each period during convolutional calculation, if comprising institute in the numerical value for the input feature vector figure for needing to be loaded into The element for stating left side first row in input feature vector figure, then disposably by 25 elements in the input feature vector figure having a size of 5*5 It is loaded into the corresponding position of the convolutional calculation unit and the numerical value of remaining each position is filled with 0, it otherwise then will be with Identical element is moved to the left a unit as a whole in previous cycle, and by input feature vector figure with previous cycle 5 elements that middle different and needs update are loaded at the position vacated by the movement.

6. method described in any one of -4 according to claim 1, wherein step 2) includes：

If the size of the convolution algorithm needed to be implemented is 3*3, then spininess is loaded into the convolutional calculation unit of the same 7*7 Remaining each numerical value is simultaneously filled with 0 by numerical value to 4 channels, 3*3 convolution kernel；

Also, step 4) includes：

In each period in the whole periods for executing convolutional calculation, the numerical value of corresponding input feature vector figure is loaded into the 7* In 7 convolutional calculation unit, the numerical value of the input feature vector figure is with one or more equal with the quantity of the convolution kernel of the 3*3 The form of a copy is loaded into the convolutional calculation unit of the 7*7, and the input feature vector figure is in the convolution of the 7*7 Distribution in computing unit is corresponding with distribution of the numerical value of the convolution kernel of the 3*3 in the convolutional calculation unit of the 7*7；

Wherein, in each period during convolutional calculation, if comprising institute in the numerical value for the input feature vector figure for needing to be loaded into The element for stating left side first row in input feature vector figure, then disposably carry 9 elements in the input feature vector figure having a size of 3*3 Enter the corresponding position to the convolutional calculation unit and the numerical value of remaining each position be filled with 0, otherwise then will with it is preceding Identical element is moved to the left a unit as a whole in one period, and will be in input feature vector figure and in previous cycle 3 elements that different and needs update are loaded into the corresponding position vacated by the movement.

7. according to the method described in claim 6, wherein step 4) further includes：

If being loaded into the numerical value for being directed to 2 or 4 channels, 3*3 convolution kernel in the convolutional calculation unit of the same 7*7, And the element for not including left side first row in the input feature vector figure in the numerical value for the input feature vector figure for needing to be loaded into, then by institute State 2 copies in the convolutional calculation unit of 7*7 in same number of columns and in different line numbers, about the input feature vector figure In with element identical in previous cycle be moved to the left a unit as a whole, and by input feature vector figure with it is previous 3 elements that different and needs update in period are loaded into the corresponding position vacated by the movement.

8. method described in any one of claim 1-4, wherein step 2) includes：

If the size of the convolution algorithm needed to be implemented is 11*11, then control is loaded into jointly by the convolutional calculation unit of four 7*7 Remaining each numerical value is simultaneously filled with 0 by the numerical value of the convolution kernel of 11*11；

Also, step 4) includes：

In each period in the whole periods for executing convolutional calculation, the numerical value of corresponding input feature vector figure is loaded into described four In the convolutional calculation unit of a 7*7, distribution of the numerical value of the input feature vector figure in the convolutional calculation unit of four 7*7 It is consistent with the distribution of the numerical value of the convolution kernel of the 11*11 in the convolutional calculation unit of four 7*7；

Wherein, in each period during convolutional calculation, if comprising institute in the numerical value for the input feature vector figure for needing to be loaded into The element for stating left side first row in input feature vector figure, then disposably by 121 members in the input feature vector figure having a size of 11*11 Element is loaded into the corresponding position of the convolutional calculation unit and the numerical value of remaining each position is filled with 0, otherwise then will Be moved to the left a unit as a whole with element identical in previous cycle, and by input feature vector figure with the last week 11 elements that interim different and needs update are loaded into the corresponding position vacated by the movement.

9. according to the method described in claim 8, wherein step 4) includes：

In each period during convolutional calculation, the convolutional calculation unit of four 7*7 is controlled respectively to loaded by it For input feature vector figure and execute and multiplication and the result of multiplication carried out tired for the element of the corresponding position of convolution kernel Add；

And step 5) includes：It adds up to the calculated result by all convolutional calculation units of four 7*7, to obtain Export the element of corresponding position in characteristic pattern.

10. a kind of control unit, for realizing the control method as described in any one of claim 1-9.

11. a kind of convolutional neural networks processor, including：The convolutional calculation unit and control unit of 7*7, the control are single Member is for realizing any one of such as claim 1-9 the method.