CN106250103A

CN106250103A - A kind of convolutional neural networks cyclic convolution calculates the system of data reusing

Info

Publication number: CN106250103A
Application number: CN201610633040.9A
Authority: CN
Inventors: 刘波; 朱智洋; 陈壮; 阮星; 龚宇; 曹鹏; 杨军
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2016-08-04
Filing date: 2016-08-04
Publication date: 2016-12-21

Abstract

The invention discloses a coarse-grained reconfigurable system-oriented convolution neural network loop convolution calculation data reuse system, including a main controller and a connection control module, an input data reuse module, a convolution loop operation processing array, and a data transmission Pathways in four parts. During the convolution cycle operation, the essence is to multiply multiple two-dimensional input data matrices with multiple two-dimensional weight matrices. Generally, these matrices are large in size, and the multiplication takes up most of the time of the entire convolution calculation. The present invention utilizes a coarse-grained reconfigurable array system to complete the convolution calculation process. After receiving the convolution operation request instruction, the register rotation method is used to fully explore the reusability of the input data in the convolution cycle calculation process, and the data utilization rate is improved. It also reduces the bandwidth access pressure, and the designed array unit is configurable, and can complete convolution operations with different cyclic convolution scales and step sizes.

Description

A Convolutional Neural Network Circular Convolution Computational Data Reuse System

技术领域technical field

本发明涉及嵌入式可重构设计领域，具体是一种面向粗粒度可重构系统的卷积神经网络循环卷积计算数据重用的系统，可用于高性能可重构系统，实现卷积神经网络进行大数量循环卷积运算，尽量使用已有数据，对数据进行重用，提高运算速率，减少数据读取带宽压力。The invention relates to the field of embedded reconfigurable design, in particular to a system for reusing convolutional neural network cyclic convolution calculation data for coarse-grained reconfigurable systems, which can be used in high-performance reconfigurable systems to realize convolutional neural networks Carry out a large number of circular convolution operations, use existing data as much as possible, reuse data, improve operation speed, and reduce data reading bandwidth pressure.

背景技术Background technique

可重构处理器体系结构是一种理想的应用加速平台，由于硬件结构可以根据程序的数据流图重新组织，可重构阵列已被证明其对于科学计算或多媒体应用具有良好的性能提升潜力。Reconfigurable processor architecture is an ideal platform for application acceleration. Because the hardware structure can be reorganized according to the data flow graph of the program, reconfigurable arrays have been proved to have good performance improvement potential for scientific computing or multimedia applications.

卷积运算在图像处理领域有着广泛的用途，例如在图像滤波、图像增强、图像分析等处理时都要用到卷积运算，图像卷积运算实质是一种矩阵运算，其特点是运算量大，并且数据复用率高，用软件计算图像卷积很难达到实时性的要求。Convolution operation has a wide range of uses in the field of image processing. For example, convolution operation is used in image filtering, image enhancement, image analysis and other processing. Image convolution operation is essentially a matrix operation, which is characterized by a large amount of calculation. , and the data reuse rate is high, it is difficult to use software to calculate image convolution to meet the real-time requirements.

卷积神经网络作为一种前馈多层神经网络，能够对大量有标签数据进行自动学习并从中提取复杂特征，卷积神经网络的优点在于只需要对输入图像进行较少的预处理就能够从像素图像中识别出视觉模式，并且对有较多变化的识别对象也有较好的识别效果，同时卷积神经网络的识别能力不易受到图像的畸变或简单几何变换的影响。作为多层人工神经网络研究的一个重要方向，卷积神经网络多年来一直是研究的热点。As a feed-forward multi-layer neural network, convolutional neural network can automatically learn a large amount of labeled data and extract complex features from it. The advantage of convolutional neural network is that it only needs less preprocessing of the input image. The visual pattern is recognized in the pixel image, and it also has a good recognition effect on the recognition objects with many changes. At the same time, the recognition ability of the convolutional neural network is not easily affected by the distortion of the image or simple geometric transformation. As an important direction of multi-layer artificial neural network research, convolutional neural network has been a research hotspot for many years.

将卷积模板放在图像点阵的左上角，则卷积模板必与图像点阵中的左上角的分割矩阵重合。把它们的重合项对应相乘，之后再全部求和，就得到了第一个结果点。然后，再将卷积模板右移一列，即可求出第二个结果点。如此这样，卷积模板在图像点阵中遍历一遍，就完全可以求出一帧图像的卷积。数据的复用率很高，可是传统方式的缓存或直接从外部直接读取，由于受到数据读取带宽的限制，以及没有可配置阵列，完成多层卷积循环运算，效率较低。If the convolution template is placed in the upper left corner of the image lattice, the convolution template must coincide with the segmentation matrix in the upper left corner of the image lattice. Multiply their coincident items correspondingly, and then sum them all up to get the first result point. Then, move the convolution template to the right by one column to obtain the second result point. In this way, the convolution template traverses the image lattice once, and the convolution of a frame of image can be completely calculated. The data reuse rate is very high, but the traditional way of caching or directly reading from the outside is limited by the data reading bandwidth and there is no configurable array to complete the multi-layer convolution cycle operation, which is inefficient.

发明内容Contents of the invention

发明目的：针对现有技术中存在的问题与不足，本发明提供一种面向粗粒度可重构系统的卷积神经网络循环卷积计算数据重用的系统，可以加速完成大数量卷积计算的要求，降低对宽带的压力，并且卷积运算阵列是可配置的。卷积神经网络的计算性能与硬件资源的占用，是卷积神经网络在粗粒度可重构体系实现中需要进行折衷的两个方面，基于可重构处理阵列的卷积神经网络的设计目标是在满足应用性能要求的前提下，充分利用可重构阵列提供的计算资源和存储资源，利用输入图像数据重用结构，利用循环卷积运算中的高重用率，加之粗粒度可重构阵列的可配置性，在数据读取带宽，计算资源限制的情况下，完成卷积计算，达成一个较优的折衷。Purpose of the invention: Aiming at the problems and deficiencies in the prior art, the present invention provides a system for reusing convolutional neural network data for coarse-grained reconfigurable systems, which can accelerate the completion of large-scale convolution calculations. , reducing the pressure on the broadband, and the array of convolution operations is configurable. The computing performance of convolutional neural network and the occupation of hardware resources are two aspects that need to be compromised in the realization of coarse-grained reconfigurable system of convolutional neural network. The design goal of convolutional neural network based on reconfigurable processing array is Under the premise of meeting the application performance requirements, make full use of the computing resources and storage resources provided by the reconfigurable array, use the input image data reuse structure, use the high reuse rate in the circular convolution operation, and the reconfigurable coarse-grained array Configurability, in the case of limited data read bandwidth and computing resources, the convolution calculation is completed to achieve a better compromise.

技术方案：一种面向粗粒度可重构系统的卷积神经网络循环卷积计算数据重用的系统，包括主控制器及连接控制模块、输入数据重用模块、卷积循环运算处理阵列和数据传输通路。Technical solution: A convolutional neural network cyclic convolution calculation data reuse system for coarse-grained reconfigurable systems, including a main controller and connection control module, input data reuse module, convolution cyclic operation processing array and data transmission path .

所述主控制器及连接控制模块，完成外界卷积运算请求的接收，计算阵列配置信息加载，计算结果返回及对循环运行状态的监控，控制外部存储器和输入数据重用模块之间数据传输。The main controller and the connection control module complete the reception of the external convolution operation request, load the calculation array configuration information, return the calculation result and monitor the cycle operation status, and control the data transmission between the external memory and the input data reuse module.

所述输入数据重用模块，是连接外部输入数据存储器与循环卷积运算处理阵列之间的数据重用模块，完成输入数据重用，其中模块上半部分是图像矩阵宽度数量FIFO，下半部分是图像矩阵宽度数量移位寄存器。FIFO从外界存储器不断加载输入数据，分别对应卷积计算的一列，当移位寄存器根据卷积步长移动，FIFO为移位寄存器更换其中一列，之后完成一次卷积运算，达到数据重用的效果。移位寄存器用于利用上半部分FIFO部分提供更新的邻域数据。由于多个移位寄存器采用环形寻址方式，来自FIFO的数据将总是替换环形移位寄存器中最旧的数据，之后把数据传输给运算阵列完成卷积运算。The input data reuse module is a data reuse module connected between the external input data memory and the circular convolution operation processing array to complete the input data reuse, wherein the upper part of the module is the image matrix width quantity FIFO, and the lower part is the image matrix Width Quantity Shift Register. FIFO continuously loads input data from the external memory, corresponding to a column of convolution calculation. When the shift register moves according to the convolution step size, FIFO replaces one of the columns for the shift register, and then completes a convolution operation to achieve the effect of data reuse. A shift register is used to provide updated neighborhood data using the upper half FIFO section. Since multiple shift registers adopt ring addressing mode, the data from FIFO will always replace the oldest data in the ring shift register, and then transfer the data to the operation array to complete the convolution operation.

此模块实现具体步骤如下：The specific steps to implement this module are as follows:

数据一次输入S(1<＝S<最大图像矩阵宽度)个32位数据给FIFO，当卷积运算用过一个寄存器中数据，FIFO就会把自己的数据传输给移位寄存器，移位寄存器需更新一列K(1<＝K<最大图像矩阵宽度，K为此次卷积计算卷积核矩阵宽度)个32位数据，加上原有K-1列数据，移位寄存器把K*K个数据传输给卷积计算矩阵，之后继续向后根据步长移动，同样只需更新一列，实现入输入数据重用。The data is input S (1<=S<maximum image matrix width) pieces of 32-bit data to FIFO at one time. When the convolution operation uses the data in a register, FIFO will transfer its own data to the shift register. The shift register needs Update a column of K (1<=K<maximum image matrix width, K is the convolution kernel matrix width for this convolution calculation) 32-bit data, plus the original K-1 column data, the shift register will K*K data Transfer to the convolution calculation matrix, and then continue to move backward according to the step size. Also only need to update one column to realize the reuse of input data.

所述循环卷积运算处理阵列，从输入数据重用模块里获取所需输入数据，完成卷积计算，并在计算完成后将数据送出的功能。The circular convolution operation processing array obtains the required input data from the input data reuse module, completes the convolution calculation, and sends the data out after the calculation is completed.

所述数据传输通路，是完成主控制器及接口控制模块，循环卷积运算处理阵列，输入数据重用模块之间的数据传输通道。The data transmission channel is a data transmission channel between the main controller and the interface control module, the circular convolution operation processing array, and the input data reuse module.

进一步，主控制器及连接控制模块包括主控制和连接控制器，连接控制器有预取判断及数据重用配置控制作用，预取判断应用来判断要进行卷积运算时所需的数据是否准备就位，如果数据就位，循环卷积运算处理阵列执行卷积循环计算，如果没有，那就等待数据就位。缓存中的数据是由外部存储器中读取的，本发明采用直接内存存取方式读取，当需要外部数据输入时，主控制器发出向外部存储器读取数据命令，之后主控制器就不对存储读取进行控制，连接控制器会发一个停止信号给主控制器，主控制器放弃对地址总线、数据总线和有关控制总线的使用权，输入数据重用模块的数据需要更新时，就通过连接控制器，直接读取外存中的数据。Further, the main controller and the connection control module include the main control and the connection controller. The connection controller has the functions of prefetch judgment and data reuse configuration control. The prefetch judgment is used to judge whether the data required for the convolution operation is ready. If the data is in place, the circular convolution operation processing array performs the convolution cycle calculation, if not, it waits for the data to be in place. The data in the cache is read from the external memory, and the present invention adopts direct memory access mode to read. When external data input is required, the main controller sends a command to read data from the external memory, and then the main controller does not store data. Read control, the connection controller will send a stop signal to the main controller, the main controller will give up the right to use the address bus, data bus and related control bus, and when the data of the input data reuse module needs to be updated, it will pass the connection control device, directly read the data in the external memory.

循环卷积运算处理阵列包括阵列配置模块，包括阵列配置模块、存储处理单元和计算处理单元，此模块应用在匹配数据重用模块时，根据卷积计算规模及步长，阵列配置模块对计算阵列进行配置，利用阵列可用的计算资源，每次计算完成一次后重新配置阵列，计算处理单元根据计算规模进行调整，进行下一次卷积运算。The circular convolution operation processing array includes an array configuration module, including an array configuration module, a storage processing unit, and a calculation processing unit. Configuration, using the available computing resources of the array, reconfigure the array after each calculation is completed, and the calculation processing unit is adjusted according to the calculation scale to perform the next convolution operation.

所述卷积运算处理阵列配置控制器，在接口控制模块加载配置信息之后，运算阵列根据循环卷积循环规模的大小以及步长信息，可使卷积图像矩阵规模变量为从1到最大图像矩阵宽度之间取值计算，每一次卷积运算都可以对运算阵列进行重新配置，卷积核规模较小时，卷积阵列还是可以利用整个卷积计算矩阵，以此来缩短卷积计算总时长。The convolution operation processing array configuration controller, after the interface control module loads the configuration information, the operation array can make the convolution image matrix scale variable from 1 to the maximum image matrix according to the size of the circular convolution cycle scale and the step size information For the value calculation between widths, each convolution operation can reconfigure the operation array. When the convolution kernel is small, the convolution array can still use the entire convolution calculation matrix to shorten the total time of convolution calculation.

存储计算单元结构存储指令与数据重用模块紧密关联，它在循环控制部件的驱动下，从地址队列中取地址或直接通过地址生成部件计算得到地址，向数据重用模块发出读数据请求，返回数据写入数据队列中，在循环结束部件的控制下，读取移位寄存器中数据。Storage and calculation unit structure The storage instruction is closely related to the data reuse module. Driven by the loop control unit, it takes the address from the address queue or directly calculates the address through the address generation unit, sends a read data request to the data reuse module, and returns the data to write Into the data queue, under the control of the loop end component, read the data in the shift register.

计算处理单元实现数据流动过程中的计算和选择功能，循环下标不断地从寄存器组中取得数据，并把数据传递给计算处理单元阵列，计算处理单元阵列按照固定的连接关系进行运算，运算的结果存储到指定的位置。The calculation and processing unit realizes the calculation and selection functions in the process of data flow. The cyclic subscript continuously obtains data from the register group and transfers the data to the calculation and processing unit array. The calculation and processing unit array performs operations according to the fixed connection relationship. The result is stored to the specified location.

循环卷积运算处理阵列应用持续流水线操作，此操作循环映射到阵列配置模块，阵列配置模块来配置循环控制变量的初值、终值和步进值，循环程序的执行不需要外部控制，各个计算阵列单元之间构成流水线链接，完成循环卷积在流水线上的调度。The cyclic convolution operation processes the continuous pipeline operation of the array application. This operation is cyclically mapped to the array configuration module. The array configuration module configures the initial value, final value and step value of the loop control variable. The execution of the loop program does not require external control. Each calculation The pipeline link is formed between the array units, and the scheduling of the circular convolution on the pipeline is completed.

附图说明Description of drawings

图1为本发明实施例中卷积计算的粗粒度可重构阵列体系结构图；FIG. 1 is a schematic diagram of a coarse-grained reconfigurable array architecture for convolution calculation in an embodiment of the present invention;

图2为本发明实施例中输入数据重用模块数据轮转调度硬件结构图；FIG. 2 is a hardware structural diagram of the data round-robin scheduling of the input data reuse module in an embodiment of the present invention;

图3为本发明实施例中粗粒度可重构卷积计算阵列中存储处理单元的结构框图；3 is a structural block diagram of a storage processing unit in a coarse-grained reconfigurable convolution computing array in an embodiment of the present invention;

图4为本发明实施例中粗粒度可重构卷积计算阵列计算处理单元的结构框图；4 is a structural block diagram of a coarse-grained reconfigurable convolution calculation array calculation processing unit in an embodiment of the present invention;

图5为本发明实施例中循环卷积在可重构阵列里实现的流程图。Fig. 5 is a flow chart of implementing circular convolution in a reconfigurable array in an embodiment of the present invention.

具体实施方式detailed description

下面结合具体实施例，进一步阐明本发明，应理解这些实施例仅用于说明本发明而不用于限制本发明的范围，在阅读了本发明之后，本领域技术人员对本发明的各种等价形式的修改均落于本申请所附权利要求所限定的范围。Below in conjunction with specific embodiment, further illustrate the present invention, should be understood that these embodiments are only used to illustrate the present invention and are not intended to limit the scope of the present invention, after having read the present invention, those skilled in the art will understand various equivalent forms of the present invention All modifications fall within the scope defined by the appended claims of the present application.

面向粗粒度可重构系统的卷积神经网络循环卷积计算数据重用的系统，包括主控制器及连接控制模块、输入数据重用模块、卷积循环运算处理阵列和数据传输通路。Coarse-grained reconfigurable system-oriented convolution neural network loop convolution calculation data reuse system, including the main controller and connection control module, input data reuse module, convolution loop operation processing array and data transmission path.

主控制器及连接控制模块，完成外界卷积运算请求的接收，计算阵列配置信息加载，计算结果返回及对循环运行状态的监控，控制外部存储器和输入数据重用模块之间数据传输。The main controller and the connection control module complete the reception of external convolution operation requests, load the calculation array configuration information, return the calculation results and monitor the cycle operation status, and control the data transmission between the external memory and the input data reuse module.

输入数据重用模块，是连接外部输入数据存储器与循环卷积运算处理阵列之间的数据重用模块，其中模块上半部分是图像矩阵宽度数量FIFO，下半部分是图像矩阵宽度数量移位寄存器。The input data reuse module is a data reuse module connected between the external input data memory and the circular convolution operation processing array, wherein the upper part of the module is the image matrix width quantity FIFO, and the lower part is the image matrix width quantity shift register.

循环卷积运算处理阵列，从输入数据重用模块里获取所需输入数据，完成卷积计算，并在计算完成后将数据送出的功能。The circular convolution operation processes the array, obtains the required input data from the input data reuse module, completes the convolution calculation, and sends the data out after the calculation is completed.

数据传输通路，是完成主控制器及接口控制模块，循环卷积运算处理阵列，输入数据重用模块之间的数据传输通道。The data transmission channel is the data transmission channel between the main controller and the interface control module, the circular convolution operation processing array, and the input data reuse module.

主控制器及连接控制模块包括主控制和连接控制器，连接控制器有预取判断及数据重用配置控制作用，预取判断应用来判断要进行卷积运算时所需的数据是否准备就位，如果数据就位，循环卷积运算处理阵列执行卷积循环计算，如果没有，那就等待数据就位。缓存中的数据是由外部存储器中读取的，本发明采用直接内存存取方式读取，当需要外部数据输入时，主控制器发出向外部存储器读取数据命令，之后主控制器就不对存储读取进行控制，连接控制器会发一个停止信号给主控制器，主控制器放弃对地址总线、数据总线和有关控制总线的使用权，输入数据重用模块的数据需要更新时，就通过连接控制器，直接读取外存中的数据。The main controller and the connection control module include the main control and the connection controller. The connection controller has the functions of prefetch judgment and data reuse configuration control. The prefetch judgment is used to judge whether the data required for convolution operation is ready. If the data is in place, the circular convolution operation processing array performs the convolution cycle calculation, if not, it waits for the data to be in place. The data in the cache is read from the external memory, and the present invention adopts direct memory access mode to read. When external data input is required, the main controller sends a command to read data from the external memory, and then the main controller does not store data. Read control, the connection controller will send a stop signal to the main controller, the main controller will give up the right to use the address bus, data bus and related control bus, and when the data of the input data reuse module needs to be updated, it will pass the connection control device, directly read the data in the external memory.

如图1所示，具体计算阵列图及数据流的粗粒度可重构阵列图。可配置的PE单元占据了最主要部分，也是因为可重构阵列是完成卷积计算的具体部分，其余部分主要是为了把开始和结束的指令传输进来。通过图1可以看出，可配置阵列中存储处理单元直接连接输入数据重用模块(如图2)，根据步长及卷积核规模信息，输入数据重用模块将卷积运算所需数据流传输给计算处理单元，路由器配置数据流通过互联网络路由到达各个计算处理单元，同时连接控制器担负一次卷积计算完成，将数据信息传出，并把计算处理单元重新配置，开始下一次新的运算。As shown in Figure 1, the coarse-grained reconfigurable array graph of the specific computing array graph and data flow. The configurable PE unit occupies the most important part, also because the reconfigurable array is the specific part that completes the convolution calculation, and the rest is mainly to transmit the start and end instructions. It can be seen from Figure 1 that the storage processing unit in the configurable array is directly connected to the input data reuse module (as shown in Figure 2). According to the step size and convolution kernel scale information, the input data reuse module transmits the data stream required for the convolution operation to The computing processing unit and the router configure the data stream to reach each computing processing unit through the Internet route, and at the same time, the connection controller is responsible for completing a convolution calculation, transmitting the data information, and reconfiguring the computing processing unit to start the next new operation.

输入数据重用模块的数据轮转调度硬件图如图2所示，以卷积核大小为K*K(K为卷积核宽度)为例，在外部存储器和移位寄存器之间加上了FIFO，数据一次输入S个32位数据给FIFO，当卷积运算用过一个寄存器中数据，FIFO就会把自己的数据传输给移位寄存器，移位寄存器需更新一列K个32位数据，加上原有K-1列数据，移位寄存器把K*K个数据传输给卷积计算矩阵，这样的输入图像数据重用结构，为高效率卷积运算提供了支撑。The data round-robin scheduling hardware diagram of the input data reuse module is shown in Figure 2. Taking the convolution kernel size as K*K (K is the convolution kernel width) as an example, a FIFO is added between the external memory and the shift register. The data is input S pieces of 32-bit data to the FIFO at a time. When the convolution operation uses the data in a register, the FIFO will transfer its own data to the shift register. The shift register needs to update a column of K 32-bit data, plus the original K-1 columns of data, the shift register transmits K*K data to the convolution calculation matrix. This kind of input image data reuse structure provides support for high-efficiency convolution operations.

如图3所示，对应的是存储处理单元的结构框图，在输入通道接收到地址信号时，此时就对应存储处理单元在阵列中的位置，这些存储处理单元完成对应数据的地址的生成，生成了地址就会对应会用到输入图像数据重用模块中的数据，此时把数据输出给计算处理单元。循环控制运算数据对应地址的生成，以及卷积运算的结束，把计算所得数据同步传输到外部存储器中。而且循环判断结构在数据不对或不足时，结束当前运算，把信息传给外部存储器，进行数据更新。As shown in Figure 3, it corresponds to the structural block diagram of the storage processing unit. When the input channel receives the address signal, it corresponds to the position of the storage processing unit in the array. These storage processing units complete the generation of the address of the corresponding data. When the address is generated, it will correspond to the data in the input image data reuse module, and then output the data to the computing processing unit. The loop controls the generation of the address corresponding to the operation data, and the end of the convolution operation, and synchronously transmits the calculated data to the external memory. Moreover, when the data in the loop judging structure is wrong or insufficient, the current operation is ended, and the information is transmitted to the external memory for data update.

如图4所示，对应的是计算处理单元的结构图，计算处理单元在接收到输入数据时，应用内部乘法器及加法器完成卷积运算，完成一次运算，根据配置控制器，重新配置运算所需要的计算处理单元，完成可配置控制，当外部循环大小，步长变换时，还是能够很好完成运算。As shown in Figure 4, it corresponds to the structure diagram of the calculation processing unit. When the calculation processing unit receives the input data, it uses the internal multiplier and adder to complete the convolution operation, completes an operation, and reconfigures the operation according to the configuration controller. The required calculation and processing unit can complete the configurable control, and when the size and step size of the outer loop are changed, the operation can still be completed well.

结合图1、图2，卷积循环计算的具体步骤如图5所示，包括如下步骤：Combined with Figure 1 and Figure 2, the specific steps of convolution cycle calculation are shown in Figure 5, including the following steps:

1)如果需要粗粒度可重构阵列体系完成大量卷积运算，首先要对这个卷积控制体系发出请求，当主处理器接收到请求，就会向连接处理单元发出指令；1) If the coarse-grained reconfigurable array system is required to complete a large number of convolution operations, it must first send a request to the convolution control system. When the main processor receives the request, it will send instructions to the connection processing unit;

2)连接处理单元首先判断输入数据重用模块中所需数据是非已经就位，如果没有就会发出等待信号，同时用直接存储存取对缓冲器进行数据传输；2) The connection processing unit first judges whether the required data in the input data reuse module is in place, if not, it will send a waiting signal, and at the same time use direct storage access to transfer data to the buffer;

3)在数据就续后，通知正在等待的运算指令，控制循环开始，卷积循环运算处理阵列中配置控制单元就会对阵列进行配置，计算阵列里的访存配置模块就会计算数剧所处位置，之后计算阵列对此位置的数据进行卷积计算，依次向后面流水进行。3) After the data is completed, the waiting operation instruction is notified, the control cycle starts, the configuration control unit in the convolution cycle operation processing array will configure the array, and the memory access configuration module in the calculation array will calculate the location of the data. position, and then the calculation array performs convolution calculation on the data at this position, which is performed sequentially.

4)Y(最大图像矩阵宽度)个FIFO缓存通过直接存储读取方式不断更新寄存器中已用过数据，当再进入此位置时，数据已完成更新，不间断进行运算，也不用每次卷积运算到外存去访问数据。4) Y (maximum image matrix width) FIFO buffers continuously update the used data in the register through direct storage and reading. When entering this position again, the data has been updated, and the operation is performed without interruption, and there is no need for each convolution Operation to external storage to access data.

5)连接控制器控制循环完成，当计算完成，将最终数据输出到外部存储器中，这次卷积运算阵列完成。5) The control cycle of the connection controller is completed. When the calculation is completed, the final data is output to the external memory, and the convolution operation array is completed this time.

在具体进行大数量循环卷积运算时，当计算资源有限时，应用数据重用的方法，加上可配置的可重构阵列，流水线完成卷积运算，我们提高了运算效率和速度。设置了对比试验，分别为对比验证系统A、对比验证系统B。其中，对比验证系统A，即传统的不支持阵列配置与重用的可重构系统。对比验证系统B，即本发明所提出的支持数据预取与重用的可重构系统。选取16x16的输入数据矩阵，3x3的卷积矩阵，步长为1，设置了10个输入数据，10个卷积权重矩阵，同时进行卷积运算。实验结果表明，对比验证系统B可以获得对比验证系统A的平均1.76倍的性能提升。When performing a large number of circular convolution operations, when computing resources are limited, the method of data reuse is applied, coupled with configurable and reconfigurable arrays, and the pipeline completes convolution operations, which improves the efficiency and speed of operations. A comparative test is set up, which are comparative verification system A and comparative verification system B respectively. Among them, system A is compared and verified, that is, a traditional reconfigurable system that does not support array configuration and reuse. Compare and verify system B, which is the reconfigurable system that supports data prefetching and reuse proposed by the present invention. Select a 16x16 input data matrix, a 3x3 convolution matrix, and a step size of 1, set 10 input data, 10 convolution weight matrices, and perform convolution operations at the same time. The experimental results show that the comparison verification system B can obtain an average performance improvement of 1.76 times that of the comparison verification system A.

Claims

1. A system for reusing convolutional neural network cyclic convolution computing data for coarse-grained reconfigurable systems, characterized in that it includes a main controller and a connection control module, an input data reuse module, a convolutional cyclic operation processing array and data transmission path;

The main controller and the connection control module complete the reception of the external convolution operation request, load the calculation array configuration information, return the calculation result and monitor the cycle operation state, and control the data transmission between the external memory and the input data reuse module;

The input data reuse module is a data reuse module connected between the external input data memory and the circular convolution operation processing array, wherein the upper part of the module is an image matrix width quantity FIFO, and the lower part is an image matrix width quantity shift register ;

The circular convolution operation processing array obtains the required input data from the input data reuse module, completes the convolution calculation, and sends the data out after the calculation is completed.

2. The data transmission channel is a data transmission channel between the main controller and the interface control module, the circular convolution operation processing array, and the input data reuse module.

3. The system for reusing convolutional neural network cyclic convolution calculation data for coarse-grained reconfigurable systems as claimed in claim 1, wherein: the main controller and the connection control module include a main control and a connection controller, and the connection The controller has the functions of prefetch judgment and data reuse configuration control. The prefetch judgment is used to judge whether the data required for convolution operation is ready. If the data is in place, the circular convolution operation processing array performs convolution cycle calculation. , if not, then wait for the data to be in place; the data in the cache is read from the external memory, and the direct memory access method is used to read. When external data input is required, the main controller sends a read data command, after which the main controller will not control the memory reading, the connection controller will send a stop signal to the main controller, the main controller will give up the right to use the address bus, data bus and related control bus, and input data reuse module When the data needs to be updated, it directly reads the data in the external storage by connecting to the controller.

4. The system for reusing convolutional neural network cyclic convolution calculation data for coarse-grained reconfigurable systems as claimed in claim 1, wherein the cyclic convolution operation processing array includes an array configuration module, including an array configuration module, Storage processing unit and calculation processing unit. This module is used to match the input data reuse module. According to the convolution calculation scale and step size, the array configuration module configures the calculation array and uses the available computing resources of the array. After each calculation is completed once The array is reconfigured, and the calculation processing unit is adjusted according to the calculation scale, and the next convolution operation is performed; the circular convolution operation processes the continuous pipeline operation of the array application, and this operation is cyclically mapped to the array configuration module, and the array configuration module configures the initial value of the loop control variable. Value, final value, and step value, the execution of the cyclic program does not require external control, and a pipeline link is formed between each calculation array unit to complete the scheduling of the cyclic convolution on the pipeline.

5. The system for reusing convolutional neural network cyclic convolution calculation data for coarse-grained reconfigurable systems as claimed in claim 1, wherein: the input data reuse module realizes specific steps as follows:

The data is input S pieces of 32-bit data to the FIFO at a time. When the convolution operation uses the data in a register, the FIFO will transfer its own data to the shift register. The shift register needs to update a column of K 32-bit data, plus the original K-1 columns of data, the shift register transmits K*K data to the convolution calculation matrix, and then continues to move backward according to the step size, and only needs to update one column to realize the reuse of input data.