WO2021243839A1

WO2021243839A1 - Composite-granularity, near-storage and approximation-based acceleration structure and method for long short-term memory network

Info

Publication number: WO2021243839A1
Application number: PCT/CN2020/106988
Authority: WO
Inventors: 王镇
Original assignee: Nanjing Prochip Electronic Technology Co Ltd
Current assignee: Nanjing Prochip Electronic Technology Co Ltd
Priority date: 2020-06-04
Filing date: 2020-08-05
Publication date: 2021-12-09
Anticipated expiration: 2022-12-04
Also published as: CN111652361B; CN111652361A

Abstract

The present invention relates to the field of acceleration techniques for long short-term memory networks. Provided are a composite-granularity, near-storage and approximation-based acceleration structure and method for a long short-term memory network. The invention mainly employs a composite-granularity-based division policy to perform parallel division with respect to computation tasks. The acceleration structure comprises a matrix vector operation module, a near-storage and approximation-based acceleration storage module, a near-storage and approximation-based acceleration operation module, and a function configuration module for near-storage and approximation-based acceleration operations. The composite-granularity, near-storage and approximation-based acceleration structure and method for a long short-term memory network closely couple a storage structure with an approximation-based computation unit structure, and employ a composite-granularity-based task division and parallel computation policy to provide a highly efficient and flexible acceleration structure for a long short-term memory neural network.

Description

Approximate acceleration structure and method for compound granularity near storage of long and short-term memory network

Technical field

本发明属于长短时记忆网络加速技术领域，具体涉及长短时记忆网络的复合粒度近存储近似加速结构和方法。The invention belongs to the technical field of long- and short-term memory network acceleration, and specifically relates to a compound-granularity near-storage approximate acceleration structure and method of a long- and short-term memory network.

Background technique

近年来，随着深度学习的不断发展，长短时记忆网络(LSTM)作为一种特殊的循环神经网络，因其对长序列有非常好的表现，在许多领域得到了广泛的应用，比如声音、视频等。但是，随着神经网络应用的快速增加，网路规模不断扩大，需要处理的数据流急剧增加，以及对处理延迟和功耗要求的进一步提高，会对内存和带宽带来极大的挑战，同时数据间的依赖关系，集中计算要求极大的限制了网络加速器的性能。这使得传统结构的长短时记忆网络很难满足设计要求。In recent years, with the continuous development of deep learning, long short-term memory network (LSTM), as a special recurrent neural network, has been widely used in many fields because of its very good performance on long sequences, such as sound, Video etc. However, with the rapid increase in neural network applications, the continuous expansion of network scale, the sharp increase in the data flow that needs to be processed, and the further increase in processing delay and power consumption requirements, will bring great challenges to memory and bandwidth, and at the same time The dependence between data and centralized computing requirements greatly limit the performance of network accelerators. This makes it difficult for the long and short-term memory network of the traditional structure to meet the design requirements.

因此，针对长短时记忆网络中对内存带宽要求高、计算功耗大的问题，需要对现有技术进行改进，以提高长短时记忆神经网络的的处理并行度与运算速度。Therefore, in view of the problems of high memory bandwidth requirements and high computational power consumption in the long- and short-term memory network, it is necessary to improve the existing technology to improve the processing parallelism and operation speed of the long- and short-term memory neural network.

发明内容Summary of the invention

本发明的目的是为了克服现有技术所存在的不足而提出了长短时记忆网络的复合粒度近存储近似加速结构和方法，基于复合粒度网络模型实现长短时记忆网络调度策略的划分，并使用近存储近似加速运算模块来执行运算，可以更好地提高运算的并行性与运算速度。The purpose of the present invention is to overcome the deficiencies in the prior art and propose a compound granular near storage approximate acceleration structure and method for long and short-term memory networks, based on the compound granular network model to achieve the division of long and short-term memory network scheduling strategies, and use near Storing approximate acceleration calculation modules to perform calculations can better improve the parallelism and speed of calculations.

为了解决上述技术问题，本发明提出如下技术方案：In order to solve the above technical problems, the present invention proposes the following technical solutions:

本发明基于复合粒度任务划分策略，提出长短时记忆网络的复合粒度近存储近似加速结构。复合粒度由粗粒度和细粒度构成，粗粒度为细胞级别的并行加速，细粒度为细胞内部的矩阵加速。The present invention proposes a compound granularity near-storage approximate acceleration structure of a long and short-term memory network based on a compound granularity task division strategy. The composite granularity is composed of coarse granularity and fine granularity. Coarse granularity is the parallel acceleration at the cell level, and fine granularity is the acceleration of the matrix inside the cell.

本发明提出的长短时记忆网络的复合粒度近存储近似加速结构包括：近存储近似加速存储模块、矩阵向量运算模块、近存储近似加速运算模块、近存储近似加速运算的功能配置模块。The compound granularity near-storage approximate acceleration structure of the long-short-term memory network proposed by the present invention includes: a near-storage approximate acceleration storage module, a matrix vector operation module, a near-storage approximate acceleration operation module, and a functional configuration module for near-storage approximate acceleration operation.

首先矩阵向量运算模块用于进行矩阵与向量之间的计算，所获得的计算中间向量数据存储在近存储近似加速存储模块中，并由近存储近似加速存储模块向近存储近似加速运算模块提供各种向量数据，近存储近似加速运算模块用于进行向量与向量之间的计算，近存储近似加速运算的功能配置模块用于对近存储近似加速运算模块进行配置。First, the matrix-vector operation module is used to calculate between the matrix and the vector. The calculated intermediate vector data is stored in the near-storage approximate acceleration storage module, and the near-storage approximate acceleration storage module provides various calculations to the near-storage approximate acceleration calculation module. For vector data, the near-storage approximate acceleration calculation module is used to perform vector-to-vector calculations, and the near-storage approximate acceleration calculation function configuration module is used to configure the near-storage approximate acceleration calculation module.

进一步，近存储近似加速运算模块进行的向量与向量之间的计算任务包括若干不同的向量计算类型，近存储近似加速运算的功能配置模块实现对不同的向量计算类型的支持。Further, the vector-to-vector calculation tasks performed by the near-storage approximate acceleration calculation module include several different vector calculation types, and the near-storage approximate acceleration calculation function configuration module supports different vector calculation types.

进一步，矩阵向量运算模块主要用来计算乘加操作，近存储近似加速运算模块用来计算激活函数或者是加法操作。Furthermore, the matrix vector operation module is mainly used to calculate multiplication and addition operations, and the near storage approximate acceleration operation module is used to calculate activation functions or addition operations.

对长短时记忆网络的采用复合粒度任务划分策略来对计算任务进行划分，实现矩阵与向量之间的计算任务发送给矩阵向量运算模块、向量与向量之间的计算任务发送给近存储近似加速运算模块，并且两个运算模块同时并行计算任务，从而实现计算加速，使得执行效率更高、功耗也更低。For the long and short-term memory network, the composite granular task division strategy is used to divide the calculation tasks, so that the calculation tasks between the matrix and the vector are sent to the matrix vector operation module, and the calculation tasks between the vector and the vector are sent to the near storage approximate acceleration operation Modules, and two computing modules parallel computing tasks at the same time, so as to achieve computing acceleration, making the execution efficiency higher and the power consumption lower.

本发明提出的长短时记忆网络的复合粒度近存储近似加速结构中，近存储近似加速运算模块包括：第一数据存储模块、第二数据存储模块、数据处理单元。需要进行运算的数据分别输入到第一数据存储模块和第二数据存储模块中，运算得到的数据由第一数据存储模块输出。In the composite granularity near-storage approximate acceleration structure of the long-short-term memory network proposed by the present invention, the near-storage approximate acceleration calculation module includes: a first data storage module, a second data storage module, and a data processing unit. The data that needs to be calculated are respectively input into the first data storage module and the second data storage module, and the data obtained by the calculation is output by the first data storage module.

第一数据存储模块和第二数据存储单元模块都是大小为1KB的地址存储区，其位宽均为16×16bit，深度均为32。第一部分和S _0i(i＝1，2，……9，A，B)、第二部分和S _1i(i＝1，2，……9，A，B)分别存储在第一数据存储模块和第二数据存储模块。 Both the first data storage module and the second data storage unit module are address storage areas with a size of 1KB, and both have a bit width of 16×16 bits and a depth of 32. The first part and S _0i (i=1, 2, ... 9, A, B), the second part and S _1i (i=1, 2, ... 9, A, B) are respectively stored in the first data storage module And the second data storage module.

进一步，基于复合粒度任务划分策略下，LSTM的第一部分和S _0i、第二部分和S _1i计算步骤具体如下： Further, based on the compound granularity task division strategy, the _{calculation steps of the first part and S 0i} , the second part and S _{1i of} LSTM are as follows:

步骤A01，设在第t时刻，网络读入第t个输入x _t，同时配置输入门i、遗忘门f、记忆单元c及输出门o响应的权值b和偏置值W，第一部分和、第二部分和满足以下公示： Step A01, suppose that at the t-th time, the network reads the t-th input x _t , and configures the input gate i, the forget gate f, the memory unit c, and the output gate o response weight b and offset value W. The first part and , The second part and meet the following announcements:

步骤A02，计算输入门i、遗忘门f、记忆单元c及输出门o的响应值，网络读入第t-1时刻隐层的状态值h _t-1，第一部分和、第二部分和满足以下公示： Step A02: Calculate the response values of input gate i, forget gate f, memory unit c, and output gate o. The network reads the state value h _{t-1 of the} hidden layer at time t-1, and the first part and the second part are satisfied The following announcement:

步骤A03，计算输入门i、遗忘门f、记忆单元c及输出门o的响应值，网络读入第t-1时刻记忆单元向量值c _t-1，第一部分和、第二部分和满足以下公示： Step A03: Calculate the response values of input gate i, forget gate f, memory cell c and output gate o, the network reads the memory cell vector value c _t-1 at time t-1, the first part and the second part satisfy the following Publicity:

步骤A04，计算输入门i、遗忘门f、记忆单元c及输出门o的响应值，第一部分和、第二部分和满足以下公示：Step A04: Calculate the response values of the input gate i, the forget gate f, the memory unit c, and the output gate o. The sum of the first part and the second part meet the following announcements:

步骤A05，计算得到输入门i、遗忘门f的向量值i _t和f _t，计算记忆单元c及输出门o的响应值，第一部分和、第二部分和满足以下公示： Step A05: Calculate the vector values i _t and f _{t of the} input gate i and the forget gate f, and calculate the response values of the memory unit c and the output gate o. The sum of the first part and the second part meet the following announcements:

上式中，i _t＝σ(W _ixx _t+W _ihh _t-1+W _icc _t-1+b _i)，f _t＝σ(W _fxx _t+W _fhh _t-1+W _fcc _t-1+b _f)，其中σ为sigmoid函数。 In the above _{_{formula, i t = σ (W ix}} x t + W ih h t-1 + W ic c t-1 + b i), f t = σ (W fx x t + W fh h t-1 + W _fc c _t-1 +b _f ), where σ is the sigmoid function.

步骤A06，计算记忆单元c及输出门o的响应值，第一部分和、第二部分和满足以下公示：Step A06: Calculate the response values of the memory unit c and the output gate o, and the first part sum and the second part sum meet the following announcements:

步骤A07，计算得到记忆单元c的向量值c _t，计算输出门o的响应值，第一部分和、第二部分和满足以下公示： In step A07, the vector value c _{t of the} memory unit c is calculated, and the response value of the output gate o is calculated. The sum of the first part and the second part satisfy the following announcements:

上式中，c _t＝f _t⊙c _t-1+i _t⊙φ(W _cxx _t+W _chh _t-1+b _c)，其中⊙表示逐元素乘法操作，φ为双曲正切函数。 In the above formula, c _t =f _t ⊙c _t-1 +i _t ⊙φ(W _cx x _t +W _ch h _t-1 +b _c ), where ⊙ represents the element-wise multiplication operation, and φ is the hyperbolic tangent function .

步骤A08，计算输出门o的响应值，第一部分和、第二部分和满足以下公示：Step A08, calculate the response value of the output gate o, the first part and the second part meet the following announcements:

步骤A09，计算输出门o的响应值，第一部分和、第二部分和满足以下公示：Step A09, calculate the response value of the output gate o, the first part and the second part meet the following announcements:

步骤A10，计算得到输出门o的向量值，第一部分和、第二部分和满足以下公示：Step A10, calculate the vector value of the output gate o, the first part sum and the second part sum satisfy the following publicity:

上式中，o _t＝σ(W _oxx _t+W _ohh _t-1+W _occ _t-1+b _o)。 In the above formula, o _t =σ(W _ox x _t +W _oh h _t-1 +W _oc c _t-1 +b _o ).

步骤A11，计算得到第t时刻隐层的状态值h _t，第一部分和、第二部分和满足以下公示： Step A11, the state value h _{t of the} hidden layer at time t is calculated, and the first part sum and the second part sum satisfy the following publicity:

上式中，h _t＝o _t⊙φ(c _t)。 In the above formula, h _t =o _t ⊙φ(c _t ).

近存储近似加速运算模块中的数据处理单元包括：配置文件缓存、配置文件解析器、第一地址发生器、第二地址发生器、多路选择器、多功能阵列处理器。The data processing unit in the near storage approximate acceleration calculation module includes: a configuration file cache, a configuration file parser, a first address generator, a second address generator, a multiplexer, and a multifunctional array processor.

配置环境文件装载到配置文件缓存中，然后由配置文件解析器进行地址配置分析操作，获得地址配置文件分别装载到第一地址发送器和第二地址发生器，第一地址发生器根据地址配置文件决定是否从第一数据存储模块中选择对应的第一地址，第二地址发生器根据地址配置文件决定是否从第二数据存储模块中选择对应的第二地址，被灌入第一地址发生器的第一地址、被灌入第二地址发生器的第二地址均输入到多路选择器中；配置文件解析器还对多路选择器进行配置，最终由多路选择器从第一地和第二地址中选择一个地址所对应的数据作为多路选择器的输出数据、并且与第一地址一并被输入到多功能阵列处理器中；同时，配置文件解析器还对多功能阵列处理器进行计算配置，对多路选择器的输出数据进行计算，然后将计算结果存储到第一地址中。The configuration environment file is loaded into the configuration file cache, and then the configuration file parser performs the address configuration analysis operation, and the obtained address configuration files are respectively loaded into the first address transmitter and the second address generator, and the first address generator is based on the address configuration file Decide whether to select the corresponding first address from the first data storage module, and the second address generator determines whether to select the corresponding second address from the second data storage module according to the address configuration file, which is poured into the first address generator The first address and the second address poured into the second address generator are both input into the multiplexer; the configuration file parser also configures the multiplexer, and the multiplexer finally starts from the first place and the second address. The data corresponding to one of the two addresses is selected as the output data of the multiplexer, and is input to the multi-function array processor together with the first address; at the same time, the configuration file parser also performs the processing on the multi-function array processor. Calculate the configuration, calculate the output data of the multiplexer, and then store the calculation result in the first address.

进一步，配置文件缓存是配置环境文件专用的高速缓存阵列；多功能阵列处理器是一种可重构的多功能阵列处理器，其输入为16bit定点数，可以完成加法、乘法以及sigmoid运算。Further, the configuration file cache is a cache array dedicated to configuration environment files; the multi-function array processor is a reconfigurable multi-function array processor, whose input is a 16-bit fixed-point number, which can perform addition, multiplication, and sigmoid operations.

本发明提出的长短时记忆网络的复合粒度近存储近似加速结构中，由近存储近似加速运算的功能配置模块实现功能配置，近存储近似加速运算的功能配置模块位宽是16，功能配置模块包括：地址配置单元、多路选择器配置单元、计算配置单元。In the compound-granularity near-storage approximate acceleration structure of the long-short-term memory network proposed by the present invention, the function configuration is realized by the function configuration module of the near-storage approximate acceleration operation. The function configuration module of the near-storage approximate acceleration operation has a bit width of 16, and the function configuration modules include : Address configuration unit, multiplexer configuration unit, calculation configuration unit.

进一步，功能配置模块的第0位到第7位是地址配置单元；其中，功能配置模块的第0位到第2位是地址发生器选择单元，用于选择地址发生器；功能配置模块的第3位到第7位是地址选择单元，用于选择地址发生器中的地址。Further, bits 0 to 7 of the function configuration module are the address configuration unit; among them, bits 0 to 2 of the function configuration module are the address generator selection unit, which is used to select the address generator; Bits 3 to 7 are the address selection unit, which is used to select the address in the address generator.

进一步，功能配置模块的第8位到第11位是多路选择器配置单元，用于乘法器选择运算的数据。Further, the 8th to 11th bits of the function configuration module are the multiplexer configuration unit, which is used to select the data for the multiplier operation.

进一步，功能配置模块的第12位到15位是计算配置单元，用于表示进行运算的种类，计算配置单元可表示加法、乘法、逻辑运算、sigmoid运算以及近似乘法运算时，计算配置单元的最后两位用于配置近似乘法的迭代次数，迭代次数越多计算结果越精确。Further, the 12th to 15th bits of the functional configuration module are the calculation configuration unit, which is used to indicate the type of operation. The calculation configuration unit can indicate addition, multiplication, logical operation, sigmoid operation, and approximate multiplication. The last calculation configuration unit Two bits are used to configure the number of iterations for approximate multiplication. The more iterations, the more accurate the calculation result will be.

本发明还提出的长短时记忆网络的复合粒度近存储近似加速方法，该加速方法的步骤具体如下：The present invention also proposes a compound-granularity near-storage approximate acceleration method for a long and short-term memory network. The steps of the acceleration method are as follows:

步骤S1，加载配置文件：将配置环境的文件装载到配置配置文件缓存；Step S1, load the configuration file: load the configuration environment file into the configuration configuration file cache;

步骤S2，对所装载的配置文件进行解析，具体包括如下三个并行开展的步骤：Step S2, parsing the loaded configuration file, specifically includes the following three parallel steps:

步骤S2-1，解析地址配置：Step S2-1, resolve the address configuration:

配置文件解析器进行地址配置分析操作，获得地址配置文件分别装载到第一地址发送器和第二地址发生器，用来进行选择是取第一地址发生器中的地址或者是选择第二地址发生器中的地址；The configuration file parser performs the address configuration analysis operation, and the obtained address configuration files are loaded into the first address transmitter and the second address generator respectively, and used to select whether to take the address in the first address generator or select the second address to occur Address in the device;

进一步，第一地址发生器根据地址配置文件决定是否从第一数据存储模块中选择对应的第一地址并灌入到第一地址发生器，第二地址发生器根据地址配置文件决定是否从第二数据存储模块中选择对应的第二地址并灌入到第二地址发生器；Further, the first address generator decides whether to select the corresponding first address from the first data storage module according to the address configuration file and fill it into the first address generator, and the second address generator decides whether to select from the second address according to the address configuration file. Select the corresponding second address in the data storage module and fill it into the second address generator;

步骤S2-2，解析多路选择器配置：Step S2-2, parse the multiplexer configuration:

配置文件解析器进行多路选择器配置分析操作，获得多路选择器配置文件，由多路选择器选择数据源；The configuration file parser performs the multiplexer configuration analysis operation, obtains the multiplexer configuration file, and the multiplexer selects the data source;

进一步，被灌入第一地址发生器的第一地址、被灌入第二地址发生器的第二地址均输入到多路选择器中；Further, the first address that is poured into the first address generator and the second address that is poured into the second address generator are both input into the multiplexer;

经过配置，多路选择器从第一地和第二地址中选择一个地址所对应的数据作为多路选择器的输出数据、并且与第一地址一并被输入到多功能阵列处理器中；After being configured, the multiplexer selects data corresponding to an address from the first address and the second address as the output data of the multiplexer, and is input into the multifunctional array processor together with the first address;

步骤S2-3，解析计算配置：Step S2-3, analyze and calculate the configuration:

配置文件解析器进行多功能阵列处理器的计算配置，通过计算配置可重构的多功能阵列处理器(RMPA)，对多路选择器的输出数据进行RMPA计算，然后将计算结果存储到第一地址中；The configuration file parser performs the calculation configuration of the multi-function array processor. The reconfigurable multi-function array processor (RMPA) is calculated and configured to perform RMPA calculation on the output data of the multiplexer, and then the calculation result is stored in the first Address;

步骤S3，判断还有没有更多的配置文件，如果有返回步骤S1，如果没有则结束。In step S3, it is judged whether there are any more configuration files, if there are any more configuration files, return to step S1, and if there are no more configuration files, then it ends.

本发明提出的面向长短时记忆神经网络的近存储近似加速结构，相比现有技术，具有以下效益：Compared with the prior art, the near-storage approximate acceleration structure oriented to the long-short-term memory neural network proposed by the present invention has the following benefits:

存储结构和近似计算单元结构进行了紧耦合的设计，通过复合粒度的任务划分和并行计算策略，针对长短时记忆神经网络设计了更高效、更灵活的加速结构。在具体计算中，通过复合粒度的任务划分和并行计算策略，将任务的并行度提升了30％以上，通过近存储近似加速结构和方法，进一步降低了功耗上的表现，可以提升20％以上的系统能效。The storage structure and the approximate computing unit structure are tightly coupled. Through the task division of compound granularity and the parallel computing strategy, a more efficient and flexible acceleration structure is designed for the long- and short-term memory neural network. In specific calculations, through the task division of compound granularity and parallel computing strategies, the parallelism of tasks is increased by more than 30%, and the near-memory approximate acceleration structure and method can further reduce the performance of power consumption, which can be increased by more than 20% Energy efficiency of the system.

Description of the drawings

图1为本发明提出长短时记忆网络的复合粒度近存储近似加速结构中，近存储近似加速运算模块的结构框架图；FIG. 1 is a structural frame diagram of the near-storage approximate acceleration computing module in the compound-granularity near-storage approximate acceleration structure of the long-short-term memory network proposed by the present invention;

图2为本发明提出长短时记忆网络的复合粒度近存储近似加速结构中，近存储近似加速运算模块电路的工作流程图；2 is a working flow chart of the near-storage approximate acceleration calculation module circuit in the compound-granularity near-storage approximate acceleration structure of the long and short-term memory network proposed by the present invention;

图3为本发明提出长短时记忆网络的复合粒度近存储近似加速结构中，近存储近似加速运算的功能配置模块结构示意图。FIG. 3 is a schematic diagram of the functional configuration module structure of the near-memory approximate acceleration operation in the compound-granularity near-memory approximate acceleration structure of the long and short-term memory network proposed by the present invention.

detailed description

以下结合实施例子对本发明作进一步的详细描述。The present invention will be further described in detail below in conjunction with examples.

实施例1。传统长短时记忆网络的加速结构，是基于单一粒度任务划分策略开展设计，而本发明提出长短时记忆网络的复合粒度近存储近似加速结构，是基于复合粒度任务划分策略开展设计。Example 1. The acceleration structure of the traditional long- and short-term memory network is designed based on a single-granularity task division strategy, while the present invention proposes a compound-granularity near-storage approximate acceleration structure of the long- and short-term memory network, which is designed based on a compound-granularity task division strategy.

进一步，复合粒度由粗粒度和细粒度构成，粗粒度为细胞级别的并行加速，细粒度为细胞内部的矩阵加速。基于复合粒度任务划分策略能够打破细胞级和门级之间的隔断关系。Furthermore, the composite granularity is composed of coarse granularity and fine granularity. Coarse granularity is parallel acceleration at the cell level, and fine granularity is matrix acceleration inside the cell. The task division strategy based on compound granularity can break the partition relationship between the cell level and the gate level.

基于复合粒度任务划分策略下，LSTM的计算具体步骤如下：Based on the compound-granularity task division strategy, the specific steps of LSTM calculation are as follows:

上式中，i _t＝σ(W _ixx _i+W _ihh _t-1+W _icc _t-1+b _i)，f _t＝σ(W _fxx _t+W _fhh _t-1+W _fcc _t-1+b _f)，其中σ为sigmoid函数。 In the above _{_{formula, i t = σ (W ix}} x i + W ih h t-1 + W ic c t-1 + b i), f t = σ (W fx x t + W fh h t-1 + W _fc c _t-1 +b _f ), where σ is the sigmoid function.

基于复合粒度的LSTM的计算任务并行划分策略，能够更好的实现细粒度下的数据级运算的并行性和粗粒度下的细胞级运算的并行性，通过重组、再分配LSTM中的任务，来提高并行性。在复合粒度网络分配模型中，在不同门和细胞中相同的操作会只执行一次。它将细粒度模型中的门和细胞之间以及粗粒度模型中的阶段之间的依赖操作转换为独立操作。The parallel division strategy of computing tasks based on compound-granularity LSTM can better realize the parallelism of data-level operations under fine-grainedness and the parallelism of cell-level operations under coarse-grainedness. By reorganizing and redistributing tasks in LSTM, Improve parallelism. In the composite granular network allocation model, the same operation in different gates and cells will be executed only once. It converts dependent operations between gates and cells in the fine-grained model and between stages in the coarse-grained model into independent operations.

实施例2。本发明提出的长短时记忆网络的复合粒度近存储近似加速结构包括：近存储近似加速存储模块、矩阵向量运算模块、近存储近似加速运算模块、近存储近似加速运算的功能配置模块。Example 2. The compound granularity near-storage approximate acceleration structure of the long-short-term memory network proposed by the present invention includes: a near-storage approximate acceleration storage module, a matrix vector operation module, a near-storage approximate acceleration operation module, and a functional configuration module for near-storage approximate acceleration operation.

其中，首先矩阵向量运算模块用于进行矩阵与向量之间的计算，所获得的计算中间向量数据存储在近存储近似加速存储模块中，并由近存储近似加速存储模块向近存储近似加速运算模块提供各种向量数据，近存储近似加速运算模块用于进行向量与向量之间的计算，近存储近似加速运算的功能配置模块用于对近存储近似加速运算模块进行配置。Among them, first, the matrix vector operation module is used to calculate between the matrix and the vector, and the calculated intermediate vector data obtained is stored in the near storage approximate acceleration storage module, and the near storage approximate acceleration storage module moves to the near storage approximate acceleration calculation module. Provides various vector data. The near-storage approximate acceleration calculation module is used to perform calculations between vectors, and the near-storage approximate acceleration calculation function configuration module is used to configure the near-storage approximate acceleration calculation module.

实施例3。本发明提出的长短时记忆网络的复合粒度近存储近似加速结构中，近存储近似加速运算模块的结构如图1所示，该运算模块包括：第一数据存储模块、第二数据存储模块、数据处理单元。需要进行运算的数据分别输入到第一数据存储模块和第二数据存储模块中，运算得到的数据由第一数据存储模块输出。图1中，实线表示数据流，虚线表示配置流。Example 3. In the compound-granularity near-storage approximate acceleration structure of the long-short-term memory network proposed in the present invention, the structure of the near-storage approximate acceleration calculation module is shown in FIG. 1. The calculation module includes: a first data storage module, a second data storage module, and data Processing unit. The data that needs to be calculated are respectively input into the first data storage module and the second data storage module, and the data obtained by the calculation is output by the first data storage module. In Figure 1, the solid line represents the data flow, and the dashed line represents the configuration flow.

第一数据存储模块和第二数据存储单元模块都是大小为1KB的地址存储区，其位宽均为16×16bit，深度均为32，与复合粒度模型相对应。第一部分和S _0i(i＝1，2，……9，A，B)、第二部分和S _1i(i＝1，2，……9，A，B)分别存储在第一数据存储模块和第二数据存储模块。 The first data storage module and the second data storage unit module are both address storage areas with a size of 1KB, with a bit width of 16×16 bits and a depth of 32, corresponding to the compound granularity model. The first part and S _0i (i=1, 2, ... 9, A, B), the second part and S _1i (i=1, 2, ... 9, A, B) are respectively stored in the first data storage module And the second data storage module.

数据处理单元包括：配置文件缓存、配置文件解析器、第一地址发生器、第二地址发生器、多路选择器、多功能阵列处理器。The data processing unit includes: a configuration file cache, a configuration file parser, a first address generator, a second address generator, a multiplexer, and a multifunctional array processor.

配置环境文件装载到配置文件缓存中，然后由配置文件解析器进行地址配置分析操作，获得地址配置文件分别装载到第一地址发送器和第二地址发生器，第一地址发生器根据地址配置文件决定是否从第一数据存储模块中选择对应的第一地址Add_0_x(x＝0，1，2，……，30，31)，第二地址发生器根据地址配置文件决定是否从第二数据存储模块中选择对应的第二地址Add_1_x(x＝0，1，2，……，30，31)，被灌入第一地址发生器的第一地址Add_0_x、被灌入第二地址发生器的第二地址Add_1_x均输入到多路选择器中；配置文件解析器还对多路选择器进行配置，最终由多路选择器从第一地Add_0_x和第二地址Add_1_x中选择一个地址所对应的数据作为多路选择器的输出数据、并且与第一地址Add_0_x一并被输入到多功能阵列处理器中；同时，配置文件解析器还对多功能阵列处理器进行计算配置，对多路选择器的输出数据进行计算，然后将计算结果存储到第一地址Add_0_x中。The configuration environment file is loaded into the configuration file cache, and then the configuration file parser performs the address configuration analysis operation, and the obtained address configuration files are respectively loaded into the first address transmitter and the second address generator, and the first address generator is based on the address configuration file Decide whether to select the corresponding first address Add_0_x (x=0,1,2,......,30,31) from the first data storage module, and the second address generator decides whether to select from the second data storage module according to the address configuration file Select the corresponding second address Add_1_x (x=0, 1, 2, ..., 30, 31) in the first address generator, and the first address Add_0_x of the first address generator, and the second address of the second address generator. The address Add_1_x is input to the multiplexer; the configuration file parser also configures the multiplexer, and finally the multiplexer selects the data corresponding to an address from the first address Add_0_x and the second address Add_1_x as the multiplexer. The output data of the channel selector is input to the multi-function array processor together with the first address Add_0_x; at the same time, the configuration file parser also calculates and configures the multi-function array processor, and the output data of the multiplexer Perform calculation, and then store the calculation result in the first address Add_0_x.

基于复合粒度的网络模型，通过对长短时记忆网络中的任务进行划分和重组，进一步利用细粒度的数据级并行性和粗粒度的单元级并行性。在复合粒度的网络划分模型中，不同门和单元中相同或相似类型的操作被作为一个任务处理，从而将细粒度模型中门和单元之间的依赖操作以及粗粒度模型中阶段之间的依赖操作转换为独立操作。Based on the network model of compound granularity, by dividing and reorganizing the tasks in the long- and short-term memory network, the fine-grained data-level parallelism and the coarse-grained unit-level parallelism are further utilized. In the compound-granularity network partition model, the same or similar types of operations in different gates and units are processed as a task, so that the dependent operations between the gates and units in the fine-grained model and the dependencies between the stages in the coarse-grained model are treated as one task. The operation is converted to independent operation.

实施例4。本发明还提出的长短时记忆网络的复合粒度近存储近似加速方法，该加速方法的步骤如图2所示，具体如下：Example 4. The present invention also proposes a compound-granularity near-storage approximate acceleration method for a long and short-term memory network. The steps of the acceleration method are shown in Figure 2, and the details are as follows:

步骤S2-1，解析地址配置：Step S2-1, resolve the address configuration:

进一步，第一地址发生器根据地址配置文件决定是否从第一数据存储模块中选择对应的第一地址Add_0_x(x＝0，1，2，……，30，31)并灌入到第一地址发生器，第二地址发生器根据地址配置文件决定是否从第二数据存储模块中选择对应的第二地址Add_1_x(x＝0，1，2，……，30，31)并灌入到第二地址发生器；Further, the first address generator decides whether to select the corresponding first address Add_0_x (x=0,1,2,...,30,31) from the first data storage module according to the address configuration file and fill it into the first address Generator, the second address generator decides whether to select the corresponding second address Add_1_x (x=0,1,2,......,30,31) from the second data storage module according to the address configuration file and fill it into the second Address generator

进一步，被灌入第一地址发生器的第一地址Add_0_x、被灌入第二地址发生器的第二地址Add_1_x均输入到多路选择器中；Further, the first address Add_0_x that is injected into the first address generator and the second address Add_1_x that is injected into the second address generator are both input into the multiplexer;

经过配置，多路选择器从第一地Add_0_x和第二地址Add_1_x中选择一个地址所对应的数据作为多路选择器的输出数据、并且与第一地址Add_0_x一并被输入到多功能阵列处理器中；After configuration, the multiplexer selects the data corresponding to an address from the first address Add_0_x and the second address Add_1_x as the output data of the multiplexer, and is input to the multifunctional array processor together with the first address Add_0_x middle;

配置文件解析器进行多功能阵列处理器的计算配置，通过计算配置可重构的多功能阵列处理器(RMPA)，对多路选择器的输出数据进行RMPA计算，然后将计算结果存储到第一地址Add_0_x中；The configuration file parser performs the calculation configuration of the multi-function array processor. The reconfigurable multi-function array processor (RMPA) is calculated and configured to perform RMPA calculation on the output data of the multiplexer, and then the calculation result is stored in the first Address Add_0_x;

步骤S3，判断还有没有更多的配置文件，如果有返回步骤S1，如果没有则结束。In step S3, it is judged whether there are more configuration files, if there are any more configuration files, return to step S1, and if there are none, then end.

实施例5。本发明提出的长短时记忆网络的复合粒度近存储近似加速结构中，由近存储近似加速运算的功能配置模块实现功能配置，近存储近似加速运算的功能配置模块位宽是16，如图3所示，功能配置模块包括：地址配置单元、多路选择器配置单元、计算配置单元。Example 5. In the composite granularity near-storage approximate acceleration structure of the long-short-term memory network proposed by the present invention, the function configuration is realized by the function configuration module of the near-storage approximate acceleration operation, and the bit width of the function configuration module of the near-storage approximate acceleration operation is 16, as shown in Figure 3. As shown, the functional configuration module includes: address configuration unit, multiplexer configuration unit, and calculation configuration unit.

进一步，功能配置模块的第0位到第7位是地址配置单元；其中，功能配置模块的第0位到第2位是地址发生器选择单元Bank，用于选择地址发生器；功能配置模块的第3位到第7位是地址选择单元Address，用于选择地址发生器中的地址。Further, bits 0 to 7 of the function configuration module are address configuration units; among them, bits 0 to 2 of the function configuration module are the address generator selection unit Bank, which is used to select the address generator; Bits 3 to 7 are the address selection unit Address, which is used to select the address in the address generator.

本优选实施例中，地址发生器选择单元Bank为000时，代表选择第一地址发生器，地址选择单元Address为00000时代表选择第一地址Add_0_0，地址选择单元Address为00001时代表选择第一地址Add_0_1，以此类推，地址选择单元Address为11111时代表选择第一地址Add_0_31。In this preferred embodiment, when the address generator selection unit Bank is 000, it represents the selection of the first address generator, when the address selection unit Address is 00000, it represents the selection of the first address Add_0_0, and when the address selection unit Address is 00001, it represents the selection of the first address. Add_0_1, and so on, when the address selection unit Address is 11111, it means that the first address Add_0_31 is selected.

本优选实施例中，地址发生器选择单元Bank为001时，代表选择第二地址发生器，地址选择单元Address为00000时代表选择第二地址Add_1_0，地址选择单元Address为00001时代表选择第二地址Add_1_1，以此类推，地址选择单元Address为11111时代表选择第二地址Add_1_31。In this preferred embodiment, when the address generator selection unit Bank is 001, it represents the selection of the second address generator, when the address selection unit Address is 00000, it represents the selection of the second address Add_1_0, and when the address selection unit Address is 00001, it represents the selection of the second address. Add_1_1, and so on, when the address selection unit Address is 11111, it means that the second address Add_1_31 is selected.

进一步，功能配置模块的第8位到第11位是多路选择器配置单元MUX，用于乘法器选择运算的数据。Further, the 8th to 11th bits of the function configuration module are the multiplexer configuration unit MUX, which is used to select the data of the multiplier operation.

进一步，功能配置模块的第12位到15位是计算配置单元OpCode，用于表示进行运算的种类，本优选实施例中，计算配置单元OpCode为0000表示加法，计算配置单元OpCode为0100表示乘法，计算配置单元OpCode为1000表示逻辑运算，计算配置单元OpCode为1100表示sigmoid运算；当表示近似乘法运算时，计算配置单元OpCode的最后两位用于配置近似乘法的迭代次数，计算配置单元OpCode为0100、0101、0110分别表示迭代次数为0、1、2，迭代次数越多计算结果越精确。Further, the 12th to 15th bits of the functional configuration module are the calculation configuration unit OpCode, which is used to indicate the type of operation to be performed. In this preferred embodiment, the calculation configuration unit OpCode is 0000 for addition, and the calculation configuration unit OpCode is 0100 for multiplication. The calculation configuration unit OpCode is 1000 for logical operation, and the calculation configuration unit OpCode is 1100 for sigmoid operation; when it represents approximate multiplication, the last two digits of the configuration unit OpCode are used to configure the number of iterations of the approximate multiplication, and the calculation configuration unit OpCode is 0100 , 0101, 0110 indicate the number of iterations are 0, 1, and 2, respectively. The more iterations, the more accurate the calculation result.

以上具体实施方式及实施例是对本发明提出的长短时记忆网络的复合粒度近存储近似加速结构和方法的技术思想的具体支持，不能以此限定本发明的保护范围，凡是按照本发明提出的技术思想，在本技术方案基础上所做的任何等同变化或等效的改动，均仍属于本发明技术方案保护的范围。The above specific implementations and examples are specific support for the technical idea of the composite granularity near storage approximate acceleration structure and method of the long-short-term memory network proposed by the present invention, and cannot be used to limit the protection scope of the present invention. Any technology proposed in accordance with the present invention Thinking, any equivalent changes or equivalent modifications made on the basis of this technical solution still belong to the protection scope of the technical solution of the present invention.

Claims

The compound granularity near storage approximate acceleration structure of the long and short-term memory network includes: near storage approximate acceleration storage module, matrix vector operation module, near storage approximate acceleration operation module, and near storage approximate acceleration operation functional configuration module, which is characterized by:

The compound granularity near storage approximate acceleration structure is based on the task division strategy of compound granularity to carry out calculation tasks in parallel, so that the calculation tasks between the matrix and the vector are sent to the matrix vector operation module, and the calculation tasks between the vector and the vector are sent to the near The storage approximate acceleration calculation module, the matrix vector calculation module, and the near storage approximate acceleration calculation module simultaneously carry out calculation tasks in parallel; the matrix vector calculation module is mainly used to calculate multiplication and addition operations, and the near storage approximate acceleration calculation module is used to calculate the activation function Or addition operation;

The compound granularity is composed of coarse granularity and fine granularity, coarse granularity is cell-level parallel acceleration, and fine granularity is matrix acceleration inside the cell;

The near storage approximate accelerated calculation module includes: a first data storage module, a second data storage module, and a data processing unit; The data storage module stores the first part sum, the second data storage module stores the second part sum, and the data obtained by the operation is output by the first data storage module;

The functional configuration module for near-storage approximate accelerated operation includes: an address configuration unit, a multiplexer configuration unit, and a calculation configuration unit;

First, the matrix-vector operation module is used to calculate between the matrix and the vector. The calculated intermediate vector data is stored in the near-storage approximate acceleration storage module, and the near-storage approximate acceleration storage module provides various calculations to the near-storage approximate acceleration calculation module. This kind of vector data, the function configuration module of the near storage approximate acceleration operation is used to configure the function of the near storage approximate acceleration operation module.

The compound-granularity near-storage approximate acceleration structure of the long-short-term memory network according to claim 1, wherein the first data storage module and the second data storage unit module are both address storage areas with a size of 1KB. The width is 16×16bit, and the depth is 32.

The compound-granularity near-storage approximate acceleration structure of the long-short-term memory network according to claim 1, wherein the data processing unit includes: a configuration file cache, a configuration file parser, a first address generator, and a second address generator. Processor, multiplexer, multi-function array processor;

The configuration environment file is loaded into the configuration file cache, and then the configuration file parser performs the address configuration analysis operation, and the obtained address configuration files are respectively loaded into the first address transmitter and the second address generator, and the first address generator is based on the address configuration file Decide whether to select the corresponding first address from the first data storage module, and the second address generator determines whether to select the corresponding second address from the second data storage module according to the address configuration file, which is poured into the first address generator The first address and the second address poured into the second address generator are both input into the multiplexer; the configuration file parser also configures the multiplexer, and the multiplexer finally starts from the first place and the second address. The data corresponding to one of the two addresses is selected as the output data of the multiplexer, and is input to the multi-function array processor together with the first address; at the same time, the configuration file parser also performs the processing on the multi-function array processor. Calculate the configuration, calculate the output data of the multiplexer, and then store the calculation result in the first address;

Further, the configuration file cache is a cache array dedicated to configuration environment files; the multi-function array processor is a reconfigurable multi-function array processor whose input is a 16-bit fixed-point number and can perform addition, multiplication, and sigmoid operations.

The compound-granularity near-storage approximate acceleration structure of the long-short-term memory network according to claim 1, wherein the function configuration module bit width of the near-storage approximate acceleration operation is 16;

The 0th to 7th bits of the function configuration module are the address configuration unit; among them, the 0th to the 2nd bits of the function configuration module are the address generator selection unit for selecting the address generator; the first bit of the function configuration module Bits 3 to 7 are the address selection unit, which is used to select the address in the address generator;

The 8th to 11th bits of the function configuration module are the multiplexer configuration unit, which is used to select data for the multiplier;

The 12th to 15th bits of the functional configuration module are the calculation configuration unit, which is used to indicate the type of operation. The calculation configuration unit can indicate addition, multiplication, logic operation, sigmoid operation, and approximate multiplication. The last calculation configuration unit Two bits are used to configure the number of iterations of approximate multiplication.

The compound-granularity near-storage approximate acceleration method of the long and short-term memory network is characterized in that the steps of the acceleration method are as follows:

Step S1, load the configuration file: load the configuration environment file into the configuration configuration file cache;

Step S2, parsing the loaded configuration file, specifically includes the following three parallel steps:

Step S2-1, resolve the address configuration:

The configuration file parser performs the address configuration analysis operation, and the obtained address configuration files are loaded into the first address transmitter and the second address generator respectively, and used to select whether to take the address in the first address generator or select the second address to occur Address in the device;

Further, the first address generator decides whether to select the corresponding first address from the first data storage module according to the address configuration file and fill it into the first address generator, and the second address generator decides whether to select from the second address according to the address configuration file. Select the corresponding second address in the data storage module and fill it into the second address generator;

Step S2-2, parse the multiplexer configuration:

The configuration file parser performs the multiplexer configuration analysis operation, obtains the multiplexer configuration file, and the multiplexer selects the data source;

Further, the first address that is poured into the first address generator and the second address that is poured into the second address generator are both input into the multiplexer;

After being configured, the multiplexer selects data corresponding to an address from the first address and the second address as the output data of the multiplexer, and is input into the multifunctional array processor together with the first address;

Step S2-3, analyze and calculate the configuration:

The configuration file parser performs the calculation configuration of the multi-function array processor. The reconfigurable multi-function array processor (RMPA) is calculated and configured to perform RMPA calculation on the output data of the multiplexer, and then the calculation result is stored in the first Address;

In step S3, it is judged whether there are any more configuration files, if there are any more configuration files, return to step S1, and if there are no more configuration files, then it ends.

The compound-granularity near-storage approximate acceleration structure of the long-short-term memory network according to claim 1, characterized in that: based on a compound-granularity task division strategy, stored in the first part and S _{0i of the} first data storage module, and stored in the second The second part of the data storage module and the _{calculation steps of S 1i} are as follows, where i represents the calculation step:

Step A01, suppose that at the t-th time, the network reads the t-th input x _t , and configures the input gate i, the forget gate f, the memory unit c, and the output gate o response weight b and offset value W. The first part and , The second part and meet the following announcements:

Step A02: Calculate the response values of input gate i, forget gate f, memory unit c, and output gate o. The network reads the state value h _{t-1 of the} hidden layer at time t-1, and the first part and the second part are satisfied The following announcement:

Step A03: Calculate the response values of input gate i, forget gate f, memory cell c and output gate o, the network reads the memory cell vector value c _t-1 at time t-1, the first part and the second part satisfy the following Publicity:

Step A04: Calculate the response values of the input gate i, the forget gate f, the memory unit c, and the output gate o. The sum of the first part and the second part meet the following announcements:

Step A05: Calculate the vector values i _t and f _{t of the} input gate i and the forget gate f, and calculate the response values of the memory unit c and the output gate o. The sum of the first part and the second part meet the following announcements:

In the above _{_{formula, i t = σ (W ix}} x t + W ih h t-1 + W ic c t-1 + b i), f t = σ (W fx x t + W fh h t-1 + W _fc c _t-1 +b _f ), where σ is the sigmoid function;

Step A06: Calculate the response values of the memory unit c and the output gate o, and the first part sum and the second part sum meet the following announcements:

In step A07, the vector value c _{t of the} memory unit c is calculated, and the response value of the output gate o is calculated. The sum of the first part and the second part satisfy the following announcements:

In the above formula, c _t =f _t ⊙c _t-1 +i _t ⊙φ(W _cx x _t +W _ch h _t-1 +b _c ), where ⊙ represents the element-wise multiplication operation, and φ is the hyperbolic tangent function ；

Step A08, calculate the response value of the output gate o, the first part and the second part meet the following announcements:

Step A09, calculate the response value of the output gate o, the first part and the second part meet the following announcements:

Step A10, calculate the vector value of the output gate o, the first part sum and the second part sum satisfy the following publicity:

In the above formula, o _t =σ(W _ox x _t +W _oh h _t-1 +W _oc c _t-1 +b _o );

Step A11, the state value h _{t of the} hidden layer at time t is calculated, and the first part sum and the second part sum satisfy the following publicity:

In the above formula, h _t =o _t ⊙φ(c _t ).