CN112799599B

CN112799599B - A data storage method, computing core, chip and electronic device

Info

Publication number: CN112799599B
Application number: CN202110172560.5A
Authority: CN
Inventors: 徐海峥; 裴京; 王松; 马骋
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-02-08
Filing date: 2021-02-08
Publication date: 2022-07-15
Anticipated expiration: 2041-02-08
Also published as: CN112799599A

Abstract

The present disclosure relates to a data storage method, a computing core, a chip and an electronic device, the method is applied to a computing core of a processor, each computing core includes a processing part and a storage part, and the storage part includes two or more storage units. The processing component sequentially writes the weight data and processing data of each convolution layer into each storage unit for storage according to the processing order of the convolution layers in the multi-layer convolution operation process; each storage unit receives and stores the weight data and the processing data. the processed data. In addition, the adjacent layer data can be dynamically covered while calculating, which increases the remaining continuous dynamic available space, frees up more data space for subsequent calculations, improves the space-time operation efficiency of the computing core, and thus improves the chip performance.

Description

A data storage method, computing core, chip and electronic device

技术领域technical field

本公开涉及神经形态工程领域，尤其涉及一种数据存储方法、计算核、芯片和电子设备。The present disclosure relates to the field of neuromorphic engineering, and in particular, to a data storage method, a computing core, a chip and an electronic device.

背景技术Background technique

卷积运算是神经网络中的常用运算。众核神经形态芯片在映射的过程中，对需要进行卷积运算数据进行存储和传输。这些数据可以被分为静态数据和动态数据，其中，静态数据是不可擦除的，动态数据是可以不断擦除的。静态数据在芯片中占据的内存空间比较固定，且不可以被覆盖，动态数据在芯片中占据的内存空间根据时序的需要可以改变、可以被覆盖。静态数据可以包括神经网络的卷积核的权值，动态数据可以包括处理数据。The convolution operation is a common operation in neural networks. In the process of mapping, the many-core neuromorphic chip stores and transmits the data that needs to be subjected to convolution operations. These data can be divided into static data and dynamic data, among which, static data is not erasable, and dynamic data can be continuously erased. The memory space occupied by static data in the chip is relatively fixed and cannot be overwritten. The memory space occupied by dynamic data in the chip can be changed and overwritten according to the timing requirements. Static data may include the weights of the convolution kernels of the neural network, and dynamic data may include processing data.

在数据存储和传输过程中容易产生交叠区域数据的覆盖、冲毁，即便采用先搬移数据错开交叠区域再发送，也会增大计算时钟，降低芯片的计算效力。芯片的存储存取耗时长，占据空间容量大，动态重复利用效率低。In the process of data storage and transmission, it is easy to generate coverage and destruction of data in the overlapping area. Even if the data is first moved to stagger the overlapping area and then sent, the computing clock will be increased and the computing efficiency of the chip will be reduced. The memory access of the chip takes a long time, occupies a large space capacity, and the dynamic reuse efficiency is low.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本公开提出了一种数据存储方法、计算核、芯片和电子设备。In view of this, the present disclosure proposes a data storage method, a computing core, a chip and an electronic device.

根据本公开的一方面，提供了一种数据存储方法，应用于处理器的计算核，所述处理器包括多个计算核，每个计算核包括处理部件及存储部件，其中，所述存储部件包括两个以上存储单元。According to an aspect of the present disclosure, a data storage method is provided, which is applied to a computing core of a processor, the processor includes a plurality of computing cores, each computing core includes a processing component and a storage component, wherein the storage component Include more than two storage units.

所述方法包括：所述处理部件根据所述多层卷积运算过程中的卷积层处理顺序，依次将各卷积层的权值数据和处理数据写入各存储单元进行存储；各存储单元接收并存储所述权值数据和所述处理数据；The method includes: the processing unit sequentially writes the weight data and processing data of each convolution layer into each storage unit for storage according to the processing order of the convolution layers in the multi-layer convolution operation process; receiving and storing the weight data and the processing data;

其中，在所述存储单元中，相同卷积层的处理数据和权值数据存储在不同存储单元中；在同一存储单元中，存储有权值数据的权值数据空间与存储有处理数据的处理数据空间按照第一地址方向依次排放；Wherein, in the storage unit, the processing data and weight data of the same convolution layer are stored in different storage units; in the same storage unit, the weight data space for storing the weight data and the processing data for storing the processing data are stored in the same storage unit. The data space is arranged in sequence according to the first address direction;

其中，存储各卷积层的权值数据的权值数据空间按照卷积层处理顺序，沿第一地址方向依次排放；存储各卷积层的处理数据的处理数据空间按照卷积层处理顺序，沿第一地址方向依次排放；Among them, the weight data space for storing the weight data of each convolution layer is arranged in the first address direction according to the processing order of the convolution layer; the processing data space for storing the processing data of each convolution layer is according to the processing order of the convolution layer, Discharge in sequence along the first address direction;

其中，存储同一卷积层的权值数据的权值数据空间中和存储同一卷积层的处理数据的处理数据空间中，权值数据和处理数据分别按照第二地址方向依次排放，第一地址方向与第二地址方向相反。Among them, in the weight data space storing the weight data of the same convolution layer and the processing data space storing the processing data of the same convolution layer, the weight data and the processing data are arranged in sequence according to the second address direction, and the first address The direction is opposite to the direction of the second address.

在一种可能的实现方式中，所述第一地址方向为高地址到低地址方向，所述第二地址方向为低地址到高地址方向。In a possible implementation manner, the first address direction is a high address to low address direction, and the second address direction is a low address to high address direction.

在一种可能的实现方式中，所述方法还包括：In a possible implementation, the method further includes:

所述处理部件将所述多层卷积运算过程中各卷积层的运算结果数据送入各存储单元进行存储；The processing component sends the operation result data of each convolution layer in the multi-layer convolution operation process into each storage unit for storage;

各存储单元接收并存储所述运算结果数据，其中，任一卷积层的运算结果数据与该卷积层的处理数据存储在同一存储单元中，并与下一卷积层的处理数据共用存储空间。Each storage unit receives and stores the operation result data, wherein the operation result data of any convolution layer and the processing data of the convolution layer are stored in the same storage unit, and are shared with the processing data of the next convolution layer. space.

所述处理部件将从所述计算核外接收的第一数据写入存储单元，并从存储单元读取第二数据以向所述计算核外发送；The processing component writes first data received from outside the computing core to a storage unit, and reads second data from the storage unit to send to the outside of the computing core;

其中，在多次写入操作中，每次写入操作的起始地址按照所述第一地址方向排列，每一次写入操作中按照所述第二地址方向写入第一数据；Wherein, in multiple write operations, the start addresses of each write operation are arranged in the direction of the first address, and the first data is written in the direction of the second address in each write operation;

在多次读取操作中，每次读取操作的起始地址按照所述第一地址方向排列，每一次读取操作中按照所述第二地址方向读取第二数据。In multiple read operations, the start addresses of each read operation are arranged in the direction of the first address, and the second data is read in the direction of the second address in each read operation.

在一种可能的实现方式中，在所述存储单元中，针对每一卷积层的处理数据和权值数据的存储顺序符合先深度方向，再横向，再纵向的卷积运算过程。In a possible implementation manner, in the storage unit, the storage order of the processing data and weight data for each convolutional layer conforms to the convolution operation process of depth direction first, then horizontal direction, and then vertical direction.

在一种可能的实现方式中，每一层卷积运算过程使用多个卷积核组，每个卷积核组中包括多个卷积核；In a possible implementation manner, each layer of convolution operation process uses multiple convolution kernel groups, and each convolution kernel group includes multiple convolution kernels;

在所述存储单元中，每一层卷积运算中的各卷积核组的权值数据依次存储，每个卷积核组的权值数据的存储顺序按照卷积核的编号顺序、卷积核深度方向、横向方向、纵向方向进行存储。In the storage unit, the weight data of each convolution kernel group in the convolution operation of each layer is stored in sequence, and the storage order of the weight data of each convolution kernel group is according to the numbering order of the convolution kernel, the convolution kernel The kernel depth direction, horizontal direction, and vertical direction are stored.

根据本公开的另一方面，提供了一种计算核，所述计算核包括处理部件及存储部件；According to another aspect of the present disclosure, a computing core is provided, the computing core includes a processing component and a storage component;

所述处理部件根据所述多层卷积运算过程中的卷积层处理顺序，依次将各卷积层的权值数据和处理数据写入各存储单元进行存储；The processing unit sequentially writes the weight data and processing data of each convolution layer into each storage unit for storage according to the processing order of the convolution layers in the multi-layer convolution operation process;

所述存储部件包括两个以上存储单元，各存储单元接收并存储所述权值数据和所述处理数据；The storage unit includes two or more storage units, and each storage unit receives and stores the weight data and the processing data;

对于上述计算核，在一种可能的实现方式中，所述第一地址方向为高地址到低地址方向，所述第二地址方向为低地址到高地址方向。For the above computing core, in a possible implementation manner, the first address direction is a high address to low address direction, and the second address direction is a low address to high address direction.

对于上述计算核，在一种可能的实现方式中，所述处理部件还用于将所述多层卷积运算过程中各卷积层的运算结果数据送入各存储单元进行存储；For the above computing core, in a possible implementation manner, the processing component is further configured to send the operation result data of each convolution layer in the multi-layer convolution operation process into each storage unit for storage;

对于上述计算核，在一种可能的实现方式中，所述处理部件还用于将从所述计算核外接收的第一数据写入存储单元，并从存储单元读取第二数据以向所述计算核外发送；For the above computing core, in a possible implementation manner, the processing component is further configured to write the first data received from outside the computing core into the storage unit, and read the second data from the storage unit to send the data to the storage unit. The calculation is sent out of the core;

对于上述计算核，在一种可能的实现方式中，在所述存储单元中，针对每一卷积层的处理数据和权值数据的存储顺序符合先深度方向，再横向，再纵向的卷积运算过程。For the above calculation kernel, in a possible implementation manner, in the storage unit, the storage order of the processing data and the weight data for each convolutional layer conforms to the depth direction first, then the horizontal direction, and then the vertical convolution. operation process.

对于上述计算核，在一种可能的实现方式中，每一层卷积运算过程使用多个卷积核组，每个卷积核组中包括多个卷积核；For the above calculation kernel, in a possible implementation manner, each layer of convolution operation process uses multiple convolution kernel groups, and each convolution kernel group includes multiple convolution kernels;

根据本公开的另一方面，提供了一种人工智能芯片，所述芯片包括多个计算核。According to another aspect of the present disclosure, there is provided an artificial intelligence chip including a plurality of computing cores.

根据本公开的另一方面，提供了一种电子设备，包括一个或多个人工智能芯片。According to another aspect of the present disclosure, there is provided an electronic device including one or more artificial intelligence chips.

根据本公开的实施例，通过按照卷积层处理顺序在各存储单元的权值数据空间和处理数据空间分别按照第一地址方向对权值数据和处理数据存储，可以对相邻卷积层数据一边计算一边动态覆盖，增大了剩余连续动态可用空间，为后续的计算腾出更大的数据空间。在存储同一层卷积层的权值数据和处理数据按照第二地址方向依次排放，可以高度匹配先深度，再横向，再纵向的卷积运算过程，提高存储过程与卷积运算过程的拟合度。在所述存储单元待发送数据地址和待接收数据地址出现交叠的情况下，通过对多次读写操作中每次读写操作的起始地址按照所述第一地址方向排列，可以解决因为同时被处理单元执行多次读写操作造成存储单元交叠区域数据的冲毁和覆盖，提高计算核的时空运算效率，进而提高芯片性能。According to the embodiments of the present disclosure, by storing the weight data and the processing data respectively in the weight data space and the processing data space of each storage unit according to the processing order of the convolution layer according to the first address direction, the data of adjacent convolution layers can be processed. Dynamic coverage while computing increases the remaining continuous dynamic free space and frees up more data space for subsequent computing. The weight data and processing data of the same convolutional layer are stored in order according to the second address direction, which can highly match the convolution operation process of depth, then horizontal, and then vertical, and improve the fit between the storage process and the convolution operation process. Spend. In the case where the address of the data to be sent and the address of the data to be received in the storage unit overlap, by arranging the starting addresses of each read and write operation in the multiple read and write operations according to the first address direction, it is possible to solve the problem because At the same time, multiple read and write operations performed by the processing unit result in the destruction and coverage of the data in the overlapping area of the storage unit, which improves the space-time operation efficiency of the computing core, thereby improving the performance of the chip.

根据下面参考附图对示例性实施例的详细说明，本公开的其它特征及方面将变得清楚。Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments with reference to the accompanying drawings.

附图说明Description of drawings

包含在说明书中并且构成说明书的一部分的附图与说明书一起示出了本公开的示例性实施例、特征和方面，并且用于解释本公开的原理。The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate exemplary embodiments, features, and aspects of the disclosure, and together with the description, serve to explain the principles of the disclosure.

图1示出根据本公开一实施例的计算核的示意图。FIG. 1 shows a schematic diagram of a computing core according to an embodiment of the present disclosure.

图2a示出相关技术中数据存储示意图。Figure 2a shows a schematic diagram of data storage in the related art.

图2b示出根据本公开一实施例的数据存储示意图。FIG. 2b shows a schematic diagram of data storage according to an embodiment of the present disclosure.

图3示出根据本公开一实施例的流程图。FIG. 3 shows a flowchart according to an embodiment of the present disclosure.

图4a示出相关技术中存储单元数据传输示意图。FIG. 4a shows a schematic diagram of data transmission in a storage unit in the related art.

图4b示出根据本公开一实施例的数据传输示意图。FIG. 4b shows a schematic diagram of data transmission according to an embodiment of the present disclosure.

图5示出根据本公开一实施例的流程图。FIG. 5 shows a flowchart according to an embodiment of the present disclosure.

图6示出根据本公开一实施例的处理数据存储顺序的示意图。FIG. 6 shows a schematic diagram of a processing data storage sequence according to an embodiment of the present disclosure.

图7示出根据本公开一实施例的权值数据存储顺序的示意图。FIG. 7 is a schematic diagram illustrating a storage sequence of weight data according to an embodiment of the present disclosure.

图8示出根据本公开一实施例的卷积运算过程存储顺序原理示意图。FIG. 8 is a schematic diagram showing the principle of storage order of a convolution operation process according to an embodiment of the present disclosure.

图9示出根据本公开一实施例的电子装置的框图。FIG. 9 shows a block diagram of an electronic device according to an embodiment of the present disclosure.

图10示出根据本公开一实施例的电子装置的框图。FIG. 10 shows a block diagram of an electronic device according to an embodiment of the present disclosure.

具体实施方式Detailed ways

下面将结合本公开实施例中的附图，对本公开实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本公开一部分实施例，而不是全部的实施例。基于本公开中的实施例，本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本公开保护的范围。The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, but not all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those skilled in the art without creative efforts shall fall within the protection scope of the present disclosure.

应当理解，本公开的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象，而不是用于描述特定顺序。本公开的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在，但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that the terms "first", "second", "third" and "fourth" in the claims, description and drawings of the present disclosure are used to distinguish different objects, rather than to describe a specific order . The terms "comprising" and "comprising" as used in the specification and claims of the present disclosure indicate the presence of the described feature, integer, step, operation, element and/or component, but do not exclude one or more other features, integers , step, operation, element, component and/or the presence or addition of a collection thereof.

还应当理解，在此本公开说明书中所使用的术语仅仅是出于描述特定实施例的目的，而并不意在限定本公开。如在本公开说明书和权利要求书中所使用的那样，除非上下文清楚地指明其它情况，否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解，在本公开说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何处理以及所有可能处理，并且包括这些处理。It should also be understood that the terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to limit the present disclosure. As used in this disclosure and the claims, the singular forms "a," "an," and "the" are intended to include the plural unless the context clearly dictates otherwise. It should further be understood that, as used in this disclosure and the claims, the term "and/or" refers to and including any and all possible treatments of one or more of the associated listed items.

如在本说明书和权利要求书中所使用的那样，术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地，短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。As used in this specification and in the claims, the term "if" may be contextually interpreted as "when" or "once" or "in response to determining" or "in response to detecting". Similarly, the phrases "if it is determined" or "if the [described condition or event] is detected" may be interpreted, depending on the context, to mean "once it is determined" or "in response to the determination" or "once the [described condition or event] is detected. ]" or "in response to detection of the [described condition or event]".

还应当理解，张量(tensor)作为存放数据的容器，可以看作多维数组。图像数据以及其他感知数据(例如音频、视频等)可以表示为多维张量，并且可以以二进制形式存放到存储器中。为了便于对本公开技术方案的理解，下文中处理数据可以以图像数据为例。在此本公开说明书中所使用的图像数据仅仅是出于描述特定实施例的目的，而并不意在限定本公开。本公开对包括视频、音频、图像等可以在存储器中以二进制形式存储的处理数据均适用。It should also be understood that, as a container for storing data, a tensor can be regarded as a multidimensional array. Image data, as well as other perceptual data (eg, audio, video, etc.), can be represented as multidimensional tensors and stored in memory in binary form. In order to facilitate the understanding of the technical solutions of the present disclosure, image data may be used as an example for processing data hereinafter. The image data used in this specification of the present disclosure is for the purpose of describing particular embodiments only, and is not intended to limit the present disclosure. The present disclosure is applicable to processing data including video, audio, images, etc., which may be stored in memory in binary form.

图1示出根据本公开一实施例的计算核的示意图。根据本公开实施例的数据存储方法应用于处理器的计算核，所述处理器包括多个计算核。FIG. 1 shows a schematic diagram of a computing core according to an embodiment of the present disclosure. The data storage method according to the embodiment of the present disclosure is applied to a computing core of a processor, where the processor includes a plurality of computing cores.

如图1所示，每个计算核包括处理部件及存储部件。所述处理部件可以包括树突单元、轴突单元、胞体单元、路由单元。所述存储部件可以包括两个以上存储单元，所述存储单元可用于存储处理数据和权值数据，存储单元中存储了处理数据的空间可称为处理数据空间，存储了权值数据的空间可称为权值数据空间。As shown in FIG. 1, each computing core includes a processing unit and a storage unit. The processing components may include dendritic units, axonal units, soma units, and routing units. The storage unit may include two or more storage units, the storage units may be used to store the processing data and the weight data, the space in which the processing data is stored in the storage unit may be referred to as the processing data space, and the space in which the weight data is stored may be referred to as the processing data space. It is called the weight data space.

在一种可能的实现方式中，所述处理器可以是类脑计算芯片，即，以大脑的处理模式为参考，通过模拟大脑中神经元对信息的传递与处理，提升处理效率并降低功耗。所述处理器可以包括多个计算核，计算核之间可以独立处理不同的任务，例如：卷积运算任务、池化任务或全连接任务等；也可以并行处理同一任务，即，每个计算核可以处理所分配到的同一任务的不同部分，例如：多层卷积神经网络运算任务中部分层的卷积运算任务等。需要说明的是，本公开对芯片中计算核数量，以及计算核所运行的任务不做限制。In a possible implementation manner, the processor may be a brain-like computing chip, that is, taking the processing mode of the brain as a reference, by simulating the transmission and processing of information by neurons in the brain, the processing efficiency is improved and the power consumption is reduced . The processor may include multiple computing cores, and the computing cores may independently process different tasks, such as convolution operation tasks, pooling tasks, or fully connected tasks, etc.; the same task may also be processed in parallel, that is, each computing The kernel can process different parts of the same task assigned, for example, the convolution operation task of some layers in the multi-layer convolutional neural network operation task. It should be noted that the present disclosure does not limit the number of computing cores in the chip and the tasks executed by the computing cores.

在计算核之内，可以设置有处理部件和存储部件。处理部件可以包括树突单元、轴突单元、胞体单元和路由单元。处理部件可以模拟大脑的神经元对信息的处理模式，其中，树突单元用于接收信号，轴突单元用于发送尖峰信号，胞体单元用于信号的集成变换，路由单元用于同其它计算核进行信息传输。计算核内的处理部件可以对存储部件的多个存储单元进行读写访问，以与计算核内的存储部件进行数据交互，并可分别承担各自的数据处理任务和/或数据传输任务，以获得数据处理结果，或者与其他计算核进行通信。本公开对所述处理部件的应用领域不做限制。Within the computing core, processing means and storage means may be provided. Processing components may include dendritic units, axonal units, soma units, and routing units. The processing component can simulate the information processing mode of neurons in the brain, in which the dendritic unit is used to receive signals, the axonal unit is used to send spike signals, the cell body unit is used for integrated transformation of signals, and the routing unit is used to communicate with other computing cores. to transmit information. The processing unit in the computing core can perform read and write access to multiple storage units of the storage unit to perform data interaction with the storage unit in the computing core, and can respectively undertake their own data processing tasks and/or data transmission tasks to obtain data processing results, or communicate with other computing cores. The present disclosure does not limit the application fields of the processing components.

在一种可能的实现方式中，存储部件可以包括两个以上存储单元，存储单元可以为静态随机存取存储器(Static Random-Access Memory，SRAM)。例如，存储单元可包括读写宽度为32B，容量为64KB的SRAM。本公开对存储单元的读写宽度和容量不做限制。In a possible implementation manner, the storage component may include more than two storage units, and the storage units may be static random-access memory (Static Random-Access Memory, SRAM). For example, the storage unit may include an SRAM with a read/write width of 32B and a capacity of 64KB. The present disclosure does not limit the read/write width and capacity of the storage unit.

在一种可能的实现方式中，所述存储单元包括处理数据空间和权值数据空间。其中，处理数据空间可以用于存储动态数据，即，在运行中可能发生变化的数据、在运行中需要输入输出的数据以及在关联操作中要改变的数据。权值数据空间可以用于存储静态数据，即，在程序运行过程中作为控制或参考用的数据。静态数据可以不随程序运行而变化，所以说静态数据可以在很长的一段时间内不发生变化。例如，对于卷积神经网络结构，静态数据可以为卷积运算的权值数据，动态数据可以为处理数据，例如包括卷积层输入数据(例如，可以是经过shortcut处理的卷积层输入数据)，随着处理数据的变化可以进行擦除、覆盖，比如说，在卷积神经网络结构分层迭代计算过程中，根据时序的需要不断修改处理数据，用新的处理数据覆盖旧的处理数据。In a possible implementation manner, the storage unit includes a processing data space and a weight data space. Among them, the processing data space can be used to store dynamic data, that is, data that may change during operation, data that needs to be input and output during operation, and data to be changed during associated operations. The weight data space can be used to store static data, that is, data used for control or reference during program execution. Static data may not change as the program runs, so static data may not change for a long period of time. For example, for a convolutional neural network structure, the static data can be the weight data of the convolution operation, and the dynamic data can be the processing data, such as the input data of the convolution layer (for example, it can be the input data of the convolution layer that has undergone shortcut processing) , which can be erased and overwritten as the processing data changes. For example, in the process of hierarchical iterative calculation of the convolutional neural network structure, the processing data is constantly modified according to the needs of the time series, and the old processing data is overwritten with the new processing data.

根据本公开的实施例的计算核，可将处理部件和存储部件设置在计算核内，使得存储部件直接接收处理部件的读写访问，无需对计算核外部的存储部件进行读写，优化了内存读写速度，适用于众核架构的处理部件。According to the computing core of the embodiment of the present disclosure, the processing component and the storage component can be arranged in the computing core, so that the storage component directly receives the read and write access of the processing component, without reading and writing the storage component outside the computing core, and the memory is optimized. Read and write speed, suitable for processing components of many-core architecture.

在一种可能的实现方式中，所述方法可以用于实现多层卷积运算过程中数据的存储与传输，其中，所述多层卷积运算过程中的数据包括各层卷积运算过程中的处理数据和权值数据。In a possible implementation manner, the method can be used to realize the storage and transmission of data in the multi-layer convolution operation process, wherein the data in the multi-layer convolution operation process includes the data in the multi-layer convolution operation process processing data and weight data.

举例来说，在应用神经网络算法对图像进行目标识别时，会对输入的图像处理数据做多层卷积运算。神经网络的基本数据结构是层，包括深度神经网络DNN、卷积神经网络CNN、循环神经网络RNN、以及深度残差网络ResNet等网络在内的各种神经网络，都是由多个层有机结合而成的神经网络。可以将层理解为一个数据处理模块，不同的数据处理类型需要用到不同的层，不同层有不同的层属状态，也就是层的权值。一个输入或多个输入在不同层所属权值作用下可以转换成一个或多个输出。For example, when applying a neural network algorithm to target an image, a multi-layer convolution operation is performed on the input image processing data. The basic data structure of a neural network is a layer. Various neural networks, including deep neural network DNN, convolutional neural network CNN, recurrent neural network RNN, and deep residual network ResNet, are all organically combined by multiple layers. formed neural network. A layer can be understood as a data processing module. Different data processing types need to use different layers, and different layers have different layer attributes, that is, the weights of the layers. One input or multiple inputs can be converted into one or more outputs under the action of weights belonging to different layers.

如果一个神经网络是比较小的网络，一个计算核就可以满足处理该神经网络各层所需的资源，可以将该神经网络的处理数据和权值数据存入一个计算核里。If a neural network is a relatively small network, one computing core can satisfy the resources required for processing each layer of the neural network, and the processing data and weight data of the neural network can be stored in a computing core.

如果是很大的神经网络，神经网络有很多层，每层都需要计算大量数据，一个计算核不能满足处理该神经网络所需的资源，在这种情况下，就需要多个计算核协作共同计算这个很大的神经网络，可以将神经网络拆分成多个部分并送入处理器分配的对应计算核内进行计算处理。例如，假设神经网络包括7个卷积层L1层-L7层，处理器可以将卷积层L1层和L2层的处理数据和权值数据分配给计算核A计算，将卷积层L3层和L4层的处理数据和权值数据分配给计算核B计算，将卷积层L5层-L7层的处理数据和权值数据分配给计算核C计算。需要说明的是，本公开对计算核运行的神经网络类型，以及如何对神经网络拆分分配到不同计算核的方式不做限定。If it is a very large neural network, the neural network has many layers, each layer needs to calculate a large amount of data, and one computing core cannot meet the resources required to process the neural network. In this case, multiple computing cores are required to cooperate and work together To calculate this large neural network, the neural network can be divided into multiple parts and sent to the corresponding computing cores allocated by the processor for computing processing. For example, assuming that the neural network includes 7 convolutional layers L1-L7, the processor can assign the processing data and weight data of the L1 and L2 convolutional layers to the calculation core A for calculation, and the convolutional layers L3 and L2. The processing data and weight data of the L4 layer are allocated to the calculation core B for calculation, and the processing data and weight data of the convolutional layer L5 layer-L7 layer are allocated to the calculation core C for calculation. It should be noted that the present disclosure does not limit the types of neural networks that the computing cores run, and how the neural networks are split and allocated to different computing cores.

相关技术中，处理器在将多层卷积运算过程中的数据存储到存储单元时，将多层卷积运算过程中的处理数据(动态数据)与权值数据(静态数据)在不同的存储单元上分开存储。也就是说处理器将权值数据存储到一个存储单元，处理数据存储到另一个存储单元中。In the related art, when the processor stores the data in the multi-layer convolution operation in the storage unit, the processing data (dynamic data) and the weight data (static data) in the multi-layer convolution operation are stored in different storage units. Separate storage on the unit. That is to say, the processor stores the weight data in one storage unit, and stores the processing data in another storage unit.

图2a示出相关技术中数据存储示意图，如图2a所示，图中有两个存储单元，即：存储单元MEM0、存储单元MEM1，每个存储单元的存储空间容量为64KB，读写位宽为32B，多层卷积运算过程中的权值数据W存储在存储单元MEM0中，处理数据X存储在存储单元MEM1中。第L5层，L6层，L7层可以表示多层卷积神经网络在运算过程中对应的卷积层数。在执行各层卷积运算时，按照L5层，L6层，L7层……的顺序执行。Figure 2a shows a schematic diagram of data storage in the related art. As shown in Figure 2a, there are two storage units in the figure, namely: storage unit MEM0 and storage unit MEM1. The storage space capacity of each storage unit is 64KB, and the read-write bit width is For 32B, the weight data W during the multi-layer convolution operation is stored in the storage unit MEM0, and the processed data X is stored in the storage unit MEM1. The L5 layer, the L6 layer, and the L7 layer can represent the corresponding convolution layers of the multi-layer convolutional neural network in the operation process. When performing the convolution operation of each layer, it is performed in the order of L5 layer, L6 layer, L7 layer... .

第L5层，L6层，L7层的权值数据放在存储单元MEM0中，按照第二地址方向依次存储着8KB大小的L5层权值数据W、18KB大小的L6层权值数据W、8KB大小的L7层权值数据W，剩余的连续动态可用空间为30KB。其中，本文中第二地址方向可为低地址到高地址方向，连续动态可用空间为未存储数据的空白空间。The weight data of the L5 layer, L6 layer, and L7 layer are placed in the storage unit MEM0. According to the second address direction, the weight data W of the L5 layer in size of 8KB, the weight data W of the L6 layer in size of 18KB, and the size of 8KB are stored in sequence according to the second address direction. The L7 layer weight data W, the remaining continuous dynamic free space is 30KB. Wherein, the second address direction in this document may be a direction from a low address to a high address, and the continuous dynamic free space is a blank space where no data is stored.

第L5层，L6层，L7层的处理数据存放在另一个存储单元MEM1中，按照第二地址方向依次存储着28KB大小的L5层处理数据X、15KB大小的L6层处理数据X、15KB大小的L7层处理数据X、剩余的连续动态可用空间为6KB。The processing data of the L5 layer, the L6 layer, and the L7 layer are stored in another storage unit MEM1. According to the second address direction, the L5 layer processing data X of 28KB size, the L6 layer processing data X of 15KB size, and the processing data of 15KB size are stored in sequence according to the second address direction. The L7 layer processes data X, and the remaining continuous dynamic free space is 6KB.

当L5层，L6层，L7层整个节点单元的卷积运算完毕需要发送给其它计算核，同时接收上一个节点单元对应的计算核发送过来的数据，由于采用第二地址方向的正序存储顺序，因此，计算核不能一边计算L6层，并把L6层计算结果输出到L7层，一边动态覆盖L6层所占空间的数据。When the convolution operation of the entire node unit of the L5 layer, L6 layer, and L7 layer is completed, it needs to be sent to other computing cores, and at the same time, the data sent by the computing core corresponding to the previous node unit is received. , therefore, the computing core cannot calculate the L6 layer and output the calculation result of the L6 layer to the L7 layer, while dynamically covering the data of the space occupied by the L6 layer.

因此，处理数据和权值数据在各自的存储单元内按照卷积层处理顺序，按照第二地址方向在存储单元正序排放，将会导致动态数据重复利用率低，剩余连续空间小，不能对同一层数据一边计算一边动态覆盖，影响后续计算的存储空间，而且存储过程中会伴随数据的不断搬运，增加运算时间。Therefore, the processing data and weight data are arranged in the respective storage units according to the processing order of the convolution layer, and are arranged in the positive order of the storage units according to the second address direction, which will result in a low dynamic data reuse rate and a small remaining continuous space, which cannot be used for The same layer of data is dynamically covered while being calculated, which affects the storage space for subsequent calculations, and the storage process will be accompanied by the continuous transfer of data, increasing the computing time.

针对上述如图2a所示的相关技术中数据存储存在的问题，图3示出根据本公开一实施例的数据存储方法的流程图。如图3所示方法可以包括步骤如下：In view of the above-mentioned problems of data storage in the related art as shown in FIG. 2a, FIG. 3 shows a flowchart of a data storage method according to an embodiment of the present disclosure. The method shown in Figure 3 may include the following steps:

步骤S1，所述处理部件根据所述多层卷积运算过程中的卷积层处理顺序，依次将各卷积层的权值数据和处理数据写入各存储单元进行存储。Step S1, the processing component sequentially writes the weight data and processing data of each convolution layer into each storage unit for storage according to the processing order of the convolution layers in the multi-layer convolution operation process.

步骤S2，各存储单元接收并存储所述权值数据和所述处理数据。Step S2, each storage unit receives and stores the weight data and the processing data.

在一种可能的实现方式中，在同一存储单元中，存储有权值数据的权值数据空间与存储有处理数据的处理数据空间按照第一地址方向依次排放。在本文中第一地址方向可为高地址到低地址方向，第一地址方向与第二地址方向相反。In a possible implementation manner, in the same storage unit, the weight data space in which the weight data is stored and the processing data space in which the processing data is stored are arranged in sequence according to the first address direction. Herein the first address direction may be a high address to low address direction, the first address direction being opposite to the second address direction.

举例说明，图2b示出根据本公开一实施例的数据存储示意图，如图2b所示，假设存储部件包括两个存储单元：存储单元MEM0、存储单元MEM1。每个存储单元可以使用读写宽度为32B，容量为64KB的SRAM。第L5层，L6层，L7层可以表示多层卷积神经网络在运算过程中对应的卷积层数。For example, FIG. 2 b shows a schematic diagram of data storage according to an embodiment of the present disclosure. As shown in FIG. 2 b , it is assumed that the storage component includes two storage units: a storage unit MEM0 and a storage unit MEM1 . Each memory cell can use SRAM with a read-write width of 32B and a capacity of 64KB. The L5 layer, the L6 layer, and the L7 layer can represent the corresponding convolution layers of the multi-layer convolutional neural network in the operation process.

如图2b所示，在存储单元MEM0中，按照第一地址方向依次为存储了8KB静态数据W的权值数据空间，存储了30KB(15KB+15KB)动态数据X的处理数据空间，以及剩余的40KB连续动态可用空间。As shown in Figure 2b, in the storage unit MEM0, according to the first address direction, the weight data space for storing 8KB static data W, the processing data space for storing 30KB (15KB+15KB) dynamic data X, and the remaining 40KB continuous dynamic free space.

如图2b所示，在存储单元MEM1中，按照第一地址方向依次为存储了26KB(18KB+8KB)的静态数据W的权值数据空间，存储了28KB的动态数据X的处理数据空间，以及剩余的10KB连续动态可用空间。As shown in FIG. 2b, in the storage unit MEM1, according to the first address direction, the weight data space for storing 26KB (18KB+8KB) of static data W, the processing data space for storing 28KB dynamic data X, and The remaining 10KB of continuous dynamic free space.

每个存储单元包括可以存储动态数据的处理数据空间以及可以存储静态数据的权值数据空间，并且存储有权值数据的权值数据空间与存储有处理数据的处理数据空间按照第一地址方向依次排放，可以提高存储单元的剩余可用的连续动态可用空间。Each storage unit includes a processing data space that can store dynamic data and a weight data space that can store static data, and the weight data space that stores weighted data and the processing data space that stores processing data are in order according to the first address direction. Emissions can increase the remaining available continuous dynamic free space of the storage unit.

在一种可能的实现方式中，在所述存储单元中，相同卷积层的处理数据和权值数据存储在不同存储单元中；不同卷积层的处理数据和权值数据可以存储在同一个存储单元存储。In a possible implementation manner, in the storage unit, the processing data and weight data of the same convolutional layer are stored in different storage units; the processing data and weight data of different convolutional layers may be stored in the same storage unit storage unit storage.

举例说明，如图2b所示，所述处理部件将L5层的权值数据W送入存储单元MEM0的权值数据空间，将L5层的处理数据X送入存储单元MEM1的处理数据空间；将L6层的权值数据W送入存储单元MEM1的权值数据空间，将L6层的处理数据X送入存储单元MEM0的处理数据空间；将L7层的权值数据W送入存储单元MEM1的权值数据空间，将L7层的处理数据X送入存储单元MEM0的处理数据空间。For example, as shown in Figure 2b, the processing component sends the weight data W of the L5 layer into the weight data space of the storage unit MEM0, and sends the processed data X of the L5 layer into the processing data space of the storage unit MEM1; The weight data W of the L6 layer is sent to the weight data space of the storage unit MEM1, the processing data X of the L6 layer is sent to the processing data space of the storage unit MEM0; the weight data W of the L7 layer is sent to the weight data space of the storage unit MEM1 In the value data space, the processing data X of the L7 layer is sent to the processing data space of the storage unit MEM0.

由于相同卷积层的处理数据和权值数据存储在不同存储单元中，所述处理部件可以并行访问多个存储单元读取处理数据和权值数据，提高计算核的计算效率。Since the processing data and weight data of the same convolution layer are stored in different storage units, the processing component can access multiple storage units in parallel to read the processing data and weight data, thereby improving the computing efficiency of the computing core.

在一种可能的实现方式中，存储各卷积层的权值数据的权值数据空间按照卷积层处理顺序，沿第一地址方向依次排放；存储各卷积层的处理数据的处理数据空间按照卷积层处理顺序，沿第一地址方向依次排放。In a possible implementation manner, the weight data space for storing the weight data of each convolutional layer is arranged in the first address direction according to the processing order of the convolutional layers; the processing data space for storing the processed data of each convolutional layer is According to the processing order of the convolution layer, they are sequentially arranged along the first address direction.

举例说明，如图2b所示，L5层的权值数据W与L6层、L7层的处理数据X在存储单元MEM0中，可以沿第一地址方向依次存储8KB大小的L5层权值数据W、15KB大小的L6层处理数据X、15KB大小的L7层处理数据X、以及剩余的连续动态可用空间40KB。For example, as shown in FIG. 2b, the weight data W of the L5 layer and the processing data X of the L6 layer and the L7 layer are in the storage unit MEM0, and the weight data W of the L5 layer of 8KB in size can be stored in the first address direction. The L6 layer processing data X with a size of 15KB, the L7 layer processing data X with a size of 15KB, and the remaining continuous dynamic free space of 40KB.

L6层、L7层的权值数据W与L5层的处理数据X在存储单元MEM1中，可以沿第一地址方向依次存储18KB大小的L6层权值数据W、8KB大小的L7层权值数据W、28KB大小的L5层处理数据X、以及剩余的连续动态可用空间10KB。The weight data W of the L6 layer and the L7 layer and the processing data X of the L5 layer can be stored in the storage unit MEM1 in sequence along the first address direction. , 28KB of L5 layer processing data X, and the remaining continuous dynamic free space of 10KB.

对比图2a相关技术中存储单元MEM0和存储单元MEM1剩余的连续动态可用空间36KB(MEM0:30KB+MEM1:6KB)，图2b中存储单元MEM0和存储单元MEM1剩余的连续动态可用空间为50KB(MEM0:40KB+MEM1:10KB)，采用本公开实施例所用方法可以提高了存储单元的连续动态可用空间。Compared with the remaining continuous dynamic free space of the storage unit MEM0 and the storage unit MEM1 in the related art of FIG. 2a 36KB (MEM0:30KB+MEM1:6KB), the remaining continuous dynamic free space of the storage unit MEM0 and the storage unit MEM1 in FIG. 2b is 50KB (MEM0 : 40KB+MEM1: 10KB), the continuous dynamic free space of the storage unit can be improved by adopting the method used in the embodiment of the present disclosure.

在一种可能的实现方式中，所述处理部件将所述多层卷积运算过程中各卷积层的运算结果数据送入各存储单元进行存储，各存储单元接收并存储所述运算结果数据。其中，任一卷积层的运算结果数据与该卷积层的处理数据存储在同一存储单元中，并与下一卷积层的处理数据共用存储空间。In a possible implementation manner, the processing component sends the operation result data of each convolution layer in the multi-layer convolution operation process to each storage unit for storage, and each storage unit receives and stores the operation result data . The operation result data of any convolutional layer and the processing data of the convolutional layer are stored in the same storage unit, and share the storage space with the processing data of the next convolutional layer.

举例说明，对比图2a所示的相关技术，不能对L6层处理数据X一边计算一边动态覆盖，不能一边将L6层的运算结果数据输出为L7层的处理数据X一边动态覆盖L6层所占空间中存储的L6层的处理数据X。如图2b所示，处理部件可以对存储单元MEM0中L6层处理数据X一边计算一边动态覆盖，L6层的卷积运算结果数据可以作为L7层的处理数据X，可以以L6层的卷积运算的结果直接覆盖L6层已经完成处理的处理数据X，作为L7层的处理数据X，即，L6层卷积运算的结果数据可以同相邻的L7层的处理数据X共用存储空间，重复利用存储单元的一片区域，为后续计算腾出足够的空间。For example, compared with the related technology shown in FIG. 2a, the processing data X of the L6 layer cannot be dynamically covered while being calculated, and the space occupied by the L6 layer cannot be dynamically covered while outputting the operation result data of the L6 layer as the processed data X of the L7 layer. The processing data X of the L6 layer stored in . As shown in Figure 2b, the processing unit can dynamically cover the processing data X of the L6 layer in the storage unit MEM0 while calculating, and the convolution operation result data of the L6 layer can be used as the processing data X of the L7 layer. The result directly covers the processed data X of the L6 layer that has been processed as the processed data X of the L7 layer, that is, the result data of the L6 layer convolution operation can share the storage space with the processed data X of the adjacent L7 layer, and reuse the storage space. An area of a cell that makes enough space for subsequent computations.

因此，所述方法通过对存储单元设置权值数据空间和处理数据空间，在满足相同层的处理数据和权值数据送入不同存储单元存储的情况下，可以根据卷积层处理顺序在权值数据空间和处理数据空间分别按照第一地址方向存储权值数据和处理数据，并且可以对相邻层数据一边计算一边动态覆盖，增大了剩余连续动态可用空间，为后续的计算腾出更大的数据空间，提高计算核的时空运算效率，进而提高芯片性能。Therefore, in the method, by setting the weight data space and the processing data space for the storage unit, under the condition that the processing data and weight data of the same layer are sent to different storage units for storage, the weight value can be stored according to the processing order of the convolution layer. The data space and the processing data space store weight data and processing data respectively according to the first address direction, and can dynamically cover the adjacent layer data while calculating, which increases the remaining continuous dynamic free space and frees up more space for subsequent calculations. It can improve the space-time operation efficiency of the computing core and improve the performance of the chip.

相关技术中，计算核对处理数据进行多层卷积运算与存储时，当一层的卷积数据计算完毕，进入下一层计算时，需要将计算核本次计算完毕的数据发给其他的计算核进行存储，同时接收上一层卷积运算完毕的数据。往往接收和发送数据在同一片存储空间，可能伴随着空间的交叠。In the related art, when the multi-layer convolution operation and storage are performed on the processing data by the calculation core, when the calculation of the convolution data of one layer is completed and the calculation of the next layer is entered, the data that the calculation core has calculated this time needs to be sent to other calculations. The core is stored, and at the same time, the data of the previous layer of convolution operation is received. Often receiving and sending data are in the same piece of storage space, which may be accompanied by overlapping spaces.

图4a示出相关技术中存储单元数据传输示意图。如图4a所示，存储单元MEM1待接收的数据容量为28KB，待发送的数据容量为28KB，待发送数据的地址和待接收的数据地址的交叠空间容量为14KB，即存储单元MEM1地址为0x5400～0x6200的数据空间。FIG. 4a shows a schematic diagram of data transmission in a storage unit in the related art. As shown in Figure 4a, the capacity of the data to be received by the storage unit MEM1 is 28KB, the capacity of the data to be sent is 28KB, and the overlapping space capacity of the address of the data to be sent and the address of the data to be received is 14KB, that is, the address of the storage unit MEM1 is Data space from 0x5400 to 0x6200.

假如采用从存储单元内待发送数据空间和待接收数据空间正序传输的方式，即，在处理部件对存储单元的多次读写操作中，每次读写操作的起始地址按照第二地址方向排列。存储单元MEM1第一次接收的数据容量为12KB的第一数据(对应存储单元MEM1地址为0x5400～0x6000的数据空间)会将第二次待发送的第二数据(对应存储单元MEM1地址为0x5200～0x5E00的数据空间)冲毁，覆盖掉，冲毁部分对应存储单元MEM1地址为0x5400～0x5E00。导致后续的待发送的第二数据为错误数据。If the forward-sequence transmission method is adopted from the data space to be sent and the data space to be received in the storage unit, that is, in the multiple read and write operations of the processing unit to the storage unit, the starting address of each read and write operation is based on the second address. Orientation arrangement. The first data with a data capacity of 12KB received by the storage unit MEM1 for the first time (corresponding to the data space of the storage unit MEM1 with an address of 0x5400~0x6000) will change the second data to be sent for the second time (corresponding to the storage unit MEM1 with an address of 0x5200~ 0x5E00 data space) is destroyed and overwritten, and the corresponding storage unit MEM1 address of the destroyed part is 0x5400~0x5E00. As a result, the subsequent second data to be sent is erroneous data.

其中，第一数据可以为对存储单元执行写入操作中出现的存储单元待接收数据；第二数据可以为对存储单元执行读取操作中出现的存储单元待发送数据。Wherein, the first data may be data to be received by a storage unit that occurs in a write operation to the storage unit; the second data may be data to be sent to a storage unit that occurs in a read operation to the storage unit.

如果采用先将需要发送的数据区域的数据向上搬移挪动14KB的空间，使二者不存在交叠。然后再按照第二地址方向依次发送，也会因为增加的搬移挪动数据过程，增大传输延时以及计算时钟，从而增加芯片运行负担，降低芯片运行效率。If the data in the data area that needs to be sent is first moved upwards and moved 14KB of space, so that there is no overlap between the two. Then, the data is sent in sequence according to the second address direction, which will also increase the transmission delay and the calculation clock due to the increased movement of the data process, thereby increasing the operating burden of the chip and reducing the operating efficiency of the chip.

在一种可能实现的方式中，所述处理部件可以将从所述计算核外接收的第一数据写入存储单元，并从存储单元读取第二数据以向所述计算核外发送。In a possible implementation manner, the processing component may write the first data received from outside the computing core into a storage unit, and read the second data from the storage unit to send to the outside of the computing core.

在多次写入操作中，每次写入操作的起始地址按照所述第一地址方向排列，每一次写入操作中按照所述第二地址方向写入第一数据。In multiple write operations, the start addresses of each write operation are arranged in the direction of the first address, and the first data is written in the direction of the second address in each write operation.

举例说明，图4b示出根据本公开一实施例的数据传输示意图。如图4b所示，存储单元MEM1第一次待发送12KB容量的第二数据的起始地址为0x5600，与此同时第一次待接收12KB容量的第一数据的起始地址为0x6400。第二次的待发送12KB容量的第二数据的起始地址为0x4A00，与此同时第二次待接收12KB容量的第一数据的起始地址为0x5800。第三次的待发送4KB容量的第二数据的起始地址为0x4600，与此同时第三次待接收4KB容量的第一数据的起始地址为0x5400。For example, FIG. 4b shows a schematic diagram of data transmission according to an embodiment of the present disclosure. As shown in FIG. 4b , the starting address of the second data of 12 KB to be sent by the storage unit MEM1 for the first time is 0x5600, and the starting address of the first data of 12 KB to be received for the first time is 0x6400. The starting address of the second data with a capacity of 12KB to be sent for the second time is 0x4A00, and at the same time, the starting address of the first data with a capacity of 12KB to be received for the second time is 0x5800. The starting address of the second data with a capacity of 4KB to be sent for the third time is 0x4600, and at the same time, the starting address of the first data with a capacity of 4KB to be received for the third time is 0x5400.

如图4b所示，三次读取操作在存储单元MEM1的起始地址依次为：0x5600、0x4A00、0x4600，三次写入操作的起始地址依次为：0x6400、0x5800、0x5400，即，每次读写操作的起始地址按照所述第一地址方向排列。并且，每一次读写操作中可以根据起始地址，按照第二地址的方向读写第一数据或第二数据。应当理解，处理部件可以对存储单元执行多次读写操作，本公开对处理部件每次读写操作的数据容量大小以及具体的读写操作次数不做限制。As shown in Figure 4b, the starting addresses of the three read operations in the memory unit MEM1 are: 0x5600, 0x4A00, 0x4600, and the starting addresses of the three write operations are: 0x6400, 0x5800, 0x5400, that is, each read and write The start addresses of the operations are arranged in the direction of the first address. Moreover, in each read and write operation, the first data or the second data can be read and written in the direction of the second address according to the starting address. It should be understood that the processing unit can perform multiple read and write operations on the storage unit, and the present disclosure does not limit the data capacity of each read and write operation by the processing unit and the specific number of read and write operations.

存储单元中，被发送走的数据在空间区域上下一个时刻就可以被重复利用。例如，存储单元MEM1第二次接收12KB容量的第一数据需要占用地址0x5800～0x6400的数据空间，由于在上一次也就是第一次存储单元MEM1已经被处理部件执行读取操作取走第二数据，并释放占用地址0x5600～0x6200的数据空间，避免造成对地址0x5800～0x6200重叠空间的冲毁与覆盖。同样的，存储单元MEM1第三次接收12KB容量的第一数据需要占用地址0x5400～0x5800的数据空间，由于在第一次存储单元MEM1已经被取走第二数据并释放了占用地址0x5600～0x6200的数据空间，以及在第二次被取走第二数据释放的占用地址0x4A00～0x5600的数据空间，第一次和第二次存储单元MEM1一共释放了地址0x4A00～0x6200的数据空间，避免造成对第三次接收第一数据所需要的地址0x5400～0x5800的数据空间的冲毁与覆盖。In the storage unit, the data sent away can be reused at the next moment in the space area. For example, the second time that the storage unit MEM1 receives the first data with the capacity of 12KB, it needs to occupy the data space of the address 0x5800~0x6400, because the last time, that is, the first time, the storage unit MEM1 has been read by the processing unit to take away the second data , and release the data space occupied by addresses 0x5600 to 0x6200 to avoid the destruction and coverage of the overlapping space of addresses 0x5800 to 0x6200. Similarly, the storage unit MEM1 needs to occupy the data space of the address 0x5400~0x5800 to receive the first data with the capacity of 12KB for the third time, because the second data has been taken away from the storage unit MEM1 for the first time and the occupied address 0x5600~0x6200 has been released. The data space, and the data space occupied by the address 0x4A00~0x5600 released by the second data taken away for the second time, the data space of the address 0x4A00~0x6200 is released by the first and second storage unit MEM1 in total, so as to avoid causing damage to the first time. The data space of addresses 0x5400 to 0x5800 required for receiving the first data three times is destroyed and overwritten.

所以，如图4b所示的数据传输过程中，在所述存储单元待发送数据地址和待接收数据地址出现交叠的情况下，不存在因为同时被处理单元执行多次读写操作造成存储单元交叠区域数据的冲毁和覆盖，更不存在为了避免数据冲毁而对数据进行搬移挪动带来的附加时间，消除了路由传输延时，进而提高芯片的时空运算效率，提高芯片性能。Therefore, in the data transmission process shown in FIG. 4b, in the case where the address of the data to be sent and the address of the data to be received overlap in the storage unit, there is no storage unit caused by multiple read and write operations performed by the processing unit at the same time. Data in the overlapping area is destroyed and covered, and there is no additional time caused by moving data to avoid data destruction, eliminating routing transmission delay, thereby improving the chip's space-time computing efficiency and improving chip performance.

在一种可能的实现方式中，存储同一卷积层的权值数据的权值数据空间中和存储同一卷积层的处理数据的处理数据空间中，权值数据和处理数据分别按照第二地址方向依次排放，第一地址方向与第二地址方向相反。In a possible implementation manner, in the weight data space storing the weight data of the same convolution layer and the processing data space storing the processing data of the same convolution layer, the weight data and the processing data respectively follow the second address The directions are arranged in sequence, with the first address direction being opposite to the second address direction.

下文给出了对权值数据和处理数据进行排序的一些示例，对于每一卷积层，该层排序后的权值数据和处理数据，可以按照排序(比如序号0、1、2、……的顺序)，在该层的权值数据空间中，按照第二地址方向依次排放，以及在该层的处理数据空间中，按照第二地址方向依次排放。Some examples of sorting the weight data and processing data are given below. For each convolutional layer, the sorted weight data and processing data of the layer can be sorted according to the order (such as serial number 0, 1, 2, ... order), in the weight data space of the layer, they are arranged in order according to the second address direction, and in the processing data space of the layer, they are arranged in order according to the second address direction.

图5示出根据本公开一实施例的流程图。如图5所示，针对所述多层卷积运算过程中每一层卷积运算过程中数据的存储方法，可以包括以下步骤：FIG. 5 shows a flowchart according to an embodiment of the present disclosure. As shown in Figure 5, for the storage method of data in each layer of convolution operation process in the multi-layer convolution operation process, the following steps may be included:

步骤S31，所述处理部件确定每一卷积层内的处理数据和权值数据存储顺序。Step S31, the processing component determines the storage order of the processing data and the weight data in each convolutional layer.

在一种可能实现的方式中，针对多层卷积运算过程中每一层卷积运算过程中的所述处理数据，处理部件确定每一卷积层内的处理数据的存储顺序为先深度方向，再横向方向，再纵向方向的顺序。In a possible implementation manner, for the processing data in the convolution operation process of each layer in the multi-layer convolution operation process, the processing component determines that the storage order of the processing data in each convolution layer is the depth direction first , then landscape orientation, then portrait orientation.

图6示出根据本公开一实施例的处理数据存储顺序的示意图。假设计算核的存储单元的位宽为32B，所述层内的所述处理数据可以是将输入的整图数据(512像素)拆分成连续32帧大小为4×4像素的系列图像。如图6所示，图6中左侧长方体可以代表所述层内的所述处理数据，沿着深度方向(z轴方向)共有32层，每一层分别对应一帧4×4像素的图像。FIG. 6 shows a schematic diagram of a processing data storage sequence according to an embodiment of the present disclosure. Assuming that the bit width of the storage unit of the computing core is 32B, the processing data in the layer may be inputted whole image data (512 pixels) divided into a series of images with a size of 4×4 pixels in 32 consecutive frames. As shown in FIG. 6 , the left cuboid in FIG. 6 can represent the processing data in the layer. There are 32 layers in total along the depth direction (z-axis direction), and each layer corresponds to a frame of 4×4 pixel image. .

处理部件可以确定每一卷积层内的处理数据的存储顺序，也就是确定图6左侧长方体中各个小立方体送入存储部件的先后顺序。所述处理部件可以对图6左侧长方体按照先深度方向(z轴方向)再横向方向(x轴方向)，再纵向方向(y轴方向)对图6左侧长方体中各个小立方体标号，所标序号与坐标(x,y,z)对应关系为：The processing unit can determine the storage sequence of the processed data in each convolutional layer, that is, determine the sequence in which each small cube in the left cuboid in FIG. 6 is sent to the storage unit. The processing unit can label each small cube in the left cuboid of FIG. 6 according to the depth direction (z-axis direction), then the lateral direction (x-axis direction), and then the longitudinal direction (y-axis direction) of the cuboid on the left side of FIG. 6 , so The corresponding relationship between the label serial number and the coordinates (x, y, z) is:

其中，M为处理数据的深度，即z方向的维度。Among them, M is the depth of the processed data, that is, the dimension in the z direction.

因此，首帧图像对应的序号为：

第2帧图像对应的序号为：

依次类推，第32帧图像对应的序号为：

Therefore, the sequence number corresponding to the first frame image is:

The serial number corresponding to the second frame image is:

By analogy, the serial number corresponding to the 32nd frame image is:

图6左侧长方体中各个小立方体的序号顺序代表所述处理部件确定的处理数据存储顺序。The serial number sequence of each small cube in the left cuboid of FIG. 6 represents the processing data storage sequence determined by the processing component.

在一种可能实现的方式中，每一层卷积运算过程使用多个卷积核组，每个卷积核组中包括多个卷积核，在所述存储单元中，每一层卷积运算中的各卷积核组的权值数据依次存储，每个卷积核组的权值数据的存储顺序按照卷积核的编号顺序、卷积核深度方向、横向方向、纵向方向进行存储。In a possible implementation manner, each layer of convolution operation process uses multiple convolution kernel groups, each convolution kernel group includes multiple convolution kernels, and in the storage unit, each layer of convolution kernels The weight data of each convolution kernel group in the operation is stored in sequence, and the storage order of the weight data of each convolution kernel group is stored according to the number sequence of the convolution kernel, the depth direction of the convolution kernel, the horizontal direction, and the vertical direction.

举例说明，针对多层卷积运算过程中每一层卷积运算过程中的所述权值数据，所述处理部件可以先对卷积核编号并分组，再按照每组中卷积核编号顺序，卷积核深度方向，卷积核横向方向，卷积核纵向方向的顺序确定一组一组的存储顺序。For example, for the weight data in the convolution operation process of each layer in the multi-layer convolution operation process, the processing component may first number and group the convolution kernels, and then follow the numbering order of the convolution kernels in each group. , the depth direction of the convolution kernel, the horizontal direction of the convolution kernel, and the order of the longitudinal direction of the convolution kernel determine the storage order of groups.

其中，所述权值数据包括多组卷积核，每组包括多个卷积核。所述处理部件可以根据所述处理数据的深度对权值数据中的卷积核分组，每组卷积核的个数可以同所述处理数据的深度相同。Wherein, the weight data includes multiple groups of convolution kernels, and each group includes multiple convolution kernels. The processing component may group the convolution kernels in the weight data according to the depth of the processing data, and the number of convolution kernels in each group may be the same as the depth of the processing data.

图7示出根据本公开一实施例的权值数据存储顺序的示意图。如图7所示，图中左侧每一个长方体代表一个卷积核，可以先对图中的64个卷积核编号W₀，W₁，…，W₆₃。对应图6所示的处理数据深度(M＝32)，可以将卷积核按32个进行分组，每32个卷积核一组，分成2组(N＝2)，W₀，W₁，…，W₃₁为第一组，W₃₂，W₃₃，…，W₆₃为第二组，所分的组数N可以代表所述权值数据的深度。FIG. 7 is a schematic diagram illustrating a storage sequence of weight data according to an embodiment of the present disclosure. As shown in Figure 7, each cuboid on the left side of the figure represents a convolution kernel, and the 64 convolution kernels in the figure can be numbered W ₀ , W ₁ , . . . , W ₆₃ first. Corresponding to the processing data depth (M=32) shown in Figure 6, the convolution kernels can be grouped by 32, and each 32 convolution kernels are grouped into 2 groups (N=2), W ₀ , W ₁ , ..., W ₃₁ is the first group, W ₃₂ , W ₃₃ , ..., W ₆₃ are the second group, and the number N of the divided groups may represent the depth of the weight data.

处理部件可确定每一卷积层内的所述权值数据的存储顺序，也就是确定图7左侧各个长方体中的各个小立方体送入存储部件的先后顺序。The processing unit may determine the storage order of the weight data in each convolutional layer, that is, determine the sequence in which each small cube in each rectangular parallelepiped on the left side of FIG. 7 is sent to the storage unit.

所述处理部件可以将图7左侧的长方体分成N组(N＝2)，并确定按照分组顺序对所述权值数据一组一组进行存储的存储顺序。The processing component may divide the cuboid on the left side of FIG. 7 into N groups (N=2), and determine a storage order for storing the weight data group by group according to the grouping order.

对于每组卷积核的存储顺序，可以先按照卷积核编号顺序，再按照卷积核的深度方向(z轴方向)、卷积核的横向方向(x轴方向)，卷积核的纵向方向(y轴方向)对图7左侧各个长方体中各个小立方体标号。其中，卷积核编号顺序方向对应处理数据深度方向。For the storage order of each group of convolution kernels, you can first follow the convolution kernel numbering order, and then follow the depth direction of the convolution kernel (z-axis direction), the horizontal direction of the convolution kernel (x-axis direction), and the longitudinal direction of the convolution kernel. The direction (y-axis direction) is assigned to each small cube in each rectangular parallelepiped on the left side of FIG. 7 . Among them, the order direction of the convolution kernel number corresponds to the depth direction of the processed data.

所标序号与坐标(x,y,z)对应关系为：The corresponding relationship between the marked serial number and the coordinates (x, y, z) is:

通过所述序号与坐标(x,y,z)对应关系可知，对于N＝1组的权值数据，卷积核W₀对应的序号为：[0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 480]，卷积核W₁对应的序号为：[1 33 65 97 129 161 193 225 257 289 321 353 385 417 449481]，依次类推，卷积核W₃₁的序号为：[31 63 95 127 159 191 223 255 287 319 351 383415 447 479 511]。According to the corresponding relationship between the serial number and the coordinates (x, y, z), for the weight data of N=1 group, the serial number corresponding to the convolution kernel W ₀ is: [0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 480], the serial number corresponding to the convolution kernel W ₁ is: [1 33 65 97 129 161 193 225 257 289 321 353 385 417 449481], and so on, the serial number of the convolution kernel W ₃₁ is: [31 63 95 127 159 191 223 255 287 319 351 383415 447 479 511].

同样的，对于N＝2组权值数据，卷积核W₃₂对应的序号为:[512 544 576 608 640672 704 736 768 800 832 864 896 928 960 992]，卷积核W₃₃对应的序号为：[513 545577 609 641 673 705 737 769 801 833 865 897 929 961 993]，依次类推，卷积核W₆₃对应的序号为：[543 575 607 639 671 703 735 767 799 831 863 895 927 959 9911023]。Similarly, for N=2 sets of weight data, the serial number corresponding to the convolution kernel W ₃₂ is: [512 544 576 608 640672 704 736 768 800 832 864 896 928 960 992], and the serial number corresponding to the convolution kernel W ₃₃ is: [513 545577 609 641 673 705 737 769 801 833 865 897 929 961 993], and so on, the serial number corresponding to the convolution kernel W ₆₃ is: [543 575 607 639 671 1103 735 767 799 920 31 59 99.9.

图7左侧各个长方体中各个小立方体的序号顺序可以代表所述处理部件确定的权值数据的存储顺序。The sequence of the serial numbers of the small cubes in the rectangular parallelepipeds on the left side of FIG. 7 may represent the storage sequence of the weight data determined by the processing component.

步骤S32，所述处理部件根据所述处理数据和所述权值数据存储顺序，将所述处理数据和所述权值数据送入存储单元。Step S32, the processing component sends the processing data and the weight data to a storage unit according to the storage order of the processing data and the weight data.

在一种可能的实现方式中，所述处理部件根据所述处理数据的存储顺序，将所述处理数据按照第二地址方向送入所述存储部件中存储单元的处理数据空间。In a possible implementation manner, the processing unit sends the processing data into the processing data space of the storage unit in the storage unit according to the storage sequence of the processing data in the second address direction.

举例说明，如图6所示，图中各个立方体上的所标序号可以对应存储单元的存储地址，所述处理部件可以通过访问存储单元对应的地址将立方体上所标序号对应的像素值送入存储单元。所述处理部件可以通过访问序号0立方体对应的存储单元中地址0x0000，将序号0立方体对应的像素值送入存储单元中地址0x0000空间；所述处理部件可以通过访问序号1立方体对应的存储单元中地址0x0001，将序号1立方体对应的像素值送入存储单元中地址0x0001空间；依次类推，所述处理部件可以通过访问序号511立方体对应的存储单元中地址0x01FF，将序号511立方体对应的像素值送入存储单元中地址0x01FF空间。For example, as shown in FIG. 6 , the serial numbers marked on the cubes in the figure may correspond to the storage addresses of the storage units, and the processing unit may send the pixel values corresponding to the serial numbers marked on the cubes by accessing the addresses corresponding to the storage units. storage unit. The processing component can send the pixel value corresponding to the serial number 0 cube into the address 0x0000 space in the storage unit by accessing the address 0x0000 in the storage unit corresponding to the serial number 0 cube; the processing component can access the storage unit corresponding to the serial number 1 cube by accessing the storage unit. Address 0x0001, send the pixel value corresponding to the cube with the serial number 1 into the address 0x0001 space in the storage unit; and so on, the processing unit can access the address 0x01FF in the storage unit corresponding to the cube with the serial number 511, and send the pixel value corresponding to the cube with the serial number 511. into the address 0x01FF space in the storage unit.

如图6所示，所述处理部件按照所标序号顺序，先深度方向，再横向方向，再纵向方向将所述处理数据送入所述存储单元。As shown in FIG. 6 , the processing unit sends the processed data into the storage unit in the depth direction, then the horizontal direction, and then the vertical direction according to the sequence of the marked serial numbers.

在一种可能的实现方式中，所述处理部件根据所述权值数据的存储顺序，将所述权值数据按照第二地址方向送入所述存储部件中存储单元的权值数据空间。In a possible implementation manner, the processing component sends the weight data into the weight data space of the storage unit in the storage component according to the storage order of the weight data in the second address direction.

举例说明，如图7所示，图中各个立方体上的所标序号可以对应存储单元的存储地址，所述处理部件可以通过访问存储单元对应的地址将立方体上所标序号对应的权值数据送入存储单元。所述处理部件可以通过访问序号0立方体对应的存储单元中地址0x0000，将序号0立方体对应的权值送入存储单元中地址0x0000空间；所述处理部件可以通过访问序号1立方体对应的存储单元中地址0x0001，将序号1立方体对应的权值送入存储单元中地址0x0001空间；依次类推，所述处理部件可以通过访问序号1023立方体对应的存储单元中地址0x03FF，将序号1023立方体对应的权值送入存储单元中地址0x03FF空间。For example, as shown in FIG. 7 , the serial numbers marked on the cubes in the figure may correspond to the storage addresses of the storage units, and the processing unit may send the weight data corresponding to the serial numbers marked on the cubes by accessing the addresses corresponding to the storage units. into the storage unit. The processing component can send the weight corresponding to the serial number 0 cube into the address 0x0000 space in the storage unit by accessing the address 0x0000 in the storage unit corresponding to the serial number 0 cube; the processing component can access the storage unit corresponding to the serial number 1 cube by accessing the address 0x0000. Address 0x0001, send the weight corresponding to the cube of serial number 1 into the address 0x0001 space in the storage unit; and so on, the processing unit can access the address 0x03FF in the storage unit corresponding to the cube of serial number 1023, and send the weight corresponding to the cube of serial number 1023 to the space of address 0x0001. into the address 0x03FF space in the storage unit.

如图7所述，所述处理部件按照所标序号顺序，先卷积核编号顺序，再卷积核深度方向，卷积核横向方向，卷积核纵向方向将所述权值数据一组一组顺序送入所述存储单元。As shown in FIG. 7 , according to the sequence of the numbered convolution kernels, the processing unit firstly divides the convolution kernel numbering sequence, then the convolution kernel depth direction, the convolution kernel horizontal direction, and the convolution kernel longitudinal direction, grouping the weight data into one group. The groups are sequentially fed into the storage unit.

步骤S33，所述存储单元接收并存储所述处理数据和所述权值数据。Step S33, the storage unit receives and stores the processing data and the weight data.

其中，可以先对权值数据存储，再对处理数据存储，权值数据可以存储在一个存储单元的权值数据空间，处理数据可以存储在另一个存储单元的处理数据空间。The weight data may be stored first, and then the processing data. The weight data may be stored in the weight data space of one storage unit, and the processing data may be stored in the processing data space of another storage unit.

图8示出根据本公开一实施例的卷积运算过程存储顺序原理示意图。如图8所示，假设处理数据为图像数据，可以用三维张量X[i,j,m]表示，具体如下：FIG. 8 is a schematic diagram showing the principle of storage order of a convolution operation process according to an embodiment of the present disclosure. As shown in Figure 8, assuming that the processing data is image data, it can be represented by a three-dimensional tensor X[i,j,m], as follows:

X[i,j,m]，i＝1,2,…,I，j＝1,2,…,J，m＝1,2,…,MX[i,j,m], i=1,2,...,I, j=1,2,...,J, m=1,2,...,M

I表示图像数据X[i,j,m]在纵向维度上有I个像素，J表示图像数据X[i,j,m]在横向维度上有J个像素，M表示图像数据X[i,j,m]在深度维度上有M个像素，处理数据的大小为I×J×M。I indicates that the image data X[i,j,m] has I pixels in the vertical dimension, J indicates that the image data X[i,j,m] has J pixels in the horizontal dimension, and M indicates that the image data X[i, j,m] has M pixels in the depth dimension, and the size of the processed data is I×J×M.

其中，M深度的图像数据，可以是由视频转换成的M帧连续序列图像，可以是由一张整图拆分成M张等大小的子图，还可以是通道深度为M的图像数据，本公开对此不做限定。Among them, the image data of M depth can be M frames of continuous sequence images converted from video, can be split from a whole image into M sub-images of equal size, or can be image data with a channel depth of M, This disclosure does not limit this.

举例说明，对M张处理数据做N组3×3固定窗口大小的滑窗卷积运算，如果一张一张的对图片分别做N组3×3固定窗口大小的滑窗卷积运算，每处理一张图片数据，就需要访问一次存储单元读取卷积核，需要对存储单元重复读取M次，造成对存储单元的频繁访问。因此，可以按照深度方向对处理数据做三维滑窗卷积运算，对存储单元访问一次读取的卷积核(即权值数据)可以同时对M张输入图像做滑窗卷积运算，实现输入图像的并行运算。For example, to perform N groups of 3×3 fixed window size sliding window convolution operations on M pieces of processing data, if N groups of 3×3 fixed window size sliding window convolution operations are performed on the pictures one by one, each To process a piece of image data, it is necessary to access the storage unit once to read the convolution kernel, and the storage unit needs to be read M times repeatedly, resulting in frequent access to the storage unit. Therefore, the three-dimensional sliding window convolution operation can be performed on the processing data according to the depth direction, and the convolution kernel (ie the weight data) read once for the storage unit can perform the sliding window convolution operation on the M input images at the same time to realize the input Parallel operations on images.

如图8所示，假设深度方向有M张图片，C[k_xk_y,m]代表滑动卷积的滑窗取到的图像数据，滑窗可以按照先深度方向再横向、再纵向在图像数据中滑动，本申请不限制滑窗的滑动方式。As shown in Figure 8, assuming that there are M pictures in the depth direction, C[k _x k _y ,m] represents the image data obtained by the sliding window of the sliding convolution. Sliding in the data, this application does not limit the sliding mode of the sliding window.

与滑窗取到的图像数据C[k_xk_y,m]对应的权值数据为K[k_xk_y,m,n]，在图8中，权值数据K[k_xk_y,m,n]有N组，每组有M个卷积核，对应于深度为M的处理数据。滑窗取到的图像数据C[k_xk_y,m]可以按照卷积权值的组数N顺序同各组卷积核K[k_xk_y,m,n](n＝1,2,…,N)分别做乘加运算。The weight data corresponding to the image data C[k _x k _y ,m] obtained by the sliding window is K[k _x k _y ,m,n]. In FIG. 8 , the weight data K[k _x k _y , m,n] has N groups, each with M convolution kernels, corresponding to the processed data of depth M. The image data C[k _x k _y ,m] obtained by the sliding window can be the same as each group of convolution kernels K[k _x k _y ,m,n](n=1,2 ,…,N) to perform multiplication and addition operations respectively.

由于对图像数据X[i,j,m]做卷积运算也就是对图像数据X[i,j,m]做乘加运算，根据乘法交换律和加法交换律，调整卷积运算过程的顺序，卷积运算的结果不变。因此，为了减少对存储单位的访问次数，可以调整卷积运算过程的顺序，先沿着深度M方向，再沿着横向J方向，纵向方向I进行运算。Since the convolution operation is performed on the image data X[i,j,m], that is, the multiplication and addition operation is performed on the image data X[i,j,m]. According to the multiplication commutative law and the addition commutative law, the order of the convolution operation process is adjusted. , the result of the convolution operation remains unchanged. Therefore, in order to reduce the number of accesses to the storage unit, the order of the convolution operation process can be adjusted, first along the depth M direction, then along the horizontal direction J, and the vertical direction I to perform operations.

运算过程对应公式如下：The corresponding formula of the operation process is as follows:

其中，C[k_xk_y,m]代表滑窗取到的图像数据，即对应卷积核尺寸的输入图像的数据，K[k_xk_y,m,n]代表卷积权值，kx*ky代表一个卷积核的大小，卷积权值按照深度方向M分成N组，每组M个卷积核，此卷积过程为先对卷积核沿深度方向M进行乘加运算，根据卷积核的尺寸kx*ky，再分别沿横向J和纵向I完成其余数据的滑窗卷积运算。Among them, C[k _x k _y ,m] represents the image data obtained by the sliding window, that is, the data of the input image corresponding to the size of the convolution kernel, K[k _x k _y ,m,n] represents the convolution weight, kx *ky represents the size of a convolution kernel. The convolution weights are divided into N groups according to the depth direction M, and each group has M convolution kernels. This convolution process is to first multiply and add the convolution kernel along the depth direction M. According to The size of the convolution kernel is kx*ky, and the sliding window convolution operation of the remaining data is completed along the horizontal J and vertical I respectively.

处理部件按照先深度，在横向，再纵向的卷积运算过程，有利于实现数据的并行处理，提高芯片的计算效率。针对卷积运算过程的处理数据和权值数据，处理部件可以对处理数据按照先深度方向，再横向方向，再纵向方向的存储顺序，以及对权值数据按照先卷积核编号顺序，再卷积核深度方向、卷积核横向方向、卷积核纵向方向一组一组存储的存储顺序，实现先存储的处理数据和权值数据可以优先读取参与运算，高度匹配先深度，再横向，再纵向的卷积运算过程，提高存储过程与卷积运算过程的拟合度，提高计算效率。According to the convolution operation process of the processing unit in the depth first, in the horizontal direction, and then in the vertical direction, it is beneficial to realize the parallel processing of data and improve the computing efficiency of the chip. For the processed data and weight data in the convolution operation process, the processing component can store the processed data in the depth direction, then the horizontal direction, and then the vertical direction, and the weight data in the order of convolution kernel numbering first, then the volume The depth direction of the accumulation kernel, the horizontal direction of the convolution kernel, and the vertical direction of the convolution kernel are stored in a group of storage sequences, so that the processed data and weight data stored first can be read and participated in the operation first, and the height matching first depth, then horizontal, The vertical convolution operation process improves the fit between the stored process and the convolution operation process, and improves the calculation efficiency.

本公开还提供了一种计算核。图1所示为计算核的一个示例，所述计算核包括处理部件及存储部件。The present disclosure also provides a computing core. Figure 1 shows an example of a computing core that includes processing components and storage components.

在一种可能的实现方式中，所述处理部件根据所述多层卷积运算过程中的卷积层处理顺序，依次将各卷积层的权值数据和处理数据写入各存储单元进行存储。In a possible implementation manner, the processing unit sequentially writes the weight data and processing data of each convolution layer into each storage unit for storage according to the processing order of the convolution layers in the multi-layer convolution operation process .

所述存储部件包括两个以上存储单元，各存储单元接收并存储所述权值数据和所述处理数据。The storage unit includes two or more storage units, and each storage unit receives and stores the weight data and the processing data.

其中，在所述存储单元中，相同卷积层的处理数据和权值数据存储在不同存储单元中；在同一存储单元中，存储有权值数据的权值数据空间与存储有处理数据的处理数据空间按照第一地址方向依次排放。Wherein, in the storage unit, the processing data and weight data of the same convolution layer are stored in different storage units; in the same storage unit, the weight data space for storing the weight data and the processing data for storing the processing data are stored in the same storage unit. The data spaces are arranged in sequence according to the first address direction.

其中，存储各卷积层的权值数据的权值数据空间按照卷积层处理顺序，沿第一地址方向依次排放；存储各卷积层的处理数据的处理数据空间按照卷积层处理顺序，沿第一地址方向依次排放。Among them, the weight data space for storing the weight data of each convolution layer is arranged in the first address direction according to the processing order of the convolution layer; the processing data space for storing the processing data of each convolution layer is according to the processing order of the convolution layer, Arranged sequentially in the direction of the first address.

在一种可能的实现方式中，所述处理部件还用于将所述多层卷积运算过程中各卷积层的运算结果数据送入各存储单元进行存储；In a possible implementation manner, the processing component is further configured to send the operation result data of each convolution layer in the multi-layer convolution operation process into each storage unit for storage;

在一种可能的实现方式中，所述处理部件还用于将从所述计算核外接收的第一数据写入存储单元，并从存储单元读取第二数据以向所述计算核外发送；In a possible implementation manner, the processing component is further configured to write the first data received from outside the computing core into a storage unit, and read the second data from the storage unit to send the data outside the computing core ;

针对上述关于计算核的实施方式，可以参见上文关于数据存储方法部分的描述，不再赘述。For the above-mentioned implementation of the computing core, reference may be made to the description of the part about the data storage method above, and details are not repeated here.

在一种可能的实现方式中，本公开实施例还提出一种人工智能芯片，所述芯片包括至少一个如上所述的计算核。所述芯片可以包括多个处理器，所述处理器可以包括多个计算核，本公开对芯片内计算核的数量不做限制。In a possible implementation manner, an embodiment of the present disclosure further provides an artificial intelligence chip, where the chip includes at least one computing core as described above. The chip may include multiple processors, and the processors may include multiple computing cores, and the present disclosure does not limit the number of computing cores in the chip.

在一种可能的实现方式中，本公开实施例提出了一种电子设备，包括一个或多个上述人工智能芯片。In a possible implementation manner, an embodiment of the present disclosure provides an electronic device, including one or more of the above-mentioned artificial intelligence chips.

图9是示出根据本公开实施例的一种组合处理装置1200的结构图。如图9中所示，该组合处理装置1200包括计算处理装置1202(例如，上述包括多个计算核的人工智能处理器)、接口装置1204、其他处理装置1206和存储装置1208。根据不同的应用场景，计算处理装置中可以包括一个或多个计算装置1210(例如，计算核)。FIG. 9 is a structural diagram illustrating a combined processing apparatus 1200 according to an embodiment of the present disclosure. As shown in FIG. 9 , the combined processing device 1200 includes a computing processing device 1202 (eg, the above-mentioned artificial intelligence processor including multiple computing cores), an interface device 1204 , other processing devices 1206 and a storage device 1208 . According to different application scenarios, the computing processing device may include one or more computing devices 1210 (eg, computing cores).

在一种可能的实现方式中，本公开的计算处理装置可以配置成执行用户指定的操作。在示例性的应用中，该计算处理装置可以实现为单核人工智能处理器或者多核人工智能处理器。类似地，包括在计算处理装置内的一个或多个计算装置可以实现为人工智能处理器核或者人工智能处理器核的部分硬件结构。当多个计算装置实现为人工智能处理器核或人工智能处理器核的部分硬件结构时，就本公开的计算处理装置而言，其可以视为具有单核结构或者同构多核结构。In one possible implementation, the computing processing apparatus of the present disclosure may be configured to perform user-specified operations. In an exemplary application, the computing processing device may be implemented as a single-core artificial intelligence processor or a multi-core artificial intelligence processor. Similarly, one or more computing devices included within a computing processing device may be implemented as an artificial intelligence processor core or as part of a hardware structure of an artificial intelligence processor core. When multiple computing devices are implemented as an artificial intelligence processor core or a part of the hardware structure of an artificial intelligence processor core, for the computing processing device of the present disclosure, they can be regarded as having a single-core structure or a homogeneous multi-core structure.

在示例性的操作中，本公开的计算处理装置可以通过接口装置与其他处理装置进行交互，以共同完成用户指定的操作。根据实现方式的不同，本公开的其他处理装置可以包括中央处理器(Central Processing Unit，CPU)、图形处理器(Graphics ProcessingUnit,GPU)、人工智能处理器等通用和/或专用处理器中的一种或多种类型的处理器。这些处理器可以包括但不限于数字信号处理器(Digital Signal Processor，DSP)、专用集成电路(Application Specific Integrated Circuit，ASIC)、现场可编程门阵列(Field-Programmable Gate Array，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等，并且其数目可以根据实际需要来确定。如前所述，仅就本公开的计算处理装置而言，其可以视为具有单核结构或者同构多核结构。然而，当将计算处理装置和其他处理装置共同考虑时，二者可以视为形成异构多核结构。In an exemplary operation, the computing processing apparatus of the present disclosure may interact with other processing apparatuses through an interface apparatus to jointly complete an operation specified by a user. According to different implementations, other processing devices of the present disclosure may include one of general-purpose and/or special-purpose processors such as a central processing unit (Central Processing Unit, CPU), a graphics processing unit (Graphics Processing Unit, GPU), and an artificial intelligence processor. one or more types of processors. These processors may include, but are not limited to, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field-Programmable Gate Arrays (FPGAs) or other programmable Logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs. As mentioned above, only for the computing processing device of the present disclosure, it can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when computing processing devices and other processing devices are considered together, the two can be viewed as forming a heterogeneous multi-core structure.

在一个或多个实施例中，该其他处理装置可以作为本公开的计算处理装置(其可以具体化为人工智能例如神经网络运算的相关运算装置)与外部数据和控制的接口，执行包括但不限于数据搬运、对计算装置的开启和/或停止等基本控制。在另外的实施例中，其他处理装置也可以和该计算处理装置协作以共同完成运算任务。In one or more embodiments, the other processing device may serve as an interface between the computing processing device of the present disclosure (which may be embodied as a related computing device for artificial intelligence such as neural network operations) and external data and control, performing operations including but not limited to Limited to basic controls such as data movement, starting and/or stopping computing devices. In other embodiments, other processing apparatuses may also cooperate with the computing processing apparatus to jointly complete computing tasks.

在一个或多个实施例中，该接口装置可以用于在计算处理装置与其他处理装置间传输数据和控制指令。例如，该计算处理装置可以经由所述接口装置从其他处理装置中获取输入数据，写入该计算处理装置片上的存储装置(或称存储器)。进一步，该计算处理装置可以经由所述接口装置从其他处理装置中获取控制指令，写入计算处理装置片上的控制缓存中。替代地或可选地，接口装置也可以读取计算处理装置的存储装置中的数据并传输给其他处理装置。In one or more embodiments, the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices. For example, the computing and processing device may obtain input data from other processing devices via the interface device, and write the input data into the on-chip storage device (or memory) of the computing and processing device. Further, the computing and processing device may obtain control instructions from other processing devices via the interface device, and write them into a control cache on the computing and processing device chip. Alternatively or alternatively, the interface device can also read the data in the storage device of the computing processing device and transmit it to other processing devices.

附加地或可选地，本公开的组合处理装置还可以包括存储装置。如图中所示，该存储装置分别与所述计算处理装置和所述其他处理装置连接。在一个或多个实施例中，存储装置可以用于保存所述计算处理装置和/或所述其他处理装置的数据。例如，该数据可以是在计算处理装置或其他处理装置的内部或片上存储装置中无法全部保存的数据。Additionally or alternatively, the combined processing device of the present disclosure may also include a storage device. As shown in the figure, the storage device is connected to the computing processing device and the other processing device, respectively. In one or more embodiments, a storage device may be used to store data of the computing processing device and/or the other processing device. For example, the data may be data that cannot be fully stored in an internal or on-chip storage device of a computing processing device or other processing device.

根据不同的应用场景，本公开的人工智能芯片可用于服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆；所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机；所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。According to different application scenarios, the artificial intelligence chip of the present disclosure can be used in servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, and mobile terminals. , mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, household appliances, and/or medical equipment . The vehicles include airplanes, ships and/or vehicles; the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods; the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph.

图10示出根据本公开实施例的一种电子设备1900的框图。例如，电子设备1900可以被提供为一服务器。参照图10，电子设备1900包括处理组件1922(例如，包括多个计算核心的人工智能处理器)，其进一步包括一个或多个计算核心，以及由存储器1932所代表的存储器资源，用于存储可由处理组件1922的执行的指令，例如应用程序。存储器1932中存储的应用程序可以包括一个或一个以上的每一个对应于一组指令的模块。此外，处理组件1922被配置为执行指令，以执行上述方法。FIG. 10 shows a block diagram of an electronic device 1900 according to an embodiment of the present disclosure. For example, the electronic device 1900 may be provided as a server. 10, an electronic device 1900 includes a processing component 1922 (eg, an artificial intelligence processor including a plurality of computing cores), which further includes one or more computing cores, and a memory resource, represented by memory 1932, for storing Instructions for execution of processing components 1922, such as applications. An application program stored in memory 1932 may include one or more modules, each corresponding to a set of instructions. Additionally, the processing component 1922 is configured to execute instructions to perform the above-described methods.

电子设备1900还可以包括一个电源组件1926被配置为执行电子设备1900的电源管理，一个有线或无线网络接口1950被配置为将电子设备1900连接到网络，和一个输入输出(I/O)接口1958。电子设备1900可以操作基于存储在存储器1932的操作系统，例如Windows ServerTM，Mac OS XTM，UnixTM,LinuxTM，FreeBSDTM或类似。The electronic device 1900 may also include a power supply assembly 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input output (I/O) interface 1958 . Electronic device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server™, Mac OS X™, Unix™, Linux™, FreeBSD™ or the like.

在本公开中，作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外，根据实际的需要，可以选择其中的部分或者全部单元来实现本公开实施例所述方案的目的。另外，在一些场景中，本公开实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。In the present disclosure, units illustrated as separate components may or may not be physically separate, and components shown as units may or may not be physical units. The aforementioned components or elements may be co-located or distributed over multiple network elements. In addition, according to actual needs, some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure. In addition, in some scenarios, multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit physically exists independently.

在上述实施例中，对各个实施例的描述都各有侧重，某个实施例中没有详述的部分，可以参见其他实施例的相关描述。上述实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments. The technical features of the above embodiments can be combined arbitrarily. For the sake of brevity, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, it should be It is considered to be the range described in this specification.

本公开的电子设备或处理器还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步，本公开的电子设备或处理器还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中，根据本公开方案的算力高的电子设备或处理器可以应用于云端设备(例如云端服务器)，而功耗小的电子设备或处理器可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中，云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容，从而可以根据终端设备和/或边缘端设备的硬件信息，从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源，以便完成端云一体或云边端一体的统一管理、调度和协同工作。The electronic device or processor of the present disclosure can also be applied to the Internet, Internet of Things, data center, energy, transportation, public administration, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical and other fields. Further, the electronic device or processor of the present disclosure can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as the cloud, edge terminal, and terminal. In one or more embodiments, electronic devices or processors with high computing power according to the present disclosure can be applied to cloud devices (eg, cloud servers), while electronic devices or processors with low power consumption can be applied to terminal devices and / or edge devices (such as smartphones or cameras). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be retrieved from the hardware information of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device. Match the appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-device integration.

以上已经描述了本公开的各实施例，上述说明是示例性的，并非穷尽性的，并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下，对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择，旨在最好地解释各实施例的原理、实际应用或对市场中的技术的改进，或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。Various embodiments of the present disclosure have been described above, and the foregoing descriptions are exemplary, not exhaustive, and not limiting of the disclosed embodiments. Numerous modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the various embodiments, the practical application or improvement over the technology in the marketplace, or to enable others of ordinary skill in the art to understand the various embodiments disclosed herein.

Claims

1. A data storage method, characterized in that, applied to a computing core of a processor, the processor includes a plurality of computing cores, and each computing core includes a processing component and a storage component, wherein the storage component includes two the above storage unit;

The method includes:

The processing component sequentially writes the weight data and processing data of each convolution layer into each storage unit for storage according to the processing order of the convolution layers in the multi-layer convolution operation process;

Each storage unit receives and stores the weight data and the processing data;

Wherein, in the storage unit, the processing data and weight data of the same convolution layer are stored in different storage units; in the same storage unit, the weight data space for storing the weight data and the processing data for storing the processing data are stored in the same storage unit. The data space is arranged in sequence according to the first address direction;

Among them, the weight data space for storing the weight data of each convolution layer is arranged in the first address direction according to the processing order of the convolution layer; the processing data space for storing the processing data of each convolution layer is according to the processing order of the convolution layer, Discharge in sequence along the first address direction;

Among them, in the weight data space storing the weight data of the same convolution layer and the processing data space storing the processing data of the same convolution layer, the weight data and the processing data are arranged in sequence according to the second address direction, and the first address The direction is opposite to the direction of the second address;

The first address direction is a high address to a low address direction, and the second address direction is a low address to a high address direction.

2. The method according to claim 1, wherein the method further comprises:

The processing component sends the operation result data of each convolution layer in the multi-layer convolution operation process into each storage unit for storage;

Each storage unit receives and stores the operation result data, wherein the operation result data of any convolution layer and the processing data of the convolution layer are stored in the same storage unit, and are shared with the processing data of the next convolution layer. space.

3. The method according to claim 1, wherein the method further comprises:

The processing component writes first data received from outside the computing core to a storage unit, and reads second data from the storage unit to send to the outside of the computing core;

Wherein, in multiple write operations, the start addresses of each write operation are arranged in the direction of the first address, and the first data is written in the direction of the second address in each write operation;

In multiple read operations, the start addresses of each read operation are arranged in the direction of the first address, and the second data is read in the direction of the second address in each read operation.

4. The method according to claim 1, wherein, in the storage unit, the storage order of the processing data and the weight data for each convolutional layer conforms to the depth direction first, then the horizontal direction, and then the vertical volume. Product operation process.

5. The method according to claim 1 or 4, wherein each layer of convolution operation process uses a plurality of convolution kernel groups, and each convolution kernel group includes a plurality of convolution kernels;

In the storage unit, the weight data of each convolution kernel group in the convolution operation of each layer is stored in sequence, and the storage order of the weight data of each convolution kernel group is according to the numbering order of the convolution kernel, the convolution kernel The kernel depth direction, horizontal direction, and vertical direction are stored.

6. A computing core, characterized in that the computing core comprises a processing component and a storage component;

The storage unit includes two or more storage units, and each storage unit receives and stores the weight data and the processing data;

7. computing kernel according to claim 6, is characterized in that, described processing unit is also used for the operation result data of each convolution layer in described multi-layer convolution operation process is sent into each storage unit for storage;

8. The computing core according to claim 6, wherein the processing component is further configured to write the first data received from outside the computing core into a storage unit, and read the second data from the storage unit to send to the outside of the computing core;

9. The computing kernel according to claim 6, wherein, in the storage unit, the storage order of the processing data and the weight data for each convolutional layer conforms to the depth direction first, then the horizontal direction, and then the vertical direction. Convolution operation process.

10. The computing kernel according to claim 6 or 9, wherein each layer of convolution operation process uses a plurality of convolution kernel groups, and each convolution kernel group includes a plurality of convolution kernels;

11. An artificial intelligence chip, wherein the chip comprises a plurality of computing cores according to any one of claims 6-10.

12. An electronic device, comprising one or more artificial intelligence chips according to claim 11.