WO2023103334A1

WO2023103334A1 - Data processing method and apparatus of neural network simulator, and terminal

Info

Publication number: WO2023103334A1
Application number: PCT/CN2022/100386
Authority: WO
Inventors: 袁华隆; 蔡万伟; 蒋文; 汪永威; 王和国
Original assignee: Shenzhen Intellifusion Technologies Co Ltd; Jiangsu Intellifusion Technologies Co Ltd
Current assignee: Shenzhen Intellifusion Technologies Co Ltd; Jiangsu Intellifusion Technologies Co Ltd
Priority date: 2021-12-08
Filing date: 2022-06-22
Publication date: 2023-06-15
Anticipated expiration: 2024-06-08
Also published as: CN114356494A; CN114356494B

Abstract

The present application belongs to the technical field of data processing, and in particular relates to a data processing method and apparatus of a neural network simulator, and a terminal. The method comprises: acquiring instruction data; transporting data of a source end to a destination end in a transaction-level transportation manner and according to a first transportation parameter carried in a first transportation instruction; transporting, with cycle-level precision, data of the destination end to a cache according to a second transportation parameter carried in a second transportation instruction; and if the cache is not empty, executing a granularity operation on the basis of data in the cache, so as to obtain a cycle-level data processing result. Therefore, a cycle-level precise instruction operation and transaction-level fuzzy data transportation are realized, and an instruction of a neural network simulator can be calculated according to a cycle level, thus maintaining the consistency with hardware and precision, optimizing cycle-level dependency of data transportation, and reducing the complexity of the neural network simulator.

Description

Data processing method, device and terminal of a neural network simulator

technical field

本申请属于数据处理技术领域，尤其涉及一种神经网络模拟器的数据处理方法、装置和终端。The present application belongs to the technical field of data processing, and in particular relates to a data processing method, device and terminal of a neural network simulator.

本申请要求于2021年12月8日提交中国专利局，申请号为202111494700.7、发明名称为“一种神经网络模拟器的数据处理方法、装置和终端”的中国专利申请的优先权，其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202111494700.7 and the title of the invention "a data processing method, device and terminal for a neural network simulator" submitted to the China Patent Office on December 8, 2021, the entire content of which Incorporated in this application by reference.

Background technique

随着人工智能大数据技术的发展，神经网络模拟器在处理器微架构设计、TVM工具链开发推广以及RTL开发验证，体现出重大优势。With the development of artificial intelligence big data technology, neural network simulators have shown significant advantages in processor microarchitecture design, TVM tool chain development and promotion, and RTL development verification.

然而，伴随着神经网络处理器处理数据量大，数据维度多，计算方式复杂多样的变化，目前的神经网络模拟器的数据处理方法已无法满足使用需求。However, with the large amount of data processed by the neural network processor, the multi-dimensional data, and the complex and diverse calculation methods, the current data processing methods of the neural network simulator can no longer meet the needs of use.

technical problem

本申请实施例提供一种神经网络模拟器的数据处理方法、装置、终端及计算机可读存储介质，可以降低神经网络模拟器的复杂度，使得神经网络模拟器能够在神经网络处理器大数据搬运、大算力特性、架构评估、指令集工具链开发、RTL验证等方面发挥极大作用。The embodiment of the present application provides a data processing method, device, terminal and computer-readable storage medium of a neural network simulator, which can reduce the complexity of the neural network simulator, so that the neural network simulator can carry large data in the neural network processor , large computing power characteristics, architecture evaluation, instruction set tool chain development, RTL verification and other aspects play a significant role.

本申请实施例第一方面提供一种神经网络模拟器的数据处理方法，包括：The first aspect of the embodiment of the present application provides a data processing method of a neural network simulator, including:

获取指令数据；所述指令数据包括携带第一搬运参数的第一搬运指令、携带第二搬运参数的第二搬运指令以及颗粒度运算指令；Acquiring instruction data; the instruction data includes a first transportation instruction carrying a first transportation parameter, a second transportation instruction carrying a second transportation parameter, and a granular operation instruction;

根据所述第一搬运指令携带的第一搬运参数将源端的数据采用事务级搬运方式搬运至目的端；Transporting the data at the source end to the destination end in a transaction-level transport manner according to the first transport parameter carried by the first transport instruction;

根据所述第二搬运指令携带的第二搬运参数将所述目的端的数据按周期级精度搬运至缓存；Transporting the data at the destination to the cache with cycle-level precision according to the second transport parameter carried by the second transport instruction;

若所述缓存不为空，则基于所述缓存中的数据执行颗粒度运算，得到周期级的数据处理结果。If the cache is not empty, a granular operation is performed based on the data in the cache to obtain a cycle-level data processing result.

根据一些实施例，基于上述第一方面提供的一种神经网络模拟器的数据处理方法，在本申请的第一种可能的实现方式中，上述将源端的数据采用事务级搬运方式搬运至目的端，包括：According to some embodiments, based on the data processing method of a neural network simulator provided in the first aspect above, in the first possible implementation of the present application, the data at the source end is transferred to the destination end in a transaction-level transfer manner ,include:

基于源端与目的端之间的通信握手，将所述源端的数据采用事务级搬运方式搬运至所述目的端。Based on the communication handshake between the source end and the destination end, the data at the source end is transferred to the destination end in a transaction-level transfer manner.

根据一些实施例，基于上述第一方面提供的一种神经网络模拟器的数据处理方法，在本申请的第二种可能的实现方式中，第二搬运参数包括当前搬运的数据对应的运算模式；According to some embodiments, based on the data processing method of a neural network simulator provided in the first aspect above, in the second possible implementation of the present application, the second handling parameters include the operation mode corresponding to the currently transported data;

所述根据所述第二搬运指令携带的第二搬运参数将所述目的端的数据按周期级精度搬运至缓存，包括：The step of transferring the data of the destination to the cache according to the second transfer parameter carried by the second transfer instruction with cycle-level precision includes:

根据所述当前搬运的数据对应的运算模式计算所述颗粒度运算所需要的目的端的数据对应的真实数据量；calculating the actual amount of data corresponding to the data at the destination end required for the granularity calculation according to the calculation mode corresponding to the currently transported data;

根据所述真实数据量以及所述源端搬运至目的端的数据量，判断所述目的端的数据量是否大于或等于所述颗粒度运算所需要的真实数据量；According to the real data volume and the data volume transferred from the source terminal to the destination terminal, it is judged whether the data volume of the destination terminal is greater than or equal to the real data volume required by the granular operation;

若所述目的端的数据量大于或等于所述颗粒度运算所需要的真实数据量，则根据所述第二搬运参数将所述目的端的数据按周期级的精度搬运至缓存。If the amount of data at the destination is greater than or equal to the actual amount of data required by the granularity calculation, the data at the destination is moved to the cache with cycle-level precision according to the second moving parameter.

根据一些实施例，基于上述第一方面提供的一种神经网络模拟器的数据处理方法，在本申请的第三种可能的实现方式中，所述第二搬运参数包括切割参数和当前搬运的数据对应的运算模式；According to some embodiments, based on the data processing method of a neural network simulator provided in the first aspect above, in the third possible implementation of the present application, the second handling parameters include cutting parameters and current handling data The corresponding operation mode;

根据所述第二搬运参数中的切割参数将所述目的端的数据按周期级精度进行切割，得到切割数据；Cutting the data of the destination according to the cutting parameters in the second handling parameters according to cycle-level accuracy to obtain cutting data;

根据所述当前搬运的数据对应的运算模式对所述切割数据进行运算，得到所述颗粒度运算所需要的目标数据；performing calculations on the cutting data according to the calculation mode corresponding to the currently transported data, to obtain the target data required for the granularity calculation;

将所述目标数据搬运至所述缓存。Move the target data to the cache.

根据一些实施例，基于上述第一方面提供的一种神经网络模拟器的数据处理方法，以及上述第一种、第二种和第三种可能的实现方式，在本申请的第四种可能的实现方式中，当所述颗粒度运算所需要的目标数据为目的端的矩阵数据中的部分数据时，所述第二搬运参数包括所述部分数据中的每个数据对应的第一位置坐标；According to some embodiments, based on the data processing method of a neural network simulator provided in the first aspect above, and the first, second and third possible implementations above, in the fourth possible implementation of the present application In an implementation manner, when the target data required by the granular operation is part of the data in the matrix data of the destination end, the second transport parameter includes the first position coordinates corresponding to each data in the part of the data;

所述根据所述第二搬运参数将所述目的端的数据按周期级精度搬运至缓存，包括：The transferring the data of the destination end to the cache according to the second transfer parameter with cycle-level accuracy includes:

将所述目的端的矩阵数据中与所述第一位置坐标对应的数据搬运至所述缓存。The data corresponding to the first position coordinate in the matrix data of the destination end is transferred to the cache.

根据一些实施例，基于上述第一方面提供的一种神经网络模拟器的数据处理方法，以及上述第一种、第二种和第三种可能的实现方式，在本申请的第五种可能的实现方式中，当所述颗粒度运算所需要的目标数据为winograd前变换的值时，所述第二搬运参数包括winograd前变换所需要的4*4的数据表中的数据的第二位置坐标；According to some embodiments, based on the data processing method of a neural network simulator provided in the first aspect above, and the first, second and third possible implementations above, in the fifth possible implementation of the present application In an implementation, when the target data required for the granularity calculation is the value transformed before winograd, the second transport parameter includes the second position coordinates of the data in the 4*4 data table required for the transformation before winograd ;

根据所述第二位置坐标读取所述目的端存储的4*4的数据表，并基于所述4*4的数据表计算得到所述winograd前变换的值，并将所述winograd前变换的值搬运至所述缓存。Read the 4*4 data table stored at the destination according to the second position coordinates, and calculate the pre-winograd converted value based on the 4*4 data table, and convert the pre-winograd converted value The value is moved to the cache.

根据一些实施例，基于上述第一方面提供的一种神经网络模拟器的数据处理方法，以及上述第一种、第二种和第三种可能的实现方式，在本申请的第六种可能的实现方式中，在所述若所述缓存不为空，则基于所述缓存中的数据执行所述颗粒度运算，得到周期级的数据处理结果之前，包括：According to some embodiments, based on the data processing method of a neural network simulator provided in the first aspect above, and the first, second and third possible implementations above, in the sixth possible implementation of the present application In an implementation manner, before performing the granular operation based on the data in the cache to obtain cycle-level data processing results if the cache is not empty, the method includes:

获取所述缓存的卷绕标志位的值，以及所述缓存的读地址和写地址；Obtain the value of the winding flag bit of the cache, and the read address and write address of the cache;

根据所述缓存的读地址和写地址是否重合，以及所述缓存的卷绕标志位的值，判断所述缓存是否为空。Whether the cache is empty is judged according to whether the read address of the cache coincides with the write address and the value of the wrapping flag bit of the cache.

本申请实施例第二方面提供一种神经网络模拟器的数据处理装置，包括：The second aspect of the embodiment of the present application provides a data processing device for a neural network simulator, including:

获取单元，用于获取指令数据；所述指令数据包括携带第一搬运参数的第一搬运指令、携带第二搬运参数的第二搬运指令以及颗粒度运算指令；An acquisition unit, configured to acquire instruction data; the instruction data includes a first transportation instruction carrying a first transportation parameter, a second transportation instruction carrying a second transportation parameter, and a granular operation instruction;

第一搬运单元，用于根据所述第一搬运指令携带的第一搬运参数将源端的数据采用事务级搬运方式搬运至目的端；The first transport unit is configured to transport the data at the source end to the destination end in a transaction-level transport manner according to the first transport parameters carried by the first transport instruction;

第二搬运单元，用于根据所述第二搬运指令携带的第二搬运参数将所述目的端的数据按周期级精度搬运至缓存；The second transfer unit is configured to transfer the data at the destination to the cache with cycle-level precision according to the second transfer parameters carried by the second transfer instruction;

数据处理单元，用于若所述缓存不为空，则基于所述缓存中的数据执行颗粒度运算，得到周期级的数据处理结果。The data processing unit is configured to, if the cache is not empty, perform granular operations based on the data in the cache to obtain cycle-level data processing results.

根据一些实施例，基于上述第二方面提供的一种神经网络模拟器的数据处理装置，在本申请的第一种可能的实现方式中，所述第一搬运单元，还用于：According to some embodiments, based on the data processing device of a neural network simulator provided in the second aspect above, in the first possible implementation manner of the present application, the first handling unit is also used for:

根据一些实施例，基于上述第二方面提供的一种神经网络模拟器的数据处理装置，在本申请的第二种可能的实现方式中，第二搬运参数包括当前搬运的数据对应的运算模式；According to some embodiments, based on the data processing device of a neural network simulator provided in the second aspect above, in the second possible implementation of the present application, the second handling parameters include the operation mode corresponding to the currently transported data;

所述第二搬运单元，还用于：The second handling unit is also used for:

根据一些实施例，基于上述第二方面提供的一种神经网络模拟器的数据处理装置，在本申请的第三种可能的实现方式中，所述第二搬运参数包括切割参数和当前搬运的数据对应的运算模式；According to some embodiments, based on the data processing device of a neural network simulator provided in the second aspect above, in the third possible implementation of the present application, the second handling parameters include cutting parameters and current handling data The corresponding operation mode;

将所述目标数据搬运至所述缓存。Move the target data to the cache.

根据一些实施例，基于上述第二方面提供的一种神经网络模拟器的数据处理装置，以及上述第一种、第二种和第三种可能的实现方式，在本申请的第四种可能的实现方式中，当所述颗粒度运算所需要的目标数据为目的端的矩阵数据中的部分数据时，所述第二搬运参数包括所述部分数据中的每个数据对应的第一位置坐标；According to some embodiments, based on the data processing device for a neural network simulator provided in the second aspect above, and the first, second and third possible implementations above, in the fourth possible implementation of the present application In an implementation manner, when the target data required by the granular operation is part of the data in the matrix data of the destination end, the second transport parameter includes the first position coordinates corresponding to each data in the part of the data;

根据一些实施例，基于上述第一方面提供的一种神经网络模拟器的数据处理装置，以及上述第一种、第二种和第三种可能的实现方式，在本申请的第五种可能的实现方式中，当所述颗粒度运算所需要的目标数据为winograd前变换的值时，所述第二搬运参数包括winograd前变换所需要的4*4的数据表中的数据的第二位置坐标；According to some embodiments, based on the data processing device for a neural network simulator provided in the first aspect above, and the first, second and third possible implementations above, in the fifth possible implementation of the present application In an implementation, when the target data required for the granularity calculation is the value transformed before winograd, the second transport parameter includes the second position coordinates of the data in the 4*4 data table required for the transformation before winograd ;

根据一些实施例，基于上述第二方面提供的一种神经网络模拟器的数据处理方法，以及上述第一种、第二种和第三种可能的实现方式，在本申请的第六种可能的实现方式中，所述数据处理单元，还用于：According to some embodiments, based on the data processing method of a neural network simulator provided in the above-mentioned second aspect, and the above-mentioned first, second, and third possible implementations, in the sixth possible implementation of the present application In an implementation manner, the data processing unit is further configured to:

在所述若所述缓存不为空，则基于所述缓存中的数据执行所述颗粒度运算，得到周期级的数据处理结果之前，获取所述缓存的卷绕标志位的值，以及所述缓存的读地址和写地址；Before said if the cache is not empty, the granular operation is executed based on the data in the cache, and before the cycle-level data processing result is obtained, the value of the winding flag bit of the cache is obtained, and the The read address and write address of the cache;

本申请实施例第三方面提供一种终端，包括存储器、处理器以及存储在存储器中并可在处理器上运行的计算机程序，处理器执行计算机程序时实现上述方法的步骤。The third aspect of the embodiments of the present application provides a terminal, including a memory, a processor, and a computer program stored in the memory and operable on the processor. When the processor executes the computer program, the steps of the foregoing method are implemented.

本申请实施例第四方面提供一种计算机可读存储介质，计算机可读存储介质存储有计算机程序，计算机程序被处理器执行时实现上述方法的步骤。A fourth aspect of the embodiments of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the foregoing method are implemented.

本申请实施例中，通过将源端的数据采用事务级搬运方式搬运至目的端，并将目的端的数据按周期级精度搬运至缓存，再基于所述缓存中的数据以及所述颗粒度运算指令执行所述颗粒度运算，使得本申请的神经网络模拟器混合了周期级和事务级的模拟器设计方法，实现了周期级精确的指令运算和事务级模糊的数据搬运，还使得神经网络模拟器的指令可以按照周期级进行计算，保持了与硬件的一致性与精确性，还优化了数据搬运的周期级依赖，降低了神经网络模拟器的复杂度，且使得神经网络模拟器能够在神经网络处理器大数据搬运、大算力特性、架构评估、指令集工具链开发、RTL验证等方面发挥极大作用。In the embodiment of the present application, the data at the source end is transferred to the destination end in a transaction-level transfer manner, and the data at the destination end is transferred to the cache with cycle-level precision, and then executed based on the data in the cache and the granular operation instructions The granular operation makes the neural network simulator of the present application mix cycle-level and transaction-level simulator design methods, realize cycle-level accurate instruction operation and transaction-level fuzzy data transfer, and also make the neural network simulator Instructions can be calculated according to the cycle level, which maintains the consistency and accuracy with the hardware, and optimizes the cycle-level dependence of data handling, reduces the complexity of the neural network simulator, and enables the neural network simulator to be processed in the neural network It plays a significant role in the handling of large data of the device, large computing power characteristics, architecture evaluation, instruction set tool chain development, RTL verification, etc.

Description of drawings

为了更清楚地说明本申请实施例的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，应当理解，以下附图仅示出了本申请的某些实施例，因此不应被看作是对范围的限定，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他相关的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present application, the accompanying drawings that are required in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of the present application, and thus It should be regarded as a limitation on the scope, and those skilled in the art can also obtain other related drawings based on these drawings without creative work.

图1是本申请实施例提供的一种神经网络模拟器的数据处理方法的实现流程示意图；Fig. 1 is the realization flowchart of the data processing method of a kind of neural network simulator provided by the embodiment of the present application;

图2是本申请实施例提供的数据切割的示意图；Fig. 2 is a schematic diagram of data cutting provided by the embodiment of the present application;

图3是本申请实施例提供的一种神经网络模拟器的数据处理方法步骤103的第一具体实现流程示意图；FIG. 3 is a schematic diagram of a first specific implementation flow chart of step 103 of a data processing method for a neural network simulator provided in an embodiment of the present application;

图4是本申请实施例提供的一种神经网络模拟器的数据处理方法步骤103的第二具体实现流程示意图；FIG. 4 is a schematic diagram of a second specific implementation flow chart of step 103 of a data processing method for a neural network simulator provided in an embodiment of the present application;

图5是本申请实施例提供的矩阵数据搬运的示意图；Fig. 5 is a schematic diagram of matrix data handling provided by the embodiment of the present application;

图6是本申请实施例提供的判断缓存是否为空的示意图；FIG. 6 is a schematic diagram of judging whether the cache is empty provided by the embodiment of the present application;

图7是本申请实施例提供的神经网络模拟器的数据处理装置的结构示意图；7 is a schematic structural diagram of a data processing device of a neural network simulator provided in an embodiment of the present application;

图8是本申请实施例提供的终端的结构示意图。FIG. 8 is a schematic structural diagram of a terminal provided by an embodiment of the present application.

Embodiments of the present invention

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本申请进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本申请，并不用于限定本申请。In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not intended to limit the present application.

应当理解，当在本说明书和所附权利要求书中使用时，术语“包括”指示所描述特征、整体、步骤、操作、元素和/或组件的存在，但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that when used in this specification and the appended claims, the term "comprising" indicates the presence of described features, integers, steps, operations, elements and/or components, but does not exclude one or more other features. , whole, step, operation, element, component and/or the presence or addition of a collection thereof.

还应当理解，在本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。It should also be understood that the terminology used in the specification of the present application is for the purpose of describing specific embodiments only and is not intended to limit the present application.

神经网络模拟器是一种技术工具，可以为人工神经网络提供建模或某种研究原型。一般来说，神经网络模拟器是研究人员研究神经网络如何工作的资源。不同种类的数据收集有助于模拟器评估人工神经网络内部发生的事情。为了有效地向研究人员展示神经网络是如何运作的，神经网络模拟器通常包括多功能的可视界面，以图形化的方式显示数据。其中许多有多个窗口，这些窗口可以被标记为便于识别的数据模块。A neural network simulator is a technical tool that provides modeling or some sort of research prototype for an artificial neural network. In general, neural network simulators are a resource for researchers to study how neural networks work. Different kinds of data collection help the simulator evaluate what's going on inside the artificial neural network. To effectively show researchers how neural networks work, neural network simulators often include versatile visual interfaces that display data graphically. Many of these have multiple windows that can be labeled as data modules for easy identification.

传统的神经网络模拟器使用周期级（cycle级）建模或者是事务级建模。然而，单纯的cycle级建模对于神经网络处理器建模复杂度、运行速度、建模周期、工具链的使用有重大影响；而单纯事务级建模，只能用于近似仿真和前期评估，无法真正做到与实际硬件保持一致的情况。这两种神经网络模拟器均无法满足逐渐往处理数据量大、数据维度多、计算方式复杂多样的方向的变化的神经网络处理器的需求。Traditional neural network simulators use cycle-level (cycle-level) modeling or transaction-level modeling. However, pure cycle-level modeling has a significant impact on the modeling complexity, running speed, modeling cycle, and tool chain usage of neural network processors; while pure transaction-level modeling can only be used for approximate simulation and pre-evaluation, Can't really do the same with actual hardware. Both of these two neural network simulators cannot meet the needs of neural network processors that are gradually changing in the direction of processing large amounts of data, multiple data dimensions, and complex and diverse calculation methods.

基于此，本申请实施例提供了一种神经网络模拟器的数据处理方法、装置和终端，混合了周期级和事务级的模拟器设计方法，可以实现周期级精确的指令运算和事务级模糊的数据搬运，保持了与硬件的一致性与精确性，能够在神经网络处理器大数据搬运、大算力特性、架构评估、指令集工具链开发、RTL验证等方面发挥极大作用。Based on this, the embodiment of the present application provides a data processing method, device, and terminal for a neural network simulator, which mix cycle-level and transaction-level simulator design methods, and can realize cycle-level accurate instruction operations and transaction-level fuzzy Data handling maintains the consistency and accuracy of the hardware, and can play a significant role in the neural network processor's big data handling, large computing power characteristics, architecture evaluation, instruction set tool chain development, RTL verification, etc.

为了说明本申请的技术方案，下面通过具体实施例来进行说明。In order to illustrate the technical solution of the present application, specific examples are used below to illustrate.

如图1示出了本申请实施例提供的一种神经网络模拟器的数据处理方法的实现流程示意图，该方法应用于终端，可以由终端上配置的神经网络模拟器的数据处理装置执行。其中，上述终端可以为电脑、服务器等智能终端。上述神经网络模拟器的数据处理方法可以包括步骤101至步骤104，详述如下：FIG. 1 shows a schematic flowchart of a data processing method for a neural network simulator provided by an embodiment of the present application. The method is applied to a terminal and can be executed by a data processing device of a neural network simulator configured on the terminal. Wherein, the above-mentioned terminal may be an intelligent terminal such as a computer or a server. The data processing method of above-mentioned neural network simulator can comprise step 101 to step 104, detailed description is as follows:

步骤101，获取指令数据。Step 101, acquire instruction data.

本申请实施例中，上述指令数据可以包括携带第一搬运参数的第一搬运指令、携带第二搬运参数的第二搬运指令以及颗粒度运算指令。并且，上述指令数据可以为神经网络模拟器中的不同模块从指令控制流中获取的指令数据。In the embodiment of the present application, the above instruction data may include a first transportation instruction carrying a first transportation parameter, a second transportation instruction carrying a second transportation parameter, and a granular operation instruction. Moreover, the above instruction data may be instruction data acquired by different modules in the neural network simulator from the instruction control flow.

例如，上述第一搬运指令为神经网络模拟器中的exdma模块从指令控制流中获取的指令数据，上述第二搬运指令为神经网络模拟器中的xdma模块从指令控制流中获取的指令数据，上述颗粒度运算指令为神经网络模拟器中的颗粒度运算模块cube从指令控制流中获取的指令数据。For example, the above-mentioned first handling instruction is the instruction data obtained by the exdma module in the neural network simulator from the instruction control flow, and the above-mentioned second handling instruction is the instruction data obtained by the xdma module in the neural network simulator from the instruction control flow, The above granular operation instruction is the instruction data obtained by the granular operation module cube in the neural network simulator from the instruction control flow.

上述第一搬运指令用于将源端的数据按事务级搬运方式搬运至目的端，上述第二搬运指令用于将所述目的端的数据按周期级精度搬运至缓存中。The first transfer instruction is used to transfer the data at the source end to the destination end in a transaction-level transfer manner, and the second transfer instruction is used to transfer the data at the destination end to the cache with cycle-level precision.

上述源端和目的端可以为神经网络模拟器中的不同模块，例如，上述源端和目的端可以为神经网络模拟器中的exdma模块，如eidma模块和eodma，上述目的端可以为神经网络模拟器中的xdma模块，如idma模块和odma。The above-mentioned source and destination can be different modules in the neural network simulator, for example, the above-mentioned source and destination can be the exdma module in the neural network simulator, such as eidma module and eodma, and the above-mentioned destination can be a neural network simulation xdma modules in the device, such as idma modules and odma.

在本申请的一些实施方式中，根据一些实施例，上述第一搬运参数可以包括每次从源端搬运至目的端的数据的数据量，即，握手颗粒度，例如，1Kb或5Kb，以及，当前需要搬运的总数据量。In some embodiments of the present application, according to some embodiments, the above-mentioned first transport parameter may include the data amount of data transported from the source end to the destination end each time, that is, the handshake granularity, for example, 1Kb or 5Kb, and the current The total amount of data that needs to be moved.

上述第二搬运参数可以包括切割参数以及当前搬运数据对应的运算模式。The above-mentioned second transport parameters may include cutting parameters and an operation mode corresponding to the current transport data.

其中，上述运算模式可以包括winograd运算模式、矩阵运算模式、填充模式padding、插0模式deconvlution和跳地址模式dilated。Wherein, the above operation modes may include a winograd operation mode, a matrix operation mode, a padding mode padding, a 0 insertion mode deconvlution, and an address jump mode dilated.

上述切割参数可以包括对数据高度（zeta）、宽度(epsilon=ci)和深度(dense)三个方向进行切割的切割参数H、W、D，以及dense次数、zeta次数、epsilon次数、kernel dense维度的滑窗次数、winograd_loop16、权重参数复用次数等数据。其中，当前搬运数据对应的运算模式为winograd运算模式时，winograd_loop16=16，表示16次piexl循环，当前搬运的数据对应的运算模式为非winograd运算模式时，winograd_loop16=1。The above cutting parameters can include cutting parameters H, W, D for cutting in three directions of data height (zeta), width (epsilon=ci) and depth (dense), as well as dense times, zeta times, epsilon times, kernel dense dimensions The number of sliding windows, winograd_loop16, weight parameter multiplexing times and other data. Among them, when the operation mode corresponding to the currently transported data is the winograd operation mode, winograd_loop16=16 means 16 piexl cycles, and when the operation mode corresponding to the currently transported data is the non-winograd operation mode, winograd_loop16=1.

例如，如图2所示，当H=16、W=32、D=8时，对目的端的数据进行切割，可以得到数据ci0，并且，延ci方向切割，可以依次得到数据ci0至数据ci7。For example, as shown in Figure 2, when H=16, W=32, and D=8, data ci0 can be obtained by cutting the data at the destination, and data ci0 to data ci7 can be obtained sequentially by cutting along the direction of ci.

并且，上述源端的数据可以为图像数据或参数数据，本申请对此不做限制。In addition, the data at the source end may be image data or parameter data, which is not limited in this application.

步骤102，根据第一搬运指令携带的第一搬运参数将源端的数据采用事务级搬运方式搬运至目的端。Step 102: Transport the data at the source end to the destination end in a transaction-level transport manner according to the first transport parameter carried in the first transport instruction.

本申请实施例中，将源端的数据搬运至目的端属于数据的准备过程，该过程采用的是事务级的搬运，并且，属于松耦合的搬运，可以与周期级的搬运过程相互独立，即，与将目的端的数据搬运至缓存这一周期级的搬运过程独立，使得目的端的数据可以提前进行准备，减少将目的端的数据搬运至缓存时的等待。In the embodiment of this application, moving the data from the source end to the destination end belongs to the data preparation process, which uses transaction-level handling, and belongs to loosely coupled handling, which can be independent of the period-level handling process, that is, It is independent from the cycle-level transfer process of transferring the data of the destination to the cache, so that the data of the destination can be prepared in advance, reducing the waiting time when transferring the data of the destination to the cache.

根据一些实施例，在本申请的一些实施方式中，上述步骤102中，将源端的数据采用事务级搬运方式搬运至目的端可以包括：基于源端与目的端之间的通信握手，将源端的数据采用事务级搬运方式搬运至目的端。According to some embodiments, in some implementations of the present application, in the above step 102, moving the data at the source end to the destination end in a transaction-level transfer manner may include: based on the communication handshake between the source end and the destination end, transferring the Data is moved to the destination using transaction-level handling.

本申请实施例中，源端与目的端之间进行通信握手是指，在将目的端的数据按周期级精度搬运至缓存目的端之前，目的端需要等待源端根据第一搬运参数将数据写入目的端，并且，源端将数据写入目的端之后，需要通知目的端数据已准备好。In this embodiment of the application, the communication handshake between the source and the destination means that the destination needs to wait for the source to write the data into The destination, and after the source writes the data to the destination, it needs to notify the destination that the data is ready.

具体的，目的端从自身的存储空间dm读数据前，向源端发送等待信号dest_wo_src，等待源端将数据写入目的端的dm。源端将数据写入目的端后，向目的端发送使能信号src_ub_dest，由目的端从dm中进行数据读取，并在完成数据读取后，向源端发送使能信号dest_ub_src，通知源端数据空间已经释放。Specifically, before the destination end reads data from its own storage space dm, it sends a waiting signal dest_wo_src to the source end, waiting for the source end to write data into dm of the destination end. After the source end writes the data to the destination end, it sends the enable signal src_ub_dest to the destination end, and the destination end reads the data from dm, and after completing the data reading, sends the enable signal dest_ub_src to the source end to notify the source end Data space has been freed.

需要说明的是，在本申请的一些实施方式中，目的端从dm中读数据前，可以向源端继续发送等待信号dest_wo_src，由源端进行累加计数，并提前将数据存入dm中，而不需要等待目的端将dm中的数据读取完之后，再向源端继续发送使能信号dest_wo_src，减少了数据读取等待的时间，提高了数据存储和读取的效率。 It should be noted that, in some embodiments of the present application, before the destination end reads data from dm, it can continue to send the waiting signal dest_wo_src to the source end, and the source end performs cumulative counting, and stores the data in dm in advance, while There is no need to wait for the destination end to read the data in dm before continuing to send the enable signal dest_wo_src to the source end, which reduces the waiting time for data reading and improves the efficiency of data storage and reading.

步骤103，根据第二搬运指令携带的第二搬运参数将所述目的端的数据按周期级精度搬运至缓存。Step 103, according to the second transfer parameter carried by the second transfer instruction, the data at the destination is transferred to the cache with cycle-level precision.

本申请实施例中，上述缓存是指颗粒度运算对应的存储空间。上述将目的端的数据搬运至缓存属于周期级的搬运和计算。In the embodiment of the present application, the cache refers to the storage space corresponding to the granular operation. The above-mentioned transfer of destination data to the cache belongs to cycle-level transfer and calculation.

具体的，在本申请的一些实施方式中，如图3所示，上述步骤103中，根据所述第二搬运指令携带的第二搬运参数将所述目的端的数据按周期级精度搬运至缓存，可以包括如下步骤301至步骤303。Specifically, in some implementations of the present application, as shown in FIG. 3 , in the above step 103, the data at the destination is transferred to the cache with cycle-level precision according to the second transfer parameter carried by the second transfer instruction, It may include steps 301 to 303 as follows.

步骤301，根据所述第二搬运参数中的切割参数将所述目的端的数据按周期级精度进行切割，得到切割数据。Step 301 , according to the cutting parameters in the second transport parameters, cutting the data of the destination end with cycle-level accuracy to obtain cutting data.

如上述步骤101所示，上述切割参数可以包括对数据从高度（zeta）、宽度(epsilon=ci)和深度(dense)三个方向进行切割的切割参数H、W、D，以及dense次数、zeta次数、epsilon次数、kernel dense维度的滑窗次数、winograd_loop16、weight 参数复用次数等数据。As shown in the above-mentioned step 101, the above-mentioned cutting parameters may include cutting parameters H, W, D for cutting data from three directions of height (zeta), width (epsilon=ci) and depth (dense), as well as dense times, zeta Times, epsilon times, sliding window times in the kernel dense dimension, winograd_loop16, weight parameter reuse times and other data.

步骤302，根据当前搬运的数据对应的运算模式对所述切割数据进行运算，得到所述颗粒度运算所需要的目标数据。Step 302: Perform calculations on the cutting data according to the calculation mode corresponding to the currently transported data to obtain the target data required for the granularity calculation.

本申请实施例中，上述运算模式可以包括填充模式padding、插0模式deconvlution和跳地址模式dilated。In the embodiment of the present application, the above operation modes may include padding in padding mode, deconvlution in 0 insertion mode, and dilated in jump address mode.

上述步骤302中根据当前搬运的数据对应的运算模式对所述切割数据进行运算，得到所述颗粒度运算所需要的目标数据可以包括以下情况：若当前搬运的数据对应的运算模式为deconvlution运算模式时，则对所述切割数据进行插0计算得到所述颗粒度运算所需要的目标数据；若当前搬运的数据对应的运算模式为dilated，则对所述切割数据计算跳地址后的数据，得到所述颗粒度运算所需要的目标数据。In the above step 302, the cutting data is calculated according to the calculation mode corresponding to the currently transported data, and the target data required for the granularity calculation is obtained may include the following situations: if the calculation mode corresponding to the currently transported data is the deconvlution calculation mode , the cutting data is calculated by inserting 0 to obtain the target data required for the granularity calculation; if the operation mode corresponding to the currently transported data is dilated, the data after the jump address is calculated for the cutting data to obtain The target data required by the granular operation.

步骤303，将所述目标数据搬运至所述缓存。Step 303, moving the target data to the cache.

本申请实施例中，通过在搬运数据的过程中，计算得到所述颗粒度运算所需要的目标数据之后，直接将目标数据搬运至缓存中，无需进行中间的缓存。In the embodiment of the present application, after calculating and obtaining the target data required by the granularity calculation in the process of transporting data, the target data is directly transported to the cache, without intermediate cache.

步骤104，若所述缓存不为空，则基于所述缓存中的数据执行颗粒度运算，得到周期级的数据处理结果。Step 104, if the cache is not empty, perform granular operations based on the data in the cache to obtain cycle-level data processing results.

本申请实施例中，当缓存不为空时，表示用于进行颗粒度运算的数据已经准备好，因而，可以直接基于所述缓存中的数据执行所述颗粒度运算，即，cube运算，该计算为周期级的精确计算，因而，实现了周期级精确结果的输出，可以用于RTL数据计算比对。In the embodiment of the present application, when the cache is not empty, it means that the data used for the granular operation has been prepared, therefore, the granular operation can be directly performed based on the data in the cache, that is, the cube operation, the The calculation is accurate calculation at the cycle level, therefore, the output of cycle-level accurate results is realized, which can be used for RTL data calculation comparison.

本申请实施例中，通过将源端的数据采用事务级搬运方式搬运至目的端，并将目的端的数据按周期级精度搬运至缓存，再基于所述缓存中的数据以及所述颗粒度运算指令执行所述颗粒度运算，使得本申请的神经网络模拟器混合了周期级和事务级的模拟器设计方法，实现了周期级精确的指令运算和事务级模糊的数据搬运，保持了与硬件的一致性与精确性，能够在神经网络处理器大数据搬运、大算力特性、架构评估、指令集工具链开发、RTL验证等方面发挥极大作用。In the embodiment of the present application, the data at the source end is transferred to the destination end in a transaction-level transfer manner, and the data at the destination end is transferred to the cache with cycle-level precision, and then executed based on the data in the cache and the granular operation instructions The granular operation makes the neural network simulator of this application mix cycle-level and transaction-level simulator design methods, realize cycle-level accurate instruction operation and transaction-level fuzzy data transfer, and maintain consistency with hardware And accuracy, it can play a great role in neural network processor big data handling, large computing power characteristics, architecture evaluation, instruction set tool chain development, RTL verification, etc.

示例性的，当上述神经网络模拟器为用于模拟用于进行人脸识别的人工神经网络模型的神经网络模拟器时，上述源端的数据可以为人脸图像数据，上述第一搬运指令携带的第一搬运参数可以包括该人脸图像数据的数据量，该第一搬运指令用于将源端的人脸图像数据按事务级搬运方式搬运至目的端，上述第二搬运指令携带的第二搬运参数可以包括对该人脸图像数据的数据高度（zeta）、宽度(epsilon=ci)和深度(dense)三个方向进行切割的切割参数H0、W0、D0，用于将目的端的人脸图像数据按周期级精度进行切割，得到切割数据，以实现对目的端的人脸图像数据按周期级精度搬运至缓存中，并在缓存不为空时，实现基于所述缓存中的数据执行颗粒度运算，即，cube运算，最终得到周期级的人脸识别结果。Exemplarily, when the above-mentioned neural network simulator is a neural network simulator for simulating an artificial neural network model for face recognition, the data at the source end may be face image data, and the first transfer instruction carried by the above-mentioned A transfer parameter may include the data volume of the face image data, the first transfer instruction is used to transfer the face image data at the source end to the destination end in a transaction-level transfer manner, and the second transfer parameter carried by the second transfer instruction may be Including the cutting parameters H0, W0, and D0 for cutting the data height (zeta), width (epsilon=ci) and depth (dense) of the face image data, which are used to cut the face image data of the destination end by cycle Cutting with high-level precision to obtain cutting data, so as to transfer the face image data of the destination to the cache with cycle-level precision, and when the cache is not empty, perform granular operations based on the data in the cache, that is, Cube operation, and finally get cycle-level face recognition results.

其中，上述cube运算是指利用神经网络算法对缓存中的人脸图像数据进行运算，得到人脸分类结果，例如，得到上述人脸图像数据对应的人脸识别结果为张三的脸，或者李四的脸。Wherein, the above-mentioned cube operation refers to using a neural network algorithm to perform operations on the face image data in the cache to obtain a face classification result. For example, the face recognition result corresponding to the above-mentioned face image data is Zhang San's face, or Li Four faces.

上述神经网络算法可以包括Layer Cubing算法（逐层算法）、By-layer Spark Cubing算法、Fast(in-mem) Cubing算法，即，“逐段”(By Segment) 或“逐块” (By Split) 算法等等神经网络算法，本申请对此不做限制。The above neural network algorithm can include Layer Cubing algorithm (layer-by-layer algorithm), By-layer Spark Cubing algorithm, Fast(in-mem) Cubing algorithm, that is, "by segment" (By Segment) or "by block" (By Split) algorithm and other neural network algorithms, this application does not limit it.

本申请通过将源端的人脸图像数据采用事务级搬运方式搬运至目的端，并将目的端的人脸图像数据按周期级精度搬运至缓存，再基于该缓存中的人脸图像数据以及所述颗粒度运算指令执行颗粒度运算，使得本申请的神经网络模拟器混合了周期级和事务级的模拟器设计方法，实现了周期级精确的指令运算和事务级模糊的数据搬运，在模拟人工神经网络模型实现人脸图像识别的过程中，保持了与硬件的一致性与精确性，能够在神经网络处理器大数据搬运、大算力特性、架构评估、指令集工具链开发、RTL验证等方面发挥极大作用。This application transfers the face image data at the source end to the destination end in a transaction-level transfer manner, and transfers the face image data at the destination end to the cache with cycle-level accuracy, and then based on the face image data in the cache and the particles High-precision operation instructions execute granular operations, so that the neural network simulator of this application mixes cycle-level and transaction-level simulator design methods, and realizes cycle-level accurate instruction operations and transaction-level fuzzy data transfer. In the process of realizing face image recognition, the model maintains the consistency and accuracy with the hardware, and can play a role in neural network processor big data handling, large computing power characteristics, architecture evaluation, instruction set tool chain development, RTL verification, etc. great effect.

还需要说明的是，上述神经网络模拟器还可以为用于模拟其他应用场景的人工神经网络模型的工作过程的神经网络模拟器，例如，上述神经网络模拟器还可以为用于模拟用于车牌识别、障碍物识别或动物分类的人工神经网络模型的工作过程的模拟器，本申请对此不做限制。It should also be noted that the above-mentioned neural network simulator can also be a neural network simulator for simulating the working process of an artificial neural network model in other application scenarios, for example, the above-mentioned neural network simulator can also be a neural network simulator for simulating the A simulator of the working process of the artificial neural network model for recognition, obstacle recognition or animal classification, the present application does not limit this.

还需要说明的是，上述源端的数据除了为图像数据以外，还可以为语音数据等等不同类型的数据，本申请对源端的数据的数据类型不做限制。It should also be noted that, besides image data, the above-mentioned data at the source end may also be different types of data such as voice data, and this application does not limit the data type of the data at the source end.

根据一些实施例，当上述源端的数据为语音数据时，相应的，上述神经网络模拟器可以为用于对语音数据进行分析处理的神经网络模拟器，例如，上述神经网络模拟器可以为用于对语音数据进行分类、去噪等处理的神经网络模拟器。According to some embodiments, when the data at the source end is speech data, correspondingly, the above-mentioned neural network simulator may be a neural network simulator for analyzing and processing speech data, for example, the above-mentioned neural network simulator may be a A neural network simulator for processing speech data such as classification, denoising, etc.

根据一些实施例，在本申请的一些实施方式中，上述步骤103，将所述目的端的数据按周期级精度搬运至缓存的过程中，可以先对源端搬运至目的端的数据量以及颗粒度运算所需要的真实数据量进行同步，以判断目的端是否存在可搬运的数据，然后再进行数据搬运，具体地，如图4所示，上述步骤103还可以通过下述步骤401至步骤403的方式实现。According to some embodiments, in some implementations of the present application, in the above step 103, in the process of transferring the data of the destination end to the cache with cycle-level precision, the amount of data transferred from the source end to the destination end and the granularity calculation can be performed first. The required real data volume is synchronized to determine whether there is data that can be transported at the destination, and then the data is transported. Specifically, as shown in FIG. accomplish.

步骤401，根据所述当前搬运的数据对应的运算模式计算所述颗粒度运算所需要的目的端的数据对应的真实数据量。Step 401: Calculate the real data amount corresponding to the data at the destination end required by the granularity calculation according to the calculation mode corresponding to the currently transported data.

本申请实施例中，当前搬运的数据对应的运算模式可以包括winograd运算模式、矩阵运算模式、填充模式padding、插0模式deconvlution和跳地址模式dilated。In the embodiment of the present application, the operation mode corresponding to the currently transferred data may include winograd operation mode, matrix operation mode, padding mode padding, 0 insertion mode deconvlution, and jump address mode dilated.

由于颗粒度运算所需要的目标数据有可能是根据所述当前搬运的数据对应的运算模式运算得到的，而非直接是目的端中的数据，因此，颗粒度运算所需要的目标数据对应的数据量与颗粒度运算所需要的目的端的数据对应的真实数据量有可能不一致。例如，颗粒度运算所需要的目标数据对应的数据量大于颗粒度运算所需要的目的端的数据对应的真实数据量。并且，由于目标数据的数据量是已知的，且为固定的数据量，因此，可以根据所述当前搬运的数据对应的运算模式反推出所述颗粒度运算所需要的目的端的数据对应的真实数据量。Since the target data required for granular operations may be obtained according to the operation mode corresponding to the currently transferred data, rather than directly the data in the destination end, the data corresponding to the target data required for granular operations The actual amount of data corresponding to the data at the destination end required for granular computing may be inconsistent. For example, the amount of data corresponding to the target data required by the granular operation is greater than the actual data amount corresponding to the data of the destination end required by the granular operation. Moreover, since the data volume of the target data is known and is a fixed data volume, the actual data corresponding to the data at the destination end required by the granularity calculation can be inversely deduced according to the calculation mode corresponding to the currently transported data. The amount of data.

例如，当前搬运的数据对应的运算模式为插0模式时，则可以根据该运算模式以及目标数据的数据量，计算出颗粒度运算所需要的目的端的数据对应的真实数据量。For example, when the operation mode corresponding to the currently transported data is the zero-insertion mode, the actual data volume corresponding to the destination data required for the granular operation can be calculated according to the operation mode and the data volume of the target data.

步骤402，根据所述真实数据量以及所述源端搬运至目的端的数据量，判断所述目的端的数据量是否大于或等于所述颗粒度运算所需要的真实数据量。Step 402: According to the real data volume and the data volume transported from the source to the destination, determine whether the data volume at the destination is greater than or equal to the real data volume required by the granular operation.

本申请实施例中，通过将计算得到的真实数据量与源端搬运的数据量进行握手，确定源端搬运目的端的数据量是否达到所述颗粒度运算所需要的真实数据量，当未达到时，等待源端将数据搬运目的端。In the embodiment of the present application, by handshaking the calculated real data volume with the data volume transported by the source terminal, it is determined whether the data volume transported by the source terminal reaches the real data volume required by the granularity calculation, and if not , waiting for the source to move the data to the destination.

步骤403，若所述目的端的数据量大于或等于所述颗粒度运算所需要的真实数据量，则根据所述第二搬运参数将所述目的端的数据按周期级的精度搬运至缓存。Step 403, if the amount of data at the destination is greater than or equal to the actual amount of data required by the granularity calculation, then move the data at the destination to the cache with cycle-level precision according to the second moving parameter.

当目的端的数据量满足所述颗粒度运算所需要的真实数据量时，表示目的端的数据量已经达到颗粒度运算所需要的真实数据量，因而可以进行数据搬运。When the amount of data at the destination meets the actual data amount required by the granular operation, it means that the data amount at the destination has reached the actual data amount required by the granular operation, and thus data transfer can be performed.

本申请实施例中，由于只需要在将所述目的端的数据按周期级精度搬运至缓存的过程中，先对源端搬运至目的端的数据量以及颗粒度运算所需要的真实数据量进行一次同步即可，无需周期级的中间过程的同步，使得将源端的数据搬运至目的端这一事务级的搬运过程可以与将目的端的数据搬运至缓存这一周期级的搬运过程独立，目的端的数据可以提前进行准备，减少将目的端的数据搬运至缓存时的等待。另外，本申请实施例在搬运数据过程中，通过一边计算padding，deconvlution，dilated等模式需要跳地址或者填充的数据，一边搬运dm真实有效数据，避免数据中间对ddr 缓存及带宽访问，提高了数据搬运的效率。In the embodiment of the present application, since it is only necessary to transfer the data of the destination end to the cache with cycle-level precision, the amount of data transferred from the source end to the destination end and the actual data amount required for granular calculations need to be synchronized once That is, there is no need for cycle-level synchronization of intermediate processes, so that the transaction-level process of moving the data from the source to the destination can be independent of the cycle-level process of moving the data from the destination to the cache, and the data at the destination can be Prepare in advance to reduce the waiting time when moving the destination data to the cache. In addition, in the process of data transfer in the embodiment of the present application, while calculating padding, deconvlution, dilated and other modes that need to jump addresses or fill data, while transferring dm real and valid data, avoiding access to ddr cache and bandwidth in the middle of the data, and improving data The efficiency of handling.

根据一些实施例，在本申请的一些实施方式中，还可以通过采用事务级的搬运方式将源端的数据搬运至目的端，并在将所述目的端的数据按周期级精度搬运至缓存时，通过抽象不同模式下运算依赖的数据集，如抽象转置运算模式及winograd运算模式下运算依赖的数据集，减少系统复杂度对RTL cycle级场景复杂信号的运算级迭代，降低cycle级建模复杂度的同时大幅提升神经网络模拟器的性能。According to some embodiments, in some implementations of the present application, the data at the source end can also be moved to the destination end by using a transaction-level transfer method, and when the data at the destination end is transferred to the cache with cycle-level precision, through Abstract computing-dependent datasets in different modes, such as abstract transpose computing mode and winograd computing-dependent datasets, to reduce system complexity Operation-level iteration of complex signals in RTL cycle-level scenarios, reducing cycle-level modeling complexity At the same time, the performance of the neural network simulator is greatly improved.

具体的，在本申请的一些实施方式中，当上述颗粒度运算所需要的目标数据为目的端的矩阵数据中的部分数据时，上述第二搬运参数可以包括该部分数据中的每个数据对应的第一位置坐标；上述步骤103中，根据所述第二搬运参数将所述目的端的数据按周期级精度搬运至缓存，可以包括：将目的端的矩阵数据中与所述第一位置坐标对应的数据搬运至所述缓存。Specifically, in some implementations of the present application, when the target data required for the above-mentioned granularity operation is part of the data in the matrix data of the destination end, the above-mentioned second transport parameter may include the corresponding first position coordinates; in the above step 103, transporting the data of the destination end to the cache according to the cycle-level precision according to the second transfer parameters may include: transferring the data corresponding to the first position coordinates in the matrix data of the destination end Moved to the cache.

例如，颗粒度运算所需要的目标数据为目的端的矩阵数据转置后的第一行数据，则上述第二搬运参数可以包括这一行数据中的每个数据在目的端的矩阵数据中对应的第一位置坐标，然后根据第一位置坐标从目的端的矩阵数据中挑选出这一行数据，并将其搬运至缓存。For example, if the target data required for the granular operation is the first row of data after the transposition of the matrix data at the destination, the above-mentioned second handling parameters may include the corresponding first row of each data in this row of data in the matrix data of the destination. Position coordinates, and then select this row of data from the matrix data at the destination end according to the first position coordinates, and move it to the cache.

具体的，如图5所示，目的端的矩阵数据为16*16的矩阵数据时，若颗粒度运算所需要的目标数据为矩阵数据的第一列数据时，则可以通过第一位置坐标挑选出这一列包含的16个数据，并将其搬运至缓存。Specifically, as shown in Figure 5, when the matrix data at the destination is 16*16 matrix data, if the target data required for the granularity operation is the first column data of the matrix data, the first position coordinates can be used to select This column contains 16 data and moves it to the cache.

需要说明的是，由于传统的神经网络模拟器，其数据搬运方式，需要与硬件保持同步，因而，无法根据当前搬运的数据对应的运算模式对包含多行多列的矩阵数据进行抽象处理得到该运算模式所依赖的数据，或者，需要通过复杂的运算迭代才能得到该数据，例如，对于上述16*16的矩阵数据，其无法通过第一位置坐标挑选出这一列包含的16个数据，因而，存在数据运算的复杂度较高，以及系统复杂度较高的问题。It should be noted that, because the traditional neural network simulator, its data transfer method needs to be synchronized with the hardware, therefore, it is impossible to abstract the matrix data containing multiple rows and columns according to the operation mode corresponding to the currently transferred data to obtain the The data that the operation mode depends on, or the data needs to be obtained through complex operation iterations. For example, for the above-mentioned 16*16 matrix data, it is impossible to select the 16 data contained in this column through the first position coordinates. Therefore, There are problems of high complexity of data operation and high system complexity.

本申请通过输入数据指令循环参数矩阵，转置矩阵中的特殊行，例如，通过转置矩阵中的数据的坐标（x，y），使x=y，y=x，得到颗粒度运算所需的数据，无需与硬件进行数据同步，解耦数据搬运，有效地减少了系统复杂度对RTL cycle级场景复杂信号的运算级迭代，降低cycle级建模复杂度的同时，大幅提升神经网络模拟器的性能。This application circulates the parameter matrix by inputting data instructions, and transposes special rows in the matrix, for example, by transposing the coordinates (x, y) of the data in the matrix, making x=y, y=x, to obtain the required granularity operation data, without data synchronization with hardware, decoupling data handling, effectively reducing system complexity for RTL Operation-level iteration of complex signals in cycle-level scenarios reduces the complexity of cycle-level modeling and greatly improves the performance of neural network simulators.

根据一些实施例，在本申请的一些实施方式中，当上述颗粒度运算所需要的目标数据为winograd前变换的值时，第二搬运参数可以包括winograd前变换所需要的4*4的数据表中的数据的第二位置坐标。According to some embodiments, in some implementations of the present application, when the target data required for the above-mentioned granularity calculation is the value transformed before winograd, the second handling parameter may include the 4*4 data table required for the transformation before winograd The second location coordinates of the data in .

上述步骤103中，根据所述第二搬运参数将所述目的端的数据按周期级精度搬运至缓存，可以包括：根据所述第二位置坐标读取所述目的端存储的4*4的数据表，并基于所述4*4的数据表计算得到所述winograd前变换的值，并将所述winograd前变换的值搬运至所述缓存。In the above step 103, transferring the data of the destination end to the cache according to the second transfer parameters with cycle-level accuracy may include: reading the 4*4 data table stored at the destination end according to the second position coordinates , and calculate the pre-winograd transformed value based on the 4*4 data table, and transfer the pre-winograd transformed value to the cache.

例如，如下表1所示，根据winograd前变换所需要的4*4的数据表中的第一个数据d0- d2- d8+d10的坐标，向下和向右平移3个坐标得到4*4的数据表，并基于该4*4的数据表计算得到winograd前变换的值。For example, as shown in Table 1 below, according to the coordinates of the first data d0-d2-d8+d10 in the 4*4 data table required for winograd transformation, translate 3 coordinates down and to the right to get 4*4 The data table, and based on the 4*4 data table, calculate the converted value before winograd.

表一：Table I:

需要说明的是，由于传统的神经网络模拟器，其数据搬运方式，需要与硬件保持同步，一次只能读取4*4的数据表中的一个数据，无法根据第二位置坐标读取目的端存储的整个4*4的数据表，或者，需要通过复杂的运算迭代才能得到该数据，因而，存在数据运算的复杂度较高，以及系统复杂度较高的问题。It should be noted that due to the traditional neural network simulator, its data handling method needs to be synchronized with the hardware, and only one piece of data in the 4*4 data table can be read at a time, and the destination end cannot be read according to the second position coordinates. The entire 4*4 data table is stored, or the data needs to be obtained through complex calculation iterations. Therefore, there are problems of high complexity of data calculation and high system complexity.

本申请通过根据winograd前变换所需要的4*4的数据表中的第一个数据d0- d2- d8+d10的坐标，向下和向右平移3个坐标得到4*4的数据表，无需与硬件缓存进行握手，可以解耦数据搬运，有效地减少了系统复杂度对RTL cycle级场景复杂信号的运算级迭代，降低cycle级建模复杂度的同时，大幅提升神经网络模拟器的性能。In this application, according to the coordinates of the first data d0-d2-d8+d10 in the 4*4 data table required for winograd transformation, the 4*4 data table is obtained by shifting 3 coordinates downward and to the right, without Handshaking with the hardware cache can decouple data handling and effectively reduce system complexity for RTL Operation-level iteration of complex signals in cycle-level scenarios reduces the complexity of cycle-level modeling and greatly improves the performance of neural network simulators.

本申请的神经网络模拟器指令集按照cycle级计算，保持了与硬件的一致性与精确性，数据搬运采用事务级搬运，通过基本握手颗粒度，抽象不同模式下如转置及winograd算法的加速，padding，deconvlution，dilated等场景计算依赖的数据集，减少系统复杂度对暂存器转移层次cycle级场景复杂信号运算级迭代，降低cycle级建模复杂的同时大幅提升系统性能。The neural network simulator instruction set of this application is calculated according to the cycle level, which maintains the consistency and accuracy with the hardware. The data transfer adopts the transaction-level transfer, and through the basic handshake granularity, abstracts the acceleration of different modes such as transposition and winograd algorithm , padding, deconvlution, dilated and other scene calculations rely on data sets to reduce system complexity and iterate on scratchpad transfer levels for cycle-level scenarios with complex signal operations, reducing the complexity of cycle-level modeling and greatly improving system performance.

根据一些实施例，在本申请的一些实施方式中，在上述步骤104之前，可以先检测缓存是否为空。According to some embodiments, in some implementations of the present application, before the above step 104, it may be checked whether the cache is empty.

具体的，可以通过获取缓存的卷绕标志位的值，以及读地址和写地址是否重合，判断缓存是否为空，若为空，则等待数据写入，若不为空，则基于所述缓存中的数据执行所述颗粒度运算，得到周期级的数据处理结果。其中，缓存采用先进先出的数据读写方式，卷绕标志位的值可以通过cube运算模块与目的端进行读写交互得到。Specifically, it is possible to determine whether the cache is empty by obtaining the value of the winding flag bit of the cache and whether the read address and the write address coincide. If it is empty, wait for data to be written. If not, then based on the cache Execute the granularity operation on the data in to obtain cycle-level data processing results. Among them, the cache adopts the first-in-first-out data reading and writing method, and the value of the winding flag can be obtained through the reading and writing interaction between the cube computing module and the destination.

例如，当目的端向缓存中写入数据以及cube运算从缓存中读取数据的过程中，如图6中的(a)所示，当卷绕标志位ring_flag =0 时，表示缓存的读写处于非卷绕的情况，若读写地址相同，则表示缓存为空状态。如图6中的(b)所示，当卷绕标志位ring_flag =1 时，表示缓存的读写产生了卷绕。该情况下，读写地址相同，表示缓存为满状态。For example, when the destination end writes data to the cache and the cube operation reads data from the cache, as shown in (a) in Figure 6, when the winding flag bit ring_flag = 0, it means that the cache reads and writes In the case of non-winding, if the read and write addresses are the same, it means that the cache is empty. As shown in (b) of FIG. 6, when the winding flag bit ring_flag=1, it indicates that the reading and writing of the cache has generated winding. In this case, the read and write addresses are the same, indicating that the cache is full.

本申请实施例中，利用上述卷绕标志位的值以及缓存的读地址和写地址确定缓存是否为空，简化了周期级的cube运算的握手流程，提升了运算效率，降低了神经网络模拟器的复杂度，使得周期级的cube运算与硬件设置相互独立。In the embodiment of the present application, the value of the above winding flag and the read address and write address of the cache are used to determine whether the cache is empty, which simplifies the handshake process of the cycle-level cube operation, improves the operation efficiency, and reduces the neural network simulator. The complexity makes the cycle-level cube operation and hardware settings independent of each other.

根据一些实施例，在本申请的一些实施方式中，基于所述缓存中的数据执行所述颗粒度运算，得到周期级的数据处理结果之后，还可以将该结果存储至目的端，提供给其他颗粒度运算。According to some embodiments, in some implementations of the present application, the granular operation is performed based on the data in the cache, and after obtaining the cycle-level data processing result, the result can also be stored in the destination end and provided to other Granular operations.

本申请实施例中，通过将源端的数据采用事务级搬运方式搬运至目的端，并将目的端的数据按周期级精度搬运至缓存，再基于所述缓存中的数据以及所述颗粒度运算指令执行所述颗粒度运算，使得本申请的神经网络模拟器混合了周期级和事务级的模拟器设计方法，实现了周期级精确的指令运算和事务级模糊的数据搬运，使得神经网络模拟器的指令可以按照周期级进行计算，保持了与硬件的一致性与精确性，还优化了数据搬运的周期级依赖，降低了神经网络模拟器的复杂度，能够在神经网络处理器大数据搬运、大算力特性、架构评估、指令集工具链开发、RTL验证等方面发挥极大作用。In the embodiment of the present application, the data at the source end is transferred to the destination end in a transaction-level transfer manner, and the data at the destination end is transferred to the cache with cycle-level precision, and then executed based on the data in the cache and the granular operation instructions The granular operation makes the neural network simulator of the present application mix cycle-level and transaction-level simulator design methods, realize cycle-level accurate instruction operation and transaction-level fuzzy data transfer, and make the neural network simulator instructions It can be calculated according to the cycle level, which maintains the consistency and accuracy with the hardware, and also optimizes the cycle-level dependence of data transfer, reduces the complexity of the neural network simulator, and can handle large data and large calculations in the neural network processor. It plays a significant role in power characteristics, architecture evaluation, instruction set tool chain development, RTL verification, etc.

需要说明的是，对于前述的各方法实施例，为了简单描述，故将其都表述为一系列的动作组合，但是本领域技术人员应该知悉，本发明并不受所描述的动作顺序的限制，因为根据本发明，某些步骤可以采用其它顺序进行。It should be noted that for the foregoing method embodiments, for the sake of simple description, they are expressed as a series of action combinations, but those skilled in the art should know that the present invention is not limited by the described action sequence. Because according to the present invention, certain steps can be carried out in other orders.

图7示出了本申请实施例提供的一种神经网络模拟器的数据处理装置700的结构示意图，包括获取单元701、第一搬运单元702，第二搬运单元703和数据处理单元704。FIG. 7 shows a schematic structural diagram of a data processing device 700 for a neural network simulator provided by an embodiment of the present application, including an acquisition unit 701 , a first handling unit 702 , a second handling unit 703 and a data processing unit 704 .

获取单元701，用于获取指令数据；所述指令数据包括携带第一搬运参数的第一搬运指令、携带第二搬运参数的第二搬运指令以及颗粒度运算指令；An acquisition unit 701, configured to acquire instruction data; the instruction data includes a first transportation instruction carrying a first transportation parameter, a second transportation instruction carrying a second transportation parameter, and a granular operation instruction;

第一搬运单元702，用于根据所述第一搬运指令携带的第一搬运参数将源端的数据采用事务级搬运方式搬运至目的端；The first transfer unit 702 is configured to transfer the data at the source end to the destination end in a transaction-level transfer manner according to the first transfer parameters carried in the first transfer instruction;

第二搬运单元703，用于根据所述第二搬运指令携带的第二搬运参数将所述目的端的数据按周期级精度搬运至缓存；The second transfer unit 703 is configured to transfer the data at the destination to the cache with cycle-level precision according to the second transfer parameters carried by the second transfer instruction;

数据处理单元704，用于若所述缓存不为空，则基于所述缓存中的数据执行所述颗粒度运算，得到周期级的数据处理结果。The data processing unit 704 is configured to execute the granular operation based on the data in the cache if the cache is not empty, and obtain a cycle-level data processing result.

在本申请的一些实施方式中，上述第一搬运单元702，还用于：In some embodiments of the present application, the above-mentioned first handling unit 702 is also used for:

在本申请的一些实施方式中，所述第二搬运参数包括当前搬运的数据对应的运算模式；上述第二搬运单元703，还用于：In some implementations of the present application, the second transport parameters include the calculation mode corresponding to the currently transported data; the above-mentioned second transport unit 703 is also used to:

在本申请的一些实施方式中，所述第二搬运参数包括切割参数和当前搬运的数据对应的运算模式；In some embodiments of the present application, the second transport parameters include cutting parameters and calculation modes corresponding to the current transport data;

上述第二搬运单元703，还用于：The above-mentioned second handling unit 703 is also used for:

将所述目标数据搬运至所述缓存。Move the target data to the cache.

在本申请的一些实施方式中，当所述颗粒度运算所需要的目标数据为目的端的矩阵数据中的部分数据时，所述第二搬运参数包括所述部分数据中的每个数据对应的第一位置坐标；上述第二搬运单元703，还用于：In some embodiments of the present application, when the target data required by the granularity operation is part of the data in the matrix data of the destination end, the second transport parameter includes the first data corresponding to each data in the part of the data. 1. Position coordinates; the above-mentioned second handling unit 703 is also used for:

在本申请的一些实施方式中，当所述颗粒度运算所需要的目标数据为winograd前变换的值时，所述第二搬运参数包括winograd前变换所需要的4*4的数据表中的数据的第二位置坐标；上述第二搬运单元703，还用于：In some embodiments of the present application, when the target data required by the granularity operation is the value transformed before winograd, the second handling parameter includes the data in the 4*4 data table required by the transformation before winograd The second position coordinates; the above-mentioned second handling unit 703 is also used for:

在本申请的一些实施方式中，上述数据处理单元，还用于：In some embodiments of the present application, the above-mentioned data processing unit is also used for:

在所述若所述缓存不为空，则基于所述缓存中的数据执行所述颗粒度运算，得到周期级的数据处理结果之前，获取所述缓存的卷绕标志位的值，以及所述缓存的读地址和写地址；Before said if the cache is not empty, the granular operation is executed based on the data in the cache, and before the cycle-level data processing result is obtained, the value of the winding flag bit of the cache is acquired, and the The read address and write address of the cache;

需要说明的是，为描述的方便和简洁，上述描述的神经网络模拟器的数据处理装置700的具体工作过程，可以参考上述图1至图6中描述的方法的对应过程，在此不再赘述。It should be noted that, for the convenience and brevity of the description, the specific working process of the data processing device 700 of the above-described neural network simulator can refer to the corresponding process of the method described in the above-mentioned Figures 1 to 6, and will not be repeated here. .

如图8所示，本申请提供一种用于实现上述神经网络模拟器的数据处理方法的终端，该终端8可以包括：处理器80、存储器81以及存储在所述存储器81中并可在所述处理器80上运行的计算机程序82，例如内存分配程序。所述处理器80执行所述计算机程序82时实现上述各个神经网络模拟器的数据处理方法实施例中的步骤，例如图1所示的步骤101至104。或者，所述处理器80执行所述计算机程序82时实现上各装置实施例中各模块/单元的功能，例如图7所示单元701至704的功能。As shown in FIG. 8 , the present application provides a terminal for realizing the data processing method of the above-mentioned neural network simulator. The terminal 8 may include: a processor 80, a memory 81, and stored in the memory 81 and can be used in the A computer program 82 running on the processor 80, such as a memory allocation program. When the processor 80 executes the computer program 82, it realizes the steps in the above-mentioned embodiments of the data processing method of each neural network simulator, such as steps 101 to 104 shown in FIG. 1 . Alternatively, when the processor 80 executes the computer program 82, it realizes the functions of the modules/units in the above device embodiments, such as the functions of the units 701 to 704 shown in FIG. 7 .

所述计算机程序可以被分割成一个或多个模块/单元，所述一个或者多个模块/单元被存储在所述存储器81中，并由所述处理器80执行，以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机程序指令段，该指令段用于描述所述计算机程序在所述终端中的执行过程。例如，所述计算机程序可以被分割成获取单元、第一搬运单元、第二搬运单元和数据处理单元，各单元具体功能如下：The computer program can be divided into one or more modules/units, and the one or more modules/units are stored in the memory 81 and executed by the processor 80 to complete the present application. The one or more modules/units may be a series of computer program instruction segments capable of accomplishing specific functions, and the instruction segments are used to describe the execution process of the computer program in the terminal. For example, the computer program can be divided into an acquisition unit, a first handling unit, a second handling unit and a data processing unit, and the specific functions of each unit are as follows:

数据处理单元，用于若所述缓存不为空，则基于所述缓存中的数据执行所述颗粒度运算，得到周期级的数据处理结果。The data processing unit is configured to execute the granular operation based on the data in the cache if the cache is not empty, to obtain a cycle-level data processing result.

所述终端可以是电脑、服务器等计算设备。所述终端可包括，但不仅限于，处理器80、存储器81。本领域技术人员可以理解，图8仅仅是终端的示例，并不构成对终端的限定，可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件，例如所述终端还可以包括输入输出设备、网络接入设备、总线等。The terminal may be a computing device such as a computer or a server. The terminal may include, but not limited to, a processor 80 and a memory 81 . Those skilled in the art can understand that FIG. 8 is only an example of a terminal, and does not constitute a limitation on the terminal. It may include more or less components than those shown in the figure, or combine certain components, or different components, such as the Terminals may also include input and output devices, network access devices, buses, and so on.

所称处理器80可以是中央处理单元(Central Processing Unit，CPU)，还可以是其他通用处理器、数字内存分配器 (Digital Signal Processor，DSP)、专用集成电路 (Application Specific Integrated Circuit，ASIC)、现成可编程门阵列 (Field-Programmable Gate Array，FPGA) 或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The so-called processor 80 may be a central processing unit (Central Processing Unit, CPU), can also be other general-purpose processors, digital memory allocators (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.

所述存储器81可以是所述终端的内部存储单元，例如终端的硬盘或内存。所述存储器81也可以是所述终端的外部存储设备，例如所述终端上配备的插接式硬盘，智能存储卡（Smart Media Card，SMC），安全数字（Secure Digital，SD）卡，闪存卡（Flash Card）等。进一步地，所述存储器81还可以既包括所述终端的内部存储单元也包括外部存储设备。所述存储器81用于存储所述计算机程序以及所述终端所需的其他程序和数据。所述存储器81还可以用于暂时地存储已经输出或者将要输出的数据。The storage 81 may be an internal storage unit of the terminal, such as a hard disk or memory of the terminal. The memory 81 may also be an external storage device of the terminal, such as a plug-in hard disk equipped on the terminal, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, a flash memory card (Flash Card), etc. Further, the memory 81 may also include both an internal storage unit of the terminal and an external storage device. The memory 81 is used to store the computer program and other programs and data required by the terminal. The memory 81 can also be used to temporarily store data that has been output or will be output.

所属领域的技术人员可以清楚地了解到，为了描述的方便和简洁，仅以上述各功能单元、模块的划分进行举例说明，实际应用中，可以根据需要而将上述功能分配由不同的功能单元、模块完成，即将所述装置的内部结构划分成不同的功能单元或模块，以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中，上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。另外，各功能单元、模块的具体名称也只是为了便于相互区分，并不用于限制本申请的保护范围。上述系统中单元、模块的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。Those skilled in the art can clearly understand that for the convenience and brevity of description, only the division of the above-mentioned functional units and modules is used for illustration. In practical applications, the above-mentioned functions can be assigned to different functional units, Completion of modules means that the internal structure of the device is divided into different functional units or modules to complete all or part of the functions described above. Each functional unit and module in the embodiment may be integrated into one processing unit, or each unit may exist separately physically, or two or more units may be integrated into one unit, and the above-mentioned integrated units may adopt hardware It can also be implemented in the form of software functional units. In addition, the specific names of the functional units and modules are only for the convenience of distinguishing each other, and are not used to limit the protection scope of the present application. For the specific working processes of the units and modules in the above system, reference may be made to the corresponding processes in the aforementioned method embodiments, and details will not be repeated here.

在上述实施例中，对各个实施例的描述都各有侧重，某个实施例中没有详述或记载的部分，可以参见其它实施例的相关描述。In the above-mentioned embodiments, the descriptions of each embodiment have their own emphases, and for parts that are not detailed or recorded in a certain embodiment, refer to the relevant descriptions of other embodiments.

本领域普通技术人员可以意识到，结合本文中所公开的实施例描述的各示例的单元及算法步骤，能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本申请的范围。Those skilled in the art can appreciate that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.

在本申请所提供的实施例中，应该理解到，所揭露的装置/终端和方法，可以通过其它的方式实现。例如，以上所描述的装置/终端实施例仅仅是示意性的，例如，所述模块或单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口，装置或单元的间接耦合或通讯连接，可以是电性，机械或其它的形式。In the embodiments provided in this application, it should be understood that the disclosed device/terminal and method may be implemented in other ways. For example, the device/terminal embodiments described above are only illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods, such as multiple units or Components may be combined or integrated into another system, or some features may be omitted, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

另外，在本申请各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.

所述集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本申请实现上述实施例方法中的全部或部分流程，也可以通过计算机程序来指令相关的硬件来完成，所述的计算机程序可存储于一计算机可读存储介质中，该计算机程序在被处理器执行时，可实现上述各个方法实施例的步骤。其中，所述计算机程序包括计算机程序代码，所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括：能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器（Read-Only Memory，ROM）、随机存取存储器（Random Access Memory，RAM）、电载波信号、电信信号以及软件分发介质等。需要说明的是，所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减，例如在某些司法管辖区，根据立法和专利实践，计算机可读介质不包括电载波信号和电信信号。If the integrated module/unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments in the present application can also be completed by instructing related hardware through computer programs. The computer programs can be stored in a computer-readable storage medium, and the computer When the program is executed by the processor, the steps in the above-mentioned various method embodiments can be realized. Wherein, the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file or some intermediate form. The computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a removable hard disk, a magnetic disk, an optical disk, a computer memory, and a read-only memory (Read-Only Memory, ROM) , Random Access Memory (Random Access Memory, RAM), electrical carrier signals, telecommunication signals, and software distribution media, etc. It should be noted that the content contained in the computer-readable medium may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction. For example, in some jurisdictions, computer-readable media Excludes electrical carrier signals and telecommunication signals.

以上所述实施例仅用以说明本申请的技术方案，而非对其限制；尽管参照前述实施例对本申请进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围，均应包含在本申请的保护范围之内。The above-described embodiments are only used to illustrate the technical solutions of the present application, rather than to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still implement the foregoing embodiments Modifications to the technical solutions described in the examples, or equivalent replacements for some of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the various embodiments of the application, and should be included in the Within the protection scope of this application.

Claims

A data processing method for a neural network simulator, comprising:

Acquiring instruction data; the instruction data includes a first transportation instruction carrying a first transportation parameter, a second transportation instruction carrying a second transportation parameter, and a granular operation instruction;

Transporting the data at the source end to the destination end in a transaction-level transport manner according to the first transport parameter carried by the first transport instruction;

Transporting the data at the destination to the cache with cycle-level precision according to the second transport parameter carried by the second transport instruction;

If the cache is not empty, a granular operation is performed based on the data in the cache to obtain a cycle-level data processing result.

The data processing method of the neural network simulator according to claim 1, wherein said transferring the data at the source end to the destination end in a transaction-level transfer mode includes:

Based on the communication handshake between the source end and the destination end, the data at the source end is transferred to the destination end in a transaction-level transfer manner.

The data processing method of neural network simulator as claimed in claim 1, is characterized in that, described second handling parameter comprises the computing mode corresponding to the data of current handling;

The step of transferring the data of the destination to the cache according to the second transfer parameter carried by the second transfer instruction with cycle-level precision includes:

calculating the actual amount of data corresponding to the data at the destination end required for the data processing according to the operation mode corresponding to the currently transported data;

According to the real data volume and the data volume transferred from the source terminal to the destination terminal, it is judged whether the data volume of the destination terminal is greater than or equal to the real data volume required by the granular operation;

If the amount of data at the destination is greater than or equal to the actual amount of data required by the granularity calculation, the data at the destination is moved to the cache with cycle-level precision according to the second moving parameter.

The data processing method of the neural network simulator according to claim 1, wherein the second handling parameters include cutting parameters and the corresponding operation mode of the current handling data;

Cutting the data of the destination according to the cutting parameters in the second handling parameters according to cycle-level accuracy to obtain cutting data;

performing calculations on the cutting data according to the calculation mode corresponding to the currently transported data, to obtain the target data required for the granularity calculation;

Move the target data to the cache.

The data processing method of the neural network simulator according to any one of claims 1-4, wherein when the target data required for the granularity operation is part of the data in the matrix data of the destination, the second The second handling parameter includes the first position coordinates corresponding to each data in the partial data;

The transferring the data of the destination end to the cache according to the second transfer parameter with cycle-level accuracy includes:

The data corresponding to the first position coordinate in the matrix data of the destination end is transferred to the cache.

The data processing method of a neural network simulator according to any one of claims 1-4, wherein when the required target data for the granularity calculation is the value transformed before winograd, the second transport parameter Including the second position coordinates of the data in the 4*4 data table required for winograd transformation;

Read the 4*4 data table stored at the destination according to the second position coordinates, and calculate the pre-winograd converted value based on the 4*4 data table, and convert the pre-winograd converted value The value is moved to the cache.

The data processing method of the neural network simulator according to any one of claims 1-4, wherein, if the cache is not empty, the granularity calculation is performed based on the data in the cache to obtain Before cycle-level data processing results, including:

Obtain the value of the winding flag bit of the cache, and the read address and write address of the cache;

Whether the cache is empty is judged according to whether the read address of the cache coincides with the write address and the value of the wrapping flag bit of the cache.

A data processing device for a neural network simulator, characterized in that it comprises:

An acquisition unit, configured to acquire instruction data; the instruction data includes a first transportation instruction carrying a first transportation parameter, a second transportation instruction carrying a second transportation parameter, and a granular operation instruction;

The first transport unit is configured to transport the data at the source end to the destination end in a transaction-level transport manner according to the first transport parameters carried by the first transport instruction;

The second transfer unit is configured to transfer the data at the destination to the cache with cycle-level precision according to the second transfer parameters carried by the second transfer instruction;

The data processing unit is configured to, if the cache is not empty, perform granular operations based on the data in the cache to obtain cycle-level data processing results.

A terminal, comprising a memory, a processor, and a computer program stored in the memory and operable on the processor, characterized in that, when the processor executes the computer program, the computer program according to claims 1 to 7 is implemented. The steps of any one of the described methods.

A computer-readable storage medium storing a computer program, wherein the computer program implements the steps of the method according to any one of claims 1 to 7 when the computer program is executed by a processor.