CN109961136A

CN109961136A - Integrated circuit chip devices and related products

Info

Publication number: CN109961136A
Application number: CN201711346335.9A
Authority: CN
Inventors: 不公告发明人
Original assignee: Beijing Zhongke Cambrian Technology Co Ltd
Current assignee: Beijing Zhongke Cambrian Technology Co Ltd
Priority date: 2017-12-14
Filing date: 2017-12-14
Publication date: 2019-07-02
Anticipated expiration: 2037-12-14
Also published as: CN110826712B; CN110909872B; CN111160542A; CN111242294A; CN111160542B; CN110909872A; CN111242294B; CN110826712A; CN109961136B

Abstract

The present disclosure provides an integrated circuit chip device and related products, the integrated circuit chip device comprising: a main processing circuit and a plurality of basic processing circuits; the main processing circuit or at least one of the plurality of basic processing circuits comprises: a data type arithmetic circuit; the data type arithmetic circuit is used for executing conversion between floating point type data and fixed point type data; the plurality of base processing circuits are distributed in an array; each basic processing circuit is connected with other adjacent basic processing circuits, and the main processing circuit is connected with the n basic processing circuits of the 1 st row, the n basic processing circuits of the m th row and the m basic processing circuits of the 1 st column. The technical scheme provided by the disclosure has the advantages of small calculation amount and low power consumption.

Description

Integrated circuit chip devices and related products

技术领域technical field

本披露涉及神经网络领域，尤其涉及一种集成电路芯片装置及相关产品。The present disclosure relates to the field of neural networks, and in particular, to an integrated circuit chip device and related products.

背景技术Background technique

人工神经网络(Artificial Neural Network，即ANN)，是20世纪80年代以来人工智能领域兴起的研究热点。它从信息处理角度对人脑神经元网络进行抽象，建立某种简单模型，按不同的连接方式组成不同的网络。在工程与学术界也常直接简称为神经网络或类神经网络。神经网络是一种运算模型，由大量的节点(或称神经元)之间相互联接构成。现有的神经网络的运算基于CPU(Central Processing Unit，中央处理器)或GPU(英文：Graphics Processing Unit，图形处理器)来实现神经网络的运算，此种运算的计算量大，功耗高。Artificial Neural Network (ANN) has been a research hotspot in the field of artificial intelligence since the 1980s. It abstracts the human brain neuron network from the perspective of information processing, establishes a certain simple model, and forms different networks according to different connection methods. In engineering and academia, it is often simply referred to as neural network or neural-like network. Neural network is an operation model, which is composed of a large number of nodes (or neurons) connected with each other. The operation of the existing neural network is based on a CPU (Central Processing Unit, central processing unit) or a GPU (English: Graphics Processing Unit, graphics processing unit) to realize the operation of the neural network, which requires a large amount of calculation and high power consumption.

发明内容SUMMARY OF THE INVENTION

本披露实施例提供了一种集成电路芯片装置及相关产品，可提升计算装置的处理速度，提高效率。Embodiments of the present disclosure provide an integrated circuit chip device and related products, which can improve the processing speed and efficiency of the computing device.

第一方面，提供一种集成电路芯片装置，所述集成电路芯片装置包括：主处理电路以及多个基础处理电路；所述主处理电路或多个基础处理电路中至少一个电路包括：数据类型运算电路；所述数据类型运算电路，用于执行浮点类型数据与定点类型数据之间的转换；In a first aspect, an integrated circuit chip device is provided, the integrated circuit chip device includes: a main processing circuit and a plurality of basic processing circuits; at least one of the main processing circuit or the plurality of basic processing circuits includes: a data type operation a circuit; the data type operation circuit is used to perform conversion between floating point type data and fixed point type data;

所述多个基础处理电路呈阵列分布；每个基础处理电路与相邻的其他基础处理电路连接，所述主处理电路连接第1行的n个基础处理电路、第m行的n个基础处理电路以及第1列的m个基础处理电路；The plurality of basic processing circuits are distributed in an array; each basic processing circuit is connected to other adjacent basic processing circuits, and the main processing circuit is connected to the n basic processing circuits in the first row and the n basic processing circuits in the mth row. circuit and m basic processing circuits in column 1;

所述主处理电路，用于执行神经网络运算中的各个连续的运算以及和与其相连的所述基础处理电路传输数据；The main processing circuit is used to perform each successive operation in the neural network operation and transmit data with the basic processing circuit connected to it;

所述多个基础处理电路，用于依据传输的数据以并行方式执行神经网络中的运算，并将运算结果通过与所述主处理电路连接的基础处理电路传输给所述主处理电路。The plurality of basic processing circuits are used to perform operations in the neural network in parallel according to the transmitted data, and transmit the operation results to the main processing circuit through the basic processing circuits connected to the main processing circuit.

第二方面，提供一种神经网络运算装置，所述神经网络运算装置包括一个或多个第一方面提供的集成电路芯片装置。A second aspect provides a neural network computing device, the neural network computing device comprising one or more integrated circuit chip devices provided in the first aspect.

第三方面，提供一种组合处理装置，所述组合处理装置包括：第二方面提供的神经网络运算装置、通用互联接口和通用处理装置；A third aspect provides a combined processing device, the combined processing device comprising: the neural network computing device, the universal interconnection interface, and the universal processing device provided in the second aspect;

所述神经网络运算装置通过所述通用互联接口与所述通用处理装置连接。The neural network computing device is connected with the general processing device through the general interconnection interface.

第四方面，提供一种芯片，所述芯片集成第一方面的装置、第二方面的装置或第三方面的装置。A fourth aspect provides a chip that integrates the device of the first aspect, the device of the second aspect, or the device of the third aspect.

第五方面，提供一种电子设备，所述电子设备包括第四方面的芯片。In a fifth aspect, an electronic device is provided, and the electronic device includes the chip of the fourth aspect.

第六方面，提供一种神经网络的运算方法，所述方法应用在集成电路芯片装置内，所述集成电路芯片装置包括：第一方面所述的集成电路芯片装置，所述集成电路芯片装置用于执行神经网络的运算。In a sixth aspect, a method for computing a neural network is provided, the method is applied in an integrated circuit chip device, and the integrated circuit chip device includes: the integrated circuit chip device described in the first aspect, and the integrated circuit chip device uses for performing neural network operations.

可以看出，通过本披露实施例，提供数据转换运算电路将数据块的类型进行转换后运算，节省了传输资源以及计算资源，所以其具有功耗低，计算量小的优点。It can be seen that, through the embodiments of the present disclosure, a data conversion operation circuit is provided to perform operations after converting the types of data blocks, which saves transmission resources and computing resources, so it has the advantages of low power consumption and small calculation amount.

附图说明Description of drawings

图1a是一种集成电路芯片装置结构示意图。FIG. 1a is a schematic structural diagram of an integrated circuit chip device.

图1b是另一种集成电路芯片装置结构示意图。FIG. 1b is a schematic structural diagram of another integrated circuit chip device.

图1c是一种基础处理电路的结构示意图。FIG. 1c is a schematic structural diagram of a basic processing circuit.

图1d是一种主处理电路的结构示意图。FIG. 1d is a schematic structural diagram of a main processing circuit.

图1e为一种定点数据类型的示意结构图。FIG. 1e is a schematic structural diagram of a fixed-point data type.

图2a是一种基础处理电路的使用方法示意图。FIG. 2a is a schematic diagram of a method of using a basic processing circuit.

图2b是一种主处理电路传输数据示意图。Figure 2b is a schematic diagram of a main processing circuit transmitting data.

图2c是矩阵乘以向量的示意图。Figure 2c is a schematic diagram of a matrix multiplied by a vector.

图2d是一种集成电路芯片装置结构示意图。FIG. 2d is a schematic structural diagram of an integrated circuit chip device.

图2e是又一种集成电路芯片装置结构示意图。FIG. 2e is a schematic structural diagram of another integrated circuit chip device.

图2f是矩阵乘以矩阵的示意图。Figure 2f is a schematic diagram of matrix-by-matrix.

图3a为卷积输入数据示意图。Figure 3a is a schematic diagram of convolution input data.

图3b为卷积核示意图。Figure 3b is a schematic diagram of the convolution kernel.

图3c为输入数据的一个三维数据块的运算窗口示意图。FIG. 3c is a schematic diagram of an operation window of a three-dimensional data block of input data.

图3d为输入数据的一个三维数据块的另一运算窗口示意图。FIG. 3d is a schematic diagram of another operation window of a three-dimensional data block of input data.

图3e为输入数据的一个三维数据块的又一运算窗口示意图.Figure 3e is a schematic diagram of another operation window of a three-dimensional data block of input data.

图4a为神经网络正向运算示意图。Figure 4a is a schematic diagram of the forward operation of the neural network.

图4b为神经网络反向运算示意图。Figure 4b is a schematic diagram of the reverse operation of the neural network.

图4c为本披露还揭露了一个组合处理装置结构示意图。FIG. 4c also discloses a schematic structural diagram of a combined processing device for the present disclosure.

图4d为本披露还揭露了一个组合处理装置另一种结构示意图。FIG. 4d also discloses another structural schematic diagram of a combined processing device for the present disclosure.

图5a为本披露实施例提供的一种神经网络处理器板卡的结构示意图；5a is a schematic structural diagram of a neural network processor board according to an embodiment of the disclosure;

图5b为本披露实施例流提供的一种神经网络芯片封装结构的结构示意图；5b is a schematic structural diagram of a neural network chip packaging structure provided by an embodiment of the disclosure;

图5c为本披露实施例流提供的一种神经网络芯片的结构示意图；FIG. 5c is a schematic structural diagram of a neural network chip provided by an embodiment of the disclosure;

图6为本披露实施例流提供的一种神经网络芯片封装结构的示意图；6 is a schematic diagram of a neural network chip packaging structure provided by an embodiment of the present disclosure;

图6a为本披露实施例流提供的另一种神经网络芯片封装结构的示意图。FIG. 6a is a schematic diagram of another neural network chip packaging structure provided by an embodiment of the present disclosure.

具体实施方式Detailed ways

为了使本技术领域的人员更好地理解本披露方案，下面将结合本披露实施例中的附图，对本披露实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本披露一部分实施例，而不是全部的实施例。基于本披露中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本披露保护的范围。In order to make those skilled in the art better understand the disclosed solutions, the technical solutions in the disclosed embodiments will be clearly and completely described below with reference to the accompanying drawings in the disclosed embodiments. Obviously, the described embodiments are only These are some, but not all, embodiments of the present disclosure. Based on the embodiments in the present disclosure, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the protection scope of the present disclosure.

在第一方面提供的装置中，所述主处理电路，用于获取待计算的数据块以及运算指令，通过所述数据类型运算电路将所述待计算的数据块转换成定点类型的数据块，依据该运算指令对所述定点类型的待计算的数据块划分成分发数据块以及广播数据块；对所述分发数据块进行拆分处理得到多个基本数据块，将所述多个基本数据块分发至与其连接的基础处理电路，将所述广播数据块广播至与其连接的基础处理电路；In the device provided in the first aspect, the main processing circuit is configured to acquire a data block to be calculated and an operation instruction, and convert the data block to be calculated into a fixed-point type data block through the data type operation circuit, According to the operation instruction, the fixed-point type data block to be calculated is divided into a distribution data block and a broadcast data block; the distribution data block is split to obtain a plurality of basic data blocks, and the plurality of basic data blocks are divided into distribute to the basic processing circuit connected to it, broadcast the broadcast data block to the basic processing circuit connected to it;

所述基础处理电路，用于对所述基本数据块与所述广播数据块以定点数据类型执行内积运算得到运算结果，将所述运算结果发送至所述主处理电路；the basic processing circuit, configured to perform an inner product operation on the basic data block and the broadcast data block in a fixed-point data type to obtain an operation result, and send the operation result to the main processing circuit;

或将所述基本数据块与所述广播数据块转发给其他基础处理电路以定点数据类型执行内积运算得到运算结果，将所述运算结果发送至所述主处理电路；Or forward the basic data block and the broadcast data block to other basic processing circuits to perform an inner product operation with a fixed-point data type to obtain an operation result, and send the operation result to the main processing circuit;

所述主处理电路，用于通过所述数据类型运算电路将对所述运算结果转换成浮点类型数据，将浮点类型数据处理得到所述待计算的数据块以及运算指令的指令结果。The main processing circuit is configured to convert the operation result into floating point type data through the data type operation circuit, and process the floating point type data to obtain the data block to be calculated and the instruction result of the operation instruction.

在第一方面提供的装置中，所述主处理电路，具体用于将所述广播数据块通过一次广播发送至与其连接的所述基础处理电路。In the apparatus provided in the first aspect, the main processing circuit is specifically configured to send the broadcast data block to the basic processing circuit connected thereto through one broadcast.

在第一方面提供的装置中，所述基础处理电路，具体用于将所述基本数据块与所述广播数据块以定点数据类型执行内积处理得到内积处理结果，将所述内积处理结果累加得到运算结果，将所述运算结果发送至所述主处理电路。In the apparatus provided in the first aspect, the basic processing circuit is specifically configured to perform inner product processing on the basic data block and the broadcast data block in a fixed-point data type to obtain an inner product processing result, and process the inner product The results are accumulated to obtain an operation result, and the operation result is sent to the main processing circuit.

在第一方面提供的装置中，所述主处理电路，用于在如所述运算结果为内积处理的结果时，对所述运算结果累加后得到累加结果，将该累加结果排列得到所述待计算的数据块以及运算指令的指令结果。In the device provided in the first aspect, the main processing circuit is configured to, if the operation result is a result of inner product processing, accumulate the operation results to obtain an accumulation result, and arrange the accumulation results to obtain the accumulation result. The data block to be calculated and the instruction result of the operation instruction.

在第一方面提供的装置中，所述主处理电路，具体用于将所述广播数据块分成多个部分广播数据块，将所述多个部分广播数据块通过多次广播至所述基础处理电路；所述多个部分广播数据块组合形成所述广播数据块。In the apparatus provided in the first aspect, the main processing circuit is specifically configured to divide the broadcast data block into a plurality of partial broadcast data blocks, and broadcast the plurality of partial broadcast data blocks to the basic processing by multiple times. a circuit; the plurality of partial broadcast data blocks are combined to form the broadcast data block.

在第一方面提供的装置中，所述基础处理电路，具体用于将所述部分广播数据块与所述基本数据块以定点数据类型执行一次内积处理后得到内积处理结果，将所述内积处理结果累加得到部分运算结果，将所述部分运算结果发送至所述主处理电路。In the device provided in the first aspect, the basic processing circuit is specifically configured to perform an inner product processing on the part of the broadcast data block and the basic data block in a fixed-point data type to obtain an inner product processing result, The inner product processing results are accumulated to obtain partial operation results, and the partial operation results are sent to the main processing circuit.

在第一方面提供的装置中，所述基础处理电路，具体用于复用n次该部分广播数据块执行该部分广播数据块与该n个基本数据块内积运算得到n个部分处理结果，将n个部分处理结果分别累加后得到n个部分运算结果，将所述n个部分运算结果发送至主处理电路，所述n为大于等于2的整数。In the device provided in the first aspect, the basic processing circuit is specifically configured to multiplex the part of the broadcast data block n times to perform an inner product operation of the part of the broadcast data block and the n basic data blocks to obtain n partial processing results, After accumulating the n partial processing results respectively, n partial operation results are obtained, and the n partial operation results are sent to the main processing circuit, where n is an integer greater than or equal to 2.

在第一方面提供的装置中，所述主处理电路包括：主寄存器或主片上缓存电路；In the device provided in the first aspect, the main processing circuit includes: a main register or a main on-chip cache circuit;

所述基础处理电路包括：基本寄存器或基本片上缓存电路。The basic processing circuit includes: a basic register or a basic on-chip cache circuit.

在第一方面提供的装置中，所述主处理电路包括：向量运算器电路、算数逻辑单元电路、累加器电路、矩阵转置电路、直接内存存取电路、数据类型运算电路或数据重排电路中的一种或任意组合。In the device provided in the first aspect, the main processing circuit includes: a vector operator circuit, an arithmetic logic unit circuit, an accumulator circuit, a matrix transpose circuit, a direct memory access circuit, a data type operation circuit or a data rearrangement circuit one or any combination.

在第一方面提供的装置中，所述主处理电路，用于获取待计算的数据块以及运算指令，依据该运算指令对所述待计算的数据块划分成分发数据块以及广播数据块；对所述分发数据块进行拆分处理得到多个基本数据块，将所述多个基本数据块分发至与其连接的基础处理电路，将所述广播数据块广播至与其连接的基础处理电路；In the device provided in the first aspect, the main processing circuit is configured to obtain a data block to be calculated and an operation instruction, and divide the data block to be calculated into a distribution data block and a broadcast data block according to the operation instruction; The distribution data block is split to obtain a plurality of basic data blocks, the plurality of basic data blocks are distributed to the basic processing circuit connected to it, and the broadcast data block is broadcast to the basic processing circuit connected to it;

所述基础处理电路，用于对所述基本数据块与所述广播数据块转换成定点类型的数据块，依据定点类型的数据块执行内积运算得到运算结果，将所述运算结果转换成浮点数据后发送至所述主处理电路；The basic processing circuit is configured to convert the basic data block and the broadcast data block into a fixed-point type data block, perform an inner product operation according to the fixed-point type data block to obtain an operation result, and convert the operation result into a floating-point type. After the point data is sent to the main processing circuit;

或对所述基本数据块与所述广播数据块转换成定点类型的数据块，将定点类型的数据块转发给其他基础处理电路执行内积运算得到运算结果，将所述运算结果转换成浮点数据后发送至所述主处理电路；Or convert the basic data block and the broadcast data block into fixed-point data blocks, forward the fixed-point data blocks to other basic processing circuits to perform an inner product operation to obtain an operation result, and convert the operation result into a floating-point number The data is then sent to the main processing circuit;

所述主处理电路，用于对所述运算结果处理得到所述待计算的数据块以及运算指令的指令结果。The main processing circuit is configured to process the operation result to obtain the data block to be calculated and the instruction result of the operation instruction.

在第一方面提供的装置中，所述数据为：向量、矩阵、三维数据块、四维数据块以及n维数据块中一种或任意组合。In the apparatus provided in the first aspect, the data is one or any combination of vectors, matrices, three-dimensional data blocks, four-dimensional data blocks, and n-dimensional data blocks.

在第一方面提供的装置中，如所述运算指令为乘法指令，所述主处理电路确定乘数数据块为广播数据块，被乘数数据块为分发数据块；In the device provided in the first aspect, if the operation instruction is a multiplication instruction, the main processing circuit determines that the multiplier data block is a broadcast data block, and the multiplicand data block is a distribution data block;

或如所述运算指令为卷积指令，所述主处理电路确定卷积输入数据块为广播数据块，卷积核为分发数据块。Or if the operation instruction is a convolution instruction, the main processing circuit determines that the convolution input data block is a broadcast data block, and the convolution kernel is a distribution data block.

在第六方面提供的方法中，所述神经网络的运算包括：卷积运算、矩阵乘矩阵运算、矩阵乘向量运算、偏执运算、全连接运算、GEMM运算、GEMV运算、激活运算中的一种或任意组合。In the method provided in the sixth aspect, the operation of the neural network includes: one of convolution operation, matrix multiplication matrix operation, matrix multiplication vector operation, paranoid operation, fully connected operation, GEMM operation, GEMV operation, and activation operation or any combination.

参阅图1a，图1a为本披露提供的一种集成电路芯片装置，该集成电路芯片装置包括：主处理电路和多个基础处理电路，所述多个基础处理电路呈阵列排布(m*n阵列)，其中，m、n的取值范围为大于等于1的整数且m、n中至少有一个值大于等于2。对于m*n阵列分布的多个基础处理电路，每个基础处理电路与相邻的基础处理电路连接，所述主处理电路连接多个基础处理电路的k个基础处理电路，所述k个基础处理电路可以为：第1行的n个基础处理电路、第m行的n个基础处理电路以及第1列的m个基础处理电路。如图1a所示的集成电路芯片装置，主处理电路和/或多个基础处理电路可以包括数据类型转换运算电路，具体的多个基础处理电路中可以有部分基础处理电路包括数据类型转换电路，例如，在一个可选的技术方案中，可以将k个基础处理电路配置数据类型转换电路，这样n个基础处理电路可以分别负责对本列的m个基础处理电路的数据进行数据类型转换步骤。此设置能够提高运算效率，降低功耗，因为对于第1行的n个基础处理电路来说，由于其最先接收到主处理电路发送的数据，那么将该接收到的数据转换成定点类型的数据可以减少后续基础处理电路的计算量以及与后续基础处理电路的数据传输的量，同理，对于第一列的m个基础处理电路配置数据类型转换电路也具有计算量小和功耗低的优点。另外，依据该结构，主处理电路可以采用动态的数据发送策略，例如，主处理电路向第1列的m个基础处理电路广播数据，主处理电路向第1行的n个基础处理电路发送分发数据，此优点是，通过不同的数据输入口传递不同的数据到基础处理电路内，这样基础处理电路可以不区分该接收到的数据是何种数据，只需要确定该数据从哪个接收端口接收即可以获知其属于何种数据。Referring to FIG. 1a, FIG. 1a provides an integrated circuit chip device for the present disclosure, the integrated circuit chip device includes: a main processing circuit and a plurality of basic processing circuits, and the plurality of basic processing circuits are arranged in an array (m*n array), where the value range of m and n is an integer greater than or equal to 1, and at least one of m and n has a value greater than or equal to 2. For a plurality of basic processing circuits distributed in an m*n array, each basic processing circuit is connected to an adjacent basic processing circuit, and the main processing circuit is connected to k basic processing circuits of the plurality of basic processing circuits, the k basic processing circuits The processing circuits may be: n basic processing circuits in the first row, n basic processing circuits in the mth row, and m basic processing circuits in the first column. In the integrated circuit chip device shown in FIG. 1a, the main processing circuit and/or a plurality of basic processing circuits may include a data type conversion operation circuit, and some of the specific basic processing circuits may include a data type conversion circuit, For example, in an optional technical solution, the k basic processing circuits may be configured with data type conversion circuits, so that the n basic processing circuits may be respectively responsible for performing the data type conversion step on the data of the m basic processing circuits in this column. This setting can improve computing efficiency and reduce power consumption, because for the n basic processing circuits in the first row, since they first receive the data sent by the main processing circuit, the received data is converted into fixed-point type. The data can reduce the amount of calculation of the subsequent basic processing circuits and the amount of data transmission with the subsequent basic processing circuits. Similarly, for the m basic processing circuits in the first column, the configuration data type conversion circuit also has the advantages of small calculation amount and low power consumption. advantage. In addition, according to this structure, the main processing circuit can adopt a dynamic data transmission strategy. For example, the main processing circuit broadcasts data to the m basic processing circuits in the first column, and the main processing circuit sends the distribution to the n basic processing circuits in the first row. The advantage of this is that different data are transmitted to the basic processing circuit through different data input ports, so that the basic processing circuit can not distinguish what kind of data the received data is, and only needs to determine which receiving port the data is received from. You can know what kind of data it belongs to.

所述主处理电路，用于执行神经网络运算中的各个连续的运算以及和与其相连的所述基础处理电路传输数据；上述连续的运算但不限于：累加运算、ALU运算、激活运算等等运算。The main processing circuit is used to perform each continuous operation in the neural network operation and transmit data with the basic processing circuit connected to it; the above-mentioned continuous operation is not limited to: accumulation operation, ALU operation, activation operation, etc. .

所述多个基础处理电路，用于依据传输的数据以并行方式执行神经网络中的运算，并将运算结果通过与所述主处理电路连接的基础处理电路传输给所述主处理电路。上述并行方式执行神经网络中的运算包括但不限于：内积运算、矩阵或向量乘法运算等等。The plurality of basic processing circuits are used to perform operations in the neural network in parallel according to the transmitted data, and transmit the operation results to the main processing circuit through the basic processing circuits connected to the main processing circuit. The operations in the neural network performed in the above-mentioned parallel manner include, but are not limited to, inner product operations, matrix or vector multiplication operations, and the like.

主处理电路可以包括：数据发送电路、数据接收电路或接口，该数据发送电路可以集成数据分发电路以及数据广播电路，当然在实际应用中，数据分发电路以及数据广播电路也可以分别设置。对于广播数据，即需要发送给每个基础处理电路的数据。对于分发数据，即需要有选择的发送给部分基础处理电路的数据，具体的，如卷积运算，卷积运算的卷积输入数据需要发送给所有的基础处理电路，所有其为广播数据，卷积核需要有选择的发送给部分基础数据块，所以卷积核为分发数据。分发数据具体的选择发送给那个基础处理电路的方式可以由主处理电路依据负载以及其他分配方式进行具体的确定。对于广播发送方式，即将广播数据以广播形式发送至每个基础处理电路。(在实际应用中，通过一次广播的方式将广播数据发送至每个基础处理电路，也可以通过多次广播的方式将广播数据发送至每个基础处理电路，本披露具体实施方式并不限制上述广播的次数)，对于分发发送方式，即将分发数据有选择的发送给部分基础处理电路。The main processing circuit may include: a data transmission circuit, a data reception circuit or an interface. The data transmission circuit may integrate a data distribution circuit and a data broadcast circuit. Of course, in practical applications, the data distribution circuit and the data broadcast circuit can also be provided separately. For broadcast data, that is the data that needs to be sent to each underlying processing circuit. For distribution data, that is, data that needs to be selectively sent to some basic processing circuits, specifically, such as convolution operation, the convolution input data of convolution operation needs to be sent to all basic processing circuits, all of which are broadcast data, volume The product kernel needs to be selectively sent to some basic data blocks, so the convolution kernel is to distribute data. The manner in which the distribution data is specifically selected and sent to the basic processing circuit may be specifically determined by the main processing circuit according to the load and other distribution manners. For the broadcast transmission method, the broadcast data is transmitted to each basic processing circuit in a broadcast form. (In practical applications, broadcast data is sent to each basic processing circuit by one broadcast, and broadcast data can also be sent to each basic processing circuit by multiple broadcasts. The specific embodiments of the present disclosure do not limit the above The number of broadcasts), for the distribution transmission mode, the distribution data is selectively sent to some basic processing circuits.

主处理电路(如图1d所示)可以包括寄存器和/或片上缓存电路，该主处理电路还可以包括:控制电路、向量运算器电路、ALU(arithmetic and logic unit，算数逻辑单元)电路、累加器电路、DMA(Direct Memory Access，直接内存存取)电路等电路，当然在实际应用中，上述主处理电路还可以添加，转换电路(例如矩阵转置电路)、数据重排电路或激活电路等等其他的电路。The main processing circuit (as shown in Figure 1d) may include a register and/or an on-chip buffer circuit, and the main processing circuit may also include: a control circuit, a vector operator circuit, an ALU (arithmetic and logic unit, arithmetic logic unit) circuit, an accumulation circuit, DMA (Direct Memory Access, Direct Memory Access) circuit, of course, in practical applications, the above-mentioned main processing circuit can also be added, conversion circuit (such as matrix transposition circuit), data rearrangement circuit or activation circuit, etc. and other circuits.

每个基础处理电路可以包括基础寄存器和/或基础片上缓存电路；每个基础处理电路还可以包括：内积运算器电路、向量运算器电路、累加器电路等中一个或任意组合。上述内积运算器电路、向量运算器电路、累加器电路都可以是集成电路，上述内积运算器电路、向量运算器电路、累加器电路也可以为单独设置的电路。Each basic processing circuit may include a basic register and/or a basic on-chip buffer circuit; each basic processing circuit may further include one or any combination of an inner product operator circuit, a vector operator circuit, an accumulator circuit, and the like. The above inner product operator circuit, vector operator circuit and accumulator circuit may all be integrated circuits, and the above inner product operator circuit, vector operator circuit and accumulator circuit may also be separately provided circuits.

可选的，对于第m行n个基础处理电路的累加器电路可以执行内积运算的累加运算，因为对于第m行基础处理电路来说，其能够接收到本列所有的基础处理电路的乘积结果，而将内积运算的累加运算通过第m行的n个基础处理电路执行内积运算的累加运算，这样能够对计算资源进行有效的分配，具有节省功耗的优点。此技术方案尤其对于m数量较大时更为适用。Optionally, the accumulator circuit of the n basic processing circuits in the mth row can perform the accumulation operation of the inner product operation, because for the basic processing circuit in the mth row, it can receive the product of all the basic processing circuits in this column. As a result, the accumulation operation of the inner product operation is performed by the n basic processing circuits in the mth row to perform the accumulation operation of the inner product operation, which can effectively allocate computing resources and has the advantage of saving power consumption. This technical solution is especially applicable when the number of m is large.

对于数据类型转换可以由主处理电路来分配执行的电路，具体的，可以通过显示或隐式的方式来分配执行的电路，对于显示方式，主处理电路可以配置一个特殊指示或指令，当基础处理电路接收到该特殊指示或指令时，确定执行数据类型转换，如基础处理电路未接收到特殊指示或指令时，确定不执行数据类型转换。又如，可以以暗示的方式来执行，例如，基础处理电路接收到数据类型为浮点类型的数据且确定需要执行内积运算时，将该数据类型转换成定点类型的数据。对于显示配置的方式，特殊指令或指示可以配置一个递减序列，该递减序列的每经过一个基础处理电路，数值减1，基础处理电路读取该递减序列的值，如该值大于零，则执行数据类型转换，如该值等于或小于零，则不执行数据类型转换。此设置是依据阵列分配的基础处理电路所配置的，例如对于第i列的m个基础处理电路来说，主处理电路需要前面5个基础处理电路执行数据类型转换，则主处理电路下发一个特殊指令，该特殊指令包含有递减序列，该递减序列的初始值可以为5，则每经过一个基础处理电路，递减序列的值即减1，到第5个基础处理电路时，该递减序列的值为1，到第6个基础处理电路时，该递减序列为0，此时第6个基础处理电路将不在执行该数据类型转换，此种方式可以使得主处理电路可以动态的配置数据类型转换的执行主体以及执行次数。For data type conversion, the main processing circuit can allocate the execution circuit. Specifically, the execution circuit can be allocated in an explicit or implicit way. For the display mode, the main processing circuit can be configured with a special instruction or instruction. When the basic processing When the circuit receives the special instruction or instruction, it determines to perform data type conversion. For example, when the basic processing circuit does not receive the special instruction or instruction, it determines not to perform data type conversion. For another example, it can be performed in an implicit manner, for example, when the basic processing circuit receives data of a floating-point type and determines that an inner product operation needs to be performed, it converts the data type to data of a fixed-point type. For the display configuration method, a special instruction or instruction can configure a decrementing sequence. Each time the decrementing sequence passes through a basic processing circuit, the value decreases by 1, and the basic processing circuit reads the value of the decrementing sequence. If the value is greater than zero, execute Data type conversion, if the value is equal to or less than zero, no data type conversion is performed. This setting is configured according to the basic processing circuits allocated by the array. For example, for the m basic processing circuits in the i-th column, the main processing circuit needs the first 5 basic processing circuits to perform data type conversion, and the main processing circuit sends a Special instruction, the special instruction contains a decrement sequence, the initial value of the decrement sequence can be 5, then each time it passes through a basic processing circuit, the value of the decrement sequence is decremented by 1, and when it reaches the fifth basic processing circuit, the value of the decrement sequence is reduced by 1. The value is 1. When the 6th basic processing circuit is reached, the decrement sequence is 0. At this time, the 6th basic processing circuit will not perform the data type conversion. This method allows the main processing circuit to dynamically configure the data type conversion. Execution subject and execution times.

本披露一个实施例提供一种集成电路芯片装置，包括一个主处理电路(也可以称为主单元)和多个基础处理电路(也可以称为基础单元)；实施例的结构如图1b所示；其中，虚线框中是所述神经网络运算装置的内部结构；灰色填充的箭头表示主处理电路和基础处理电路阵列之间的数据传输通路，空心箭头表示基础处理电路阵列中各个基础处理电路(相邻基础处理电路)之间的数据传输通路。其中，基础处理电路阵列的长宽长度可以不同，即m、n的取值可以不同，当然也可以相同，本披露并不限制上述取值的具体值。An embodiment of the present disclosure provides an integrated circuit chip device, including a main processing circuit (also referred to as a main unit) and a plurality of basic processing circuits (also referred to as basic units); the structure of the embodiment is shown in FIG. 1b ; Wherein, the dashed box is the internal structure of the neural network computing device; the arrows filled in gray represent the data transmission paths between the main processing circuit and the basic processing circuit array, and the hollow arrows represent the basic processing circuits in the basic processing circuit array ( A data transmission path between adjacent basic processing circuits). The length, width and length of the basic processing circuit array may be different, that is, the values of m and n may be different, and of course they may be the same. The present disclosure does not limit the specific values of the above values.

基础处理电路的电路结构如图1c所示；图中虚线框表示基础处理电路的边界，与虚线框交叉的粗箭头表示数据输入输出通道(指向虚线框内是输入通道，指出虚线框是输出通道)；虚线框中的矩形框表示存储单元电路(寄存器和/或片上缓存)，包括输入数据1，输入数据2，乘法或内积结果，累加数据；菱形框表示运算器电路，包括乘法或内积运算器，加法器。The circuit structure of the basic processing circuit is shown in Figure 1c; the dashed box in the figure represents the boundary of the basic processing circuit, and the thick arrows crossing the dashed box represent the data input and output channels (pointing to the dashed box is the input channel, and pointing out that the dashed box is the output channel. ); the rectangular box in the dashed box represents the storage unit circuit (register and/or on-chip cache), including input data 1, input data 2, multiplication or inner product result, and accumulated data; the diamond box represents the operator circuit, including multiplication or inner Product operator, adder.

本实施例中，所述神经网络运算装置包括一个主处理电路和16个基础处理电路(16个基础处理电路仅仅为了举例说明，在实际应用中，可以采用其他的数值)；In this embodiment, the neural network computing device includes a main processing circuit and 16 basic processing circuits (the 16 basic processing circuits are only for illustration, and other values may be used in practical applications);

本实施例中，基础处理电路有两个数据输入接口，两个数据输出接口；在本例的后续描述中，将横向的输入接口(图1b中指向本单元的横向箭头)称作输入0，竖向的输入接口(图1b中指向本单元的竖向箭头)称作输入1；将每一个横向的数据输出接口(图1b中从本单元指出的横向箭头)称作输出0，竖向的数据输出接口(图1b中从本单元指出的竖向箭头)称作输出1。In this embodiment, the basic processing circuit has two data input interfaces and two data output interfaces; in the subsequent description of this example, the horizontal input interface (the horizontal arrow pointing to the unit in FIG. The vertical input interface (the vertical arrow pointing to this unit in Figure 1b) is called input 1; each horizontal data output interface (the horizontal arrow pointing to this unit in Figure 1b) is called output 0, the vertical The data output interface (the vertical arrow pointing from this unit in Figure 1b) is called output 1.

每一个基础处理电路的数据输入接口和数据输出接口可以分别连接不同的单元，包括主处理电路与其他基础处理电路；The data input interface and data output interface of each basic processing circuit can be connected to different units, including the main processing circuit and other basic processing circuits;

本例中，基础处理电路0,4,8,12(编号见图1b)这四个基础处理电路的输入0与主处理电路的数据输出接口连接；In this example, the input 0 of the four basic processing circuits 0, 4, 8, and 12 (numbered as shown in Figure 1b) is connected to the data output interface of the main processing circuit;

本例中，基础处理电路0,1,2,3这四个基础处理电路的输入1与主处理电路的数据输出接口连接；In this example, the input 1 of the basic processing circuits 0, 1, 2, and 3 is connected to the data output interface of the main processing circuit;

本例中，基础处理电路12,13,14,15这四个基础处理电路的输出1与主处理电路的数据输入接口相连；In this example, the outputs 1 of the four basic processing circuits 12, 13, 14, and 15 are connected to the data input interface of the main processing circuit;

本例中，基础处理电路输出接口与其他基础处理电路输入接口相连接的情况见图1b所示，不再一一列举；In this example, the connection of the output interface of the basic processing circuit with the input interface of other basic processing circuits is shown in Figure 1b, and will not be listed one by one;

具体地，S单元的输出接口S1与P单元的输入接口P1相连接，表示P单元将可以从其P1接口接收到S单元发送到其S1接口的数据。Specifically, the output interface S1 of the S unit is connected to the input interface P1 of the P unit, indicating that the P unit can receive the data sent by the S unit to its S1 interface from its P1 interface.

本实施例包含一个主处理电路，主处理电路与外部装置相连接(即由输入接口也有输出接口)，主处理电路的一部分数据输出接口与一部分基础处理电路的数据输入接口相连接；主处理电路的一部分数据输入接口与一部分基础处理电路的数据输出接口相连。This embodiment includes a main processing circuit, the main processing circuit is connected with an external device (that is, an input interface also has an output interface), and a part of the data output interface of the main processing circuit is connected with a part of the data input interface of the basic processing circuit; the main processing circuit A part of the data input interface is connected with a part of the data output interface of the basic processing circuit.

集成电路芯片装置的使用方法Method of using integrated circuit chip device

本披露提供的使用方法中所涉及到的数据可以是任意数据类型的数据，例如，可以是任意位宽的浮点数表示的数据也可以是任意位宽的定点数表示的数据。The data involved in the use method provided by the present disclosure may be data of any data type, for example, data represented by floating-point numbers of any bit width or data represented by fixed-point numbers of any bit width.

该定点类型数据的一种结构示意图如图1e所示，如图1e所示，为一种定点类型数据的表达方法，对于计算系统，1个浮点数据的存储位数为32bit，而对于定点数据，尤其是采用如图1e所示的浮点类型的数据进行数据的表示，其1个定点数据的存储位数可以做到16Bit以下，所以对于此转换来说，可以极大的减少计算器之间的传输开销，另外，对于计算器来说，较少比特位的数据存储的空间也较小，即存储开销会较小，计算量也会减少，即计算开销会减少，所以能够减少计算开销以及存储的开销，但是对于数据类型的转换也是需要有部分的开销的，下面简称转换开销，对于计算量大，数据存储量大的数据，转换开销相对于后续的计算开销、存储开销以及传输开销来说几乎可以忽略不计，所以对于计算量大，数据存储量大的数据，本披露采用了将数据类型转换成定点类型的数据的技术方案，反之，对于计算量小，数据存储量小的数据，此时由于本身计算开销、存储开销以及传输开销就比较小，此时如果使用定点数据，由于定点数据的精度会略低于浮点数据，在计算量较小的前提下，需要保证计算的精度，所以这里将定点类型的数据转换成浮点数据，即通过增加较小的开销来提高计算的精度的目的。A schematic diagram of the structure of the fixed-point type data is shown in Figure 1e, which is an expression method of fixed-point type data. For a computing system, the storage bits of one floating-point data are 32 bits, while Data, especially the floating-point data as shown in Figure 1e is used for data representation, and the storage bits of one fixed-point data can be less than 16Bit, so for this conversion, it can greatly reduce the calculator In addition, for the calculator, the data storage space with fewer bits is also smaller, that is, the storage overhead will be smaller, and the amount of calculation will be reduced, that is, the calculation overhead will be reduced, so the calculation can be reduced. Overhead and storage overhead, but the conversion of data types also requires some overhead, hereinafter referred to as conversion overhead. For data with a large amount of calculation and a large amount of data storage, the conversion overhead is relative to the subsequent calculation overhead, storage overhead and transmission overhead. The overhead is almost negligible, so for the data with a large amount of calculation and a large amount of data storage, the present disclosure adopts the technical solution of converting the data type into fixed-point type data. On the contrary, for the data with a small amount of calculation and a small amount of data storage At this time, the calculation overhead, storage overhead and transmission overhead are relatively small. If fixed-point data is used at this time, since the precision of fixed-point data will be slightly lower than that of floating-point data, it is necessary to ensure that the calculation amount is small under the premise of small amount of calculation. Therefore, the data of fixed-point type is converted into floating-point data here, that is, the purpose of improving the accuracy of calculation by adding less overhead.

需要在基础处理电路中完成的运算，可以使用下述方法进行：Operations that need to be done in the basic processing circuit can be performed using the following methods:

主处理电路先对数据的类型进行转换然后再传输给基础处理电路运算(例如，主处理电路可以将浮点数转换成位宽更低的定点数再传输给基础处理电路，其优点是可以减少传输数据的位宽，减少传输的总比特数量，基础处理电路执行地位宽定点运算的效率也更高，功耗更低)The main processing circuit first converts the type of data and then transmits it to the basic processing circuit for operation (for example, the main processing circuit can convert floating-point numbers into fixed-point numbers with a lower bit width and then transmit them to the basic processing circuit, which has the advantage of reducing transmission. The bit width of the data reduces the total number of bits transmitted, and the basic processing circuit performs the position-wide fixed-point operation with higher efficiency and lower power consumption)

基础处理电路可以收到数据后先进行数据类型转化然后再进行计算(例如，基础处理电路收到主处理电路传输过来的浮点数，然后转换为定点数进行运算，提高运算效率，降低功耗)。The basic processing circuit can convert the data type and then perform the calculation after receiving the data (for example, the basic processing circuit receives the floating-point number transmitted from the main processing circuit, and then converts it into a fixed-point number for operation, which improves the operation efficiency and reduces the power consumption) .

基础处理电路计算出结果之后可以先进行数据类型转换然后再传输给主处理电路(例如，基础处理电路计算出的浮点数运算结果可以先转换为低位宽的定点数然后再传输给主处理电路，其好处是降低了传输过程的数据位宽，效率更高，而且节约了功耗)。After the basic processing circuit calculates the result, it can first perform data type conversion and then transmit it to the main processing circuit (for example, the floating-point operation result calculated by the basic processing circuit can be converted into a fixed-point number with low bit width first and then transmitted to the main processing circuit, The advantage is that the data bit width of the transmission process is reduced, the efficiency is higher, and the power consumption is saved).

基础处理电路的使用方法(如图2a)；How to use the basic processing circuit (as shown in Figure 2a);

主处理电路从装置外部接收待计算的输入数据；The main processing circuit receives input data to be calculated from outside the device;

可选地，主处理电路利用本单元的各种运算电路，向量运算电路，内积运算器电路、累加器电路等对数据进行运算处理；Optionally, the main processing circuit uses various operation circuits, vector operation circuits, inner product operator circuits, accumulator circuits, etc. of this unit to perform operation processing on the data;

主处理电路通过数据输出接口向基础处理电路阵列(把所有基础处理电路的集合称作基础处理电路阵列)发送数据(如图2b所示)；The main processing circuit sends data to the basic processing circuit array (the set of all basic processing circuits is called the basic processing circuit array) through the data output interface (as shown in Figure 2b);

此处的发送数据的方式可以是向一部分基础处理电路直接发送数据，即多次广播方式；The method of sending data here may be to directly send data to a part of the basic processing circuit, that is, the method of broadcasting multiple times;

此处发送数据的方式可以向不同的基础处理电路分别发送不同的数据，即分发方式；The method of sending data here can send different data to different basic processing circuits respectively, that is, the distribution method;

基础处理电路阵列对数据进行计算；The basic processing circuit array calculates the data;

基础处理电路接收到输入数据后进行运算；The basic processing circuit performs operations after receiving the input data;

可选地，基础处理电路接收到数据后将该数据从本单元的数据输出接口传输出去；(传输给其他没有直接从主处理电路接收到数据的基础处理电路。)Optionally, after receiving the data, the basic processing circuit transmits the data from the data output interface of the unit; (transmits it to other basic processing circuits that do not directly receive data from the main processing circuit.)

可选地，基础处理电路将运算结果从数据输出接口传输出去；(中间计算结果或者最终计算结果)Optionally, the basic processing circuit transmits the operation result from the data output interface; (intermediate calculation result or final calculation result)

主处理电路接收到从基础处理电路阵列返回的输出数据；The main processing circuit receives the output data returned from the basic processing circuit array;

可选地，主处理电路对从基础处理电路阵列接收到的数据继续进行处理(例如累加或激活操作)；Optionally, the main processing circuit continues to process (eg accumulate or activate) data received from the base processing circuit array;

主处理电路处理完毕，将处理结果从数据输出接口传输给装置外部。After the main processing circuit finishes processing, it transmits the processing result from the data output interface to the outside of the device.

使用所述电路装置完成矩阵乘向量运算；using the circuit arrangement to perform a matrix multiplication vector operation;

(矩阵乘向量可以是矩阵中的每一行分别与向量进行内积运算，并将这些结果按对应行的顺序摆放成一个向量。)(A matrix multiplication by a vector can be performed by performing an inner product operation on each row in the matrix and a vector, and arranging the results into a vector in the order of the corresponding rows.)

下面描述计算尺寸是M行L列的矩阵S和长度是L的向量P的乘法的运算，如下图2c所示。The following describes an operation for calculating the multiplication of a matrix S of size M rows and L columns and a vector P of length L, as shown in Figure 2c below.

此方法用到所述神经网络计算装置的全部或者一部分基础处理电路，假设用到了K个基础处理电路；This method uses all or a part of the basic processing circuits of the neural network computing device, assuming that K basic processing circuits are used;

主处理电路将矩阵S的部分或全部行中的数据发送到k个基础处理电路中的每个基础处理电路；the main processing circuit sends data in some or all of the rows of the matrix S to each of the k basic processing circuits;

在一种可选的方案中，主处理电路的控制电路将矩阵S中某行的数据每次发送一个数或者一部分数给某个基础处理电路；(例如，对于每次发送一个数，可以为对于某一个基础处理电路，第1次发送第3行第1个数，第2次发送第3行数据中的第2个数，第3次发送第3行的第3个数……，或者对于每次发送一部分数，第1次发送第3行前两个数(即第1、2个数)，第二次发送第3行第3和第4个数，第三次发送第3行第5和第6个数……；)In an optional solution, the control circuit of the main processing circuit sends the data of a certain row in the matrix S a number or a part of the number to a basic processing circuit at a time; (for example, for sending a number at a time, it can be For a certain basic processing circuit, the first number of the third line is sent for the first time, the second number of the third line of data is sent for the second time, the third number of the third line is sent for the third time..., or For each time a part of the number is sent, the first two numbers (ie, the 1st and 2nd numbers) of the third line are sent for the first time, the third and fourth numbers of the third line are sent for the second time, and the third line is sent for the third time. 5th and 6th numbers... ;)

在一种可选的方案中，主处理电路的控制电路将矩阵S中某几行的数据每次各发送一个数者一部分数给某个基础处理电路；(例如，对于某一个基础处理电路，第1次发送第3,4,5行每行的第1个数，第2次发送第3,4,5行每行的第2个数，第3次发送第3,4,5行每行的第3个数……，或者第1次发送第3,4,5行每行前两个数，第二次发送第3,4,5行每行第3和第4个数，第三次发送第3,4,5行每行第5和第6个数……。)In an optional solution, the control circuit of the main processing circuit sends the data of certain rows in the matrix S each time a number or a part of the number to a certain basic processing circuit; (for example, for a certain basic processing circuit, The 1st time sends the 1st number of each line of 3,4,5 lines, the 2nd time sends the 2nd number of each line of 3,4,5 lines, the 3rd time sends the 3rd,4,5th line of each line The 3rd number of the line..., or the first two numbers of the 3rd, 4th, and 5th lines of each line are sent, and the 3rd and 4th numbers of each line of the 3rd, 4th, and 5th lines are sent for the second time. Send the 3rd, 4th, 5th row 5th and 6th numbers each three times....)

主处理电路的控制电路将向量P中的数据逐次发送到第0个基础处理电路；The control circuit of the main processing circuit sends the data in the vector P to the 0th basic processing circuit successively;

第0个基础处理电路接收到向量P的数据之后，将该数据发送给与其相连接的下一个基础处理电路，即基础处理电路1；After the 0th basic processing circuit receives the data of the vector P, it sends the data to the next basic processing circuit connected to it, that is, basic processing circuit 1;

具体的，有些基础处理电路不能直接从主处理电路处获得计算所需的所有的数据，例如，图2d中的基础处理电路1，只有一个数据输入接口与主处理电路相连，所以只能直接从主处理电路获得矩阵S的数据，而向量P的数据就需要依靠基础处理电路0输出给基础处理电路1，同理，基础处理电路1也要收到数据后也要继续把向量P的数据输出给基础处理电路2。Specifically, some basic processing circuits cannot directly obtain all the data required for calculation from the main processing circuit. For example, the basic processing circuit 1 in Fig. 2d has only one data input interface connected to the main processing circuit, so it can only directly obtain data from the main processing circuit. The main processing circuit obtains the data of the matrix S, and the data of the vector P needs to rely on the basic processing circuit 0 to output to the basic processing circuit 1. Similarly, the basic processing circuit 1 must continue to output the data of the vector P after receiving the data. to the base processing circuit 2.

每一个基础处理电路对接收到的数据进行运算，该运算包括但不限于：内积运算、乘法运算、加法运算等等；Each basic processing circuit performs operations on the received data, including but not limited to: inner product operations, multiplication operations, addition operations, etc.;

在一种可选方案中，基础处理电路每次计算一组或多组两个数据的乘法，然后将结果累加到寄存器和或片上缓存上；In an alternative, the base processing circuit computes the multiplication of one or more sets of two data at a time, and then accumulates the results into registers and/or on-chip caches;

在一种可选方案中，基础处理电路每次计算一组或多组两个向量的内积，然后将结果累加到寄存器和或片上缓存上；In an alternative, the underlying processing circuit computes the inner product of one or more sets of two vectors at a time, and then accumulates the results into registers and or on-chip caches;

基础处理电路计算出结果后，将结果从数据输出接口传输出去(即传输给与其连接的其他基础处理电路)；After the basic processing circuit calculates the result, it transmits the result from the data output interface (that is, to other basic processing circuits connected to it);

在一种可选方案中，该计算结果可以是内积运算的最终结果或中间结果；In an optional solution, the calculation result may be the final result or the intermediate result of the inner product operation;

基础处理电路接收到来自其他基础处理电路的计算结果之后，将该数据传输给与其相连接的其他基础处理电路或者主处理电路；After the basic processing circuit receives the calculation results from other basic processing circuits, it transmits the data to other basic processing circuits or main processing circuits connected to it;

主处理电路接收到各个基础处理电路内积运算的结果，将该结果处理得到最终结果(该处理可以为累加运算或激活运算等等)。The main processing circuit receives the result of the inner product operation of each basic processing circuit, and processes the result to obtain a final result (the processing may be an accumulation operation or an activation operation, etc.).

采用上述计算装置实现矩阵乘向量方法的实施例：Adopt the above-mentioned computing device to realize the embodiment of matrix multiplication vector method:

在一种可选方案中，方法所用到的多个基础处理电路按照如下图2d或者图2e所示的方式排列；In an optional solution, the plurality of basic processing circuits used in the method are arranged in the manner shown in Figure 2d or Figure 2e below;

如图2c所示，主处理电路的数据转换运算电路将矩阵S和矩阵P转换成定点类型的数据；主处理单元的控制电路将矩阵S的M行数据分成K组，分别由第i个基础处理电路负责第i组(该组数据中行的集合记为Ai)的运算；As shown in Figure 2c, the data conversion operation circuit of the main processing circuit converts the matrix S and the matrix P into fixed-point data; the control circuit of the main processing unit divides the data of the M rows of the matrix S into K groups, which are divided by the i-th basic The processing circuit is responsible for the operation of the i-th group (the set of rows in this group of data is denoted as Ai);

此处对M行数据进行分组的方法是任意不会重复分配的分组方式；The method for grouping M rows of data here is an arbitrary grouping method that will not be repeatedly allocated;

在一种可选方案中，采用如下分配方式：将第j行分给第j％K(％为取余数运算)个基础处理电路；In an optional solution, the following allocation method is adopted: the jth row is allocated to the j%Kth (% is the remainder operation) basic processing circuit;

在一种可选方案中，对于不能平均分组的情况也可以先对一部分行平均分配，对于剩下的行以任意方式分配。In an optional solution, in the case of not being able to be grouped equally, a part of the rows can be evenly allocated first, and the remaining rows can be allocated in an arbitrary manner.

主处理电路的控制电路每次将矩阵S中部分或全部行中的数据依次发送给对应的基础处理电路；The control circuit of the main processing circuit sends the data in part or all of the rows in the matrix S to the corresponding basic processing circuit in turn;

在一种可选方案中，主处理电路的控制电路每次向第i个基础处理电路发送其负责的第i组数据Mi中的一行数据中的一个或多个数据；In an optional solution, the control circuit of the main processing circuit sends one or more data in a row of data in the i-th group of data Mi that it is responsible for to the i-th basic processing circuit each time;

在一种可选方案中，主处理电路的控制电路每次向第i个基础处理电路发送其负责的第i组数据Mi中的部分或全部行中的每行的一个或多个数据；In an optional solution, the control circuit of the main processing circuit sends one or more data of each row in part or all of the rows in the i-th group of data Mi that it is responsible for to the i-th basic processing circuit each time;

主处理电路的控制电路将向量P中的数据依次向第1个基础处理电路发送；The control circuit of the main processing circuit sends the data in the vector P to the first basic processing circuit in turn;

在一种可选方案中，主处理电路的的控制电路每次可以发送向量P中的一个或多个数据；In an optional solution, the control circuit of the main processing circuit can send one or more data in the vector P each time;

第i个基础处理电路接收到向量P的数据之后发送给与其相连的第i+1个基础处理电路；After receiving the data of the vector P, the i-th basic processing circuit sends it to the i+1-th basic processing circuit connected to it;

每个基础处理电路接收到来自矩阵S中某一行或者某几行中的一个或多个数据以及来自向量P的一个或多个数据后，进行运算(包括但不限于乘法或加法)；Each basic processing circuit performs operations (including but not limited to multiplication or addition) after receiving one or more data from a certain row or several rows in the matrix S and one or more data from the vector P;

在一种可选方案中，基础处理电路接收到的数据也可以是中间结果，保存在寄存器和或片上缓存上；In an optional solution, the data received by the basic processing circuit can also be an intermediate result, which is stored in a register and or on-chip cache;

基础处理电路将本地的计算结果传输给与其相连接的下一个基础处理电路或者主处理电路；The basic processing circuit transmits the local calculation result to the next basic processing circuit or main processing circuit connected to it;

在一种可选方案中，对应于图2d的结构，只有每列的最后一个基础处理电路的输出接口与主处理电路相连接的，这种情况下，只有最后一个基础处理电路可以直接将本地的计算结果传输给主处理电路，其他基础处理电路的计算结果都要传递给自己的下一个基础处理电路，下一个基础处理电路传递给下下个基础处理电路直至全部传输给最后一个基础处理电路，最后一个基础处理电路将本地的计算结果以及接收到的本列的其他基础处理电路的结果执行累加计算得到中间结果，将中间结果发送至主处理电路；当然还可以为最后一个基础处理电路可以将本列的其他基础电路的结果以及本地的处理结果直接发送给主处理电路。In an optional solution, corresponding to the structure of Fig. 2d, only the output interface of the last basic processing circuit of each column is connected to the main processing circuit. In this case, only the last basic processing circuit can directly connect the local The calculation results of the basic processing circuit are transmitted to the main processing circuit, and the calculation results of other basic processing circuits are passed to the next basic processing circuit of their own, and the next basic processing circuit is passed to the next basic processing circuit until all are transmitted to the last basic processing circuit. , the last basic processing circuit accumulates the local calculation results and the received results of other basic processing circuits in this column to obtain intermediate results, and sends the intermediate results to the main processing circuit; of course, it can also be the last basic processing circuit. The results of the other base circuits in this column and the local processing results are sent directly to the main processing circuit.

在一种可选方案中，对应于图2e的结构，每一个基础处理电路都有与主处理电路相连接的输出接口，这种情况下，每一个基础处理电路都直接将本地的计算结果传输给主处理电路；In an optional solution, corresponding to the structure of FIG. 2e, each basic processing circuit has an output interface connected to the main processing circuit. In this case, each basic processing circuit directly transmits the local calculation result to the main processing circuit;

基础处理电路接收到其他基础处理电路传递过来的计算结果之后，传输给与其相连接的下一个基础处理电路或者主处理电路。After the basic processing circuit receives the calculation result passed by other basic processing circuits, it transmits it to the next basic processing circuit or main processing circuit connected to it.

主处理电路接收到M个内积运算的结果，作为矩阵乘向量的运算结果。The main processing circuit receives the results of the M inner product operations as the operation result of multiplying a matrix by a vector.

使用所述电路装置完成矩阵乘矩阵运算；using the circuit arrangement to perform a matrix-by-matrix operation;

下面描述计算尺寸是M行L列的矩阵S和尺寸是L行N列的矩阵P的乘法的运算，(矩阵S中的每一行与矩阵P的每一列长度相同，如图2f所示)The following describes the operation to calculate the multiplication of a matrix S of size M rows and L columns and a matrix P of size L rows and N columns, (each row in matrix S is the same length as each column of matrix P, as shown in Figure 2f)

本方法使用所述装置如图1b所示的实施例进行说明；The method is described using the embodiment of the device as shown in FIG. 1b;

主处理电路的数据转换运算电路将矩阵S和矩阵P转换成定点类型的数据；The data conversion operation circuit of the main processing circuit converts the matrix S and the matrix P into fixed-point type data;

主处理电路的控制电路将矩阵S的部分或全部行中的数据发送到通过横向数据输入接口直接与主处理电路相连的那些基础处理电路(例如，图1b中最上方的灰色填充的竖向数据通路)；The control circuit of the main processing circuit sends the data in some or all of the rows of matrix S to those underlying processing circuits that are directly connected to the main processing circuit through the horizontal data input interface (for example, the uppermost gray-filled vertical data in Figure 1b). path);

在一种可选方案中，主处理电路的控制电路将矩阵S中某行的数据每次发送一个数或者一部分数给某个基础处理电路；(例如，对于某一个基础处理电路，第1次发送第3行第1个数，第2次发送第3行数据中的第2个数，第3次发送第3行的第3个数……，或者第1次发送第3行前两个数，第二次发送第3行第3和第4个数，第三次发送第3行第5和第6个数……；)In an optional solution, the control circuit of the main processing circuit sends the data of a certain row in the matrix S one number or part of the number to a certain basic processing circuit at a time; (for example, for a certain basic processing circuit, the first time Send the 1st number in the 3rd line, send the 2nd number in the 3rd line for the 2nd time, send the 3rd number in the 3rd line for the 3rd time..., or send the first two in the 3rd line for the 1st time Numbers, the second time to send the 3rd and 4th numbers on the 3rd line, the third time to send the 5th and 6th numbers on the 3rd line... ;)

在一种可选方案中，主处理电路的控制电路将矩阵S中某几行的数据每次各发送一个数者一部分数给某个基础处理电路；(例如，对于某一个基础处理电路，第1次发送第3,4,5行每行的第1个数，第2次发送第3,4,5行每行的第2个数，第3次发送第3,4,5行每行的第3个数……，或者第1次发送第3,4,5行每行前两个数，第二次发送第3,4,5行每行第3和第4个数，第三次发送第3,4,5行每行第5和第6个数……；)In an optional solution, the control circuit of the main processing circuit sends the data of certain rows in the matrix S each time a number or a part of the number to a certain basic processing circuit; (for example, for a certain basic processing circuit, the first 1 time to send the 1st number of each line 3,4,5, 2nd time to send the 2nd number of each line of 3,4,5 lines, 3rd time to send the 3rd,4,5 lines of each line The 3rd number..., or the first two numbers of the 3rd, 4th, and 5th lines of each line are sent, the second time the 3rd and 4th numbers of each line of the 3rd, 4th, and 5th lines are sent, and the third Send the 3rd, 4th, 5th line 5th and 6th numbers each time... ;)

主处理电路的控制电路将矩阵P中的部分或全部列中的数据发送到通过竖向数据输入接口直接与主处理电路相连的那些基础处理电路(例如，图1b中基础处理电路阵列左侧的灰色填充的横向数据通路)；The control circuit of the main processing circuit sends the data in some or all of the columns in matrix P to those basic processing circuits that are directly connected to the main processing circuit through the vertical data input interface (for example, the left side of the basic processing circuit array in Figure 1b). horizontal data paths filled in gray);

在一种可选方案中，主处理电路的控制电路将矩阵P中某列的数据每次发送一个数或者一部分数给某个基础处理电路；(例如，对于某一个基础处理电路，第1次发送第3列第1个数，第2次发送第3列数据中的第2个数，第3次发送第3列的第3个数……，或者第1次发送第3列前两个数，第二次发送第3列第3和第4个数，第三次发送第3列第5和第6个数……；)In an optional solution, the control circuit of the main processing circuit sends a number or a part of the data of a certain column of the matrix P to a certain basic processing circuit at a time; (for example, for a certain basic processing circuit, the first time Send the 1st number in the 3rd column, send the 2nd number in the 3rd column for the 2nd time, send the 3rd number in the 3rd column for the 3rd time..., or send the first two in the 3rd column for the 1st time Numbers, the second send the 3rd and 4th numbers in the 3rd column, the third time the 5th and 6th numbers in the 3rd column... ;)

在一种可选方案中，主处理电路的控制电路将矩阵P中某几列的数据每次各发送一个数者一部分数给某个基础处理电路；(例如，对于某一个基础处理电路，第1次发送第3,4,5列每列的第1个数，第2次发送第3,4,5列每列的第2个数，第3次发送第3,4,5列每列的第3个数……，或者第1次发送第3,4,5列每列前两个数，第二次发送第3,4,5列每列第3和第4个数，第三次发送第3,4,5列每列第5和第6个数……；)In an optional solution, the control circuit of the main processing circuit sends the data of certain columns in the matrix P each time a number or a part of the number to a certain basic processing circuit; (for example, for a certain basic processing circuit, the first 1 time to send the 1st number of each column of the 3rd,4th,5th column, 2nd time to send the 2nd number of each column of the 3rd,4th,5th column, 3rd time to send the 3rd,4th,5th column of each column The 3rd number..., or the first two numbers in the 3rd, 4th, and 5th columns are sent, and the 3rd and 4th numbers in the 3rd, 4th, and 5th columns are sent in the second time, and the third Send the 3rd, 4th, 5th column 5th and 6th numbers each time... ;)

基础处理电路接收到矩阵S的数据之后，将该数据通过其横向的数据输出接口传输给其相连接下一个基础处理电路(例如，图1b中基础处理电路阵列中间的白色填充的横向的数据通路)；基础处理电路接收到矩阵P的数据后，将该数据通过其竖向的数据输出接口传输给与其相连接的下一个基础处理电路(例如，图1b中基础处理电路阵列中间的白色填充的竖向的数据通路)；After the basic processing circuit receives the data of the matrix S, it transmits the data to the next basic processing circuit connected to it through its horizontal data output interface (for example, the white-filled horizontal data path in the middle of the basic processing circuit array in Figure 1b). ); after the basic processing circuit receives the data of the matrix P, it transmits the data to the next basic processing circuit connected to it through its vertical data output interface (for example, the white-filled ones in the middle of the basic processing circuit array in FIG. 1b vertical data path);

每一个基础处理电路对接收到的数据进行运算；Each basic processing circuit operates on the received data;

基础处理电路计算出结果后，可以将结果从数据输出接口传输出去；After the basic processing circuit calculates the result, the result can be transmitted from the data output interface;

具体地，如果该基础处理电路有直接与主处理电路相连接的输出接口则从该接口传输结果，如果没有，则向着能够直接向主处理电路输出的基础处理电路的方向输出结果(例如，图1b中，最下面一行基础处理电路将其输出结果直接输出给主处理电路，其他基础处理电路从竖向的输出接口向下传输运算结果)。Specifically, if the basic processing circuit has an output interface directly connected to the main processing circuit, the result is transmitted from the interface, if not, the result is output in the direction of the basic processing circuit that can directly output to the main processing circuit (for example, Fig. In 1b, the basic processing circuit in the bottom row directly outputs its output result to the main processing circuit, and other basic processing circuits transmit the operation result downward from the vertical output interface).

向着能够直接向主处理电路输出的方向输出结果(例如，图1b中，最下面一行基础处理电路将其输出结果直接输出给主处理电路，其他基础处理电路从竖向的输出接口向下传输运算结果)；Output the results in the direction that can be directly output to the main processing circuit (for example, in Figure 1b, the basic processing circuit in the bottom row outputs its output results directly to the main processing circuit, and other basic processing circuits transmit operations downward from the vertical output interface. result);

主处理电路接收到各个基础处理电路内积运算的结果，即可得到输出结果。The main processing circuit can obtain the output result after receiving the result of the inner product operation of each basic processing circuit.

“矩阵乘矩阵”方法的实施例：Example of "matrix by matrix" method:

方法用到按照如图1b所示方式排列的基础处理电路阵列，假设有h行，w列；The method uses the basic processing circuit array arranged as shown in Figure 1b, assuming that there are h rows and w columns;

主处理电路的控制电路将矩阵S的h行数据分成h组，分别由第i个基础处理电路负责第i组(该组数据中行的集合记为Hi)的运算；The control circuit of the main processing circuit divides the h-row data of the matrix S into h groups, and the i-th basic processing circuit is responsible for the operation of the i-th group (the set of rows in this group of data is denoted as Hi);

此处对h行数据进行分组的方法是任意不会重复分配的分组方式；The method of grouping the h rows of data here is any grouping method that will not be repeatedly allocated;

在一种可选方案中，采用如下分配方式：主处理电路的控制电路将第j行分给第j％h个基础处理电路；In an optional solution, the following allocation method is adopted: the control circuit of the main processing circuit allocates the jth row to the j%hth basic processing circuit;

主处理电路的控制电路将矩阵P的W列数据分成w组，分别由第i个基础处理电路负责第i组(该组数据中行的集合记为Wi)的运算；The control circuit of the main processing circuit divides the W column data of the matrix P into w groups, and the i-th basic processing circuit is responsible for the operation of the i-th group (the set of rows in this group of data is denoted as Wi);

此处对W列数据进行分组的方法是任意不会重复分配的分组方式；The method of grouping the data in column W here is any grouping method that will not be repeatedly allocated;

在一种可选方案中，采用如下分配方式：主处理电路的控制电路将第j行分给第j％w个基础处理电路；In an optional solution, the following allocation method is adopted: the control circuit of the main processing circuit allocates the jth row to the j%wth basic processing circuit;

在一种可选方案中，对于不能平均分组的情况也可以先对一部分列平均分配，对于剩下的列以任意方式分配。In an optional solution, in the case that the grouping cannot be evenly divided, a part of the columns can be evenly allocated first, and the remaining columns can be allocated in an arbitrary manner.

主处理电路的控制电路将矩阵S的部分或全部行中的数据发送到基础处理电路阵列中每行的第一个基础处理电路；The control circuit of the main processing circuit sends the data in some or all of the rows of the matrix S to the first basic processing circuit of each row in the array of basic processing circuits;

在一种可选方案中，主处理电路的控制电路每次向基础处理电路阵列中第i行的第一个基础处理电路发送其负责的第i组数据Hi中的一行数据中的一个或多个数据；In an optional solution, the control circuit of the main processing circuit sends one or more pieces of data in a row of the i-th group of data Hi that it is responsible for to the first basic processing circuit in the i-th row in the basic processing circuit array each time. data;

在一种可选方案中，主处理电路的控制电路每次向基础处理电路阵列中第i行的第一个基础处理电路发送其负责的第i组数据Hi中的部分或全部行中的每行的一个或多个数据；In an optional solution, the control circuit of the main processing circuit sends each time the first basic processing circuit of the i-th row in the basic processing circuit array is responsible for every part or all of the rows in the i-th group of data Hi. one or more data of the row;

主处理电路的控制电路将矩阵P的部分或全部列中的数据发送到基础处理电路阵列中每列的第一个基础处理电路；The control circuit of the main processing circuit sends the data in some or all of the columns of the matrix P to the first basic processing circuit of each column in the basic processing circuit array;

在一种可选方案中，主处理电路的控制电路每次向基础处理电路阵列中第i列的第一个基础处理电路发送其负责的第i组数据Wi中的一列数据中的一个或多个数据；In an optional solution, the control circuit of the main processing circuit sends one or more data of a column in the i-th group of data Wi that it is responsible for to the first basic processing circuit of the i-th column in the basic processing circuit array each time. data;

在一种可选方案中，主处理电路的控制电路每次向基础处理电路阵列中第i列的第一个基础处理电路发送其负责的第i组数据Ni中的部分或全部列中的每列的一个或多个数据；In an optional solution, the control circuit of the main processing circuit sends each time the first basic processing circuit of the i-th column in the basic processing circuit array is responsible for every part or all of the columns in the i-th group of data Ni. one or more data for a column;

具体地，如果该基础处理电路有直接与主处理电路相连接的输出接口则从该接口传输结果，如果没有，则向着能够直接向主处理电路输出的基础处理电路的方向输出结果(例如，最下面一行基础处理电路将其输出结果直接输出给主处理电路，其他基础处理电路从竖向的输出接口向下传输运算结果)。Specifically, if the basic processing circuit has an output interface directly connected to the main processing circuit, the result is transmitted from the interface, if not, the result is output in the direction of the basic processing circuit that can directly output to the main processing circuit (for example, the most The following line of basic processing circuits directly output their output results to the main processing circuit, and other basic processing circuits transmit the operation results downward from the vertical output interface).

向着能够直接向主处理电路输出的方向输出结果(例如，最下面一行基础处理电路将其输出结果直接输出给主处理电路，其他基础处理电路从竖向的输出接口向下传输运算结果)；Output results in a direction that can be directly output to the main processing circuit (for example, the basic processing circuit in the bottom row directly outputs its output results to the main processing circuit, and other basic processing circuits transmit the operation results downward from the vertical output interface);

以上描述中使用的“横向”，“竖向”等词语只是为了表述图1b所示的例子，实际使用只需要区分出每个单元的“横向”“竖向”接口代表两个不同的接口即可。The words "horizontal" and "vertical" used in the above description are only to describe the example shown in Figure 1b. In actual use, it is only necessary to distinguish the "horizontal" and "vertical" interfaces of each unit to represent two different interfaces, namely Can.

使用所述电路装置完成全连接运算：A fully connected operation is done using the described circuit arrangement:

如果全连接层的输入数据是一个向量(即神经网络的输入是单个样本的情况)，则以全连接层的权值矩阵作为矩阵S，输入向量作为向量P，按照所述装置的使用矩阵乘以向量方法执行运算；If the input data of the fully connected layer is a vector (that is, the input of the neural network is a single sample), the weight matrix of the fully connected layer is used as the matrix S, the input vector is used as the vector P, and the matrix multiplication is performed according to the use of the device. perform operations in vector methods;

如果全连接层的输入数据是一个矩阵(即神经网络的输入是多个样本的情况)，则以全连接层的权值矩阵作为矩阵S，输入向量作为矩阵P，或者以全连接层的权值矩阵作为矩阵P，输入向量作为矩阵S，按照所述装置的矩阵乘以矩阵执行运算；If the input data of the fully connected layer is a matrix (that is, when the input of the neural network is multiple samples), the weight matrix of the fully connected layer is used as the matrix S, the input vector is used as the matrix P, or the weight of the fully connected layer is used as the matrix S. The value matrix is used as the matrix P, the input vector is used as the matrix S, and the operation is performed according to the matrix multiplication of the device by the matrix;

使用所述电路装置完成卷积运算：The convolution operation is performed using the described circuit arrangement:

下面描述卷积运算，下面的图中一个方块表示一个数据，输入数据用图3a表示(N个样本，每个样本有C个通道，每个通道的特征图的高为H，宽为W)，权值也即卷积核用图3b表示(有M个卷积核，每个卷积核有C个通道，高和宽分别为KH和KW)。对于输入数据的N个样本，卷积运算的规则都是一样的，下面解释在一个样本上进行卷积运算的过程，在一个样本上，M个卷积核中的每一个都要进行同样的运算，每个卷积核运算得到一张平面特征图，M个卷积核最终计算得到M个平面特征图，(对一个样本，卷积的输出是M个特征图)，对于一个卷积核，要在一个样本的每一个平面位置进行内积运算，然后沿着H和W方向进行滑动，例如，图3c表示一个卷积核在输入数据的一个样本中右下角的位置进行内积运算的对应图；图3d表示卷积的位置向左滑动一格和图3e表示卷积的位置向上滑动一格。The convolution operation is described below. In the figure below, a block represents a data, and the input data is represented by Figure 3a (N samples, each sample has C channels, and the feature map of each channel has a height of H and a width of W) , and the weights, that is, the convolution kernels, are shown in Figure 3b (there are M convolution kernels, and each convolution kernel has C channels, and the height and width are KH and KW, respectively). For N samples of input data, the rules of the convolution operation are the same. The following explains the process of performing the convolution operation on a sample. On a sample, each of the M convolution kernels must perform the same Operation, each convolution kernel operation obtains a plane feature map, and M convolution kernels finally calculate M plane feature maps, (for a sample, the output of the convolution is M feature maps), for a convolution kernel , to perform the inner product operation at each plane position of a sample, and then slide along the H and W directions. For example, Figure 3c shows that a convolution kernel performs the inner product operation at the lower right corner of a sample of the input data. Corresponding figures; Figure 3d shows that the position of the convolution slides one cell to the left and Figure 3e shows that the position of the convolution slides one cell up.

主处理电路的数据转换运算电路可以将权值的部分或全部卷积核中的数据转换成定点类型的数据，主处理电路的控制电路将权值的部分或全部卷积核中的数据发送到通过横向数据输入接口直接与主处理电路相连的那些基础处理电路(例如，图1b中最上方的灰色填充的竖向数据通路)；The data conversion operation circuit of the main processing circuit can convert part or all of the data in the convolution kernel of the weight into fixed-point data, and the control circuit of the main processing circuit sends part or all of the data in the convolution kernel of the weight to Those basic processing circuits that are directly connected to the main processing circuit through the horizontal data input interface (eg, the uppermost gray-filled vertical data path in Figure 1b);

在一种可选方案中，主处理电路的控制电路将权值中某个卷积核的数据每次发送一个数或者一部分数给某个基础处理电路；(例如，对于某一个基础处理电路，第1次发送第3行第1个数，第2次发送第3行数据中的第2个数，第3次发送第3行的第3个数……，或者第1次发送第3行前两个数，第二次发送第3行第3和第4个数，第三次发送第3行第5和第6个数……；)In an optional solution, the control circuit of the main processing circuit sends a number or a part of the data of a certain convolution kernel in the weight value to a certain basic processing circuit at a time; (for example, for a certain basic processing circuit, Send the 1st number of the 3rd line for the 1st time, send the 2nd number of the 3rd line data for the 2nd time, send the 3rd number of the 3rd line for the 3rd time..., or send the 3rd line for the 1st time The first two numbers, the 3rd and 4th numbers on the 3rd line are sent the second time, the 5th and 6th numbers on the 3rd line are sent the third time... ;)

在一种可选方案中另一种情况是，主处理电路的控制电路将权值中某几个卷积核的数据每次各发送一个数者一部分数给某个基础处理电路；(例如，对于某一个基础处理电路，第1次发送第3,4,5行每行的第1个数，第2次发送第3,4,5行每行的第2个数，第3次发送第3,4,5行每行的第3个数……，或者第1次发送第3,4,5行每行前两个数，第二次发送第3,4,5行每行第3和第4个数，第三次发送第3,4,5行每行第5和第6个数……；)In an alternative solution, another situation is that the control circuit of the main processing circuit sends the data of some convolution kernels in the weights each time a number or a part of the number to a basic processing circuit; (for example, For a certain basic processing circuit, the first number of lines 3, 4, and 5 is sent for the first time, the second number of lines 3, 4, and 5 is sent for the second time, and the third number is sent for the third time. The 3rd number of each line of 3,4,5 lines..., or the first two numbers of each line of 3,4,5 lines are sent for the first time, and the 3rd number of each line of 3,4,5 lines is sent for the second time and 4th number, 3rd time sending 3rd, 4th, 5th line 5th and 6th number per line... ;)

主处理电路的控制电路把输入数据按照卷积的位置进行划分，主处理电路的控制电路将输入数据中的部分或全部卷积位置中的数据发送到通过竖向数据输入接口直接与主处理电路相连的那些基础处理电路(例如，图1b中基础处理电路阵列左侧的灰色填充的横向数据通路)；The control circuit of the main processing circuit divides the input data according to the convolution position, and the control circuit of the main processing circuit sends part or all of the data in the convolution position in the input data to the main processing circuit directly through the vertical data input interface. those underlying processing circuits that are connected (for example, the gray-filled lateral data paths to the left of the underlying processing circuit array in Figure 1b);

在一种可选方案中，主处理电路的控制电路将输入数据中某个卷积位置的数据每次发送一个数或者一部分数给某个基础处理电路；(例如，对于某一个基础处理电路，第1次发送第3列第1个数，第2次发送第3列数据中的第2个数，第3次发送第3列的第3个数……，或者第1次发送第3列前两个数，第二次发送第3列第3和第4个数，第三次发送第3列第5和第6个数……；)In an optional solution, the control circuit of the main processing circuit sends a number or a part of the data at a certain convolution position in the input data to a certain basic processing circuit at a time; (for example, for a certain basic processing circuit, Send the 1st number in the 3rd column for the 1st time, send the 2nd number in the 3rd column for the 2nd time, send the 3rd number in the 3rd column for the 3rd time..., or send the 3rd column for the 1st time The first two numbers, the 3rd and 4th numbers in the 3rd column are sent the second time, the 5th and 6th numbers in the 3rd column are sent the third time... ;)

在一种可选方案中另一种情况是，主处理电路的控制电路将输入数据中某几个卷积位置的数据每次各发送一个数或者一部分数给某个基础处理电路；(例如，对于某一个基础处理电路，第1次发送第3,4,5列每列的第1个数，第2次发送第3,4,5列每列的第2个数，第3次发送第3,4,5列每列的第3个数……，或者第1次发送第3,4,5列每列前两个数，第二次发送第3,4,5列每列第3和第4个数，第三次发送第3,4,5列每列第5和第6个数……；)In an alternative solution, another situation is that the control circuit of the main processing circuit sends the data of certain convolution positions in the input data each time a number or a part of the number to a certain basic processing circuit; (for example, For a certain basic processing circuit, the first number of columns 3, 4, and 5 is sent for the first time, the second number of columns 3, 4, and 5 is sent for the second time, and the second number of each column is sent for the third time. The 3rd number of each column of 3, 4, 5 columns..., or the first two numbers of columns 3, 4, and 5 are sent, and the second number of columns 3, 4, and 5 is sent to the 3rd number of each column. and 4th number, 3rd time sending 3rd, 4th, 5th column 5th and 6th number in each column... ;)

基础处理电路接收到权值的数据之后，将该数据通过其横向的数据输出接口传输给其相连接下一个基础处理电路(例如，图1b中基础处理电路阵列中间的白色填充的横向的数据通路)；基础处理电路接收到输入数据的数据后，将该数据通过其竖向的数据输出接口传输给与其相连接的下一个基础处理电路(例如，图1b中基础处理电路阵列中间的白色填充的竖向的数据通路)；After the basic processing circuit receives the weight data, it transmits the data to the next basic processing circuit connected to it through its horizontal data output interface (for example, the white-filled horizontal data path in the middle of the basic processing circuit array in Figure 1b). ); after the basic processing circuit receives the data of the input data, it transmits the data to the next basic processing circuit connected to it through its vertical data output interface (for example, the white-filled ones in the middle of the basic processing circuit array in FIG. 1b vertical data path);

在一种可选方案中，基础处理电路每次计算一组或多组两个数据的乘法，然后将结果累加到寄存器和/或片上缓存上；In an alternative, the base processing circuit computes the multiplication of one or more sets of two data at a time, and then accumulates the results into registers and/or on-chip caches;

在一种可选方案中，基础处理电路每次计算一组或多组两个向量的内积，然后将结果累加到寄存器和/或片上缓存上；In an optional solution, the basic processing circuit calculates the inner product of one or more sets of two vectors at a time, and then accumulates the result into a register and/or on-chip cache;

使用所述电路装置完成加偏置操作的方法；A method of performing a biasing operation using the circuit arrangement;

利用主处理电路的向量运算器电路可以实现两个向量或者两个矩阵相加的功能；The vector operator circuit of the main processing circuit can realize the function of adding two vectors or two matrices;

利用主处理电路的向量运算器电路可以实现把一个向量加到一个矩阵的每一行上，或者每一个列上的功能。The function of adding a vector to each row or each column of a matrix can be realized by using the vector operator circuit of the main processing circuit.

在一种可选方案中，所述矩阵可以来自所述装置执行矩阵乘矩阵运算的结果；In an alternative, the matrix may come from the result of the apparatus performing a matrix-by-matrix operation;

在一种可选方案中，所述向量可以来自所述装置执行矩阵乘向量运算的结果；In an optional solution, the vector may come from the result of the apparatus performing a matrix-multiply-vector operation;

在一种可选方案中，所述矩阵可以来自所述装置的主处理电路从外部接受的数据。In an alternative, the matrix may be from data received externally by the main processing circuit of the device.

在一种可选方案中，所述向量可以来自所述装置的主处理电路从外部接受的数据。In an alternative, the vector may come from data received externally by the main processing circuit of the device.

包括但不限于以上这些数据来源。Including but not limited to the above data sources.

使用所述电路装置完成激活函数运算的方法：A method for completing the activation function operation using the circuit arrangement:

利用主处理电路的激活电路，输入一向量，计算出该向量的激活向量；Using the activation circuit of the main processing circuit, input a vector, and calculate the activation vector of the vector;

在一种可选方案中，主处理电路的激活电路将输入向量中的每一个值通过一个激活函数(激活函数的输入是一个数值，输出也是一个数值)，计算出一个数值输出到输出向量的对应位置；In an optional solution, the activation circuit of the main processing circuit passes each value in the input vector through an activation function (the input of the activation function is a numerical value, and the output is also a numerical value), and calculates a numerical value and outputs it to the output vector. corresponding position;

在一种可选方案中，激活函数可以是：y＝max(m,x)，其中x是输入数值，y是输出数值，m是一个常数；In an optional solution, the activation function can be: y=max(m,x), where x is the input value, y is the output value, and m is a constant;

在一种可选方案中，激活函数可以是：y＝tanh(x)，其中x是输入数值，y是输出数值；In an optional solution, the activation function may be: y=tanh(x), where x is the input value and y is the output value;

在一种可选方案中，激活函数可以是：y＝sigmoid(x)，其中x是输入数值，y是输出数值；In an optional solution, the activation function may be: y=sigmoid(x), where x is the input value and y is the output value;

在一种可选方案中，激活函数可以是一个分段线性函数；In an alternative, the activation function can be a piecewise linear function;

在一种可选方案中，激活函数可以是任意输入一个数，输出一个数的函数。In an alternative solution, the activation function can be any function that inputs a number and outputs a number.

在一种可选方案中，输入向量的来源有(包括但不限于)：In an optional solution, the sources of input vectors are (including but not limited to):

所述装置的外部数据来源；external data sources for the device;

在一种可选方案中，输入数据来自所述装置进行矩阵乘向量的运算结果；In an optional solution, the input data comes from the operation result of matrix multiplication vector by the device;

在一种可选方案中，输入数据来自所述装置进行矩阵乘矩阵的运算结果；In an optional solution, the input data comes from the operation result of matrix multiplication performed by the device;

所述装置的主处理电路计算结果；the calculation result of the main processing circuit of the device;

在一种可选方案中，输入数据来自所述装置主处理电路实现加偏置之后的计算结果。In an optional solution, the input data comes from the calculation result of the main processing circuit of the device after adding the bias.

使用所述装置实现BLAS(Basic Linear Algebra Subprograms)的方法；A method for realizing BLAS (Basic Linear Algebra Subprograms) using the device;

GEMM计算是指：BLAS库中的矩阵-矩阵乘法的运算。该运算的通常表示形式为：C＝alpha*op(S)*op(P)+beta*C，其中，A和B为输入的两个矩阵，C为输出矩阵，alpha和beta为标量，op代表对矩阵S或P的某种操作，此外，还会有一些辅助的整数作为参数来说明矩阵的A和B的宽高；GEMM calculation refers to: the operation of matrix-matrix multiplication in the BLAS library. The usual representation of this operation is: C=alpha*op(S)*op(P)+beta*C, where A and B are the two input matrices, C is the output matrix, alpha and beta are scalars, and op Represents a certain operation on the matrix S or P, in addition, there will be some auxiliary integers as parameters to explain the width and height of A and B of the matrix;

使用所述装置实现GEMM计算的步骤为：The steps of using the device to realize GEMM calculation are:

主处理电路在进行OP操作之前可以将输入矩阵S和矩阵P进行数据类型的转换；The main processing circuit can convert the data type of the input matrix S and the matrix P before performing the OP operation;

主处理电路的转换电路对输入矩阵S和矩阵P进行各自相应的op操作；The conversion circuit of the main processing circuit performs respective op operations on the input matrix S and the matrix P;

在一种可选方案中，op可以为矩阵的转置操作；利用主处理电路的向量运算功能或者数据重排列功能(前面提到了主处理电路具有数据重排列的电路)，实现矩阵转置操作，当然在实际应用中，上述OP也可以直接通过转换电路来实现，例如矩阵转置操作时，直接通过矩阵转置电路来实现OP操作；In an optional solution, op can be the transpose operation of the matrix; the vector operation function or the data rearrangement function of the main processing circuit is used (it was mentioned above that the main processing circuit has a data rearrangement circuit) to realize the matrix transposition operation. , of course, in practical applications, the above OP can also be directly implemented by a conversion circuit. For example, when a matrix transpose operation is performed, the OP operation can be directly implemented by a matrix transpose circuit;

在一种可选方案中，某个矩阵的op可以为空，OP操作不进行；In an optional solution, the op of a certain matrix can be empty, and the OP operation is not performed;

利用矩阵乘矩阵的计算方法完成op(S)与op(P)之间的矩阵乘法计算；The matrix multiplication calculation between op(S) and op(P) is completed by the calculation method of matrix multiplication matrix;

利用主处理电路的算术逻辑电路对op(S)*op(P)的结果中的每一个值进行乘以alpha的操作；Use the arithmetic logic circuit of the main processing circuit to multiply each value in the result of op(S)*op(P) by alpha;

在一种可选方案中，alpha为1的情况下乘以alpha操作不进行；In an optional solution, the multiplication by alpha operation is not performed when alpha is 1;

利用主处理电路的算术逻辑电路实现beta*C的运算；Use the arithmetic logic circuit of the main processing circuit to realize the operation of beta*C;

在一种可选方案中，beta为1的情况下，不进行乘以beta操作；In an optional solution, when beta is 1, the multiplication by beta operation is not performed;

利用主处理电路的算术逻辑电路，实现矩阵alpha*op(S)*op(P)和beta*C之间对应位置相加的步骤；Use the arithmetic logic circuit of the main processing circuit to realize the steps of adding the corresponding positions between the matrices alpha*op(S)*op(P) and beta*C;

在一种可选方案中，beta为0的情况下，不进行相加操作；In an alternative solution, when beta is 0, no addition operation is performed;

GEMV计算是指：BLAS库中的矩阵-向量乘法的运算。该运算的通常表示形式为：C＝alpha*op(S)*P+beta*C，其中，S为输入矩阵，P为输入的向量，C为输出向量，alpha和beta为标量，op代表对矩阵S的某种操作；GEMV calculation refers to: the operation of matrix-vector multiplication in the BLAS library. The usual representation of this operation is: C=alpha*op(S)*P+beta*C, where S is the input matrix, P is the input vector, C is the output vector, alpha and beta are scalars, and op represents the pair some operation of matrix S;

使用所述装置实现GEMV计算的步骤为：The steps of using the device to realize GEMV calculation are:

主处理电路的转换电路对输入矩阵S进行相应的op操作；The conversion circuit of the main processing circuit performs corresponding op operations on the input matrix S;

在一种可选方案中，op可以为矩阵的转置操作；利用主处理电路的矩阵转置电路实现矩阵转置操作；In an optional solution, op can be a matrix transposition operation; the matrix transposition operation is realized by using the matrix transposition circuit of the main processing circuit;

用矩阵乘向量的计算方法完成矩阵op(S)与向量P之间的矩阵-向量乘法计算；The matrix-vector multiplication calculation between the matrix op(S) and the vector P is completed by the calculation method of matrix multiplication vector;

利用主处理电路的算术逻辑电路对op(S)*P的结果中的每一个值进行乘以alpha的操作；Use the arithmetic logic circuit of the main processing circuit to multiply each value in the result of op(S)*P by alpha;

利用主处理电路的算术逻辑电路，实现beta*C的运算；Use the arithmetic logic circuit of the main processing circuit to realize the operation of beta*C;

利用主处理电路的算术逻辑电路，实现矩阵alpha*op(S)*P和beta*C之间对应位置相加的步骤；Use the arithmetic logic circuit of the main processing circuit to realize the steps of adding the corresponding positions between the matrices alpha*op(S)*P and beta*C;

实现数据类型转换Implement data type conversion

利用主处理电路的数据类型转换运算电路实现将数据类型的转换；The data type conversion is realized by using the data type conversion operation circuit of the main processing circuit;

在一种可选方案中，数据类型转化的形式包括但不限于：浮点数转定点数和定点数转浮点数等；In an optional solution, the form of data type conversion includes but is not limited to: floating-point number to fixed-point number, fixed-point number to floating-point number, etc.;

更新权值的方法：How to update weights:

利用主处理电路的向量运算器电路实现神经网络训练过程中的权值更新功能，具体地，权值更新是指使用权值的梯度来更新权值的方法。The vector operator circuit of the main processing circuit is used to implement the weight update function in the training process of the neural network. Specifically, the weight update refers to a method of using the gradient of the weight to update the weight.

在一种可选方案中，使用主处理电路的向量运算器电路对权值和权值梯度这两个向量进行加减运算得到运算结果，该运算结果即为更新权值。In an optional solution, the vector operator circuit of the main processing circuit is used to perform addition and subtraction operations on the two vectors of the weight value and the weight value gradient to obtain an operation result, and the operation result is the updated weight value.

在一种可选方案中，使用主处理电路的向量运算器电路在权值以及权值梯度乘以或除以一个数得到中间权值和中间权值梯度值，向量运算器电路对中间权值和中间权值梯度值进行加减运算得到运算结果，该运算结果即为更新权值。In an optional solution, the vector operator circuit of the main processing circuit is used to multiply or divide the weight and the weight gradient by a number to obtain the intermediate weight and the intermediate weight gradient value, and the vector operator circuit calculates the intermediate weight value. Add and subtract the gradient value of the intermediate weight to obtain the operation result, and the operation result is the updated weight.

在一种可选方案中，可以先使用权值的梯度计算出一组动量，然后再使用动量与权值进行加减计算得到更新后的权值。In an optional solution, a set of momentums can be first calculated using the gradient of the weights, and then the updated weights can be obtained by adding and subtracting the momentums and the weights.

实现全连接层的反向运算的方法A method to implement the reverse operation of the fully connected layer

全连接层的反向运算可以分成两部分，如下图中，实线箭头表示全连接层的正向计算过程，虚线部分表示全连接层的反向计算过程。The reverse operation of the fully connected layer can be divided into two parts, as shown in the figure below, the solid line arrow represents the forward calculation process of the fully connected layer, and the dotted line part represents the reverse calculation process of the fully connected layer.

从上图可以看出来，可以使用装置的使用所述装置完成矩阵相乘运算的方法完成全连接层的反向运算；As can be seen from the above figure, the reverse operation of the fully connected layer can be completed by using the method of using the device to complete the matrix multiplication operation;

实现卷积层的反向运算；Implement the reverse operation of the convolutional layer;

卷积层的反向运算可以分成两部分，如下图4a中，实线箭头表示卷积层的正向计算过程，如图4b所示，表示卷积层的反向计算过程。The reverse operation of the convolutional layer can be divided into two parts, as shown in Figure 4a below, the solid arrow indicates the forward calculation process of the convolutional layer, as shown in Figure 4b, the reverse calculation process of the convolutional layer.

图4a、图4b所示的卷积层的反向运算，可以使用如图1a所示装置采用如图1b所示的装置完成卷积层的反向运算。在执行正向运算或反向运算实际为神经网络的多个运算，该多个运算包括但不限于：矩阵乘以矩阵、矩阵乘以向量、卷积运算、激活运算等等运算中的一种或任意组合，上述运算的方式可以本披露中的描述，这里不在赘述。For the reverse operation of the convolutional layer shown in FIG. 4a and FIG. 4b, the reverse operation of the convolutional layer can be completed by using the device shown in FIG. 1a and the device shown in FIG. 1b. When performing forward operation or reverse operation, it is actually multiple operations of a neural network, and the multiple operations include, but are not limited to: one of matrix multiplication by matrix, matrix multiplication by vector, convolution operation, activation operation, etc. Or any combination, the above operation manners can be described in the present disclosure, which will not be repeated here.

本披露实施例提供了一种神经网络处理器板卡，可用于众多通用或专用的计算系统环境或配置中。例如：个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、智能家居、家电、多处理器系统、基于微处理器的系统、机器人、可编程的消费电子设备、网络个人计算机(personal computer，PC)、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。Embodiments of the present disclosure provide a neural network processor board that can be used in numerous general-purpose or special-purpose computing system environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, smart homes, appliances, multiprocessor systems, microprocessor-based systems, robotics, programmable consumer electronics, network personal computers , PC), minicomputers, mainframe computers, distributed computing environments including any of the above systems or devices, and the like.

请参照图5a，图5a为本披露实施例提供的一种神经网络处理器板卡的结构示意图。如图5a所示，上述神经网络处理器板卡10包括神经网络芯片封装结构11、第一电气及非电气连接装置12和第一基板(substrate)13。Please refer to FIG. 5a, which is a schematic structural diagram of a neural network processor board according to an embodiment of the present disclosure. As shown in FIG. 5 a , the neural network processor board 10 includes a neural network chip package structure 11 , a first electrical and non-electrical connection device 12 and a first substrate 13 .

本披露对于神经网络芯片封装结构11的具体结构不作限定，可选的，如图5b所示，上述神经网络芯片封装结构11包括：神经网络芯片111、第二电气及非电气连接装置112、第二基板113。The present disclosure does not limit the specific structure of the neural network chip packaging structure 11. Optionally, as shown in FIG. 5b, the neural network chip packaging structure 11 includes: a neural network chip 111, a second electrical and non-electrical connection device 112, a first Two substrates 113 .

本披露所涉及的神经网络芯片111的具体形式不作限定，上述的神经网络芯片111包含但不限于将神经网络处理器集成的神经网络晶片，上述晶片可以由硅材料、锗材料、量子材料或分子材料等制成。根据实际情况(例如：较严苛的环境)和不同的应用需求可将上述神经网络晶片进行封装，以使神经网络晶片的大部分被包裹住，而将神经网络晶片上的引脚通过金线等导体连到封装结构的外边，用于和更外层进行电路连接。The specific form of the neural network chip 111 involved in the present disclosure is not limited. The above-mentioned neural network chip 111 includes but is not limited to a neural network chip that integrates a neural network processor. The above-mentioned chip can be made of silicon material, germanium material, quantum material or molecular material. materials etc. According to the actual situation (for example: harsh environment) and different application requirements, the above neural network chip can be packaged, so that most of the neural network chip is wrapped, and the pins on the neural network chip are passed through gold wires. The other conductors are connected to the outside of the package structure for circuit connection with the outer layers.

本披露对于神经网络芯片111的具体结构不作限定，可选的，请参照图1a或图1b所示的装置。The present disclosure does not limit the specific structure of the neural network chip 111. Optionally, please refer to the device shown in FIG. 1a or FIG. 1b.

本披露对于第一基板13和第二基板113的类型不做限定，可以是印制电路板(printed circuit board，PCB)或(printed wiring board，PWB)，还可能为其它电路板。对PCB的制作材料也不做限定。The present disclosure does not limit the types of the first substrate 13 and the second substrate 113, which may be a printed circuit board (PCB) or a printed wiring board (PWB), and may also be other circuit boards. The material for making the PCB is also not limited.

本披露所涉及的第二基板113用于承载上述神经网络芯片111，通过第二电气及非电气连接装置112将上述的神经网络芯片111和第二基板113进行连接得到的神经网络芯片封装结构11，用于保护神经网络芯片111，便于将神经网络芯片封装结构11与第一基板13进行进一步封装。The second substrate 113 involved in the present disclosure is used to carry the above-mentioned neural network chip 111 , and the neural network chip package structure 11 is obtained by connecting the above-mentioned neural network chip 111 and the second substrate 113 through the second electrical and non-electrical connection device 112 . , which is used to protect the neural network chip 111 and facilitate further packaging of the neural network chip packaging structure 11 and the first substrate 13 .

对于上述具体的第二电气及非电气连接装置112的封装方式和封装方式对应的结构不作限定，可根据实际情况和不同的应用需求选择合适的封装方式并进行简单地改进，例如：倒装芯片球栅阵列封装(Flip Chip Ball Grid Array Package，FCBGAP)，薄型四方扁平式封装(Low-profile Quad Flat Package，LQFP)、带散热器的四方扁平封装(QuadFlat Package with Heat sink，HQFP)、无引脚四方扁平封装(Quad Flat Non-leadPackage，QFN)或小间距四方扁平式封装(Fine-pitch Ball Grid Package，FBGA)等封装方式。The packaging method and the structure corresponding to the packaging method of the above-mentioned specific second electrical and non-electrical connection device 112 are not limited, and an appropriate packaging method can be selected according to the actual situation and different application requirements and can be simply improved, for example: flip chip Ball Grid Array Package (Flip Chip Ball Grid Array Package, FCBGAP), Low-profile Quad Flat Package (LQFP), Quad Flat Package with Heat Sink (HQFP), No Lead Quad Flat Package (Quad Flat Non-lead Package, QFN) or Small Pitch Quad Flat Package (Fine-pitch Ball Grid Package, FBGA) and other packaging methods.

倒装芯片(Flip Chip)，适用于对封装后的面积要求高或对导线的电感、信号的传输时间敏感的情况下。除此之外可以用引线键合(Wire Bonding)的封装方式，减少成本，提高封装结构的灵活性。Flip Chip is suitable for situations where the area after the package is high or is sensitive to the inductance of the wire and the transmission time of the signal. In addition, a wire bonding (Wire Bonding) packaging method can be used to reduce the cost and improve the flexibility of the packaging structure.

球栅阵列(Ball Grid Array)，能够提供更多引脚，且引脚的平均导线长度短，具备高速传递信号的作用，其中，封装可以用引脚网格阵列封装(Pin Grid Array，PGA)、零插拔力(Zero Insertion Force，ZIF)、单边接触连接(Single Edge Contact Connection，SECC)、触点阵列(Land Grid Array，LGA)等来代替。Ball Grid Array (Ball Grid Array) can provide more pins, and the average lead length of the pins is short, which has the function of high-speed signal transmission. Among them, the package can be packaged with Pin Grid Array (PGA) , zero insertion force (Zero Insertion Force, ZIF), single edge contact connection (Single Edge Contact Connection, SECC), contact array (Land Grid Array, LGA), etc. instead.

可选的，采用倒装芯片球栅阵列(Flip Chip Ball Grid Array)的封装方式对神经网络芯片111和第二基板113进行封装，具体的神经网络芯片封装结构的示意图可参照图6。如图6所示，上述神经网络芯片封装结构包括：神经网络芯片21、焊盘22、焊球23、第二基板24、第二基板24上的连接点25、引脚26。Optionally, the neural network chip 111 and the second substrate 113 are packaged in a flip chip ball grid array (Flip Chip Ball Grid Array) packaging method. Refer to FIG. 6 for a schematic diagram of a specific neural network chip packaging structure. As shown in FIG. 6 , the above-mentioned neural network chip package structure includes: a neural network chip 21 , pads 22 , solder balls 23 , a second substrate 24 , connection points 25 and pins 26 on the second substrate 24 .

其中，焊盘22与神经网络芯片21相连，通过在焊盘22和第二基板24上的连接点25之间焊接形成焊球23，将神经网络芯片21和第二基板24连接，即实现了神经网络芯片21的封装。Wherein, the pad 22 is connected to the neural network chip 21, and solder balls 23 are formed by welding between the pad 22 and the connection point 25 on the second substrate 24, and the neural network chip 21 and the second substrate 24 are connected. Packaging of the neural network chip 21 .

引脚26用于与封装结构的外部电路(例如，神经网络处理器板卡10上的第一基板13)相连，可实现外部数据和内部数据的传输，便于神经网络芯片21或神经网络芯片21对应的神经网络处理器对数据进行处理。对于引脚的类型和数量本披露也不作限定，根据不同的封装技术可选用不同的引脚形式，并遵从一定规则进行排列。The pin 26 is used to connect with the external circuit of the package structure (for example, the first substrate 13 on the neural network processor board 10 ), which can realize the transmission of external data and internal data, and is convenient for the neural network chip 21 or the neural network chip 21 The corresponding neural network processor processes the data. The present disclosure also does not limit the type and quantity of pins, and different pin forms can be selected according to different packaging technologies, and are arranged in accordance with certain rules.

可选的，上述神经网络芯片封装结构还包括绝缘填充物，置于焊盘22、焊球23和连接点25之间的空隙中，用于防止焊球与焊球之间产生干扰。Optionally, the above-mentioned neural network chip package structure further includes insulating fillers, which are placed in the gaps between the pads 22 , the solder balls 23 and the connection points 25 , to prevent interference between the solder balls and the solder balls.

其中，绝缘填充物的材料可以是氮化硅、氧化硅或氧氮化硅；干扰包含电磁干扰、电感干扰等。Wherein, the material of the insulating filler may be silicon nitride, silicon oxide or silicon oxynitride; the interference includes electromagnetic interference, inductive interference, and the like.

可选的，上述神经网络芯片封装结构还包括散热装置，用于散发神经网络芯片21运行时的热量。其中，散热装置可以是一块导热性良好的金属片、散热片或散热器，例如，风扇。Optionally, the above-mentioned neural network chip packaging structure further includes a heat dissipation device for dissipating heat during operation of the neural network chip 21 . Wherein, the heat dissipation device may be a metal sheet, a heat sink or a heat sink with good thermal conductivity, such as a fan.

举例来说，如图6a所示，神经网络芯片封装结构11包括：神经网络芯片21、焊盘22、焊球23、第二基板24、第二基板24上的连接点25、引脚26、绝缘填充物27、散热膏28和金属外壳散热片29。其中，散热膏28和金属外壳散热片29用于散发神经网络芯片21运行时的热量。For example, as shown in FIG. 6a, the neural network chip package structure 11 includes: a neural network chip 21, pads 22, solder balls 23, a second substrate 24, connection points 25 on the second substrate 24, pins 26, Insulation filler 27, thermal paste 28 and metal shell heat sink 29. Among them, the heat dissipation paste 28 and the metal shell heat dissipation fins 29 are used to dissipate the heat of the neural network chip 21 during operation.

可选的，上述神经网络芯片封装结构11还包括补强结构，与焊盘22连接，且内埋于焊球23中，以增强焊球23与焊盘22之间的连接强度。Optionally, the above-mentioned neural network chip package structure 11 further includes a reinforcing structure, which is connected to the pads 22 and embedded in the solder balls 23 to enhance the connection strength between the solder balls 23 and the pads 22 .

其中，补强结构可以是金属线结构或柱状结构，在此不做限定。Wherein, the reinforcing structure may be a metal wire structure or a columnar structure, which is not limited herein.

本披露对于第一电气及非电气装置12的具体形式也不作限定，可参照第二电气及非电气装置112的描述，即通过焊接的方式将神经网络芯片封装结构11进行封装，也可以采用连接线连接或插拔方式连接第二基板113和第一基板13的方式，便于后续更换第一基板13或神经网络芯片封装结构11。The present disclosure also does not limit the specific form of the first electrical and non-electrical device 12. Reference can be made to the description of the second electrical and non-electrical device 112, that is, the neural network chip package structure 11 is packaged by welding, or a connection can be used. The way of connecting the second substrate 113 and the first substrate 13 by wire connection or plugging is convenient for subsequent replacement of the first substrate 13 or the neural network chip package structure 11 .

可选的，第一基板13包括用于扩展存储容量的内存单元的接口等，例如：同步动态随机存储器(Synchronous Dynamic Random Access Memory，SDRAM)、双倍速率同步动态随机存储器(Double Date Rate SDRAM，DDR)等，通过扩展内存提高了神经网络处理器的处理能力。Optionally, the first substrate 13 includes an interface and the like for a memory unit used to expand the storage capacity, for example: a synchronous dynamic random access memory (Synchronous Dynamic Random Access Memory, SDRAM), a double-rate synchronous dynamic random access memory (Double Date Rate SDRAM, DDR), etc., the processing power of the neural network processor is improved by expanding the memory.

第一基板13上还可包括快速外部设备互连总线(Peripheral ComponentInterconnect-Express，PCI-E或PCIe)接口、小封装可热插拔(Small Form-factorPluggable，SFP)接口、以太网接口、控制器局域网总线(Controller Area Network，CAN)接口等等，用于封装结构和外部电路之间的数据传输，可提高运算速度和操作的便利性。The first substrate 13 may further include a Peripheral Component Interconnect-Express (PCI-E or PCIe) interface, a Small Form-factor Pluggable (SFP) interface, an Ethernet interface, and a controller. A local area network bus (Controller Area Network, CAN) interface, etc., is used for data transmission between the package structure and the external circuit, which can improve the operation speed and the convenience of operation.

将神经网络处理器封装为神经网络芯片111，将神经网络芯片111封装为神经网络芯片封装结构11，将神经网络芯片封装结构11封装为神经网络处理器板卡10，通过板卡上的接口(插槽或插芯)与外部电路(例如：计算机主板)进行数据交互，即直接通过使用神经网络处理器板卡10实现神经网络处理器的功能，并保护神经网络芯片111。且神经网络处理器板卡10上还可添加其他模块，提高了神经网络处理器的应用范围和运算效率。The neural network processor is packaged as a neural network chip 111, the neural network chip 111 is packaged as a neural network chip package structure 11, and the neural network chip package structure 11 is packaged as a neural network processor board 10, through the interface ( socket or ferrule) for data interaction with an external circuit (eg, computer motherboard), that is, directly using the neural network processor board 10 to realize the function of the neural network processor and protect the neural network chip 111 . In addition, other modules can be added to the neural network processor board 10, which improves the application scope and operation efficiency of the neural network processor.

在一个实施例里，本公开公开了一个电子装置，其包括了上述神经网络处理器板卡10或神经网络芯片封装结构11。In one embodiment, the present disclosure discloses an electronic device including the above-mentioned neural network processor board 10 or neural network chip package structure 11 .

电子装置包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。Electronic devices include data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cameras, video cameras, projectors, watches, headphones, mobile storage , wearable devices, vehicles, home appliances, and/or medical devices.

所述交通工具包括飞机、轮船和/或车辆；所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机；所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。The vehicles include airplanes, ships and/or vehicles; the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods; the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph.

以上所述的具体实施例，对本披露的目的、技术方案和有益效果进行了进一步详细说明，所应理解的是，以上所述仅为本披露的具体实施例而已，并不用于限制本披露，凡在本披露的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本披露的保护范围之内。The specific embodiments described above further describe the purpose, technical solutions and beneficial effects of the present disclosure in further detail. It should be understood that the above are only specific embodiments of the present disclosure, and are not intended to limit the present disclosure. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this disclosure shall be included within the protection scope of this disclosure.

Claims

1. An integrated circuit chip device, characterized in that the integrated circuit chip device comprises: a main processing circuit and a plurality of basic processing circuits; at least one circuit in the main processing circuit or the plurality of basic processing circuits comprises: a data type an arithmetic circuit; the data type arithmetic circuit is used to perform conversion between floating-point type data and fixed-point type data;

The plurality of basic processing circuits are distributed in an array; each basic processing circuit is connected to other adjacent basic processing circuits, and the main processing circuit is connected to the n basic processing circuits in the first row and the n basic processing circuits in the mth row. circuit and m basic processing circuits in column 1;

The main processing circuit is used to perform each successive operation in the neural network operation and transmit data with the basic processing circuit connected to it;

The plurality of basic processing circuits are used to perform operations in the neural network in parallel according to the transmitted data, and transmit the operation results to the main processing circuit through the basic processing circuits connected to the main processing circuit.

2. The integrated circuit chip device according to claim 1, wherein,

The main processing circuit is used to obtain the data block to be calculated and an operation instruction, and convert the data block to be calculated into a fixed-point type data block through the data type operation circuit, and the fixed-point type is processed according to the operation instruction. The data block to be calculated is divided into a distribution data block and a broadcast data block; the distribution data block is split to obtain a plurality of basic data blocks, and the plurality of basic data blocks are distributed to the basic processing circuit connected to it, broadcasting the broadcast data block to underlying processing circuits connected thereto;

the basic processing circuit, configured to perform an inner product operation on the basic data block and the broadcast data block in a fixed-point data type to obtain an operation result, and send the operation result to the main processing circuit;

Or forward the basic data block and the broadcast data block to other basic processing circuits to perform an inner product operation with a fixed-point data type to obtain an operation result, and send the operation result to the main processing circuit;

The main processing circuit is configured to convert the operation result into floating point type data through the data type operation circuit, and process the floating point type data to obtain the data block to be calculated and the instruction result of the operation instruction.

3. The integrated circuit chip device according to claim 2, wherein,

The main processing circuit is specifically configured to send the broadcast data block to the basic processing circuit connected to it through a broadcast.

4. The integrated circuit chip device according to claim 2, wherein,

The basic processing circuit is specifically configured to perform inner product processing on the basic data block and the broadcast data block in a fixed-point data type to obtain an inner product processing result, and accumulate the inner product processing results to obtain an operation result. The operation result is sent to the main processing circuit.

5. The integrated circuit chip device according to claim 4, wherein,

The main processing circuit is configured to, if the operation result is the result of inner product processing, accumulate the operation result to obtain an accumulation result, and arrange the accumulation result to obtain the data block to be calculated and the value of the operation instruction. command result.

6. The integrated circuit chip device according to claim 2, wherein,

The main processing circuit is specifically configured to divide the broadcast data block into a plurality of partial broadcast data blocks, and broadcast the plurality of partial broadcast data blocks to the basic processing circuit through multiple times; the plurality of partial broadcast data blocks The blocks are combined to form the broadcast data blocks.

7. The integrated circuit chip device according to claim 6, wherein,

The basic processing circuit is specifically configured to perform an inner product processing on the part of the broadcast data block and the basic data block in a fixed-point data type to obtain an inner product processing result, and accumulate the inner product processing results to obtain a partial operation result , and send the partial operation result to the main processing circuit.

8. The integrated circuit chip device according to claim 7, wherein,

The basic processing circuit is specifically used for multiplexing the part of the broadcast data block n times to perform the inner product operation of the part of the broadcast data block and the n basic data blocks to obtain n partial processing results, and after accumulating the n partial processing results respectively. Obtain n partial operation results, and send the n partial operation results to the main processing circuit, where n is an integer greater than or equal to 2.

9. The integrated circuit chip device according to claim 1, wherein,

The main processing circuit includes: a main register or a main on-chip buffer circuit;

The basic processing circuit includes: a basic register or a basic on-chip cache circuit.

10. The integrated circuit chip device according to claim 9, wherein,

The main processing circuit includes one or any combination of a vector operator circuit, an arithmetic logic unit circuit, an accumulator circuit, a matrix transposition circuit, a direct memory access circuit, a data type operation circuit or a data rearrangement circuit.

11. The integrated circuit chip device according to claim 1, wherein,

The main processing circuit is used to obtain the data block to be calculated and an operation instruction, divide the data block to be calculated into a distribution data block and a broadcast data block according to the operation instruction; perform split processing on the distribution data block obtaining a plurality of basic data blocks, distributing the plurality of basic data blocks to the basic processing circuit connected thereto, and broadcasting the broadcast data blocks to the basic processing circuit connected thereto;

The basic processing circuit is configured to convert the basic data block and the broadcast data block into a fixed-point type data block, perform an inner product operation according to the fixed-point type data block to obtain an operation result, and convert the operation result into a floating-point type. After the point data is sent to the main processing circuit;

Or convert the basic data block and the broadcast data block into fixed-point data blocks, forward the fixed-point data blocks to other basic processing circuits to perform an inner product operation to obtain an operation result, and convert the operation result into a floating-point number The data is then sent to the main processing circuit;

The main processing circuit is configured to process the operation result to obtain the data block to be calculated and the instruction result of the operation instruction.

12. The integrated circuit chip device according to claim 1, wherein,

The data is one or any combination of vectors, matrices, three-dimensional data blocks, four-dimensional data blocks and n-dimensional data blocks.

13. The integrated circuit chip device according to claim 2, wherein,

If the operation instruction is a multiplication instruction, the main processing circuit determines that the multiplier data block is a broadcast data block, and the multiplicand data block is a distribution data block;

Or if the operation instruction is a convolution instruction, the main processing circuit determines that the convolution input data block is a broadcast data block, and the convolution kernel is a distribution data block.

14. A neural network computing device, wherein the neural network computing device comprises one or more integrated circuit chip devices according to any one of claims 1-13.

15. A combined processing device, characterized in that, the combined processing device comprises: the neural network computing device according to claim 14, a general interconnection interface and a general processing device;

The neural network computing device is connected with the general processing device through the general interconnection interface.

16. A chip, characterized in that, the chip integrates the device according to any one of claims 1-15.

17. An electronic device, wherein the electronic device comprises the chip of claim 16.

18. A method for computing a neural network, wherein the method is applied in an integrated circuit chip device, the integrated circuit chip device comprising: the integrated circuit chip device according to any one of claims 1-13, The integrated circuit chip device is used to perform the operation of the neural network.

19. The method according to claim 16, wherein the operation of the neural network comprises: convolution operation, matrix multiplication matrix operation, matrix multiplication vector operation, paranoid operation, fully connected operation, GEMM operation, GEMV operation, One or any combination of activation operations.