TWI768159B

TWI768159B - Integrated circuit chip apparatus and related product

Info

Publication number: TWI768159B
Application number: TW107144034A
Authority: TW
Inventors: 發明人放棄姓名表示權
Original assignee: 大陸商中科寒武紀科技股份有限公司
Priority date: 2017-12-14
Filing date: 2018-12-07
Publication date: 2022-06-21
Also published as: CN109961134B; CN111126588A; CN111160541A; CN111126588B; CN111105033B; CN111105033A; CN111160541B; CN109961134A; TW201931220A

Abstract

本披露提供一種集成電路芯片裝置及相關產品，所述集成電路芯片裝置包括：主處理電路以及多個基礎處理電路；所述主處理電路或多個基礎處理電路中至少一個基礎處理電路包括：數據類型運算電路，所述數據類型運算電路，用於執行浮點類型數據以及定點類型數據之間的轉換。本披露提供的技術方案具有計算量小，功耗低的優點。The present disclosure provides an integrated circuit chip device and related products, the integrated circuit chip device includes: a main processing circuit and a plurality of basic processing circuits; at least one basic processing circuit of the main processing circuit or the plurality of basic processing circuits includes: data A type operation circuit, the data type operation circuit is used to perform conversion between floating point type data and fixed point type data. The technical solution provided by the present disclosure has the advantages of small calculation amount and low power consumption.

Description

Integrated circuit chip devices and related products

本披露涉及神經網絡領域，尤其涉及一種集成電路芯片裝置及相關產品。 The present disclosure relates to the field of neural networks, and in particular, to an integrated circuit chip device and related products.

人工神經網絡(Artificial Neural Network，ANN)，是20世紀80年代以來人工智能領域興起的研究熱點。它從信息處理角度對人腦神經元網絡進行抽象，建立某種簡單模型，按不同的連接方式組成不同的網絡。在工程與學術界也常直接簡稱為神經網絡或類神經網絡。神經網絡是一種運算模型，由大量的節點(或稱神經元)之間相互聯接構成。現有的神經網絡的運算基於CPU(Central Processing Unit，中央處理器)或GPU(Graphics Processing Unit，圖形處理器)來實現神經網絡的運算，此種運算的計算量大，功耗高。 Artificial Neural Network (ANN) has been a research hotspot in the field of artificial intelligence since the 1980s. It abstracts the human brain neuron network from the perspective of information processing, establishes a certain simple model, and forms different networks according to different connection methods. In engineering and academia, it is often simply referred to as neural network or neural-like network. Neural network is an operation model, which is composed of a large number of nodes (or neurons) connected with each other. The operation of the existing neural network is based on a CPU (Central Processing Unit, central processing unit) or a GPU (Graphics Processing Unit, graphics processing unit) to realize the operation of the neural network, which requires a large amount of calculation and high power consumption.

本披露實施例提供了一種集成電路芯片裝置及相關產品，可提升計算裝置的處理速度，提高效率。 Embodiments of the present disclosure provide an integrated circuit chip device and related products, which can improve the processing speed and efficiency of the computing device.

第一方面，提供一種集成電路芯片裝置，所述集成電路芯片裝置包括：主處理電路、k個分支電路以及k組基礎處理電路，所述主處理電路與所述k個分支電路分別連接，k個分支電路中每個分支電路對應k組基礎處理電路中的一組基礎處理電路，所述一組基礎處理電路包括至少一個基礎處理電路；所述分支電路包括：數據類型運算電路，用於執行浮點類型數據與定點類型數據之間的轉換；所述主處理電路，用於執行神經網絡運算中的各個連續的運算以及和與其相連的所述k個分支電路傳輸數據；所述k個分支電路，用於在主處理電路與k組基礎處理電路之間轉發所述傳輸數據，依據所述傳輸數據的運算控制是否啟動所述數據類型運算電路對所述傳輸數據的類型執行轉換；所述k組基礎處理電路，用於依據所述傳輸數據或轉換後的傳輸數據以並行方式執行神經網絡中的運算，並將運算結果通過與所述主處理電路連接的分支電路傳輸給所述主處理電路。 A first aspect provides an integrated circuit chip device, the integrated circuit chip device includes: a main processing circuit, k branch circuits and k groups of basic processing circuits, the main processing circuit is connected to the k branch circuits respectively, k Each of the branch circuits corresponds to a group of basic processing circuits in the k groups of basic processing circuits, and the group of basic processing circuits includes at least one basic processing circuit; the branch circuits include: a data type operation circuit for executing Conversion between floating-point type data and fixed-point type data; the main processing circuit is used to perform each successive operation in the neural network operation and transmit data with the k branch circuits connected to it; the k branches a circuit for forwarding the transmission data between the main processing circuit and the k groups of basic processing circuits, and controlling whether to start the data type operation circuit to perform conversion on the type of the transmission data according to the operation of the transmission data; the The k groups of basic processing circuits are used to execute the operation in the neural network in parallel according to the transmission data or the converted transmission data, and transmit the operation result to the main processing circuit through the branch circuit connected to the main processing circuit circuit.

第二方面，提供一種神經網絡運算裝置，所述神經網絡運算裝置包括一個或多個第一方面提供的集成電路芯片裝置。 A second aspect provides a neural network computing device, the neural network computing device comprising one or more integrated circuit chip devices provided in the first aspect.

第三方面，提供一種組合處理裝置，所述組合處理裝置包括：第二方面提供的神經網絡運算裝置、通用互聯介面和通用處理裝置；所述神經網絡運算裝置通過所述通用互聯介面與所述通用處理裝置連接。 In a third aspect, a combined processing device is provided. The combined processing device includes: the neural network computing device, a general interconnection interface, and a general processing device provided in the second aspect; the neural network computing device communicates with the neural network computing device through the general interconnection interface. Universal processing unit connection.

第四方面，提供一種芯片，所述芯片集成第一方面的裝置、第二方面的裝置或第三方面的裝置。 A fourth aspect provides a chip that integrates the device of the first aspect, the device of the second aspect, or the device of the third aspect.

第五方面，提供一種電子設備，所述電子設備包括第四方面的芯片。 In a fifth aspect, an electronic device is provided, and the electronic device includes the chip of the fourth aspect.

第六方面，提供一種神經網絡的運算方法，所述方法應用在集成電路芯片裝置內，所述集成電路芯片裝置包括：第一方面所述的集成電路芯片裝置，所述集成電路芯片裝置用於執行神經網絡的運算。 In a sixth aspect, a method for computing a neural network is provided, the method is applied in an integrated circuit chip device, and the integrated circuit chip device includes: the integrated circuit chip device according to the first aspect, the integrated circuit chip device is used for Execute the operations of the neural network.

可以看出，通過本披露實施例，提供數據轉換運算電路將數據塊的類型進行轉換後運算，節省了傳輸資源以及計算資源，所以其具有功耗低，計算量小的優點。 It can be seen that, through the embodiments of the present disclosure, a data conversion operation circuit is provided to perform operations after converting the types of data blocks, which saves transmission resources and computing resources, so it has the advantages of low power consumption and small calculation amount.

S201、S202、S203、S204、S201b、S202b、S203b、S301、S302、S303、S304:步驟 S201, S202, S203, S204, S201b, S202b, S203b, S301, S302, S303, S304: Steps

A、Ai、B、S:矩陣 A, Ai, B, S: Matrix

P:向量 P: vector

10:神經網絡處理器板卡 10: Neural network processor board

11:神經網絡芯片封裝結構 11: Neural network chip packaging structure

12:第一電氣及非電氣連接裝置 12: First electrical and non-electrical connection device

13:第一基板 13: The first substrate

111:神經網絡芯片 111: Neural Network Chip

112:第二電氣及非電氣連接裝置 112: Second electrical and non-electrical connection device

113:第二基板 113: Second substrate

1111:存儲單元 1111: storage unit

1112:直接內存存取單元 1112: Direct Memory Access Unit

1113:指令緩存單元 1113: Instruction cache unit

1114:權值緩存單元 1114: Weight cache unit

1115:輸入神經元緩存單元 1115: Input neuron buffer unit

1116:輸出神經元緩存單元 1116: output neuron buffer unit

1117:控制單元 1117: Control Unit

1118:運算單元 1118: arithmetic unit

21:神經網絡芯片 21: Neural Network Chip

22:焊盤 22: Pad

23:焊球 23: Solder Balls

24:第二基板 24: Second substrate

25:第二基板24上的連接點 25: Connection point on the second substrate 24

26:引腳 26: pin

27:絕緣填充物 27: Insulation filler

28:散熱膏 28: Thermal paste

29:金屬外殼散熱片 29: Metal shell heat sink

圖1a是一種集成電路芯片裝置結構示意圖。 FIG. 1a is a schematic structural diagram of an integrated circuit chip device.

圖1b是另一種集成電路芯片裝置結構示意圖。 FIG. 1b is a schematic structural diagram of another integrated circuit chip device.

圖1c是一種基礎處理電路的結構示意圖。 FIG. 1c is a schematic structural diagram of a basic processing circuit.

圖1d為一種定點數據類型的示意結構圖。 FIG. 1d is a schematic structural diagram of a fixed-point data type.

圖2為一種矩陣乘以向量流程示意圖。 FIG. 2 is a schematic diagram of a process flow of multiplying a matrix by a vector.

圖2a是矩陣乘以向量的示意圖。 Figure 2a is a schematic diagram of a matrix multiplied by a vector.

圖2b為一種矩陣乘以矩陣流程示意圖。 FIG. 2b is a schematic flow chart of a matrix multiplication matrix.

圖2c是矩陣Ai乘以向量的示意圖。 Figure 2c is a schematic diagram of matrix Ai multiplied by a vector.

圖2d是矩陣A乘以矩陣B的示意圖。 Figure 2d is a schematic diagram of matrix A multiplied by matrix B.

圖2e是矩陣Ai乘以矩陣B的示意圖。 Figure 2e is a schematic diagram of matrix Ai multiplied by matrix B.

圖3a為神經網絡訓練示意圖。 Figure 3a is a schematic diagram of neural network training.

圖3b為卷積運算示意圖。 Figure 3b is a schematic diagram of the convolution operation.

圖4a為神經網絡正向運算示意圖。 Figure 4a is a schematic diagram of the forward operation of the neural network.

圖4b為神經網絡反向運算示意圖。 Figure 4b is a schematic diagram of the reverse operation of the neural network.

圖4c為本披露還揭露了一個組合處理裝置結構示意圖。 FIG. 4c also discloses a schematic structural diagram of a combined processing device for the present disclosure.

圖4d為本披露還揭露了一個組合處理裝置另一種結構示意圖。 FIG. 4d also discloses another structural schematic diagram of a combined processing device for the present disclosure.

圖5a為神經網絡另一種正向運算示意圖。 Figure 5a is a schematic diagram of another forward operation of the neural network.

圖5b為神經網絡另一種反向運算示意圖。 Figure 5b is a schematic diagram of another reverse operation of the neural network.

圖5c為本披露實施例提供的一種神經網絡處理器板卡的結構示意圖；圖5d為本披露實施例提供的一種神經網絡芯片封裝結構的結構示意圖；圖5e為本披露實施例提供的一種神經網絡芯片的結構示意圖；圖6為本披露實施例流提供的一種神經網絡芯片封裝結構的示意圖；圖6a為本披露實施例流提供的另一種神經網絡芯片封裝結構的示意圖。 5c is a schematic structural diagram of a neural network processor board provided by an embodiment of the disclosure; FIG. 5d is a schematic structural diagram of a neural network chip packaging structure provided by an embodiment of the disclosure; Schematic diagram of the structure of a network chip; FIG. 6 is a schematic diagram of a packaging structure of a neural network chip provided by an embodiment of the disclosure; FIG. 6a is a schematic diagram of another packaging structure of a neural network chip provided by the embodiment of the disclosure.

為了使本技術領域的人員更好地理解本披露方案，下面將結合本披露實施例中的圖式，對本披露實施例中的技術方案進行清楚、完整地描述，顯然，所描述的實施例僅僅是本披露一部分實施例，而不是全部的實施例。基於本披露中的實施例，所屬技術領域中具有通常知識者在沒有作出創造性勞動前提下所獲得的所有其他實施例，都屬於本披露保護的範圍。 In order for those skilled in the art to better understand the disclosed solutions, the technical solutions in the disclosed embodiments will be clearly and completely described below with reference to the drawings in the disclosed embodiments. Obviously, the described embodiments are only These are some, but not all, embodiments of the present disclosure. Based on the embodiments in the present disclosure, all other embodiments obtained by those with ordinary knowledge in the technical field without creative work shall fall within the protection scope of the present disclosure.

在第一方面提供的裝置中，所述主處理電路，用於獲取待計算的數據塊以及運算指令，依據該運算指令對所述待計算的數據塊劃分成分發數據塊以及廣播數據塊；對所述分發數據塊進行拆分處理得到多個基本數據塊，將所述多個基本數據塊分發至與其連接的所述k個分支電路，將所述廣播數據塊廣播至與其連接的所述k個分支電路；所述k個分支電路，用於接收基本數據塊以及廣播數據塊，啟動數據類型運算電路將該基本數據塊以及廣播數據塊轉換成定點數據類型；將基本數據塊以及廣播數據塊以定點數據類型轉發至k組基礎處理電路；所述基礎處理電路，用於對所述基本數據塊與所述廣播數據塊以定點數據類型執行內積運算得到運算結果，將所述運算結果發送至所述k個分支電路；所述k個分支電路，用於將所述運算結果轉換成浮點類型的運算結果，將浮點類型的運算結果發送至主處理電路；所述主處理電路，用於對所述浮點類型的運算結果處理得到所述待計算的數據塊以及運算指令的指令結果。 In the device provided in the first aspect, the main processing circuit is configured to obtain a data block to be calculated and an operation instruction, and divide the data block to be calculated into a distribution data block and a broadcast data block according to the operation instruction; Splitting the distribution data block to obtain a plurality of basic data blocks, distributing the plurality of basic data blocks to the k branch circuits connected to it, and broadcasting the broadcast data block to the k connected to it branch circuits; the k branch circuits are used to receive basic data blocks and broadcast data blocks, and start the data type operation circuit to convert the basic data blocks and broadcast data blocks into fixed-point data types; Forwarding to k groups of basic processing circuits in the fixed-point data type; the basic processing circuit is configured to perform an inner product operation on the basic data block and the broadcast data block in the fixed-point data type to obtain an operation result, and send the operation result to the k branch circuits; the k branch circuits are used to convert the operation result into a floating-point type operation result, and send the floating-point type operation result to the main processing circuit; the main processing circuit, It is used to process the operation result of the floating point type to obtain the data block to be calculated and the instruction result of the operation instruction.

在第一方面提供的裝置中，所述主處理電路，具體用於將所述廣播數據塊通過一次廣播至所述k個分支電路。 In the apparatus provided in the first aspect, the main processing circuit is specifically configured to broadcast the broadcast data block to the k branch circuits at one time.

在第一方面提供的裝置中，所述主處理電路，具體用於將所述廣播數據塊分成多個部分廣播數據塊，將所述多個部分廣播數據塊通過多次廣播至所述k個分支電路。 In the apparatus provided in the first aspect, the main processing circuit is specifically configured to divide the broadcast data block into a plurality of partial broadcast data blocks, and broadcast the plurality of partial broadcast data blocks to the k through multiple broadcasts branch circuit.

在第一方面提供的裝置中，所述基礎處理電路，具體用於將所述部分廣播數據塊與所述基本數據塊以定點類型執行一次內積處理後得到內積處理結果，將所述內積處理結果累加得到部分運算結果，將所述部分運算結果發送至所述k個分支電路，所述k個分支電路，用於將所述部分運算結果轉換成浮點類型數據發送至所述主處理電路。 In the apparatus provided in the first aspect, the basic processing circuit is specifically configured to perform an inner product processing on the part of the broadcast data block and the basic data block in a fixed-point type to obtain an inner product processing results, accumulating the inner product processing results to obtain partial operation results, sending the partial operation results to the k branch circuits, and the k branch circuits are used to convert the partial operation results into floating point Type data is sent to the main processing circuit.

在第一方面提供的裝置中，所述基礎處理電路，具體用於復用n次該部分廣播數據塊以定點數據類型執行該部分廣播數據塊與該n個基本數據塊內積運算得到定點數據類型的n個部分處理結果，將定點數據類型的n個部分處理結果分別累加後得到定點類型的n個部分運算結果，將所述定點類型的n個部分運算結果發送至分支電路；所述分支電路，用於將所述定點類型的n個部分運算結果轉換成浮點類型的n個部分運算結果，將浮點類型的n個部分運算結構發送至主處理電路，所述n為大於等於2的整數。 In the device provided in the first aspect, the basic processing circuit is specifically configured to multiplex the part of the broadcast data block n times to perform an inner product operation of the part of the broadcast data block and the n basic data blocks in a fixed-point data type to obtain fixed-point data n partial processing results of the type, the n partial processing results of the fixed-point data type are respectively accumulated to obtain n partial operation results of the fixed-point type, and the n partial operation results of the fixed-point type are sent to the branch circuit; the branch A circuit for converting the n partial operation results of the fixed point type into n partial operation results of the floating point type, and sending the n partial operation structures of the floating point type to the main processing circuit, where n is greater than or equal to 2 the integer.

在第一方面提供的裝置中，所述主處理電路包括：主寄存器或主片上緩存電路；或所述分支電路包括：基本寄存器或基本片上緩存電路；或所述基礎處理電路包括：基本寄存器或基本片上緩存電路。 In the device provided in the first aspect, the main processing circuit includes: a main register or a main on-chip buffer circuit; or the branch circuit includes: a basic register or a basic on-chip buffer circuit; or the basic processing circuit includes: a basic register or Basic on-chip cache circuit.

在第一方面提供的裝置中，所述主處理電路包括：向量運算器電路、算數邏輯單元電路、累加器電路、矩陣轉置電路、直接內存存取電路、數據類型運算電路或數據重排電路中的一種或任意組合。 In the device provided in the first aspect, the main processing circuit includes: a vector operator circuit, an arithmetic logic unit circuit, an accumulator circuit, a matrix transpose circuit, a direct memory access circuit, a data type operation circuit or a data rearrangement circuit one or any combination.

在第一方面提供的裝置中，所述數據為：向量、矩陣、三維數據塊、四維數據塊以及n維數據塊中一種或任意組合。 In the apparatus provided in the first aspect, the data is one or any combination of vectors, matrices, three-dimensional data blocks, four-dimensional data blocks, and n-dimensional data blocks.

在第一方面提供的裝置中，如所述運算指令為乘法指令，所述主處理電路確定乘數數據塊為廣播數據塊，被乘數數據塊為分發數據塊；如所述運算指令為卷積指令，所述主處理電路確定輸入數據塊為廣播數據塊，卷積核為分發數據塊。 In the device provided in the first aspect, if the operation instruction is a multiplication instruction, the main processing circuit determines that the multiplier data block is a broadcast data block, and the multiplicand data block is a distribution data block; if the operation instruction is a volume The main processing circuit determines that the input data block is a broadcast data block, and the convolution kernel is a distribution data block.

在第四方面提供的方法中，所述神經網絡的運算包括：卷積運算、矩陣乘矩陣運算、矩陣乘向量運算、偏置運算、全連接運算、GEMM運算、GEMV運算、激活運算中的一種或任意組合。 In the method provided in the fourth aspect, the operation of the neural network includes: one of convolution operation, matrix multiplication matrix operation, matrix multiplication vector operation, bias operation, fully connected operation, GEMM operation, GEMV operation, and activation operation or any combination.

參閱圖1a，圖1a為一種集成電路芯片裝置的結構示意圖，如圖1a所示，該芯片裝置包括：主處理電路、基本處理電路和分支處理電路。其中，具體的，集成電路芯片裝置包括：主處理電路、k個分支電路(如圖1a所示，k=4，當然在實際應用中也可以為其他數值，例如8、16等等數值)以及k組基礎處理電路，所述主處理電路與所述k個分支電路分別連接，k個分支電路中每個分支電路對應k組基礎處理電路中的一組基礎處理電路，所述一組基礎處理電路包括至少一個基礎處理電路；所述分支電路包括：數據類型運算電路，用於執行浮點類型數據與定點類型數據之間的轉換；所述主處理電路，用於執行神經網絡運算中的各個連續的運算以及和與其相連的所述k個分支電路傳輸數據；所述k個分支電路，用於在主處理電路與k組基礎處理電路之間轉發所述傳輸數據，依據所述傳輸數據的運算控制是否啟動所述數據類型運算電路對所述傳輸數據的類型執行轉換；所述k組基礎處理電路，用於依據所述傳輸數據或轉換後的傳輸數據以並行方式執行神經網絡中的運算，並將運算結果通過與所述主處理電路連接的分支電路傳輸給所述主處理電路 Referring to FIG. 1a, FIG. 1a is a schematic structural diagram of an integrated circuit chip device. As shown in FIG. 1a, the chip device includes: a main processing circuit, a basic processing circuit and a branch processing circuit. Specifically, the integrated circuit chip device includes: a main processing circuit, k branch circuits (as shown in FIG. 1a, k=4, of course, other values can also be used in practical applications, such as 8, 16, etc.) and K groups of basic processing circuits, the main processing circuit is respectively connected to the k branch circuits, each branch circuit in the k branch circuits corresponds to a group of basic processing circuits in the k groups of basic processing circuits, and the group of basic processing circuits The circuit includes at least one basic processing circuit; the branch circuit includes: a data type operation circuit for performing conversion between floating point type data and fixed point type data; the main processing circuit for performing each of the neural network operations Continuous operation and transmission of data with the k branch circuits connected to it; the k branch circuits are used to forward the transmission data between the main processing circuit and the k groups of basic processing circuits, according to the transmission data. Operation control whether to start the data type operation circuit to perform conversion on the type of the transmission data; the k groups of basic processing circuits are used to perform operations in the neural network in parallel according to the transmission data or the converted transmission data , and transmit the operation result to the main processing circuit through the branch circuit connected to the main processing circuit

主處理電路可以包括寄存器和/或片上緩存電路，該主處理電路還可以包括：控制電路、向量運算器電路、ALU(arithmetic and logic unit，算數邏輯單元)電路、累加器電路、DMA(Direct Memory Access，直接內存存取)電路等電路，當然在實際應用中，上述主處理電路還可以添加，轉換電路(例如矩陣轉置電路)、數據重排電路或激活電路等等其他的電路；可選的，主處理電路可以包括：數據類型轉換運算電路，數據類型轉換運算電路可以用於將接收或發送的數據從浮點類型數據轉換成定點類型數據，當然在實際應用中，也可以將定點類型數據轉換成浮點類型數據。本發明並不限制上述數據類型轉換運算電路的具體形式。 The main processing circuit may include a register and/or an on-chip buffer circuit, and the main processing circuit may further include: a control circuit, a vector operator circuit, an ALU (arithmetic and logic unit, arithmetic logic unit) circuit, an accumulator circuit, a DMA (Direct Memory) circuit Access, direct memory access) circuits and other circuits, of course, in practical applications, the above-mentioned main processing circuits can also be added, conversion circuits (such as matrix transposition circuits), data rearrangement circuits or activation circuits and other circuits; optional; Yes, the main processing circuit may include: a data type conversion operation circuit, and the data type conversion operation circuit can be used to convert the received or transmitted data from floating-point type data to fixed-point type data. Of course, in practical applications, fixed-point type data can also be converted. Data is converted to floating point type data. The present invention does not limit the specific form of the above-mentioned data type conversion operation circuit.

主處理電路還包括數據發送電路、數據接收電路或介面，該數據發送電路可以集成數據分發電路以及數據廣播電路，當然在實際應用中，數據分發電路以及數據廣播電路也可以分別設置；在實際應用中上述數據發送電路以及數據接收電路也可以集成在一起形成數據收發電路。對於廣播數據，即需要發送給每個基礎處理電路的數據。對於分發數據，即需要有選擇的發送給部分基礎處理電路的數據，具體的選擇方式可以由主處理電路依據負載以及計算方式進行具體的確定。對於廣播發送方式，即將廣播數據以廣播形式發送至每個基礎處理電路。(在實際應用中，通過一次廣播的方式將廣播數據發送至每個基礎處理電路，也可以通過多次廣播的方式將廣播數據發送至每個基礎處理電路，本申請具體實施方式並不限制上述廣播的次數)，對於分發發送方式，即將分發數據有選擇的發送給部分基礎處理電路。 The main processing circuit also includes a data transmission circuit, a data reception circuit or an interface. The data transmission circuit can integrate a data distribution circuit and a data broadcast circuit. Of course, in practical applications, the data distribution circuit and the data broadcast circuit can also be set separately; in practical applications The above-mentioned data transmission circuit and data reception circuit can also be integrated together to form a data transmission and reception circuit. For broadcast data, that is the data that needs to be sent to each underlying processing circuit. For the distribution data, that is, the data that needs to be selectively sent to some basic processing circuits, the specific selection method can be specifically determined by the main processing circuit according to the load and the calculation method. For the broadcast transmission method, the broadcast data is transmitted to each basic processing circuit in a broadcast form. (In practical applications, broadcast data is sent to each basic processing circuit by one broadcast, and broadcast data can also be sent to each basic processing circuit by multiple broadcasts. The specific implementation of this application does not limit the above The number of broadcasts), for the distribution transmission mode, the distribution data is selectively sent to some basic processing circuits.

在實現分發數據時，主處理電路的控制電路向部分或者全部基礎處理電路傳輸數據(該數據可以相同，也可以不同，具體的，如果採用分發的方式發送數據，各個接收數據的基礎處理電路收到的數據可以不同，當然也可以有部分基礎處理電路收到的數據相同；具體地，廣播數據時，主處理電路的控制電路向部分或者全部基礎處理電路傳輸數據，各個接收數據的基礎處理電路可以收到相同的數據。 When distributing data, the control circuit of the main processing circuit transmits data to some or all of the basic processing circuits (the data may be the same or different. Specifically, if the distribution method is adopted The data received by each basic processing circuit that receives data may be different, and of course some basic processing circuits may receive the same data; specifically, when broadcasting data, the control circuit of the main processing circuit sends data to some or all of The processing circuits transmit data, and each underlying processing circuit that receives the data can receive the same data.

可選的，上述主處理電路的向量運算器電路可以執行向量運算，包括但不限於：兩個向量加減乘除，向量與常數加、減、乘、除運算，或者對向量中的每個元素執行任意運算。其中，連續的運算具體可以為，向量與常數加、減、乘、除運算、激活運算、累加運算等等。 Optionally, the vector operator circuit of the above-mentioned main processing circuit can perform vector operations, including but not limited to: addition, subtraction, multiplication and division of two vectors, addition, subtraction, multiplication, and division between vectors and constants, or performing operations on each element in the vector. Arbitrary operation. The continuous operation may specifically be, vector and constant addition, subtraction, multiplication, division operation, activation operation, accumulation operation, and the like.

每個基礎處理電路可以包括基礎寄存器和/或基礎片上緩存電路；每個基礎處理電路還可以包括：內積運算器電路、向量運算器電路、累加器電路等中一個或任意組合。上述內積運算器電路、向量運算器電路、累加器電路都可以是集成電路，上述內積運算器電路、向量運算器電路、累加器電路也可以為單獨設置的電路。 Each basic processing circuit may include a basic register and/or a basic on-chip buffer circuit; each basic processing circuit may further include one or any combination of an inner product operator circuit, a vector operator circuit, an accumulator circuit, and the like. The above inner product operator circuit, vector operator circuit and accumulator circuit may all be integrated circuits, and the above inner product operator circuit, vector operator circuit and accumulator circuit may also be separately provided circuits.

該芯片裝置可選的還可以包括一個或多個分支處理電路，如具有分支處理電路時，其中主處理電路與分支處理電路連接，該分支處理電路與基本處理電路連接，該基本處理電路的內積運算器電路用於執行數據塊之間的內積運算，該主處理電路的控制電路控制數據接收電路或數據發送電路收發外部數據，以及通過控制電路控制數據發送電路將外部數據分發至分支處理電路，該分支處理電路用於收發主處理電路或基本處理電路的數據。如圖1a所示的結構適合複雜數據的計算，因為對於主處理電路來說，其連接的單元的數量有限，所以需要在主處理電路與基本處理電路之間添加分支處理電路以實現更多的基本處理電路的接入，從而實現對複雜數據塊的計算。分支處理電路和基礎處理電路的連接結構可以是任意的，不局限在圖1a的H型結構。可選的，主處理電路到基礎處理電路是廣播或分發的結構，基礎處理電路到主處理電路是收集(gather)的結構。廣播，分發和收集的定義如下，對於分發或廣播結構，此時的基礎處理電路的數量大於主處理電路，即1個主處理電路對應多個基礎處理電路，即從主處理電路到多個基礎處理電路為廣播或分發的結構，反之，從多個基礎處理電路到主處理電路可以為收集結構。 Optionally, the chip device may further include one or more branch processing circuits. For example, when a branch processing circuit is provided, the main processing circuit is connected to the branch processing circuit, the branch processing circuit is connected to the basic processing circuit, and the internal processing circuit of the basic processing circuit is connected. The product operator circuit is used to perform the inner product operation between the data blocks, the control circuit of the main processing circuit controls the data receiving circuit or the data transmitting circuit to send and receive external data, and controls the data transmitting circuit to distribute the external data to the branch processing through the control circuit A circuit, the branch processing circuit is used to send and receive data from the main processing circuit or the basic processing circuit. The structure shown in Figure 1a is suitable for the calculation of complex data, because for the main processing circuit, the number of connected units is limited, so it is necessary to add branch processing circuits between the main processing circuit and the basic processing circuit to achieve more Access to basic processing circuits to enable computation of complex data blocks. Branch processing circuits and base processing The connection structure of the circuit can be arbitrary, and is not limited to the H-type structure in FIG. 1a. Optionally, the main processing circuit to the basic processing circuit is a broadcast or distribution structure, and the basic processing circuit to the main processing circuit is a gather structure. Broadcast, distribution and collection are defined as follows. For distribution or broadcast structure, the number of basic processing circuits at this time is greater than that of main processing circuits, that is, 1 main processing circuit corresponds to multiple basic processing circuits, that is, from the main processing circuit to multiple basic processing circuits. The processing circuit is a broadcast or distribution structure, and conversely, from a plurality of base processing circuits to the main processing circuit can be a collection structure.

基礎處理電路，接收主處理電路分發或者廣播的數據保存到基礎處理電路的片上緩存中，可以進行運算產生結果，可以向主處理電路發送數據。 The basic processing circuit receives the data distributed or broadcasted by the main processing circuit and saves it to the on-chip cache of the basic processing circuit, can perform operations to generate results, and can send data to the main processing circuit.

基礎處理電路中所涉及到的數據可以是任意數據類型的數據，可以是任意位寬的浮點數表示的數據也可以是任意位寬的定點數表示的數據；涉及到的所有運算電路和存儲電路都可以是能夠處理的任意數據類型的運算電路和存儲電路，可以是任意位寬的浮點數的運算電路和存儲電路也可以是任意位寬的定點數的運算電路和存儲電路。 The data involved in the basic processing circuit can be data of any data type, data represented by floating-point numbers of any bit width, or data represented by fixed-point numbers of any bit width; all operation circuits and storage involved. The circuits can be arithmetic circuits and storage circuits of any data type that can be processed, arithmetic circuits and storage circuits of floating-point numbers of any bit width, or fixed-point arithmetic circuits and storage circuits of arbitrary bit widths.

可選的，每個基礎處理電路均可以包括數據類型轉換運算電路，也可以在部分基礎處理電路配置數據類型轉換運算電路；數據類型轉換運算電路可以用於將接收或發送的數據從浮點類型數據轉換成定點類型數據，也可以將定點類型數據轉換成浮點類型數據。本發明並不限制上述數據類型轉換運算電路的具體形式。 Optionally, each basic processing circuit may include a data type conversion operation circuit, or a data type conversion operation circuit may be configured in some basic processing circuits; the data type conversion operation circuit may be used to convert the received or sent data from floating-point type. The data is converted into fixed-point type data, and the fixed-point type data can also be converted into floating-point type data. The present invention does not limit the specific form of the above-mentioned data type conversion operation circuit.

可選的，該基礎處理電路的向量運算器電路可以對數據類型轉換後的兩個向量執行的向量運算，當然在實際應用中，基礎處理電路的內積運算器電路可以對數據類型轉換後的兩個向量執行內積運算，累加器電路也可以對內積運算的結果進行累加。 Optionally, the vector operator circuit of the basic processing circuit can perform vector operations on the two vectors after data type conversion. Of course, in practical applications, the inner product operator circuit of the basic processing circuit can perform the data type conversion. Two vectors perform an inner product operation, and the accumulator circuit can also accumulate the results of the inner product operation.

在一種可選方案中，兩個向量可以存放在片上緩存和/或寄存器中，基礎處理電路可以根據實際計算的需要提取兩個向量執行運算。該運算包括但不限於：內積運算、乘法運算、加法運算或其他的運算。 In an optional solution, the two vectors can be stored in the on-chip cache and/or register, and the basic processing circuit can extract the two vectors to perform operations according to actual computing needs. The operations include, but are not limited to, inner product operations, multiplication operations, addition operations, or other operations.

在一種可選方案中，內積運算的結果可以累加到片上緩存和/或寄存器上；其可選方案的優點是，減少了基礎處理電路和主處理電路之間的數據傳輸量，提高了運算效率，降低了數據傳輸功耗。 In an alternative solution, the result of the inner product operation can be accumulated in the on-chip cache and/or register; the advantages of the alternative solution are that the amount of data transmission between the basic processing circuit and the main processing circuit is reduced, and the operation is improved. efficiency, reducing data transmission power consumption.

在一種可選方案中，內積運算的結果不進行累加，直接作為結果傳輸；此技術方案的優點是，減少了基礎處理電路內部的運算量，提高基礎處理電路的運算效率。 In an optional solution, the result of the inner product operation is not accumulated, and is directly transmitted as the result; the advantage of this technical solution is that the amount of computation inside the basic processing circuit is reduced, and the computational efficiency of the basic processing circuit is improved.

在一種可選方案中，每個基礎處理電路可以執行多組兩個向量的內積運算，也可以對多組內積運算的結果分別進行累加；在一種可選方案中，多組的兩個向量數據可以存放在片上緩存和/或寄存器中；在一種可選方案中，多組內積運算的結果可以分別累加到片上緩存和/或寄存器中；在一種可選方案中，各組內積運算的結果可以不進行累加，直接作為結果傳輸；在一種可選方案中，每個基礎處理電路可以執行同一個向量與多個向量分別進行內積運算的操作(「一對多」內積，即多組內積里每組的兩個向量中有一個向量是共享的)，並將每個向量對應的內積結果分別進行累加。此技術方案可以實現同一套權值對不同的輸入數據進行多次計算，增加了數據復用，減少基礎處理電路內部數據的數據傳輸量，提高計算效率，降低功耗。 In an optional solution, each basic processing circuit can perform multiple sets of inner product operations of two vectors, and can also accumulate the results of multiple sets of inner product operations respectively; in an alternative solution, two of the multiple sets of Vector data can be stored in on-chip caches and/or registers; in an optional solution, the results of multiple sets of inner product operations can be accumulated in on-chip caches and/or registers respectively; The result of the operation can be directly transmitted as the result without being accumulated; in an optional solution, each basic processing circuit can perform the inner product operation of the same vector and multiple vectors respectively ("one-to-many" inner product, That is, one of the two vectors in each group in the multi-group inner product is shared), and the inner product results corresponding to each vector are accumulated separately. This technical solution can realize the same set of weights to perform multiple calculations on different input data, increase data multiplexing, reduce the data transmission amount of data in the basic processing circuit, improve calculation efficiency, and reduce power consumption.

具體地，計算內積使用的數據中，各組共享的向量和每組的另一個向量(即每組之間不同的那個向量)的數據來源可以不同：在一種可選方案中，在計算內積時，各組共享的向量來自主處理電路或者分支處理電路的廣播或者分發；在一種可選方案中，在計算內積時，各組共享的向量來自片上緩存；在一種可選方案中，在計算內積時，各組共享的向量來自寄存器；在一種可選方案中，在計算內積時，每組的另一個非共享向量來自主處理電路或者分支處理電路的廣播或者分發；在一種可選方案中，在計算內積時，每組的另一個非共享向量來自從片上緩存；在一種可選方案中，在計算內積時，每組的另一個非共享向量來自寄存器；在一種可選方案中，在進行多組的內積運算時，每組共享的向量在基礎處理電路的片上緩存和/寄存器中保留任意份數；在一種可選方案中，共享向量可以對應每組內積各保留一份；在一種可選方案中，共享向量可以只保留一份；具體地，多組內積運算的結果可以分別累加到片上緩存和/或寄存器中；具體地，各組內積運算的結果可以不進行累加，直接作為結果傳輸；參閱圖1a所示的結構，其包含一主處理電路(可以執行向量操作)，多基礎處理電路(可以執行內積操作)。這樣組合的好處是：裝置不僅能使用基礎處理電路執行矩陣和向量乘法運算，也能使用主處理電路執行其他任意的向量運算，使裝置在有限的硬件電路的配置下，能夠更快的完成更多的運算，減少了與裝置外部進行數據傳輸的次數，提高了計算效率，降低了功耗。另外，本芯片在基礎處理電路和/或主處理電路均可以設置數據類型轉換運算電路，這樣在進行神經網絡計算時能夠將浮點類型數據轉換成定點類型數據，也可以將定點類型數據轉換成浮點類型數據，並且本芯片可以依據各個電路(主要是主處理電路和基礎處理電路)的運算量(即負載量)動態的分配由那個電路將數據類型進行轉換，這樣能夠減少數據計算的複雜程式，降低功耗，並且動態的分配數據類型的轉換能夠實現不影響芯片的計算效率。該分配的方式包括但不限於：負載均衡、負載最小值分配等等方式。 Specifically, in the data used to calculate the inner product, the data source of the vector shared by each group and the other vector of each group (that is, the vector that is different between each group) can be different: in an alternative solution, within the calculation When calculating the product, the vectors shared by each group come from the broadcast or distribution of the main processing circuit or the branch processing circuit; in an alternative solution, when calculating the inner product, the vectors shared by each group come from the on-chip cache; in an alternative solution, When calculating the inner product, the vectors shared by each group come from registers; in an alternative, when calculating the inner product, the other non-shared vector of each group is broadcast or distributed from the main processing circuit or the branch processing circuit; in a In the alternative, when calculating the inner product, the other non-shared vector of each group comes from the slave on-chip cache; in an alternative, when calculating the inner product, the other non-shared vector of each group comes from the register; in a In an optional solution, when performing inner product operations of multiple groups, the shared vectors of each group are retained in any number of copies in the on-chip cache and/or registers of the basic processing circuit; One copy of each product is reserved; in an optional solution, only one copy of the shared vector can be reserved; specifically, the results of multiple groups of inner product operations can be respectively accumulated in the on-chip cache and/or register; The result of the operation can be directly transmitted as the result without being accumulated; Referring to the structure shown in FIG. 1a, it includes a main processing circuit (which can perform vector operations) and a multi-base processing circuit (which can perform inner product operations). The advantage of such a combination is that the device can not only use the basic processing circuit to perform matrix and vector multiplication operations, but also use the main processing circuit to perform other arbitrary vector operations, so that the device can complete more rapid operations with limited hardware circuit configuration. More operations, reduce the number of data transmission with the outside of the device, improve computing efficiency, and reduce power consumption. In addition, this chip can be provided with a data type conversion operation circuit in both the basic processing circuit and/or the main processing circuit, so that when performing neural network calculations, floating-point type data can be converted into fixed-point type data, and fixed-point type data can also be converted into Floating-point type data, and the chip can convert the data type by that circuit dynamically according to the calculation amount (that is, the load) of each circuit (mainly the main processing circuit and the basic processing circuit), which can reduce the complexity of data calculation. Programs, reduce power consumption, and dynamically allocate data type conversions without affecting the computing efficiency of the chip. The distribution methods include, but are not limited to, load balancing, load minimum value distribution, and the like.

參閱圖1b所示的裝置，圖1b所示的裝置為分支處理電路單獨連接基礎處理電路的計算裝置，如圖1b所示的裝置，其包括：主處理電路以及N個基礎處理電路，其中，主處理電路(具體的結構如圖1c所示)與N個基礎處理電路可以直接或間接連接，如為間接連接的方式時，一種可選的方案如圖1a所示可以包括N/4個分支處理電路，每個分支處理電路分別連接4個基礎處理電路，對於主處理電路以及N個基礎處理電路分別包含的電路可以參見上述如圖1a所示的描述，這裡不再贅述，這裡需要說明的是，上述基礎處理電路還可以設置在分支處理電路內，另外，每個分支處理電路連接基礎處理電路的數量也可以不局限於4個，廠家可以根據實際需要進行配置。該上述主處理電路和/或N個基礎處理電路均可以包括數據類型轉換運算電路，具體的，可以是主處理電路包括數據類型運算電路，也可以是N個基礎處理電路或其中的一部分包括數據類型轉換電路，也可以是主處理電路和N個基礎處理電路或其中的一部分均包括。上述主處理電路可以根據神經網絡計算指令動態的分配數據類型轉換步驟的操作實體，具體的，主處理電路可以根據自身的負載確定是否對接收到的數據執行數據類型轉換步驟，具體的，可以將負載的值設置多個區間，每個區間對應分配數據類型轉換步驟的執行主體，例如，以3個區間為例，區間1的負載值較低，可以由主處理電路單獨執行數據類型轉換步驟，區間2負載值位於區間1以及區間3之間，可以由主處理電路或N個基礎處理電路共同執行數據類型轉換步驟，區間3負載值較高，可以由N個基礎處理電路執行數據類型轉換步驟。對此，可以以明示的方式來執行，例如主處理電路可以配置一個特殊指示或指令，當基礎處理電路接收到該特殊指示或指令時，確定執行數據類型轉換步驟，如基礎處理電路未接收到特殊指示或指令時，確定不執行數據類型轉換步驟。又如，可以以暗示的方式來執行，例如，基礎處理電路接收到數據類型為浮點類型的數據且確定需要執行內積運算時，將該數據類型轉換成定點類型的數據。 Referring to the device shown in FIG. 1b, the device shown in FIG. 1b is a computing device in which the branch processing circuit is independently connected to the basic processing circuit. The device shown in FIG. 1b includes: a main processing circuit and N basic processing circuits, wherein, The main processing circuit (the specific structure is shown in Figure 1c) and the N basic processing circuits can be directly or indirectly connected. If the connection is indirect, an optional solution can include N/4 branches as shown in Figure 1a. Processing circuit, each branch processing circuit is connected to 4 basic processing circuits respectively. For the circuits respectively included in the main processing circuit and the N basic processing circuits, please refer to the description shown in Figure 1a above, which will not be repeated here. Yes, the above-mentioned basic processing circuits can also be arranged in branch processing circuits. In addition, the number of basic processing circuits connected to each branch processing circuit may not be limited to four, and manufacturers can configure them according to actual needs. The above-mentioned main processing circuit and/or the N basic processing circuits may each include a data type conversion operation circuit. Specifically, the main processing circuit may include The data type operation circuit may also include N basic processing circuits or a part of them including the data type conversion circuit, or may include both the main processing circuit and the N basic processing circuits or a part of them. The above-mentioned main processing circuit can dynamically allocate the operation entity of the data type conversion step according to the neural network calculation instruction. Specifically, the main processing circuit can determine whether to perform the data type conversion step on the received data according to its own load. The load value is set to multiple intervals, and each interval corresponds to the execution body of the data type conversion step. For example, taking 3 intervals as an example, the load value of interval 1 is low, and the data type conversion step can be executed by the main processing circuit alone. The load value in interval 2 is between interval 1 and interval 3, and the data type conversion step can be performed by the main processing circuit or N basic processing circuits. The load value in interval 3 is higher, and the data type conversion step can be performed by N basic processing circuits. . In this regard, it can be performed in an explicit manner. For example, the main processing circuit can be configured with a special instruction or instruction. When the basic processing circuit receives the special instruction or instruction, it is determined to execute the data type conversion step. If the basic processing circuit does not receive the special instruction or instruction When special instructions or instructions, make sure not to perform the data type conversion step. For another example, it can be performed in an implicit manner, for example, when the basic processing circuit receives data of a floating-point type and determines that an inner product operation needs to be performed, it converts the data type to data of a fixed-point type.

下面提供一種採用如圖1a所示的裝置實現計算的方法，該計算的方法具體可以為神經網絡的計算方式，例如神經網絡的正向運算，神經網絡的訓練，在實際應用中，正向運算依據不同的輸入數據可以執行矩陣乘矩陣、卷積運算、激活運算、變換運算等等運算，上述運算均可以採用如圖1a所示的裝置實現。 The following provides a method for computing by using the device shown in FIG. 1a. The computing method may be a neural network computing method, such as the forward operation of the neural network, the training of the neural network. In practical applications, the forward operation Operations such as matrix multiplication, convolution operations, activation operations, transformation operations, etc. may be performed according to different input data, and the above operations may be implemented by the device shown in FIG. 1a.

主處理電路的數據轉換運算電路先對數據的類型進行轉換然後由控制電路傳輸給基礎處理電路運算，例如，主處理電路的數據轉換運算電路可以將浮點數轉換成位寬更低的定點數再傳輸給基礎處理電路，其優點是可以減少傳輸數據的位寬，減少傳輸的總比特數量，基礎處理電路執行地位寬定點運算的效率也更高，功耗更低。 The data conversion operation circuit of the main processing circuit first converts the type of data and then transmits it to the basic processing circuit for operation by the control circuit. For example, the data conversion operation circuit of the main processing circuit can convert floating-point numbers into fixed-point numbers with lower bit width. It is then transmitted to the basic processing circuit, which has the advantage of being able to By reducing the bit width of the transmitted data and reducing the total number of bits transmitted, the basic processing circuit has higher efficiency and lower power consumption for performing position-wide fixed-point operations.

如基礎處理電路接收到的數據為浮點數據，那麼基礎處理電路可以收到數據後由數據轉換運算電路先進行數據類型轉化然後再進行計算，例如，基礎處理電路收到主處理電路傳輸過來的浮點數，數據轉換運算電路然後轉換為定點數，然後基礎處理電路的內積運算器電路、向量運算器電路或累加器電路進行運算，提高運算效率，降低功耗。 If the data received by the basic processing circuit is floating-point data, then the basic processing circuit can first convert the data type and then perform the calculation by the data conversion operation circuit after receiving the data. For example, the basic processing circuit receives the transmission from the main processing circuit. The floating-point number, the data conversion operation circuit is then converted into a fixed-point number, and then the inner product operator circuit, vector operator circuit or accumulator circuit of the basic processing circuit performs operations to improve operational efficiency and reduce power consumption.

基礎處理電路計算出結果之後可以先進行數據類型轉換然後再傳輸給主處理電路，例如，基礎處理電路計算出的浮點數運算結果可以先轉換為低位寬的定點數然後再傳輸給主處理電路，其好處是降低了傳輸過程的數據位寬，效率更高，而且節約了功耗。 After the basic processing circuit calculates the result, it can first perform data type conversion and then transmit it to the main processing circuit. For example, the floating-point operation result calculated by the basic processing circuit can be converted into a fixed-point number with a low bit width first and then transmitted to the main processing circuit. , the benefit is that the data bit width of the transmission process is reduced, the efficiency is higher, and the power consumption is saved.

主處理電路將待計算的數據傳輸到全部或者一部分基礎處理電路上；以矩陣乘以向量計算為例，主處理電路的控制電路可以將矩陣數據拆分每列作為一個基礎數據，例如m*n矩陣，可以拆分成n個m行的向量，主處理電路的控制電路將拆分後的n個m行的向量分發給多個基礎處理電路。對於向量，主處理電路的控制電路可以將向量整體廣播給每個基礎處理電路。如果m的值比較大，那麼控制電路可以先將m*n矩陣拆分成x*n個向量，以x=2為例，具體的可以拆分成，2n個向量，每個向量包含m/2行，即將n個m行的向量中每個向量均分成2個向量，以第一行為例，如n個m行的向量的第一個向量為1000行，那麼均分成2個向量可以為，將前500行組成第一向量，將後500行組成第二向量，控制電路通過2個廣播將2個向量廣播給多個基礎處理電路。 The main processing circuit transmits the data to be calculated to all or part of the basic processing circuit; taking matrix multiplication by vector calculation as an example, the control circuit of the main processing circuit can split the matrix data into each column as a basic data, such as m*n The matrix can be split into n vectors of m rows, and the control circuit of the main processing circuit distributes the split vectors of n m rows to multiple basic processing circuits. For vectors, the control circuit of the main processing circuit can broadcast the vector as a whole to each of the base processing circuits. If the value of m is relatively large, then the control circuit can first split the m*n matrix into x*n vectors. Taking x=2 as an example, it can be split into 2n vectors, each of which contains m/ 2 lines, that is, each vector in the n vectors of m lines is divided into 2 vectors, taking the first line as an example, if the first vector of the n vectors of m lines is 1000 lines, then the dividing into 2 vectors can be , the first 500 lines are composed of the first vector, the last 500 lines are composed of the second vector, and the control circuit broadcasts the two vectors to multiple basic processing circuits through two broadcasts.

所述數據傳輸的方式可以是廣播或者分發，或者其他任何可能的傳輸方式；基礎處理電路接收到數據後，執行運算，得到運算結果；基礎處理電路將運算結果傳輸回主處理電路；所述運算結果可以是中間運算結果，也可以是最終運算結果。 The data transmission method can be broadcast or distribution, or any other possible transmission method; after the basic processing circuit receives the data, it performs an operation to obtain the operation result; the basic processing circuit transmits the operation result back to the main processing circuit; the operation The result can be an intermediate operation result or a final operation result.

使用如圖1a所示裝置完成矩陣乘向量的運算；(矩陣乘向量可以是矩陣中的每一行分別與向量進行內積運算，並將這些結果按對應行的順序擺放成一個向量。) Use the device shown in Figure 1a to complete the operation of matrix multiplication by vector; (matrix multiplication by vector can be that each row in the matrix and the vector are respectively subjected to the inner product operation, and these results are arranged in the order of the corresponding rows into a vector.)

下面描述計算尺寸是M行L列的矩陣S和長度是L的向量P的乘法的運算，如下圖2a所示，(矩陣S中的每一行與向量P長度相同，他們中的數據按位置一一對應)所述神經網絡計算裝置擁有K個基礎處理電路：參閱圖2，圖2提供了了一種矩陣乘向量的實現方法，具體可以包括：步驟S201，主處理電路的數據轉換運算電路將矩陣S中的每一行數據轉換成定點類型的數據，主處理電路的控制電路分發到K個基礎處理電路中的某一個上，基礎處理電路將接收到的分發數據保存在基礎處理電路的片上緩存和/或寄存器中；在一種可選方案中，如果矩陣S的行數M<=K則，主處理電路的控制電路給K個基礎處理電路分別分發S矩陣的一行；在一種可選方案中，如果矩陣S的行數M>K，則主處理電路的控制電路給每個基礎處理電路分別分發S矩陣中一行或多行的數據。 The following describes the operation of calculating the multiplication of a matrix S of size M rows and L columns and a vector P of length L, as shown in Figure 2a below, (each row in the matrix S has the same length as the vector P, and the data in them is sorted by position one One corresponding) The neural network computing device has K basic processing circuits: refer to FIG. 2, which provides a method for realizing matrix multiplication by vector, which may specifically include: Step S201, the data conversion operation circuit of the main processing circuit converts the matrix Each line of data in S is converted into fixed-point type data, and the control circuit of the main processing circuit distributes it to one of the K basic processing circuits, and the basic processing circuit saves the received distributed data in the on-chip cache of the basic processing circuit and / or register; in an optional solution, if the number of rows M<=K of the matrix S, the control circuit of the main processing circuit distributes a row of the matrix S to the K basic processing circuits respectively; in an optional solution, If the number of rows of the matrix S is M>K, the control circuit of the main processing circuit distributes data of one or more rows in the matrix S to each basic processing circuit.

分發到第i個基礎處理電路的S中的行的集合為Ai，共有Mi個行，如圖2c表示第i個基礎處理電路上將要執行的計算。 The set of rows in S distributed to the i-th basic processing circuit is Ai, and there are Mi rows in total. Figure 2c shows the computation to be performed on the i-th basic processing circuit.

在一種可選方案中，在每個基礎處理電路中，例如第i個基礎處理電路中，可以將接收到的分發數據例如矩陣Ai保存在第i個基礎處理電路的寄存器和/或片上緩存中；優點是減少了之後的分發數據的數據傳輸量，提高了計算效率，降低了功耗。 In an optional solution, in each basic processing circuit, such as the ith basic processing circuit, the received distribution data, such as the matrix Ai, may be stored in a register and/or an on-chip cache of the ith basic processing circuit ; The advantage is that the data transmission amount of the distributed data is reduced, the calculation efficiency is improved, and the power consumption is reduced.

步驟S202，主處理電路的數據類型運算電路將向量P轉換成定點類型的數據，主處理電路的控制電路將定點類型的向量P中各部分以廣播的方式傳輸給K個基礎處理電路；在一種可選方案中，主處理電路的控制電路可以將向量P中各部分只廣播一次到各個基礎處理電路的寄存器或者片上緩存中，第i個基礎處理電路對這一次得到的向量P的數據進行充分地復用，完成對應與矩陣Ai中每一行的內積運算。優點是，減少從主處理電路到基礎處理電路的向量P的重復傳輸的數據傳輸量，提高執行效率，降低傳輸功耗。 Step S202, the data type operation circuit of the main processing circuit converts the vector P into fixed-point type data, and the control circuit of the main processing circuit transmits each part of the fixed-point type vector P to K basic processing circuits by broadcasting; In an alternative solution, the control circuit of the main processing circuit can broadcast each part of the vector P only once to the registers or the on-chip cache of each basic processing circuit, and the i-th basic processing circuit can fully perform the data of the vector P obtained this time. Ground multiplexing to complete the inner product operation corresponding to each row in the matrix Ai. The advantage is that the data transmission amount of the repeated transmission of the vector P from the main processing circuit to the basic processing circuit is reduced, the execution efficiency is improved, and the transmission power consumption is reduced.

在一種可選方案中，主處理電路的控制電路可以將向量P中各部分多次廣播到各個基礎處理電路的寄存器或者片上緩存中，第i個基礎處理電路對每次得到的向量P的數據不進行復用，分次完成對應於矩陣Ai中的每一行的內積運算；優點是，減少基礎處理電路內部的單次傳輸的向量P的數據傳輸量，並可以降低基礎處理電路緩存和/或寄存器的容量，提高執行效率，降低傳輸功耗，降低成本。 In an optional solution, the control circuit of the main processing circuit can broadcast each part of the vector P to the registers or the on-chip cache of each basic processing circuit for multiple times, and the i-th basic processing circuit can evaluate the data of the vector P obtained each time. No multiplexing is performed, and the inner product operation corresponding to each row in the matrix Ai is completed in stages; the advantage is that the data transmission amount of the vector P transmitted in a single transmission inside the basic processing circuit can be reduced, and the basic processing circuit cache and/or can be reduced. Or the capacity of the register, improve the execution efficiency, reduce the transmission power consumption, and reduce the cost.

在一種可選方案中，主處理電路的控制電路可以將向量P中各部分多次廣播到各個基礎處理電路的寄存器或者片上緩存中，第i個基礎處理電路對每次得到的向量P的數據進行部分復用，完成對應於矩陣Ai中的每一行的內積運算；優點是，減少從主處理電路到基礎處理電路的數據傳輸量，也減少基礎處理電路內部的數據傳輸量，提高執行效率，降低傳輸功耗。 In an optional solution, the control circuit of the main processing circuit can broadcast each part of the vector P to the registers or the on-chip cache of each basic processing circuit for multiple times, and the i-th basic processing circuit can evaluate the data of the vector P obtained each time. Partial multiplexing is performed to complete the Inner product operation; the advantage is that it reduces the amount of data transmission from the main processing circuit to the basic processing circuit, and also reduces the amount of data transmission inside the basic processing circuit, improves execution efficiency, and reduces transmission power consumption.

步驟S203，K個基礎處理電路的內積運算器電路計算矩陣S和向量P的數據的內積，例如第i個基礎處理電路，計算矩陣Ai的數據和向量P的數據的內積；步驟S204，K個基礎處理電路的累加器電路將內積運算的結果進行累加得到累加結果，將累加結果以定點類型形式傳輸回主處理電路。 Step S203, the inner product operator circuit of the K basic processing circuits calculates the inner product of the data of the matrix S and the vector P, for example, the ith basic processing circuit calculates the inner product of the data of the matrix Ai and the data of the vector P; Step S204 , the accumulator circuits of the K basic processing circuits accumulate the results of the inner product operation to obtain the accumulated result, and transmit the accumulated result back to the main processing circuit in the form of fixed-point type.

在一種可選方案中，可以將每次基礎處理電路執行內積運算得到的部分和(部分和即累加結果的一部分，例如累加結果為：F1*G1+F2*G2+F3*G3+F4*G4+F5*G5，那麼部分和可以為：F1*G1+F2*G2+F3*G3的值)傳輸回主處理電路進行累加；優點是，減少了基礎處理電路內部的運算量，提高基礎處理電路的運算效率。 In an optional solution, the partial sum (the partial sum is a part of the accumulated result) obtained by each time the basic processing circuit performs the inner product operation, for example, the accumulated result is: F1*G1+F2*G2+F3*G3+F4* G4+F5*G5, then the partial sum can be: the value of F1*G1+F2*G2+F3*G3) is transmitted back to the main processing circuit for accumulation; the advantage is that it reduces the amount of operations inside the basic processing circuit and improves the basic processing The operational efficiency of the circuit.

在一種可選方案中，也可以將每次基礎處理電路執行的內積運算得到的部分和保存在基礎處理電路的寄存器和/或片上緩存中，累加結束之後傳輸回主處理電路；優點是，減少了基礎處理電路和主處理電路之間的數據傳輸量，提高了運算效率，降低了數據傳輸功耗。 In an optional solution, the partial sum obtained by the inner product operation performed by the basic processing circuit each time can also be stored in the register and/or the on-chip cache of the basic processing circuit, and transferred back to the main processing circuit after the accumulation is completed; the advantage is that, The data transmission amount between the basic processing circuit and the main processing circuit is reduced, the operation efficiency is improved, and the data transmission power consumption is reduced.

在一種可選方案中，也可以將每次基礎處理電路執行的內積運算得到的部分和在部分情況下保存在基礎處理電路的寄存器和/或片上緩存中進行累加，部分情況下傳輸到主處理電路進行累加，累加結束之後傳輸回主處理電路；優點是，減少了基礎處理電路和主處理電路之間的數據傳輸量，提高了運算效率，降低了數據傳輸功耗，減少了基礎處理電路內部的運算量，提高基礎處理電路的運算效率。 In an optional solution, the partial sum obtained by the inner product operation performed by the basic processing circuit each time can also be stored in the register and/or on-chip cache of the basic processing circuit for accumulation in some cases, and transferred to the main The processing circuit performs accumulation, and after the accumulation is completed, it is transmitted back to the main processing circuit; the advantage is that the amount of data transmission between the basic processing circuit and the main processing circuit is reduced, the operation efficiency is improved, the power consumption of data transmission is reduced, and the basic processing circuit is reduced. The amount of internal operation improves the operation efficiency of the basic processing circuit.

參閱圖2b，使用如圖1a所示的裝置完成矩陣乘矩陣的運算；下面描述計算尺寸是M行L列的矩陣S和尺寸是L行N列的矩陣W的乘法的運算，(矩陣S中的每一行與矩陣W的每一列長度相同，如圖2d所示)所述神經網絡計算裝置擁有K個基礎處理電路：步驟S201b，主處理電路的控制電路將矩陣S中的每一行數據分發到K個基礎處理電路中的某一個上，基礎處理電路將接收到的數據保存在片上緩存和/或寄存器中；在一種可選方案中，如果S的行數M<=K則，主處理電路的控制電路給M個基礎處理電路分別分發S矩陣的一行；在一種可選方案中，如果S的行數M>K，主處理電路的控制電路給每個基礎處理電路分別分發S矩陣中一行或多行的數據。 Referring to Figure 2b, the device as shown in Figure 1a is used to complete the operation of matrix multiplication by matrix; the following describes the operation of the multiplication of the matrix S of M rows and L columns and the size of the matrix W of L rows and N columns, (in the matrix S Each row of the matrix W has the same length as each column of the matrix W, as shown in Figure 2d) The neural network computing device has K basic processing circuits: step S201b, the control circuit of the main processing circuit distributes the data of each row in the matrix S to On one of the K basic processing circuits, the basic processing circuit saves the received data in the on-chip cache and/or register; in an optional solution, if the number of rows of S M<=K, the main processing circuit The control circuit of the main processing circuit distributes a row of the S matrix to the M basic processing circuits; in an optional solution, if the number of rows of S M>K, the control circuit of the main processing circuit distributes a row of the S matrix to each basic processing circuit. or multiple rows of data.

S中有Mi行分發到第i個基礎處理電路，這Mi行的集合稱為Ai，如圖2e表示第i個基礎處理電路上將要執行的計算。 There are Mi rows in S that are distributed to the i-th basic processing circuit, and the set of Mi rows is called Ai. Figure 2e shows the computation to be performed on the i-th basic processing circuit.

在一種可選方案中，在每個基礎處理電路中，例如第i個基礎處理電路中：接收的由主處理電路分發的矩陣Ai，將矩陣Ai保存在第i個基礎處理電路寄存器和/或片上緩存中；優點是減少了之後的數據傳輸量，提高了計算效率，降低了功耗。 In an optional solution, in each basic processing circuit, such as the ith basic processing circuit: the received matrix Ai distributed by the main processing circuit is stored in the ith basic processing circuit register and/or In the on-chip cache; the advantage is that the amount of subsequent data transmission is reduced, the computing efficiency is improved, and the power consumption is reduced.

步驟S202b，主處理電路的控制電路將矩陣W中各部分以廣播的方式傳輸給各個基礎處理電路；在一種可選方案中，可以將矩陣W中各部分只廣播一次到各個基礎處理電路的寄存器或者片上緩存中，第i個基礎處理電路對這一次得到的矩陣W的數據進行充分地復用，完成對應與矩陣Ai中每一行的內積運算；本實施例中的復用具體可以為基礎處理電路在計算中重復使用，例如矩陣W的數據的復用，可以是對矩陣W的數據在多次使用。 In step S202b, the control circuit of the main processing circuit transmits each part in the matrix W to each basic processing circuit by broadcasting; in an optional solution, each part in the matrix W can be broadcast only once to the registers of each basic processing circuit. Or in the on-chip cache, the i-th basic processing circuit is responsible for the moment obtained this time. The data of the matrix W is fully multiplexed, and the inner product operation corresponding to each row in the matrix Ai is completed; the multiplexing in this embodiment can be used repeatedly in the calculation for the basic processing circuit, for example, the multiplexing of the data of the matrix W , which can be used multiple times for the data of matrix W.

在一種可選方案中，主處理電路的控制電路可以將矩陣W中各部分多次廣播到各個基礎處理電路的寄存器或者片上緩存中，第i個基礎處理電路對每次得到的矩陣W的數據不進行復用，分次完成對應於矩陣Ai中的每一行的內積運算；在一種可選方案中，主處理電路的控制電路可以將矩陣W中各部分多次廣播到各個基礎處理電路的寄存器或者片上緩存中，第i個基礎處理電路對每次得到的矩陣W的數據進行部分復用，完成對應於矩陣Ai中的每一行的內積運算；在一種可選方案中，每個基礎處理電路，例如第i個基礎處理電路，計算矩陣Ai的數據和矩陣W的數據的內積；步驟S203b，每個基礎處理電路的累加器電路將內積運算的結果進行累加並傳輸回主處理電路。 In an optional solution, the control circuit of the main processing circuit can broadcast each part of the matrix W to the registers or the on-chip cache of each basic processing circuit for multiple times, and the i-th basic processing circuit can evaluate the data of the matrix W obtained each time by the i-th basic processing circuit. No multiplexing is performed, and the inner product operation corresponding to each row in the matrix Ai is completed in stages; in an optional solution, the control circuit of the main processing circuit can broadcast each part in the matrix W to the basic processing circuit multiple times. In the register or on-chip cache, the i-th basic processing circuit partially multiplexes the data of the matrix W obtained each time, and completes the inner product operation corresponding to each row in the matrix Ai; The processing circuit, such as the ith basic processing circuit, calculates the inner product of the data of the matrix Ai and the data of the matrix W; step S203b, the accumulator circuit of each basic processing circuit accumulates the result of the inner product operation and transmits it back to the main processing circuit.

在一種可選方案中，基礎處理電路可以將每次執行內積運算得到的部分和傳輸回主處理電路進行累加；在一種可選方案中，也可以將每次基礎處理電路執行的內積運算得到的部分和保存在基礎處理電路的寄存器和/或片上緩存中，累加結束之後傳輸回主處理電路；在一種可選方案中，也可以將每次基礎處理電路執行的內積運算得到的部分和在部分情況下保存在基礎處理電路的寄存器和/或片上緩存中進行累加，部分情況下傳輸到主處理電路進行累加，累加結束之後傳輸回主處理電路；參閱圖3a，使用如圖1a所示的裝置完成全連接運算：如果全連接層的輸入數據是一個向量(即神經網絡的輸入是單個樣本的情況)，則以全連接層的權值矩陣作為矩陣S，輸入向量作為向量P，按照所述裝置的使用方法一執行如圖2所示的矩陣乘向量的運算；如果全連接層的輸入數據是一個矩陣(即神經網絡的輸入是多個樣本作為batch的情況)，則以全連接層的權值矩陣作為矩陣S，輸入向量作為矩陣W，或者以全連接層的權值矩陣作為矩陣W，輸入向量作為矩陣S，按照所述裝置的使用如圖2c所示的矩陣乘矩陣的執行運算；參閱圖3b，使用如圖1a所示的裝置完成卷積運算：對於一個卷積層，記其卷積核的數量為M；步驟S301，主處理電路的控制電路將卷積層權值中的每一個卷積核的權值分發到K個基礎處理電路中的某一個上，保存在基礎處理電路的片上緩存和/或寄存器中；在一種可選方案中，如果卷積核的個數M<=K則，主處理電路的控制電路給M個基礎處理電路分別分發一個卷積核的權值；在一種可選方案中，如果卷積核的個數M>K，主處理電路的控制電路給每個基礎處理電路分別分發一個或多個卷積核的權值。 In an optional solution, the basic processing circuit may transmit the partial sums obtained by performing the inner product operation each time back to the main processing circuit for accumulation; in an optional solution, the inner product operation performed by the basic processing circuit each time may also be The obtained partial sum is stored in the register and/or on-chip cache of the basic processing circuit, and is transmitted back to the main processing circuit after the accumulation; and in some cases stored in registers and/or on-chip caches of the underlying processing circuit. Line accumulation, in some cases, it is transmitted to the main processing circuit for accumulation, and then transmitted back to the main processing circuit after the accumulation; referring to Figure 3a, the device shown in Figure 1a is used to complete the fully connected operation: if the input data of the fully connected layer is a vector (that is, when the input of the neural network is a single sample), the weight matrix of the fully connected layer is used as the matrix S, the input vector is used as the vector P, and the matrix multiplication vector shown in Figure 2 is performed according to the method of using the device 1. operation; if the input data of the fully connected layer is a matrix (that is, the input of the neural network is a case where multiple samples are used as batches), then the weight matrix of the fully connected layer is used as the matrix S, and the input vector is used as the matrix W, or with The weight matrix of the fully connected layer is taken as the matrix W, and the input vector is taken as the matrix S. According to the use of the device, the operation of the matrix multiplied by the matrix shown in Figure 2c is used; referring to Figure 3b, the volume is completed using the device shown in Figure 1a. Product operation: for a convolution layer, record the number of its convolution kernels as M; in step S301, the control circuit of the main processing circuit distributes the weight of each convolution kernel in the weights of the convolution layer to the K basic processing circuits On a certain one, it is stored in the on-chip cache and/or register of the basic processing circuit; in an optional solution, if the number of convolution kernels M<=K, the control circuit of the main processing circuit gives M basic processing The circuit distributes the weights of a convolution kernel respectively; in an optional solution, if the number of convolution kernels M>K, the control circuit of the main processing circuit distributes one or more convolution kernels to each basic processing circuit. weight value.

共有Mi個卷積核分發到第i個基礎處理電路，這些卷積核權值的集合稱為Ai。 A total of Mi convolution kernels are distributed to the i-th basic processing circuit, and the set of these convolution kernel weights is called Ai.

在一種可選方案中，在每個基礎處理電路中，例如第i個基礎處理電路中：將收到的由主處理電路分發的卷積核權值Ai保存在其寄存器和/或片上緩存中；步驟S302，主處理電路的控制電路將輸入數據T中各部分以廣播的方式傳輸給各個基礎處理電路；在一種可選方案中，主處理電路的控制電路可以將輸入數據T中各部分只廣播一次到各個基礎處理電路的寄存器或者片上緩存中，第i個基礎處理電路對這一次得到的輸入數據T的數據進行充分地復用，完成對應與Ai中每一個卷積核的內積運算；在一種可選方案中，主處理電路的控制電路可以將輸入數據T中各部分多次廣播到各個基礎處理電路的寄存器或者片上緩存中，第i個基礎處理電路對每次得到的輸入數據T的數據不進行復用，分次完成對應於Ai中的每一個卷積核的內積運算；在一種可選方案中，主處理電路的控制電路可以將輸入數據T中各部分多次廣播到各個基礎處理電路的寄存器或者片上緩存中，第i個基礎處理電路對每次得到的輸入數據T的數據進行部分復用，完成對應於Ai中的每一個卷積核的內積運算；步驟S303，每個基礎處理電路計算卷積核和輸入數據T的數據內積，例如第i個基礎處理電路，計算Ai的每一個卷積核和輸入數據T的數據的內積；步驟S304，每個基礎處理電路的累加器電路將內積運算的結果進行累加並傳輸回主處理電路：在一種可選方案中，可基礎處理電路以將每次執行內積運算得到的部分和傳輸回主處理電路進行累加；在一種可選方案中，基礎處理電路也可以將每次執行的內積運算得到的部分和保存在基礎處理電路的寄存器和/或片上緩存中，累加結束之後傳輸回主處理電路；在一種可選方案中，基礎處理電路也可以將每次執行的內積運算得到的部分和在部分情況下保存在基礎處理電路的寄存器和/或片上緩存中進行累加，部分情況下傳輸到主處理電路進行累加，累加結束之後傳輸回主處理電路；使用如圖1a所示的裝置更新權值的方法：利用主處理電路的向量運算器電路實現神經網絡訓練過程中的權值更新功能，具體地，權值更新是指使用權值的梯度來更新權值的方法。 In an optional solution, in each basic processing circuit, such as the ith basic processing circuit: the received convolution kernel weights Ai distributed by the main processing circuit are stored in its register and/or on-chip cache Step S302, the control circuit of the main processing circuit transmits each part in the input data T to each basic processing circuit by broadcasting; In a kind of optional scheme, the control circuit of the main processing circuit Broadcast once to the registers or on-chip cache of each basic processing circuit, the i-th basic processing circuit fully multiplexes the data of the input data T obtained this time, and completes the inner product operation corresponding to each convolution kernel in Ai ; In an optional scheme, the control circuit of the main processing circuit can broadcast each part of the input data T to the registers or the on-chip cache of each basic processing circuit for many times, and the i-th basic processing circuit is responsible for the input data obtained each time. The data of T is not multiplexed, and the inner product operation corresponding to each convolution kernel in Ai is completed in stages; in an optional solution, the control circuit of the main processing circuit can broadcast each part of the input data T multiple times. In the register or on-chip cache of each basic processing circuit, the i-th basic processing circuit performs partial multiplexing on the data of the input data T obtained each time, and completes the inner product operation corresponding to each convolution kernel in Ai; step S303, each basic processing circuit calculates the data inner product of the convolution kernel and the input data T, such as the i-th basic processing circuit, calculates the inner product of each convolution kernel of Ai and the data of the input data T; Step S304, the accumulator circuit of each basic processing circuit accumulates the result of the inner product operation and transmits it back to the main processing circuit: Transfer back to the main processing circuit for accumulation; in an optional solution, the basic processing circuit can also save the partial sum obtained by each inner product operation performed in the register and/or on-chip cache of the basic processing circuit, and transmit after the accumulation. back to the main processing circuit; in an optional solution, the basic processing circuit can also store the partial sums obtained by each inner product operation performed in some cases in the registers and/or on-chip cache of the basic processing circuit for accumulation, and some In this case, it is transmitted to the main processing circuit for accumulation, and after the accumulation is completed, it is transmitted back to the main processing circuit; the method of updating the weights using the device shown in Figure 1a: using the vector operator circuit of the main processing circuit to realize the weights in the neural network training process. The value update function, specifically, the weight update refers to a method of updating the weight using the gradient of the weight.

在一種可選方案中，使用主處理電路的向量運算器電路對權值和權值梯度這兩個向量進行加減運算得到運算結果，該運算結果即為更新權值。 In an optional solution, the vector operator circuit of the main processing circuit is used to perform addition and subtraction operations on the two vectors of the weight value and the weight value gradient to obtain an operation result, and the operation result is the updated weight value.

在一種可選方案中，使用主處理電路的向量運算器電路在權值以及權值梯度乘以或除以一個數得到中間權值和中間權值梯度值，向量運算器電路對中間權值和中間權值梯度值進行加減運算得到運算結果，該運算結果即為更新權值。 In an optional solution, the vector operator circuit of the main processing circuit is used to multiply or divide the weight and the gradient of the weight by a number to obtain the intermediate weight and the gradient of the intermediate weight, and the vector operator circuit compares the intermediate weight and the gradient of the intermediate weight. The intermediate weight gradient value is added and subtracted to obtain the operation result, and the operation result is the updated weight value.

在一種可選方案中，可以先使用權值的梯度計算出一組動量，然後再使用動量與權值進行加減計算得到更新後的權值；使用如圖1a所示的裝置實現全連接層的反向運算的方法 In an optional solution, a set of momentums can be calculated first using the gradient of the weights, and then the updated weights can be obtained by adding and subtracting the momentum and the weights; A method for implementing the reverse operation of the fully connected layer using the device shown in Figure 1a

全連接層的反向運算可以分成兩部分，如下圖4a所示，實線箭頭表示全連接層的正向計算過程，如圖4b所示，表示全連接層的反向計算過程。 The reverse operation of the fully connected layer can be divided into two parts, as shown in Figure 4a below. The solid line arrow represents the forward calculation process of the fully connected layer, as shown in Figure 4b, which represents the reverse calculation process of the fully connected layer.

圖4a、圖4b所示的全連接層的反向運算，可以使用如圖1a所示的裝置如圖2b所示的矩陣乘矩陣方法來完成；使用如圖1a所示的裝置實現卷積層的反向運算；卷積層的反向運算可以分成兩部分，如下圖5a中，實線箭頭表示卷積層的正向計算過程，如圖5b所示，表示卷積層的反向計算過程。 The reverse operation of the fully connected layer shown in Fig. 4a and Fig. 4b can be accomplished by using the device shown in Fig. 1a and the matrix-by-matrix method shown in Fig. 2b; Reverse operation; the reverse operation of the convolutional layer can be divided into two parts, as shown in Figure 5a below, the solid line arrow represents the forward calculation process of the convolutional layer, as shown in Figure 5b, represents the reverse calculation process of the convolutional layer.

圖5a、圖5b所示的卷積層的反向運算，可以使用如圖1a所示裝置採用如圖3b所示的方法完成卷積層的反向運算。 For the reverse operation of the convolution layer shown in FIG. 5a and FIG. 5b, the reverse operation of the convolution layer can be completed by using the device shown in FIG. 1a and the method shown in FIG. 3b.

使用如圖1a所示的裝置實現BLAS(Basic Linear Algebra Subprograms)函數的方法 A method for implementing the BLAS (Basic Linear Algebra Subprograms) function using the device shown in Figure 1a

GEMM計算是指：BLAS庫中的矩陣-矩陣乘法的運算。該運算的通常表示形式為：C=alpha*op(S)*op(P)+beta*C，其中，S和P為輸入的兩個矩陣，C為輸出矩陣，alpha和beta為標量，op代表對矩陣S或P的某種操作，此外，還會有一些輔助的整數作為參數來說明矩陣的S和P的寬高；使用如圖1a的裝置實現GEMM計算的步驟包括：主處理電路的數據類型轉換運算電路可以對矩陣S以及矩陣W進行數據類型轉換；主處理電路的轉換電路對輸入矩陣S和矩陣W進行各自相應的op操作；在一種可選方案中，op可以為矩陣的轉置操作；可以利用主處理電路的矩陣轉置電路實現該矩陣轉置操作；在一種可選方案中，在執行完矩陣S和矩陣W的OP操作以後，還可以由主處理電路的數據轉換運算電路執行數據類型轉換操作，即數據轉換運算電路將op(S)以及op(P)的數據類型由浮點類型數據轉換成定點類型數據，然後執行如圖2b所示的矩陣乘法運算。 GEMM calculation refers to: the operation of matrix-matrix multiplication in the BLAS library. The usual representation of this operation is: C=alpha*op(S)*op(P)+beta*C, where S and P are the two input matrices, C is the output matrix, alpha and beta are scalars, and op Represents a certain operation on the matrix S or P, in addition, there will be some auxiliary integers as parameters to describe the width and height of S and P of the matrix; the steps of using the device as shown in Figure 1a to realize the GEMM calculation include: the main processing circuit The data type conversion operation circuit can perform data type conversion on the matrix S and the matrix W; the conversion circuit of the main processing circuit performs respective corresponding op operations on the input matrix S and the matrix W; In an optional solution, op can be the transpose operation of the matrix; the matrix transpose operation can be realized by using the matrix transpose circuit of the main processing circuit; in an optional solution, after the OP of the matrix S and the matrix W is executed After the operation, the data type conversion operation can also be performed by the data conversion operation circuit of the main processing circuit, that is, the data conversion operation circuit converts the data types of op(S) and op(P) from floating-point type data to fixed-point type data, and then Perform a matrix multiplication operation as shown in Figure 2b.

在一種可選方案中，某個矩陣的op可以為空，op操作不進行；用如圖1a所示的裝置的使用如圖2b中所述矩陣乘矩陣的計算方法完成op(S)與op(P)之間的矩陣乘法計算；利用主處理電路的算術邏輯單元對op(S)*op(P)的結果中的每一個值進行乘以alpha的操作；在一種可選方案中，alpha為1的情況下乘以alpha的操作不進行；利用主處理電路的算術邏輯單元實現beta*C的運算；在一種可選方案中，beta為1的情況下，不進行乘以beta的操作；利用主處理電路的向量運算器電路實現矩陣alpha*op(s)*op(P)和beta*C之間對應位置相加的步驟得到GEMM計算的結果。 In an optional solution, the op of a certain matrix can be empty, and the op operation is not performed; the op(S) and op are completed by using the device shown in FIG. 1a using the matrix multiplication matrix calculation method as shown in FIG. 2b Matrix multiplication calculation between (P); use the arithmetic logic unit of the main processing circuit to multiply each value in the result of op(S)*op(P) by alpha; in an alternative solution, alpha In the case of 1, the operation of multiplying by alpha is not performed; the arithmetic logic unit of the main processing circuit is used to realize the operation of beta*C; in an optional solution, when beta is 1, the operation of multiplying by beta is not performed; The vector operator circuit of the main processing circuit is used to realize the step of adding the corresponding positions between the matrices alpha*op(s)*op(P) and beta*C to obtain the result of the GEMM calculation.

在一種可選方案中，beta為0的情況下，不進行這步操作；GEMV計算是指：BLAS庫中的矩陣-向量乘法的運算。該運算的通常表示形式為：C=alpha*op(S)*P+beta*C，其中，S為輸入矩陣，P為輸入的向量，C為輸出向量，alpha和beta為標量，op代表對矩陣S的某種操作；使用如圖1a的裝置實現GEMV計算的步驟為：主處理電路的數據類型轉換運算電路可以對輸入矩陣S以及矩陣W進行數據類型轉換；主處理電路的轉換電路對輸入矩陣S進行相應的op操作；在一種可選方案中，op可以為矩陣的轉置操作；利用主處理電路的轉換電路實現矩陣轉置操作；在一種可選方案中，某個矩陣的op可以為空，轉置操作不進行；用如圖1a所示裝置使用如圖2a中所述矩陣乘向量的計算方法完成矩陣op(S)與向量P之間的矩陣-向量乘法計算；利用主處理電路的算術邏輯單元對op(S)*P的結果中的每一個值進行乘以alpha的操作；在一種可選方案中，alpha為1的情況下乘以alpha的操作不進行；利用主處理電路的算術邏輯單元實現beta*C的運算；在一種可選方案中，beta為1的情況下，不進行乘以beta的操作；利用主處理電路的向量運算器電路實現矩陣alpha*op(S)* P和beta*C之間對應位置相加的步驟得到GEMV的結果。 In an optional solution, when beta is 0, this step is not performed; GEMV calculation refers to the operation of matrix-vector multiplication in the BLAS library. The usual representation of this operation is: C=alpha*op(S)*P+beta*C, where S is the input matrix, P is the input vector, C is the output vector, alpha and beta are scalars, and op represents the pair A certain operation of matrix S; the steps to implement GEMV calculation using the device as shown in Figure 1a are: The data type conversion operation circuit of the main processing circuit can perform data type conversion on the input matrix S and the matrix W; the conversion circuit of the main processing circuit performs the corresponding op operation on the input matrix S; Transpose operation; use the conversion circuit of the main processing circuit to realize the matrix transposition operation; in an optional scheme, the op of a certain matrix can be empty, and the transposition operation is not performed; use the device shown in Figure 1a to use Figure 2a The calculation method of the matrix multiplication vector described in completes the matrix-vector multiplication calculation between the matrix op(S) and the vector P; utilizes the arithmetic logic unit of the main processing circuit to carry out each value in the result of op(S)*P. The operation of multiplying by alpha; in an optional solution, the operation of multiplying by alpha is not performed when alpha is 1; the operation of beta*C is realized by the arithmetic logic unit of the main processing circuit; in an optional solution, beta In the case of 1, the operation of multiplying beta is not performed; the vector operator circuit of the main processing circuit is used to realize the step of adding the corresponding positions between the matrices alpha*op(S)*P and beta*C to obtain the result of GEMV.

在一種可選方案中，beta為0的情況下，不進行相加的步驟操作；使用如圖1a的裝置實現激活函數的方法利用主處理電路的激活電路輸入一向量，計算出該向量的激活向量；在一種可選方案中，主處理電路激活電路將輸入向量中的每一個值通過一個激活函數(激活函數的輸入是一個數值，輸出也是一個數值)，計算出一個數值輸出到輸出向量的對應位置；在一種可選方案中，激活函數可以是：y=max(m,x)，其中x是輸入數值，y是輸出數值，m是一個常數；在一種可選方案中，激活函數可以是：y=tanh(x)，其中x是輸入數值，y是輸出數值；在一種可選方案中，激活函數可以是：y=sigmoid(x)，其中x是輸入數值，y是輸出數值；在一種可選方案中，激活函數可以是一個分段線性函數；在一種可選方案中，激活函數可以是任意輸入一個數，輸出一個數的函數。 In an optional solution, when beta is 0, the step operation of addition is not performed; the activation function of the device as shown in Figure 1a is used to input a vector using the activation circuit of the main processing circuit, and the activation of the vector is calculated. vector; in an optional solution, the main processing circuit activation circuit passes each value in the input vector through an activation function (the input of the activation function is a numerical value, and the output is also a numerical value), and calculates a value that is output to the output vector. corresponding position; In an alternative, the activation function can be: y=max(m,x), where x is the input value, y is the output value, and m is a constant; in an alternative, the activation function can be: y =tanh(x), where x is the input value and y is the output value; in an alternative, the activation function can be: y=sigmoid(x), where x is the input value and y is the output value; in an alternative In an alternative scheme, the activation function can be a piecewise linear function; in an alternative scheme, the activation function can be any function that inputs a number and outputs a number.

在一種可選方案中，輸入向量的來源有(包括但不限於)：所述裝置的外部數據來源；在一種可選方案中，輸入數據來自所述裝置進行矩陣乘向量的運算結果；在一種可選方案中，輸入數據來自所述裝置進行矩陣乘矩陣的運算結果；所述裝置的主處理電路計算結果；在一種可選方案中，輸入數據來自所述裝置主處理電路實現加偏置之後的計算結果。 In an optional solution, the sources of the input vector include (including but not limited to): an external data source of the device; in an optional solution, the input data comes from the operation result of the device performing matrix multiplication with a vector; in a In an optional solution, the input data comes from the operation result of the device performing matrix multiplication; the main processing circuit of the device calculates the result; in an optional solution, the input data comes from the main processing circuit of the device after adding bias calculation result.

需要說明的是，上述激活操作可以由主處理電路內的算數邏輯電路和累加器電路來實現，也可以在主處理電路單獨增加一個激活電路來實現激活操作。 It should be noted that the above activation operation can be implemented by an arithmetic logic circuit and an accumulator circuit in the main processing circuit, or an activation circuit can be separately added to the main processing circuit to implement the activation operation.

使用如圖1a的裝置實現加偏置操作：利用主處理電路的向量運算器電路可以實現兩個向量或者兩個矩陣相加的功能；利用主處理電路的向量運算器電路可以實現把一個向量加到一個矩陣的每一行上，或者每一個列上的功能。 The biasing operation is achieved using the setup shown in Figure 1a: Using the vector operator circuit of the main processing circuit can realize the function of adding two vectors or two matrices; using the vector operator circuit of the main processing circuit can realize adding a vector to each row or each column of a matrix function on.

在一種可選方案中，所述矩陣可以來自所述裝置執行矩陣乘矩陣運算的結果；在一種可選方案中，所述矩陣可以來自所述裝置執行矩陣乘向量運算的結果；在一種可選方案中，所述矩陣可以來自所述裝置的主處理電路從外部接受的數據。 In an alternative, the matrix may come from the result of a matrix-multiply-matrix operation performed by the apparatus; in an alternative, the matrix may come from a result of a matrix-multiplied-vector operation performed by the apparatus; in an alternative In the scheme, the matrix may be from data received from the outside by the main processing circuit of the device.

在一種可選方案中，所述向量可以來自所述裝置的主處理電路從外部接受的數據。 In an alternative, the vector may come from data received externally by the main processing circuit of the device.

包括但不限於以上這些數據來源。 Including but not limited to the above data sources.

使用如圖1a的裝置實現數據類型轉換：利用主處理電路的數據類型轉換運算電路實現將數據類型的轉換；在一種可選方案中，使用主處理電路的數據類型轉換運算電路實現一組數據的數據類型轉換；在一種可選方案中，數據類型轉化的形式包括但不限於：浮點數轉定點數和定點數轉浮點數等；本發明還提供一種芯片，該芯片包含計算裝置，該計算裝置包括：包括一個主處理電路，主處理電路中所涉及到的數據可以是任意數據類型的數據，在一種可選方案中，可以是任意位寬的浮點數表示的數據也可以是任意位寬的定點數表示的數據；涉及到的所有運算電路和存儲電路都可以是任意數據類型的運算電路和存儲電路，在一種可選方案中，可以是任意位寬的浮點數的運算電路和存儲電路也可以是任意位寬的定點數的運算電路和存儲電路。 Use the device as shown in Figure 1a to realize data type conversion: use the data type conversion operation circuit of the main processing circuit to realize the conversion of data types; Data type conversion; in an optional solution, the form of data type conversion includes but is not limited to: floating-point number to fixed-point number and fixed-point number to floating-point number, etc.; the present invention also provides a chip, the chip includes a computing device, the The computing device includes: including a main processing circuit, the data involved in the main processing circuit can be data of any data type, in an optional solution, it can be data represented by floating-point numbers of any bit width. It can be data represented by a fixed-point number of any bit width; all the arithmetic circuits and storage circuits involved can be arithmetic circuits and storage circuits of any data type. In an optional solution, it can be a floating point number of any bit width. The arithmetic circuit and storage circuit can also be a fixed-point arithmetic circuit and storage circuit of any bit width.

在一種可選方案中，主處理電路包括數據類型轉換運算電路；在一種可選方案中，主處理電路包括執行數據類型轉換的向量運算單元；具體地，包含接收輸入數據的數據輸入介面；在一種可選方案中，所述接收的數據來源可以是：所述神經網絡運算電路裝置的外部或所述神經網絡運算電路裝置的部分或全部基礎處理電路；在一種可選方案中，所述數據輸入介面可以有多個；具體地，可以包含輸出數據的數據輸出介面；在一種可選方案中，所述輸出的數據的去向可以是：所述神經網絡運算裝置的外部或所述神經網絡運算電路裝置的部分或全部基礎處理電路；在一種可選方案中，所述數據輸出介面可以有多個；在一種可選方案中，所述主處理電路包括片上緩存和/或寄存器；在一種可選方案中，所述主處理電路中包含運算單元，可以執行數據運算；在一種可選方案中，所述主處理電路中包含算術運算單元；在一種可選方案中，所述主處理電路中包含向量運算單元，可以同時對一組數據執行運算；具體地，所述算術運算和/或向量運算可以是任意類型的運算，包括但不限於：兩個數相加減乘除，一個數與常數加減乘除，對一個數執行指數運算，冪次運算，對數運算，以及各種非線性運算，對兩個數執行比較運算，邏輯運算等。兩個向量相加減乘除，一個向量中的每一個元素與常數加減乘除，對向量中的每一個元素執行指數運算，冪次運算，對數運算，以及各種非線性運算等，對一個向量中的每兩個對應的元素執行比較運算，邏輯運算等。 In an optional solution, the main processing circuit includes a data type conversion operation circuit; in an optional solution, the main processing circuit includes a vector operation unit that performs data type conversion; specifically, includes a data input interface for receiving input data; In an optional solution, the source of the received data may be: the outside of the neural network computing circuit device or part or all of the basic processing circuits of the neural network computing circuit device; in an optional solution, the data There may be multiple input interfaces; specifically, it may include a data output interface for outputting data; in an optional solution, the destination of the output data may be: the outside of the neural network computing device or the neural network computing device Part or all of the basic processing circuits of the circuit device; in an optional solution, there may be multiple data output interfaces; in an optional solution, the main processing circuit includes on-chip caches and/or registers; in an optional solution In an alternative, the main processing circuit includes an arithmetic unit that can perform data operations; in an alternative, the main processing circuit includes an arithmetic operation unit; in an alternative, the main processing circuit Contains a vector operation unit, which can perform operations on a group of data at the same time; specifically, the arithmetic operations and/or vector operations can be any Types of operations, including but not limited to: addition, subtraction, multiplication and division of two numbers, addition, subtraction, multiplication and division of a number and a constant, performing exponential operations on a number, exponentiation, logarithmic operations, and various nonlinear operations, performing comparison operations on two numbers , logical operations, etc. Two vectors are added, subtracted, multiplied and divided. Each element in a vector is added, subtracted, multiplied and divided with a constant. Exponential operations, power operations, logarithmic operations, and various nonlinear operations are performed on each element in the vector. Two corresponding elements perform comparison operations, logical operations, etc.

在一種可選方案中，所述主處理電路包括數據重排列單元，用於按照一定的順序向基礎處理電路傳輸數據，或者按照一定的順序原地重新排列數據；在一種可選方案中，所述數據排列的順序包括：對一個多維數據塊進行維度順序的變換；所述數據排列的順序還可以包括：對一個數據塊進行分塊以發送到不同的基礎處理電路。 In an optional solution, the main processing circuit includes a data rearrangement unit for transmitting data to the basic processing circuit in a certain order, or rearranging the data in a certain order in place; in an optional solution, all The order of data arrangement includes: transforming a multidimensional data block in the order of dimensions; the order of data arrangement may further include: dividing a data block into blocks to send to different basic processing circuits.

該計算裝置還包括多個基礎處理電路：每一個基礎處理電路用於計算兩個向量的內積，計算的方法是，基礎處理電路收到的兩組數，將這兩組數中的元素對應相乘，並且將相乘的結果累加起來；內積的結果傳輸出去，這裡傳輸出去根據基礎處理電路的位置，有可能傳輸給其他基礎處理電路，也可以直接傳輸給主處理電路。 The computing device also includes a plurality of basic processing circuits: each basic processing circuit is used to calculate the inner product of two vectors, and the calculation method is that the elements in the two groups of numbers are corresponding to the two groups of numbers received by the basic processing circuit. Multiply, and accumulate the results of the multiplication; the result of the inner product is transmitted, and it is transmitted here. According to the position of the basic processing circuit, it may be transmitted to other basic processing circuits, or it can be directly transmitted to the main processing circuit.

基礎處理電路中所涉及到的數據可以是任意數據類型的數據，在一種可選方案中，可以是任意位寬的浮點數表示的數據也可以是任意位寬的定點數表示的數據；涉及到的所有運算電路和存儲電路都可以是任意數據類型的運算電路和存儲電路，在一種可選方案中，可以是任意位寬的浮點數的運算電路和存儲電路也可以是任意位寬的定點數的運算電路和存儲電路。在一種可選方案中，基礎處理電路包括數據類型轉換運算電路；在一種可選方案中，基礎處理電路包括執行數據類型轉換的向量運算單元；具體地，包括由片上緩存和/或寄存器構成的存儲單元；具體地，包括一個或多個接收數據的數據輸入介面；在一種可選方案中，包括兩個數據輸入介面，每次從兩個數據輸入介面處可以分別獲得一個或多個數據；在一種可選方案中，基礎處理電路可以將從數據輸入介面接收到輸入數據後保存在寄存器和/或片上緩存中；上述數據輸入介面接收數據的來源可以是：其他基礎處理電路和/或主處理電路。 The data involved in the basic processing circuit can be data of any data type. In an optional solution, it can be data represented by floating-point numbers of any bit width or data represented by fixed-point numbers of any bit width; All arithmetic circuits and storage circuits can be arithmetic circuits and storage circuits of any data type. In an optional solution, they can be arithmetic circuits and storage circuits of floating-point numbers of any bit width, or they can be of any bit width. Fixed-point arithmetic and storage circuits. In an optional solution, the basic processing circuit includes a data type conversion operation circuit; in an optional solution, the basic processing circuit includes a vector operation unit that performs data type conversion; storage unit; specifically, including one or more data input interfaces for receiving data; in an optional solution, including two data input interfaces, one or more data can be obtained from the two data input interfaces each time; In an optional solution, the basic processing circuit may receive input data from the data input interface and store it in a register and/or an on-chip cache; the source of the data received by the data input interface may be: other basic processing circuits and/or main processing circuit.

所述神經網絡運算電路裝置的主處理電路；所述神經網絡運算電路裝置的其他基礎處理電路(所述神經網絡運算電路裝置擁有多個基礎處理電路)；具體地，包括一個或多個傳輸輸出數據的數據輸出介面；在一種可選方案中，可以將一個或多個數據從數據輸出介面傳輸出去；具體地，通過數據輸出介面傳輸出去的數據可以是：從數據輸入介面接收到的數據、保存在片上緩存和/或寄存器中的數據、乘法器運算結果、累加器運算結果或內積運算器運算結果中的一種或任意組合。 The main processing circuit of the neural network arithmetic circuit device; other basic processing circuits of the neural network arithmetic circuit device (the neural network arithmetic circuit device has a plurality of basic processing circuits); specifically, including one or more transmission outputs Data output interface for data; in an optional solution, one or more data can be transmitted from the data output interface; specifically, the data transmitted through the data output interface can be: data received from the data input interface, One or any combination of the data stored in the on-chip cache and/or registers, the result of the multiplier operation, the result of the accumulator operation, or the result of the inner product operator.

在一種可選方案中，包含三個數據輸出介面，其中的兩個分別對應於兩個數據輸入介面，每一層出上一層從數據輸入介面接收到的數據，第三個數據輸出介面負責輸出運算結果；具體地，所述數據輸出介面傳輸數據的去向可以是：上文數據來源和此處的數據去向決定了基礎處理電路在裝置中的連接關係。 In an optional solution, three data output interfaces are included, two of which correspond to two data input interfaces respectively, each layer outputs the data received from the data input interface on the previous layer, and the third data output interface is responsible for the output operation As a result; specifically, the destination of the data transmitted by the data output interface may be: the above data source and the data destination here determine the connection relationship of the basic processing circuit in the device.

所述神經網絡運算電路裝置的主處理電路；所述神經網絡運算電路裝置的其他基礎處理電路，所述神經網絡運算電路裝置擁有多個基礎處理電路；具體地，包括算術運算電路：該算術運算電路具體可以為：一個或多個乘法器電路、一個或多個累加器電路、一個或多個執行兩組數內積運算的電路中的一個或任意組合。 The main processing circuit of the neural network operation circuit device; other basic processing circuits of the neural network operation circuit device, the neural network operation circuit device has a plurality of basic processing circuits; The circuit may specifically be: one or more multiplier circuits, one or more accumulator circuits, one or more circuits for performing inner product operation of two groups of numbers, or any combination thereof.

在一種可選方案中，可以執行兩個數的乘法運算，其結果可以保存在片上緩存和/或寄存器上，也可以直接累加到寄存器和/或片上緩存中；在一種可選方案中，可以執行兩組數據的內積運算，其結果可以保存在片上緩存和/或寄存器中，也可以直接累加到寄存器和/或片上緩存中；在一種可選方案中，可以執行數據的累加運算，將數據累加到片上緩存和或寄存器中；具體地，累加器電路被累加的數據，可以是：從數據輸入介面接收到的數據、保存在片上緩存和/或寄存器中的數據、乘法器運算結果、累加器運算結果、內積運算器運算結果中的一個或任意組合。 In an alternative, a multiplication of two numbers can be performed, and the result can be stored in on-chip caches and/or registers, or directly accumulated in registers and/or on-chip caches; in an alternative, it can be Perform the inner product operation of two sets of data, and the result can be stored in the on-chip cache and/or register, or can be directly accumulated into the register and/or on-chip cache; The data is accumulated in the on-chip cache and/or register; specifically, the data accumulated by the accumulator circuit can be: data received from the data input interface, data stored in the on-chip cache and/or register, multiplier operation results, One or any combination of the operation result of the accumulator and the operation result of the inner product operator.

需要說明的是，上述對基礎處理電路的描述中所用到的「數據輸入介面」和「數據輸出介面」是指每一個基礎處理電路的數據輸入與輸出介面，而不是整個裝置的數據輸入與輸出介面。 It should be noted that the "data input interface" and "data output interface" used in the above description of the basic processing circuit refer to the data input and output interface of each basic processing circuit, rather than the data input and output of the entire device. interface.

本披露還揭露了一個神經網絡運算裝置，其包括一個或多個在如圖1a或如圖1b所示的芯片，用於從其他處理裝置中獲取待運算數據和控制信息，執行指定的神經網絡運算，執行結果通過I/O介面傳遞給外圍設備。外圍設備譬如攝像頭，顯示器，鼠標，鍵盤，網卡，wifi介面，服務器。當包含一個以上神如圖1a或如圖1b所示的芯片時，如圖1a或如圖1b所示的芯片間可以通過特定的結構進行鏈接並傳輸數據，譬如，通過PCIE總線進行互聯並傳輸數據，以支持更大規模的神經網絡的運算。此時，可以共享同一控制系統，也可以有各自獨立的控制系統；可以共享內存，也可以每個加速器有各自的內存。此外，其互聯方式可以是任意互聯拓撲。 The present disclosure also discloses a neural network computing device, which includes one or more chips as shown in FIG. 1a or FIG. 1b, for obtaining data to be operated and control information from other processing devices, and executing a specified neural network Operation, the execution result is passed to the peripheral device through the I/O interface. Peripherals such as camera, monitor, mouse, keyboard, network card, wifi interface, server. When more than one chip as shown in Figure 1a or Figure 1b is included, the chips shown in Figure 1a or Figure 1b can be linked and transmitted through a specific structure, for example, interconnected and transmitted through the PCIE bus data to support larger-scale neural network operations. At this time, the same control system can be shared, or there can be independent control systems; memory can be shared, or each accelerator can have its own memory. In addition, the interconnection method can be any interconnection topology.

該神經網絡運算裝置具有較高的兼容性，可通過PCIE介面與各種類型的服務器相連接。 The neural network computing device has high compatibility and can be connected with various types of servers through the PCIE interface.

本披露還揭露了一個組合處理裝置，其包括上述的神經網絡運算裝置，通用互聯介面，和其他處理裝置(即通用處理裝置)。神經網絡運算裝置與其他處理裝置進行交互，共同完成用戶指定的操作。如4c下圖為組合處理裝置的示意圖。 The present disclosure also discloses a combined processing device, which includes the above-mentioned neural network computing device, a universal interconnection interface, and other processing devices (ie, general-purpose processing devices). The neural network computing device interacts with other processing devices to jointly complete the operation specified by the user. Figure 4c below is a schematic diagram of the combined processing device.

其他處理裝置，包括中央處理器CPU、圖形處理器GPU、神經網絡處理器等通用/專用處理器中的一種或以上的處理器類型。其他處理裝置所包括的處理器數量不做限制。其他處理裝置作為神經網絡運算裝置與外部數據和控制的介面，包括數據搬運，完成對本神經網絡運算裝置的開啟、停止等基本控制；其他處理裝置也可以和神經網絡運算裝置協作共同完成運算任務。 Other processing devices include one or more processor types among general-purpose/special-purpose processors such as a central processing unit (CPU), a graphics processing unit (GPU), and a neural network processor. The number of processors included in other processing devices is not limited. Other processing devices serve as the interface between the neural network computing device and external data and control, including data transfer, to complete the start and stop of the neural network computing device and other basic controls; other processing devices can also cooperate with neural network computing devices to complete computing tasks.

通用互聯介面，用於在所述神經網絡運算裝置與其他處理裝置間傳輸數據和控制指令。該神經網絡運算裝置從其他處理裝置中獲取所需的輸入數據，寫入神經網絡運算裝置片上的存儲裝置；可以從其他處理裝置中獲取控制指令，寫入神經網絡運算裝置片上的控制緩存；也可以讀取神經網絡運算裝置的存儲模塊中的數據並傳輸給其他處理裝置。 The universal interconnection interface is used to transmit data and control instructions between the neural network computing device and other processing devices. The neural network computing device obtains the required input data from other processing devices, and writes it into the memory device on-chip of the neural network computing device; it can obtain control instructions from other processing devices and write it into the control cache on the neural network computing device chip; The data in the memory module of the neural network computing device can be read and transmitted to other processing devices.

如圖4d所示，可選的，該結構還包括存儲裝置，用於保存在本運算單元/運算裝置或其他運算單元所需要的數據，尤其適用於所需要運算的數據在本神經網絡運算裝置或其他處理裝置的內部存儲中無法全部保存的數據。 As shown in FIG. 4d, optionally, the structure further includes a storage device for storing the data required by the operation unit/operation device or other operation units, especially suitable for the data required for operation in the neural network operation device. or other data that cannot be fully stored in the internal storage of the processing device.

該組合處理裝置可以作為手機、機器人、無人機、視頻監控設備等設備的SOC片上系統，有效降低控制部分的核心面積，提高處理速度，降低整體功耗。此情況時，該組合處理裝置的通用互聯介面與設備的某些部件相連接。某些部件譬如攝像頭，顯示器，鼠標，鍵盤，網卡，wifi介面。 The combined processing device can be used as an SOC system for mobile phones, robots, drones, video surveillance equipment and other equipment, effectively reducing the core area of the control part, improving the processing speed and reducing the overall power consumption. In this case, the general interconnection interface of the combined processing device is connected to some components of the device. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.

本披露實施例提供了一種神經網絡處理器板卡，可用於眾多通用或專用的計算系統環境或配置中。例如：個人計算機、服務器計算機、手持設備或便攜式設備、平板型設備、智能家居、家電、多處理器系統、基於微處理器的系統、機器人、可編程的消費電子設備、網絡個人計算機(personal computer，PC)、小型計算機、大型計算機、包括以上任何系統或設備的分布式計算環境等等。 Embodiments of the present disclosure provide a neural network processor board that can be used in numerous general-purpose or special-purpose computing system environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, smart homes, appliances, multiprocessor systems, microprocessor-based systems, robotics, programmable consumer electronics, network personal computers , PC), minicomputers, mainframe computers, distributed computing environments including any of the above systems or devices, and the like.

請參照圖5c，圖5c為本披露實施例提供的一種神經網絡處理器板卡的結構示意圖。如圖5c所示，上述神經網絡處理器板卡10包括神經網絡芯片封裝結構11、第一電氣及非電氣連接裝置12和第一基板(substrate)13。 Please refer to FIG. 5c, which is a schematic structural diagram of a neural network processor board according to an embodiment of the present disclosure. As shown in FIG. 5 c , the above-mentioned neural network processor board 10 includes a neural network chip package structure 11 , a first electrical and non-electrical connection device 12 and a first substrate 13 .

本披露對於神經網絡芯片封裝結構11的具體結構不作限定，可選的，如圖5d所示，上述神經網絡芯片封裝結構11包括：神經網絡芯片111、第二電氣及非電氣連接裝置112、第二基板113。 The present disclosure does not limit the specific structure of the neural network chip packaging structure 11. Optionally, as shown in FIG. 5d, the neural network chip packaging structure 11 includes: a neural network chip 111, a second electrical and non-electrical connection device 112, a first Two substrates 113 .

本披露所涉及的神經網絡芯片111的具體形式不作限定，上述的神經網絡芯片111包含但不限於將神經網絡處理器集成的神經網絡晶片，上述晶片可以由硅材料、鍺材料、量子材料或分子材料等製成。根據實際情況(例如：較嚴苛的環境)和不同的應用需求可將上述神經網絡晶片進行封裝，以使神經網絡晶片的大部分被包裹住，而將神經網絡晶片上的引腳通過金線等導體連到封裝結構的外邊，用於和更外層進行電路連接。 The specific form of the neural network chip 111 involved in the present disclosure is not limited. The above-mentioned neural network chip 111 includes but is not limited to a neural network chip that integrates a neural network processor. The above-mentioned chip can be made of silicon material, germanium material, quantum material or molecular material. materials etc. According to the actual situation (for example: harsh environment) and different application requirements, the above neural network chip can be packaged, so that most of the neural network chip is wrapped, and the pins on the neural network chip are passed through gold wires. The other conductors are connected to the outside of the package structure for circuit connection with the outer layers.

本披露對於神經網絡芯片111的具體結構不作限定，可選的，請參照圖1a或圖1b所示的裝置。 The present disclosure does not limit the specific structure of the neural network chip 111. Optionally, please refer to the device shown in FIG. 1a or FIG. 1b.

本披露對於第一基板13和第二基板113的類型不做限定，可以是印制電路板(printed circuit board，PCB)或(printed wiring board，PWB)，還可能為其它電路板。對PCB的製作材料也不做限定。 The present disclosure does not limit the types of the first substrate 13 and the second substrate 113, which may be a printed circuit board (PCB) or a printed wiring board (PWB), and may also be other circuit boards. The material for making the PCB is also not limited.

本披露所涉及的第二基板113用於承載上述神經網絡芯片111，通過第二電氣及非電氣連接裝置112將上述的神經網絡芯片111和第二基板113進行連接得到的神經網絡芯片封裝結構11，用於保護神經網絡芯片111，便於將神經網絡芯片封裝結構11與第一基板13進行進一步封裝。 The second substrate 113 involved in the present disclosure is used to carry the above-mentioned neural network chip 111 , and the neural network chip package structure 11 is obtained by connecting the above-mentioned neural network chip 111 and the second substrate 113 through the second electrical and non-electrical connection device 112 . , which is used to protect the neural network chip 111 and facilitate further packaging of the neural network chip packaging structure 11 and the first substrate 13 .

對於上述具體的第二電氣及非電氣連接裝置112的封裝方式和封裝方式對應的結構不作限定，可根據實際情況和不同的應用需求選擇合適的封裝方式並進行簡單地改進，例如：倒裝芯片球柵陣列封裝(Flip Chip Ball Grid Array Package，FCBGAP)，薄型四方扁平式封裝(Low-profile Quad Flat Package，LQFP)、帶散熱器的四方扁平封裝(Quad Flat Package with Heat sink，HQFP)、無引腳四方扁平封裝(Quad Flat Non-lead Package，QFN)或小間距四方扁平式封裝(Fine-pitch Ball Grid Package，FBGA)等封裝方式。 The packaging method and the structure corresponding to the packaging method of the above-mentioned specific second electrical and non-electrical connection device 112 are not limited, and an appropriate packaging method can be selected according to the actual situation and different application requirements and can be simply improved, for example: flip chip Ball Grid Array Package (Flip Chip Ball Grid Array Package, FCBGAP), Low-profile Quad Flat Package (LQFP), Quad Flat Package with Heat Sink (HQFP), None Pin quad flat package (Quad Flat Non-lead Package, QFN) or small pitch quad flat package (Fine-pitch Ball Grid Package, FBGA) and other packaging methods.

倒裝芯片(Flip Chip)，適用於對封裝後的面積要求高或對導線的電感、信號的傳輸時間敏感的情況下。除此之外可以用引線鍵合(Wire Bonding)的封裝方式，減少成本，提高封裝結構的靈活性。 Flip Chip is suitable for situations where the area after the package is high or is sensitive to the inductance of the wire and the transmission time of the signal. In addition, a wire bonding (Wire Bonding) packaging method can be used to reduce the cost and improve the flexibility of the packaging structure.

球柵陣列(Ball Grid Array)，能夠提供更多引腳，且引腳的平均導線長度短，具備高速傳遞信號的作用，其中，封裝可以用引腳網格陣列封裝(Pin Grid Array，PGA)、零插拔力(Zero Insertion Force，ZIF)、單邊接觸連接(Single Edge Contact Connection，SECC)、觸點陣列(Land Grid Array，LGA)等來代替。 Ball Grid Array (Ball Grid Array) can provide more pins, and the average lead length of the pins is short, which has the function of high-speed signal transmission. Among them, the package can be packaged with Pin Grid Array (PGA) , zero insertion force (Zero Insertion Force, ZIF), single edge contact connection (Single Edge Contact Connection, SECC), contact array (Land Grid Array, LGA), etc. instead.

可選的，採用倒裝芯片球柵陣列(Flip Chip Ball Grid Array)的封裝方式對神經網絡芯片111和第二基板113進行封裝，具體的神經網絡芯片封裝結構的示意圖可參照圖6。如圖6所示，上述神經網絡芯片封裝結構包括：神經網絡芯片21、焊盤22、焊球23、第二基板24、第二基板24上的連接點25、引腳26。 Optionally, the neural network chip 111 and the second substrate 113 are packaged in a flip chip ball grid array (Flip Chip Ball Grid Array) packaging method. Refer to FIG. 6 for a schematic diagram of a specific neural network chip packaging structure. As shown in FIG. 6 , the above-mentioned neural network chip package structure includes: a neural network chip 21 , pads 22 , solder balls 23 , a second substrate 24 , connection points 25 and pins 26 on the second substrate 24 .

其中，焊盤22與神經網絡芯片21相連，通過在焊盤22和第二基板24上的連接點25之間焊接形成焊球23，將神經網絡芯片21和第二基板24連接，即實現了神經網絡芯片21的封裝。 Wherein, the pad 22 is connected to the neural network chip 21, and solder balls 23 are formed by welding between the pad 22 and the connection point 25 on the second substrate 24, and the neural network chip 21 and the second substrate 24 are connected. Packaging of the neural network chip 21 .

引腳26用於與封裝結構的外部電路(例如，神經網絡處理器板卡10上的第一基板13)相連，可實現外部數據和內部數據的傳輸，便於神經網絡芯片21或神經網絡芯片21對應的神經網絡處理器對數據進行處理。對於引腳的類型和數量本披露也不作限定，根據不同的封裝技術可選用不同的引腳形式，並遵從一定規則進行排列。 The pin 26 is used to connect with the external circuit of the package structure (for example, the first substrate 13 on the neural network processor board 10 ), which can realize the transmission of external data and internal data, and is convenient for the neural network chip 21 or the neural network chip 21 The corresponding neural network processor processes the data. The present disclosure also does not limit the type and quantity of pins, and different pin forms can be selected according to different packaging technologies, and are arranged in accordance with certain rules.

可選的，上述神經網絡芯片封裝結構還包括絕緣填充物，置於焊盤22、焊球23和連接點25之間的空隙中，用於防止焊球與焊球之間產生干擾。 Optionally, the above-mentioned neural network chip package structure further includes insulating fillers, which are placed in the gaps between the pads 22 , the solder balls 23 and the connection points 25 , to prevent interference between the solder balls and the solder balls.

其中，絕緣填充物的材料可以是氮化硅、氧化硅或氧氮化硅；干擾包含電磁干擾、電感干擾等。 Wherein, the material of the insulating filler may be silicon nitride, silicon oxide or silicon oxynitride; the interference includes electromagnetic interference, inductive interference, and the like.

可選的，上述神經網絡芯片封裝結構還包括散熱裝置，用於散髮神經網絡芯片21運行時的熱量。其中，散熱裝置可以是一塊導熱性良好的金屬片、散熱片或散熱器，例如，風扇。 Optionally, the above-mentioned neural network chip packaging structure further includes a heat dissipation device for dissipating heat during operation of the neural network chip 21 . Wherein, the heat dissipation device may be a metal sheet, a heat sink or a heat sink with good thermal conductivity, such as a fan.

舉例來說，如圖6a所示，神經網絡芯片封裝結構11包括：神經網絡芯片21、焊盤22、焊球23、第二基板24、第二基板24上的連接點25、引腳26、絕緣填充物27、散熱膏28和金屬外殼散熱片29。其中，散熱膏28和金屬外殼散熱片29用於散髮神經網絡芯片21運行時的熱量。 For example, as shown in FIG. 6a, the neural network chip package structure 11 includes: a neural network chip 21, pads 22, solder balls 23, a second substrate 24, connection points 25 on the second substrate 24, pins 26, Insulation filler 27 , thermal paste 28 and metal shell heat sink 29 . Among them, the heat dissipation paste 28 and the metal shell heat dissipation fins 29 are used to dissipate the heat of the neural network chip 21 during operation.

可選的，上述神經網絡芯片封裝結構11還包括補強結構，與焊盤22連接，且內埋於焊球23中，以增強焊球23與焊盤22之間的連接強度。 Optionally, the above-mentioned neural network chip package structure 11 further includes a reinforcing structure, which is connected to the pads 22 and embedded in the solder balls 23 to enhance the connection strength between the solder balls 23 and the pads 22 .

其中，補強結構可以是金屬線結構或柱狀結構，在此不做限定。 Wherein, the reinforcing structure may be a metal wire structure or a columnar structure, which is not limited herein.

本披露對於第一電氣及非電氣裝置12的具體形式也不作限定，可參照第二電氣及非電氣裝置112的描述，即通過焊接的方式將神經網絡芯片封裝結構11進行封裝，也可以採用連接線連接或插拔方式連接第二基板113和第一基板13的方式，便於後續更換第一基板13或神經網絡芯片封裝結構11。 The present disclosure also does not limit the specific form of the first electrical and non-electrical device 12. Reference can be made to the description of the second electrical and non-electrical device 112, that is, the neural network chip package structure 11 is packaged by welding, or a connection can be used. The way of connecting the second substrate 113 and the first substrate 13 by wire connection or plugging is convenient for subsequent replacement of the first substrate 13 or the neural network chip package structure 11 .

可選的，第一基板13包括用於擴展存儲容量的內存單元的介面等，例如：同步動態隨機存儲器(Synchronous Dynamic Random Access Memory，SDRAM)、雙倍速率同步動態隨機存儲器(Double Date Rate SDRAM，DDR)等，通過擴展內存提高了神經網絡處理器的處理能力。 Optionally, the first substrate 13 includes an interface and the like for a memory unit used to expand the storage capacity, for example: a synchronous dynamic random access memory (Synchronous Dynamic Random Access Memory, SDRAM), a double-rate synchronous dynamic random access memory (Double Date Rate SDRAM, DDR), etc., the processing power of the neural network processor is improved by expanding the memory.

第一基板13上還可包括快速外部設備互連總線(Peripheral Component Interconnect-Express，PCI-E或PCIe)介面、小封裝可熱插拔(Small Form-factor Pluggable，SFP)介面、以太網介面、控制器局域網總線(Controller Area Network，CAN)介面等等，用於封裝結構和外部電路之間的數據傳輸，可提高運算速度和操作的便利性。 The first substrate 13 may further include a Peripheral Component Interconnect-Express (PCI-E or PCIe) interface, a Small Form-factor Pluggable (SFP) interface, an Ethernet interface, A controller area network bus (Controller Area Network, CAN) interface, etc., is used for data transmission between the package structure and the external circuit, which can improve the operation speed and the convenience of operation.

將神經網絡處理器封裝為神經網絡芯片111，將神經網絡芯片111封裝為神經網絡芯片封裝結構11，將神經網絡芯片封裝結構11封裝為神經網絡處理器板卡10，通過板卡上的介面(插槽或插芯)與外部電路(例如：計算機主板)進行數據交互，即直接通過使用神經網絡處理器板卡10實現神經網絡處理器的功能，並保護神經網絡芯片111。且神經網絡處理器板卡10上還可添加其他模塊，提高了神經網絡處理器的應用範圍和運算效率。 The neural network processor is packaged into a neural network chip 111, the neural network chip 111 is packaged into a neural network chip package structure 11, and the neural network chip package structure 11 is packaged into a neural network processor board 10, through the interface ( socket or ferrule) for data interaction with an external circuit (eg, computer motherboard), that is, directly using the neural network processor board 10 to realize the function of the neural network processor and protect the neural network chip 111 . In addition, other modules can be added to the neural network processor board 10, which improves the application scope and operation efficiency of the neural network processor.

在一個實施例里，本公開公開了一個電子裝置，其包括了上述神經網絡處理器板卡10或神經網絡芯片封裝結構11。 In one embodiment, the present disclosure discloses an electronic device including the above-mentioned neural network processor board 10 or neural network chip package structure 11 .

電子裝置包括數據處理裝置、機器人、電腦、打印機、掃描儀、平板電腦、智能終端、手機、行車記錄儀、導航儀、傳感器、攝像頭、服務器、相機、攝像機、投影儀、手錶、耳機、移動存儲、可穿戴設備、交通工具、家用電器、和/或醫療設備。 Electronic devices include data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cameras, video cameras, projectors, watches, headphones, mobile storage , wearable devices, vehicles, home appliances, and/or medical devices.

所述交通工具包括飛機、輪船和/或車輛；所述家用電器包括電視、空調、微波爐、冰箱、電飯煲、加濕器、洗衣機、電燈、燃氣灶、油煙機；所述醫療設備包括核磁共振儀、B超儀和/或心電圖儀。 The vehicles include airplanes, ships and/or vehicles; the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods; the medical equipment includes nuclear magnetic resonance device, B-ultrasound and/or electrocardiograph.

以上所述的具體實施例，對本披露的目的、技術方案和有益效果進行了進一步詳細說明，所應理解的是，以上所述僅為本披露的具體實施例而已，並不用於限制本披露，凡在本披露的精神和原則之內，所做的任何修改、等同替換、改進等，均應包含在本披露的保護範圍之內。 The specific embodiments described above further describe the purpose, technical solutions and beneficial effects of the present disclosure in further detail. It should be understood that the above are only specific embodiments of the present disclosure, and are not intended to limit the present disclosure. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this disclosure shall be included within the protection scope of this disclosure.

Claims

An integrated circuit chip device, wherein the integrated circuit chip device comprises: a main processing circuit, k branch circuits and k groups of basic processing circuits, the main processing circuit is respectively connected with the k branch circuits, among the k branch circuits Each branch circuit corresponds to a group of basic processing circuits in the k groups of basic processing circuits, and the group of basic processing circuits includes at least one basic processing circuit; the branch circuit includes: a data type operation circuit for executing floating-point type data conversion with fixed-point type data; the main processing circuit is used to perform each successive operation in the neural network operation and transmit data with the k branch circuits connected to it; the k branch circuits are used to perform in the main A transmission data is forwarded between the processing circuit and the k groups of basic processing circuits, and according to the operation control of the transmission data, whether to start the data type operation circuit to perform conversion of the type of the transmission data; the k groups of basic processing circuits are used for according to the transmission. The data or the converted transmission data execute the operation in the neural network in parallel, and transmit the operation result to the main processing circuit through the branch circuit connected with the main processing circuit; the main processing circuit is used to obtain a to-be-calculated A data block and an operation instruction, according to the operation instruction, the data block to be calculated is divided into a distribution data block and a broadcast data block; the distribution data block is divided and processed to obtain a plurality of basic data blocks, and the plurality of basic data blocks are obtained. The data block is distributed to the k branch circuits connected to it, and the broadcast data block is broadcast to the k branch circuits connected to it; the k branch circuits are used to receive the basic data block and the broadcast data block, and activate the The data type operation circuit converts the basic data block and the broadcast data block into a fixed-point data type; forwards the basic data block and the broadcast data block to the k groups of basic processing circuits in the fixed-point data type; The basic processing circuit is used to perform an inner product operation on the basic data block and the broadcast data block with fixed-point data type to obtain an operation result, and send the operation result to the k branch circuits; the k branch circuits are used for Convert the operation result into a floating-point type operation result, and send the floating-point type operation result to the main processing circuit; the main processing circuit is used for processing the floating-point type operation result to obtain the data to be calculated The block and the instruction result of the operation instruction.

The integrated circuit chip device according to claim 1, wherein the main processing circuit is specifically configured to broadcast the broadcast data block to the k branch circuits at one time.

The integrated circuit chip device according to claim 1, wherein the main processing circuit is specifically configured to divide the broadcast data block into a plurality of partial broadcast data blocks, and broadcast the plurality of partial broadcast data blocks to the k branch circuits.

The integrated circuit chip device according to item 3 of the patent application scope, wherein the basic processing circuit is specifically configured to perform an inner product processing on the part of the broadcast data block and the basic data block in a fixed-point type to obtain an inner product processing result, Accumulate the inner product processing result to obtain a part of the operation result, send the partial operation result to the k branch circuits, and the k branch circuits are used to convert the partial operation result into floating-point type data and send it to the main processing circuit .

The integrated circuit chip device according to claim 3, wherein the basic processing circuit is specifically configured to multiplex the part of the broadcast data block n times to perform the inner product of the part of the broadcast data block and the n basic data blocks in a fixed-point data type The operation obtains n partial processing results of the fixed-point data type. After accumulating the n partial processing results of the fixed-point data type respectively, the n partial operation results of the fixed-point type are obtained, and the n partial operation results of the fixed-point type are sent to the branch circuit. ; The branch circuit is used to convert the n partial operation results of the fixed point type into n partial operation results of the floating point type, and send the n partial operation results of the floating point type to the main processing circuit, where n is greater than or equal to An integer of 2.

The integrated circuit chip device according to claim 1, wherein the main processing circuit comprises: a main register or a main on-chip buffer circuit; or the branch circuit comprises: a basic register or a basic on-chip buffer circuit; or the basic processing circuit comprises: Basic registers or basic on-chip cache circuits.

The integrated circuit chip device according to claim 6, wherein the main processing circuit includes: a vector operator circuit, an arithmetic logic unit circuit, an accumulator circuit, a matrix transposition circuit, a direct memory access circuit, and a data type operation circuit Or one or any combination of data rearrangement circuits.

The integrated circuit chip device according to claim 1, wherein the data is one or any combination of vectors, matrices, three-dimensional data blocks, four-dimensional data blocks, and n-dimensional data blocks.

The integrated circuit chip device according to claim 1, wherein, if the operation command is a multiplication command, the main processing circuit determines that the multiplier data block is the broadcast data block, and the multiplicand data block is the distribution data block; such as The operation instruction is a convolution instruction, the main processing circuit determines that the input data block is the broadcast data block, and the convolution kernel is the distribution data block.

A neural network computing device, wherein the neural network computing device includes one or more integrated circuit chip devices according to any one of items 1 to 9 of the patent application scope.

A combined processing device, wherein the combined processing device comprises: a neural network computing device, a general interconnection interface and a general processing device as claimed in item 10 of the patent application scope; the neural network computing device communicates with the general purpose through the general interconnection interface Handling device connection.

A chip for performing the operation of a neural network, wherein the chip integrates the device according to any one of items 1 to 9 of the scope of the application.

A smart device, wherein the smart device includes the chip as claimed in item 12 of the patent application scope.

A method for computing a neural network, wherein the method is applied in an integrated circuit chip device, the integrated circuit chip device comprising: the integrated circuit chip device according to any one of items 1 to 9 in the scope of the application, the integrated circuit chip device using for performing neural network operations.

The method according to item 14 of the scope of the application, wherein the operation of the neural network includes: convolution operation, matrix multiplication matrix operation, matrix multiplication vector operation, bias operation, full connection operation, GEMM operation, GEMV operation, and activation operation. one or any combination.