WO2021210527A1 - ニューラルネットワーク回路の制御方法 - Google Patents
ニューラルネットワーク回路の制御方法 Download PDFInfo
- Publication number
- WO2021210527A1 WO2021210527A1 PCT/JP2021/015148 JP2021015148W WO2021210527A1 WO 2021210527 A1 WO2021210527 A1 WO 2021210527A1 JP 2021015148 W JP2021015148 W JP 2021015148W WO 2021210527 A1 WO2021210527 A1 WO 2021210527A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- circuit
- semaphore
- quantization
- convolution
- memory
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/52—Program synchronisation; Mutual exclusion, e.g. by means of semaphores
- G06F9/526—Mutual exclusion algorithms
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0495—Quantised networks; Sparse networks; Compressed networks
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16Y—INFORMATION AND COMMUNICATION TECHNOLOGY SPECIALLY ADAPTED FOR THE INTERNET OF THINGS [IoT]
- G16Y20/00—Information sensed or collected by the things
Definitions
- the present invention relates to a method for controlling a neural network circuit.
- the present application claims priority based on Japanese Patent Application No. 2020-071933 filed in Japan on April 13, 2020, the contents of which are incorporated herein by reference.
- a convolutional neural network has been used as a model for image recognition and the like.
- a convolutional neural network has a multi-layer structure having a convolutional layer and a pooling layer, and requires a large number of operations such as a convolutional operation.
- Various arithmetic methods have been devised to speed up the arithmetic by the convolutional neural network (Patent Document 1 and the like).
- the method for controlling the neural network circuit according to the first aspect of the present invention includes a first memory for storing input data, a convolution calculation circuit for performing a convolution operation on the input data stored in the first memory, and a convolution calculation circuit.
- a second memory for storing the convolution operation output data of the convolution operation circuit, a quantization operation circuit for performing a quantization operation on the convolution operation output data stored in the second memory, and the convolution operation circuit.
- a second write semapho that restricts writing to the second memory
- a second read semapho that restricts reading from the second memory by the quantization operation circuit, and restrictions on writing to the first memory by the quantization operation circuit.
- a method for controlling a neural network circuit including a third write semapho, a third read semapho that restricts reading from the first memory by the convolution operation circuit, and the third read semapho and the second write. Based on the semapho, the convolution calculation circuit is made to perform the convolution calculation.
- the neural network circuit control method of the present invention can operate a neural network circuit that can be incorporated into an embedded device such as an IoT device with high performance.
- FIG. 1 is a diagram showing a convolutional neural network 200 (hereinafter referred to as “CNN200”).
- CNN200 convolutional neural network 200
- the calculation performed by the neural network circuit 100 (hereinafter referred to as “NN circuit 100”) according to the first embodiment is at least a part of the learned CNN 200 used at the time of inference.
- the CNN 200 is a multi-layered network including a convolution layer 210 that performs a convolution calculation, a quantization calculation layer 220 that performs a quantization calculation, and an output layer 230. In at least a part of the CNN 200, the convolution layer 210 and the quantization calculation layer 220 are alternately connected.
- the CNN200 is a model widely used for image recognition and video recognition.
- the CNN 200 may further have a layer having other functions such as a fully connected layer.
- FIG. 2 is a diagram illustrating a convolution operation performed by the convolution layer 210.
- the convolution layer 210 performs a convolution operation using the weight w on the input data a.
- the convolution layer 210 performs a product-sum operation with the input data a and the weight w as inputs.
- the input data a (also referred to as activation data or feature map) to the convolution layer 210 is multidimensional data such as image data.
- the input data a is a three-dimensional tensor composed of elements (x, y, c).
- the convolution layer 210 of the CNN 200 performs a convolution operation on the low-bit input data a.
- the element of the input data a is a 2-bit unsigned integer (0,1,2,3).
- the element of the input data a may be, for example, a 4-bit or 8-bit unsigned integer.
- the CNN 200 places an input layer for type conversion or quantization before the convolution layer 210. You may also have more.
- the weight w (also called a filter or kernel) of the convolution layer 210 is multidimensional data having elements that are learnable parameters.
- the weight w is a four-dimensional tensor composed of elements (i, j, c, d).
- the weight w has d three-dimensional tensors (hereinafter referred to as "weight w") composed of elements (i, j, c).
- the weight w in the trained CNN 200 is the trained data.
- the convolution layer 210 of the CNN 200 performs a convolution operation using a low bit weight w.
- the element of the weight w is a 1-bit signed integer (0,1), the value "0" represents +1 and the value "1" represents -1.
- the convolution layer 210 performs the convolution operation shown in Equation 1 and outputs the output data f.
- s represents a stride.
- the area shown by the dotted line in FIG. 2 indicates one of the areas ao (hereinafter referred to as “applicable area ao”) to which the weight w is applied to the input data a.
- the elements of the application area ao are represented by (x + i, y + j, c).
- the quantization calculation layer 220 performs quantization or the like on the output of the convolution calculation output by the convolution layer 210.
- the quantization calculation layer 220 includes a pooling layer 221, a Batch Normalization layer 222, an activation function layer 223, and a quantization layer 224.
- the pooling layer 221 compresses the output data f of the convolution layer 210 by performing operations such as average pooling (Equation 2) and MAX pooling (Equation 3) on the output data f of the convolution operation output by the convolution layer 210. do.
- Equations 2 and 3 u indicates the input tensor, v indicates the output tensor, and T indicates the size of the pooling region.
- max is a function that outputs the maximum value of u for the combination of i and j included in T.
- the Batch Normalization layer 222 normalizes the data distribution of the output data of the quantization calculation layer 220 and the pooling layer 221 by, for example, the calculation shown in Equation 4.
- Equation 4 u represents the input tensor, v represents the output tensor, ⁇ represents the scale, and ⁇ represents the bias.
- ⁇ and ⁇ are trained constant vectors.
- the activation function layer 223 performs an operation of an activation function such as ReLU (Equation 5) on the outputs of the quantization calculation layer 220, the pooling layer 221 and the Batch Nomalization layer 222.
- ReLU ReLU
- u is the input tensor
- v is the output tensor.
- max is a function that outputs the largest number of arguments.
- the quantization layer 224 performs the quantization of the outputs of the pooling layer 221 and the activation function layer 223, for example, as shown in Equation 6 based on the quantization parameters.
- the quantization shown in Equation 6 reduces the input tensor u to 2 bits.
- q (c) is a vector of quantization parameters.
- q (c) is a trained constant vector.
- the inequality sign “ ⁇ ” in Equation 6 may be “ ⁇ ”.
- the output layer 230 is a layer that outputs the result of CNN200 by an identity function, a softmax function, or the like.
- the layer in front of the output layer 230 may be a convolution layer 210 or a quantization calculation layer 220.
- the load of the convolutional calculation of the convolutional layer 210 is smaller than that of other convolutional neural networks that do not perform quantization. ..
- the NN circuit 100 divides the input data of the convolution calculation (Equation 1) of the convolution layer 210 into partial tensors and performs the calculation.
- the method of dividing into partial tensors and the number of divisions are not particularly limited.
- the partial tensor is formed, for example, by dividing the input data a (x + i, y + j, c) into a (x + i, y + j, co).
- the NN circuit 100 can also perform calculations without dividing the input data of the convolution calculation (Equation 1) of the convolution layer 210.
- the variable c in Equation 1 is divided into blocks of size Bc as shown in Equation 7. Further, the variable d in the equation 1 is divided into blocks of size Bd as shown in the equation 8.
- co is the offset and ci is the index from 0 to (Bc-1).
- do is the offset and di is the index from 0 to (Bd-1).
- the size Bc and the size Bd may be the same.
- the input data a (x + i, y + j, c) in Equation 1 is divided by the size Bc in the c-axis direction and is represented by the divided input data a (x + i, y + j, co).
- the divided input data a is also referred to as “divided input data a”.
- the weight w (i, j, c, d) in the equation 1 is divided by the size Bc in the c-axis direction and the size Bd in the d-axis direction, and is represented by the divided weight w (i, j, co, do). NS.
- the divided weight w is also referred to as “divided weight w”.
- the output data f (x, y, do) divided by the size Bd can be obtained by Equation 9.
- the final output data f (x, y, d) can be calculated by combining the divided output data f (x, y, do).
- the NN circuit 100 expands the input data a and the weight w in the convolution operation of the convolution layer 210 to perform the convolution operation.
- FIG. 3 is a diagram illustrating the development of data for the convolution operation.
- the divided input data a (x + i, y + j, co) is expanded into vector data having Bc elements.
- the element of the divided input data a is indexed by ci (0 ⁇ ci ⁇ Bc).
- the divided input data a expanded into vector data for each i and j is also referred to as “input vector A”.
- the input vector A has elements from the divided input data a (x + i, y + j, co ⁇ Bc) to the divided input data a (x + i, y + j, co ⁇ Bc + (Bc-1)).
- the division weight w (i, j, co, do) is expanded into matrix data having Bc ⁇ Bd elements.
- the element of the division weight w expanded in the matrix data is indexed by ci and di (0 ⁇ di ⁇ Bd).
- the division weight w expanded in the matrix data for each i and j is also referred to as “weight matrix W”.
- the weight matrix W has a division weight w (i, j, co ⁇ Bc, do ⁇ Bd) to a division weight w (i, j, co ⁇ Bc + (Bc-1), do ⁇ Bd + (Bd-1)). Let it be an element.
- Vector data is calculated by multiplying the input vector A and the weight matrix W.
- Output data f (x, y, do) can be obtained by shaping the vector data calculated for each i, j, and co into a three-dimensional tensor. By expanding such data, the convolution operation of the convolution layer 210 can be performed by multiplying the vector data and the matrix data.
- FIG. 4 is a diagram showing an overall configuration of the NN circuit 100 according to the present embodiment.
- the NN circuit 100 includes a first memory 1, a second memory 2, a DMA controller 3 (hereinafter, also referred to as “DMAC3”), a convolution arithmetic circuit 4, a quantization arithmetic circuit 5, and a controller 6. ..
- the NN circuit 100 is characterized in that the convolution calculation circuit 4 and the quantization calculation circuit 5 are formed in a loop shape via the first memory 1 and the second memory 2.
- the first memory (first memory unit) 1 is a rewritable memory such as a volatile memory composed of, for example, SRAM (Static RAM). Data is written to and read from the first memory 1 via the DMAC 3 and the controller 6.
- the first memory 1 is connected to the input port of the convolution calculation circuit 4, and the convolution calculation circuit 4 can read data from the first memory 1. Further, the first memory 1 is connected to the output port of the quantization calculation circuit 5, and the quantization calculation circuit 5 can write data to the first memory 1.
- the external host CPU can input / output data to / from the NN circuit 100 by writing / reading data to / from the first memory 1.
- the second memory (second memory unit) 2 is a rewritable memory such as a volatile memory composed of, for example, SRAM (Static RAM). Data is written to and read from the second memory 2 via the DMAC 3 and the controller 6.
- the second memory 2 is connected to the input port of the quantization calculation circuit 5, and the quantization calculation circuit 5 can read data from the second memory 2. Further, the second memory 2 is connected to the output port of the convolution calculation circuit 4, and the convolution calculation circuit 4 can write data to the second memory 2.
- the external host CPU can input / output data to / from the NN circuit 100 by writing / reading data to / from the second memory 2.
- the DMAC 3 is connected to the external bus EB and transfers data between an external memory such as a DRAM and the first memory 1. Further, the DMAC 3 transfers data between an external memory such as a DRAM and the second memory 2. Further, the DMAC 3 transfers data between an external memory such as a DRAM and the convolution calculation circuit 4. Further, the DMAC 3 transfers data between an external memory such as a DRAM and the quantization calculation circuit 5.
- the convolution calculation circuit 4 is a circuit that performs a convolution calculation in the convolution layer 210 of the trained CNN 200.
- the convolution calculation circuit 4 reads the input data a stored in the first memory 1 and performs a convolution calculation on the input data a.
- the convolution operation circuit 4 writes the output data f of the convolution operation (hereinafter, also referred to as “convolution operation output data”) to the second memory 2.
- the quantization calculation circuit 5 is a circuit that performs at least a part of the quantization calculation in the quantization calculation layer 220 of the trained CNN200.
- the quantization operation circuit 5 reads out the output data f of the convolution operation stored in the second memory 2, and uses the output data f of the convolution operation for the quantization operation (pooling, Batch Nomalization, activation function, and quantization). Of these, at least operations including quantization) are performed.
- the quantization calculation circuit 5 writes the output data of the quantization calculation (hereinafter, also referred to as “quantization calculation output data”) to the first memory 1.
- the controller 6 is connected to the external bus EB and operates as a slave of the external host CPU.
- the controller 6 has a register 61 including a parameter register and a status register.
- the parameter register is a register that controls the operation of the NN circuit 100.
- the status register is a register indicating the state of the NN circuit 100 including the semaphore S.
- the external host CPU can access the register 61 via the controller 6.
- the controller 6 is connected to the first memory 1, the second memory 2, the DMAC 3, the convolution calculation circuit 4, and the quantization calculation circuit 5 via the internal bus IB.
- the external host CPU can access each block via the controller 6. For example, the external host CPU can instruct instructions to the DMAC 3, the convolution calculation circuit 4, and the quantization calculation circuit 5 via the controller 6. Further, the DMAC 3, the convolution calculation circuit 4, and the quantization calculation circuit 5 can update the state registers (including the semaphore S) of the controller 6 via the internal bus IB.
- the state register (including the semaphore S) may be configured to be updated via a dedicated wiring connected to the DMAC 3, the convolution calculation circuit 4, and the quantization calculation circuit 5.
- the NN circuit 100 has a first memory 1, a second memory 2, and the like, it is possible to reduce the number of times of data transfer of duplicate data in data transfer by DMAC3 from an external memory such as a DRAM. As a result, the power consumption generated by the memory access can be significantly reduced.
- FIG. 5 is a timing chart showing an operation example of the NN circuit 100.
- the DMAC 3 stores the input data a of the layer 1 in the first memory 1.
- the DMAC 3 may divide the input data a of the layer 1 and transfer it to the first memory 1 according to the order of the convolution operations performed by the convolution operation circuit 4.
- the convolution calculation circuit 4 reads the input data a of the layer 1 stored in the first memory 1.
- the convolution calculation circuit 4 performs the convolution calculation of layer 1 shown in FIG. 1 with respect to the input data a of layer 1.
- the output data f of the layer 1 convolution operation is stored in the second memory 2.
- the quantization calculation circuit 5 reads out the output data f of the layer 1 stored in the second memory 2.
- the quantization calculation circuit 5 performs a layer 2 quantization calculation on the output data f of the layer 1.
- the output data of the layer 2 quantization operation is stored in the first memory 1.
- the convolution operation circuit 4 reads out the output data of the quantization operation of the layer 2 stored in the first memory 1.
- the convolution calculation circuit 4 performs the convolution calculation of the layer 3 with the output data of the quantization calculation of the layer 2 as the input data a.
- the output data f of the layer 3 convolution operation is stored in the second memory 2.
- the convolution operation circuit 4 reads out the output data of the quantization operation of layer 2M-2 (M is a natural number) stored in the first memory 1.
- the convolution operation circuit 4 performs the convolution operation of the layer 2M-1 by using the output data of the quantization operation of the layer 2M-2 as the input data a.
- the output data f of the convolution operation of the layer 2M-1 is stored in the second memory 2.
- the quantization calculation circuit 5 reads out the output data f of the layer 2M-1 stored in the second memory 2.
- the quantization calculation circuit 5 performs a layer 2M quantization calculation on the output data f of the 2M-1 layer.
- the output data of the layer 2M quantization operation is stored in the first memory 1.
- the convolution operation circuit 4 reads out the output data of the layer 2M quantization operation stored in the first memory 1.
- the convolution calculation circuit 4 performs the convolution operation of the layer 2M + 1 by using the output data of the quantization operation of the layer 2M as the input data a.
- the output data f of the convolution operation of the layer 2M + 1 is stored in the second memory 2.
- the convolution calculation circuit 4 and the quantization calculation circuit 5 alternately perform calculations, and proceed with the calculation of CNN200 shown in FIG.
- the convolution calculation circuit 4 performs the convolution calculation of the layer 2M-1 and the layer 2M + 1 by time division.
- the quantization calculation circuit 5 performs the quantization calculation of the layer 2M-2 and the layer 2M by time division. Therefore, the circuit scale of the NN circuit 100 is significantly smaller than that in the case where the convolution calculation circuit 4 and the quantization calculation circuit 5 are separately mounted for each layer.
- the NN circuit 100 calculates the operation of the CNN 200, which is a multi-layer structure of a plurality of layers, by a circuit formed in a loop shape.
- the NN circuit 100 can efficiently use hardware resources due to the loop-shaped circuit configuration.
- the parameters in the convolution calculation circuit 4 and the quantization calculation circuit 5 that change in each layer are updated as appropriate.
- the NN circuit 100 transfers intermediate data to an external calculation device such as an external host CPU. After the external calculation device performs the calculation on the intermediate data, the calculation result by the external calculation device is input to the first memory 1 and the second memory 2. The NN circuit 100 restarts the calculation on the calculation result by the external calculation device.
- an external calculation device such as an external host CPU.
- FIG. 6 is a timing chart showing another operation example of the NN circuit 100.
- the NN circuit 100 may divide the input data a into partial tensors and perform operations on the partial tensors by time division.
- the method of dividing into partial tensors and the number of divisions are not particularly limited.
- FIG. 6 shows an operation example when the input data a is decomposed into two partial tensors.
- the decomposed partial tensors are referred to as “first partial tensor a 1 " and “second partial tensor a 2 ".
- convolution Layer 2M-1 can (6, “Layer 2M-1 (a 1)” hereinafter) convolution operation corresponding to the first portion tensor a 1 and, corresponding to the second portion tensor a 2 It is decomposed into a convolution operation (denoted as “Layer 2M-1 (a 2)" in FIG. 6).
- the convolution operation and the quantization operation corresponding to the first part tensor a 1 and the convolution operation and the quantization operation corresponding to the second part tensor a 2 can be performed independently.
- Convolution operation circuit 4 (in FIG. 6, the operations shown in Layer 2M-1 (a 1)) convolution Layer 2M-1 corresponding to the first portion tensor a 1 performs. Then, convolution circuit 4 (in FIG. 6, the layer 2M-1 (operation indicated by a 2)) convolution Layer 2M-1 corresponding to the second portion tensor a 2 performs.
- the quantization arithmetic circuit 5 (in FIG. 6, the operations shown in Layer 2M (a 1)) quantization operation layer 2M corresponding to the first portion tensor a 1 performs. In this way, the NN circuit 100 can perform the convolution operation of the layer 2M-1 corresponding to the second partial tensor a 2 and the quantization operation of the layer 2M corresponding to the first partial tensor a 1 in parallel.
- the convolution calculation circuit 4 performs a convolution operation of layer 2M + 1 corresponding to the first partial tensor a 1 (the operation shown by layer 2M + 1 (a 1 ) in FIG. 6).
- the quantization arithmetic circuit 5 (in FIG. 6, the operations shown in Layer 2M (a 2)) quantization operation layer 2M corresponding to the second portion tensor a 2 performs.
- the NN circuit 100 can perform the convolution operation of the layer 2M + 1 corresponding to the first partial tensor a 1 and the quantization operation of the layer 2M corresponding to the second partial tensor a 2 in parallel.
- the convolution operation and the quantization operation corresponding to the first part tensor a 1 and the convolution operation and the quantization operation corresponding to the second part tensor a 2 can be performed independently. Therefore, even if the NN circuit 100 performs, for example, the convolution operation of the layer 2M-1 corresponding to the first partial tensor a 1 and the quantization operation of the layer 2M + 2 corresponding to the second partial tensor a 2 in parallel. good. That is, the convolution operation and the quantization operation performed by the NN circuit 100 in parallel are not limited to the operations of continuous layers.
- the NN circuit 100 can operate the convolution operation circuit 4 and the quantization operation circuit 5 in parallel. As a result, the waiting time of the convolution calculation circuit 4 and the quantization calculation circuit 5 is reduced, and the calculation processing efficiency of the NN circuit 100 is improved.
- the number of divisions was 2, but similarly, when the number of divisions is larger than 2, the NN circuit 100 can operate the convolution operation circuit 4 and the quantization operation circuit 5 in parallel. can.
- the NN circuit 100 corresponds to the second part tensor a 2.
- convolution layer 2M-1 for quantization and calculation of the layer 2M corresponding to the third portion tensor a 3 may also be performed in parallel.
- the order of operations is appropriately changed depending on the storage status of the input data a in the first memory 1 and the second memory 2.
- an example (method 1) in which the operation of the partial tensor in the same layer is performed by the convolution operation circuit 4 or the quantization operation circuit 5 and then the operation of the partial tensor in the next layer is performed is shown. ..
- the convolution operation of the layer 2M-1 corresponding to the first partial tensor a 1 and the second partial tensor a 2 in FIG. 6, layer 2M-1 (a 1 )).
- the operation shown by layer 2M-1 (a 2 )) and then the convolution operation of layer 2M + 1 corresponding to the first partial tensor a 1 and the second partial tensor a 2 (in FIG. 6, layer 2M + 1 (a 1 ) and The operation shown by layer 2M + 1 (a 2 )) is performed.
- the calculation method for the partial tensor is not limited to this.
- the calculation method for the partial tensor may be a method in which the calculation of the partial tensor in a plurality of layers is performed and then the calculation of the remaining partial tensor is performed (method 2). For example, convolution in the arithmetic circuit 4, after the folding operation Layer 2M + 1 corresponding to the layer 2M-1 and the first portion tensor a 1 corresponding to the first portion tensor a 1, the layer 2M corresponding to the second portion tensor a 2 The convolution operation of layer 2M + 1 corresponding to -1 and the second partial tensor a 2 may be performed.
- the calculation method for the partial tensor may be a method of calculating the partial tensor by combining the method 1 and the method 2.
- the method 2 it is necessary to perform the operation according to the dependency on the operation order of the partial tensors.
- FIG. 7 is an internal block diagram of the DMAC3.
- the DMAC 3 has a data transfer circuit 31 and a state controller 32.
- the DMAC 3 has a state controller 32 dedicated to the data transfer circuit 31, and when an instruction command is input, the DMA data can be transferred without the need for an external controller.
- the data transfer circuit 31 is connected to the external bus EB and transfers DMA data between an external memory such as a DRAM and the first memory 1. Further, the data transfer circuit 31 transfers DMA data between an external memory such as a DRAM and the second memory 2. Further, the data transfer circuit 31 transfers data between an external memory such as a DRAM and the convolution calculation circuit 4. Further, the data transfer circuit 31 transfers data between an external memory such as a DRAM and the quantization calculation circuit 5.
- the number of DMA channels in the data transfer circuit 31 is not limited. For example, each of the first memory 1 and the second memory 2 may have a dedicated DMA channel.
- the state controller 32 controls the state of the data transfer circuit 31. Further, the state controller 32 is connected to the controller 6 via the internal bus IB.
- the state controller 32 has an instruction queue 33 and a control circuit 34.
- the instruction queue 33 is a queue in which the instruction command C3 for DMAC3 is stored, and is composed of, for example, a FIFO memory. One or more instruction commands C3 are written to the instruction queue 33 via the internal bus IB.
- the control circuit 34 is a state machine that decodes the instruction command C3 and sequentially controls the data transfer circuit 31 based on the instruction command C3.
- the control circuit 34 may be implemented by a logic circuit or by a CPU controlled by software.
- FIG. 8 is a state transition diagram of the control circuit 34.
- the control circuit 34 transitions from the idle state ST1 to the decode state ST2.
- the control circuit 34 decodes the instruction command C3 output from the instruction queue 33 in the decode state ST2. Further, the control circuit 34 reads the semaphore S stored in the register 61 of the controller 6 and determines whether or not the operation of the data transfer circuit 31 instructed by the instruction command C3 can be executed. If it is not executable (Not ready), the control circuit 34 waits until it becomes executable (Wait). If it is executable, the control circuit 34 transitions from the decode state ST2 to the execution state ST3.
- the control circuit 34 controls the data transfer circuit 31 in the execution state ST3 to cause the data transfer circuit 31 to perform the operation instructed by the instruction command C3.
- the control circuit 34 removes the executed instruction command C3 from the instruction queue 33 and updates the semaphore S stored in the register 61 of the controller 6.
- the control circuit 34 transitions from the execution state ST3 to the decode state ST2.
- the control circuit 34 transitions from the execution state ST3 to the idle state ST1.
- FIG. 9 is an internal block diagram of the convolution operation circuit 4.
- the convolution operation circuit 4 includes a weight memory 41, a multiplier 42, an accumulator circuit 43, and a state controller 44.
- the convolution operation circuit 4 has a state controller 44 dedicated to the multiplier 42 and the accumulator circuit 43, and when an instruction command is input, the convolution operation can be performed without the need for an external controller.
- the weight memory 41 is a memory in which the weight w used for the convolution operation is stored, and is a rewritable memory such as a volatile memory composed of, for example, SRAM (Static RAM).
- the DMAC 3 writes the weight w required for the convolution operation to the weight memory 41 by DMA transfer.
- FIG. 10 is an internal block diagram of the multiplier 42.
- the multiplier 42 multiplies the input vector A and the weight matrix W.
- the input vector A is vector data in which the divided input data a (x + i, y + j, co) has Bc elements expanded for each i and j.
- the weight matrix W is matrix data in which the division weights w (i, j, co, do) have Bc ⁇ Bd elements expanded for each i and j.
- the multiplier 42 has Bc ⁇ Bd product-sum calculation units 47, and can perform multiplication of the input vector A and the weight matrix W in parallel.
- the multiplier 42 reads the input vector A and the weight matrix W required for multiplication from the first memory 1 and the weight memory 41 and performs multiplication.
- the multiplier 42 outputs the Bd product-sum operation result O (di).
- FIG. 11 is an internal block diagram of the product-sum calculation unit 47.
- the product-sum calculation unit 47 multiplies the element A (ci) of the input vector A and the element W (ci, di) of the weight matrix W. Further, the product-sum calculation unit 47 adds the multiplication result and the multiplication result S (ci, di) of another product-sum calculation unit 47.
- the product-sum calculation unit 47 outputs the addition result S (ci + 1, di).
- Element A (ci) is a 2-bit unsigned integer (0,1,2,3).
- the element W (ci, di) is a 1-bit signed integer (0,1), where the value "0" represents +1 and the value "1" represents -1.
- the product-sum calculation unit 47 has an inversion device (inverter) 47a, a selector 47b, and an adder 47c.
- the multiply-accumulate unit 47 does not use a multiplier, but uses only the inverter 47a and the selector 47b to perform multiplication.
- the selector 47b selects the input of the element A (ci) when the element W (ci, di) is “0”. When the element W (ci, di) is "1", the selector 47b selects the complement in which the element A (ci) is inverted by the inversion device.
- the element W (ci, di) is also input to the Carry-in of the adder 47c.
- the adder 47c outputs a value obtained by adding the element A (ci) to the S (ci, di) when the element W (ci, di) is “0”.
- the adder 47c outputs a value obtained by subtracting the element A (ci) from S (ci, di) when W (ci, di) is “1”.
- FIG. 12 is an internal block diagram of the accumulator circuit 43.
- the accumulator circuit 43 accumulates the product-sum calculation result O (di) of the multiplier 42 into the second memory 2.
- the accumulator circuit 43 has Bd accumulator units 48, and can accumulate Bd product-sum calculation results O (di) in parallel in the second memory 2.
- FIG. 13 is an internal block diagram of the accumulator unit 48.
- the accumulator unit 48 has an adder 48a and a mask portion 48b.
- the adder 48a adds the element O (di) of the product-sum calculation result O and the partial sum stored in the second memory 2 which is an intermediate process of the convolution operation shown in Equation 1.
- the addition result is 16 bits per element.
- the addition result is not limited to 16 bits per element, and may be, for example, 15 bits or 17 bits per element.
- the adder 48a writes the addition result to the same address in the second memory 2.
- the mask unit 48b masks the output from the second memory 2 and sets the addition target for the element O (di) to zero.
- the initialization signal clear is asserted when the partial sum of the progress is not stored in the second memory 2.
- the output data f (x, y, do) is stored in the second memory.
- the state controller 44 controls the states of the multiplier 42 and the accumulator circuit 43. Further, the state controller 44 is connected to the controller 6 via the internal bus IB.
- the state controller 44 has an instruction queue 45 and a control circuit 46.
- the instruction queue 45 is a queue in which the instruction command C4 for the convolution operation circuit 4 is stored, and is composed of, for example, a FIFO memory.
- the instruction command C4 is written in the instruction queue 45 via the internal bus IB.
- the control circuit 46 is a state machine that decodes the instruction command C4 and controls the multiplier 42 and the accumulator circuit 43 based on the instruction command C4.
- the control circuit 46 has the same configuration as the control circuit 34 of the state controller 32 of the DMAC3.
- FIG. 14 is an internal block diagram of the quantization calculation circuit 5.
- the quantization calculation circuit 5 includes a quantization parameter memory 51, a vector calculation circuit 52, a quantization circuit 53, and a state controller 54. It has a dedicated state controller 54, and when an instruction command is input, the quantization operation can be performed without the need for an external controller.
- the quantization parameter memory 51 is a memory in which the quantization parameter q used for the quantization operation is stored, and is a rewritable memory such as a volatile memory composed of, for example, SRAM (Static RAM).
- the DMAC3 writes the quantization parameter q required for the quantization operation into the quantization parameter memory 51 by DMA transfer.
- FIG. 15 is an internal block diagram of the vector calculation circuit 52 and the quantization circuit 53.
- the vector calculation circuit 52 performs a calculation on the output data f (x, y, do) stored in the second memory 2.
- the vector calculation circuit 52 has Bd calculation units 57, and performs SIMD calculation in parallel with the output data f (x, y, do).
- FIG. 16 is a block diagram of the arithmetic unit 57.
- the arithmetic unit 57 has, for example, an ALU 57a, a first selector 57b, a second selector 57c, a register 57d, and a shifter 57e.
- the arithmetic unit 57 may further include other arithmetic units and the like included in the known general-purpose SIMD arithmetic circuit.
- the vector calculation circuit 52 combines the calculation units and the like of the calculation unit 57 with respect to the output data f (x, y, do), such as the pooling layer 221 in the quantization calculation layer 220, the Batch Normalization layer 222, and the Batch Normalization layer 222. At least one of the operations of the activation function layer 223 is performed.
- the arithmetic unit 57 can add the data stored in the register 57d and the element f (di) of the output data f (x, y, do) read from the second memory 2 by the ALU 57a.
- the arithmetic unit 57 can store the addition result by the ALU 57a in the register 57d.
- the calculation unit 57 can initialize the addition result by inputting “0” to the ALU 57a instead of the data stored in the register 57d by selecting the first selector 57b. For example, when the pooling area is 2 ⁇ 2, the shifter 57e can output the average value of the addition results by shifting the output of the ALU 57a to the right by 2 bits.
- the vector calculation circuit 52 can perform the average pooling calculation shown in Equation 2 by repeating the above calculation and the like by the Bd calculation units 57.
- the arithmetic unit 57 can compare the data stored in the register 57d with the element f (di) of the output data f (x, y, do) read from the second memory 2 by the ALU 57a.
- the arithmetic unit 57 controls the second selector 57c according to the comparison result by the ALU 57a, and can select the larger of the data stored in the register 57d and the element f (di).
- the calculation unit 57 can initialize the comparison target to the minimum value by inputting the minimum value of the possible values of the element f (di) into the ALU 57a by selecting the first selector 57b.
- the vector calculation circuit 52 can perform the MAX pooling calculation of the formula 3 by repeating the above calculation and the like by the Bd calculation units 57. In the MAX pooling operation, the shifter 57e does not shift the output of the second selector 57c.
- the arithmetic unit 57 can subtract the data stored in the register 57d and the element f (di) of the output data f (x, y, do) read from the second memory 2 by the ALU 57a.
- the shifter 57e can shift the output of the ALU 57a to the left (ie, multiply) or to the right (ie, divide).
- the vector calculation circuit 52 can perform the Batch Normalization calculation of Equation 4 by repeating the above calculation and the like by the Bd calculation units 57.
- the arithmetic unit 57 can compare the element f (di) of the output data f (x, y, do) read from the second memory 2 with the “0” selected by the first selector 57b by the ALU 57a.
- the arithmetic unit 57 can select and output either the element f (di) or the constant value "0" stored in the register 57d in advance according to the comparison result by the ALU 57a.
- the vector calculation circuit 52 can perform the ReLU calculation of the formula 5 by repeating the above calculation and the like by the Bd calculation units 57.
- the vector calculation circuit 52 can perform average pooling, MAX pooling, Batch Normalization, activation function calculation, and a combination of these calculations. Since the vector calculation circuit 52 can perform a general-purpose SIMD calculation, other calculations necessary for the calculation in the quantization calculation layer 220 may be performed. Further, the vector calculation circuit 52 may perform a calculation other than the calculation in the quantization calculation layer 220.
- the quantization calculation circuit 5 does not have to have the vector calculation circuit 52.
- the output data f (x, y, do) is input to the quantization circuit 53.
- the quantization circuit 53 quantizes the output data of the vector calculation circuit 52. As shown in FIG. 15, the quantization circuit 53 has Bd quantization units 58, and performs operations in parallel with respect to the output data of the vector operation circuit 52.
- FIG. 17 is an internal block diagram of the quantization unit 58.
- the quantization unit 58 quantizes the element in (di) of the output data of the vector calculation circuit 52.
- the quantization unit 58 includes a comparator 58a and an encoder 58b.
- the quantization unit 58 performs an operation (Equation 6) of the quantization layer 224 on the quantization operation layer 220 with respect to the output data (16 bits / element) of the vector calculation circuit 52.
- the quantization unit 58 reads out the necessary quantization parameters q (th0, th1, th2) from the quantization parameter memory 51, and compares the input in (di) with the quantization parameter q by the comparator 58a.
- the quantization unit 58 quantizes the comparison result by the comparator 58a into 2 bits / element by the encoder 58b. Since ⁇ (c) and ⁇ (c) in Equation 4 are different parameters for each variable c, the quantization parameters q (th0, th1, th2) reflecting ⁇ (c) and ⁇ (c) are in ( This is a different parameter for each di).
- the quantization unit 58 By comparing the input in (di) with the three thresholds th0, th1, th2, the quantization unit 58 sets the input in (di) in four regions (for example, in ⁇ th0, th0 ⁇ in ⁇ th1, th1 ⁇ in). Classify into ⁇ th2, th2 ⁇ in), and the classification result is encoded into 2 bits and output.
- the quantization unit 58 can also perform Batch Normalization and activation function calculations in addition to quantization.
- the quantization unit 58 performs quantization by setting the threshold value th0 as ⁇ (c) of the equation 4 and the threshold difference (th1-th0) and (th2-th1) as ⁇ (c) of the equation 4.
- the Batch Normalization operation shown in Equation 4 can be performed together with the quantization.
- ⁇ (c) can be reduced by increasing (th1-th0) and (th2-th1). By reducing (th1-th0) and (th2-th1), ⁇ (c) can be increased.
- the quantization unit 58 can perform the ReLU operation of the activation function together with the quantization of the input in (di). For example, the quantization unit 58 saturates the output value in the region where in (di) ⁇ th0 and th2 ⁇ in (di). The quantization unit 58 can perform the operation of the activation function together with the quantization by setting the quantization parameter q so that the output is non-linear.
- the state controller 54 controls the states of the vector calculation circuit 52 and the quantization circuit 53. Further, the state controller 54 is connected to the controller 6 via the internal bus IB.
- the state controller 54 has an instruction queue 55 and a control circuit 56.
- the instruction queue 55 is a queue in which the instruction command C5 for the quantization calculation circuit 5 is stored, and is composed of, for example, a FIFO memory.
- the instruction command C5 is written in the instruction queue 55 via the internal bus IB.
- the control circuit 56 is a state machine that decodes the instruction command C5 and controls the vector operation circuit 52 and the quantization circuit 53 based on the instruction command C5.
- the control circuit 56 has the same configuration as the control circuit 34 of the state controller 32 of the DMAC3.
- the quantization calculation circuit 5 writes the quantization calculation output data having Bd elements to the first memory 1.
- the preferable relationship between Bd and Bc is shown in Equation 10.
- Equation 10 n is an integer.
- the controller 6 transfers the instruction command transferred from the external host CPU to the instruction queue of the DMAC3, the convolution operation circuit 4, and the quantization operation circuit 5.
- the controller 6 may have an instruction memory for storing instruction commands for each circuit.
- the controller 6 is connected to the external bus EB and operates as a slave of the external host CPU.
- the controller 6 has a register 61 including a parameter register and a status register.
- the parameter register is a register that controls the operation of the NN circuit 100.
- the status register is a register indicating the state of the NN circuit 100 including the semaphore S.
- FIG. 18 is a diagram illustrating control of the NN circuit 100 by the semaphore S.
- the semaphore S has a first semaphore S1, a second semaphore S2, and a third semaphore S3.
- the semaphore S is decremented by the P operation and incremented by the V operation.
- the P operation and V operation by the DMAC 3, the convolution operation circuit 4, and the quantization operation circuit 5 update the semaphore S of the controller 6 via the internal bus IB.
- the first semaphore S1 is used to control the first data flow F1.
- the first data flow F1 is a data flow in which the DMAC3 (Producer) writes the input data a to the first memory 1 and the convolution calculation circuit 4 (Consumer) reads the input data a.
- the first semaphore S1 has a first light semaphore S1W and a first lead semaphore S1R.
- the second semaphore S2 is used to control the second data flow F2.
- the second data flow F2 is a data flow in which the convolution calculation circuit 4 (Producer) writes the output data f to the second memory 2, and the quantization calculation circuit 5 (Consumer) reads the output data f.
- the second semaphore S2 has a second light semaphore S2W and a second lead semaphore S2R.
- the third semaphore S3 is used to control the third data flow F3.
- the quantization calculation circuit 5 (Producer) writes the quantization calculation output data to the first memory 1, and the convolution calculation circuit 4 (Conuser) reads the quantization calculation output data of the quantization calculation circuit 5. It is a data flow.
- the third semaphore S3 has a third light semaphore S3W and a third lead semaphore S3R.
- FIG. 19 is a timing chart of the first data flow F1.
- the first write semaphore S1W is a semaphore that limits writing to the first memory 1 by DMAC3 in the first data flow F1.
- the first write semaphore S1W determines the number of memory areas in the first memory 1 that can store data of a predetermined size, such as the input vector A, in which the data has already been read and other data can be written. Shown. When the first write semaphore S1W is "0", the DMAC3 cannot write to the first memory 1 in the first data flow F1, and waits until the first write semaphore S1W becomes "1" or more.
- the first read semaphore S1R is a semaphore that limits reading from the first memory 1 by the convolution operation circuit 4 in the first data flow F1.
- the first read semaphore S1R indicates the number of memory areas in the first memory 1 that can store data of a predetermined size, such as the input vector A, in which the data has been written and can be read.
- the convolution arithmetic circuit 4 cannot read from the first memory 1 in the first data flow F1, and waits until the first read semaphore S1R becomes "1" or more.
- the DMAC3 starts the DMA transfer when the instruction command C3 is stored in the instruction queue 33. As shown in FIG. 19, since the first light semaphore S1W is not “0”, the DMAC3 starts the DMA transfer (DMA transfer 1). The DMAC3 performs a P operation on the first light semaphore S1W when starting the DMA transfer. After the DMA transfer is completed, the DMAC3 performs a V operation on the first read semaphore S1R.
- the convolution operation circuit 4 starts the convolution operation when the instruction command C4 is stored in the instruction queue 45. As shown in FIG. 19, since the first read semaphore S1R is “0”, the convolution calculation circuit 4 is waited until the first read semaphore S1R becomes “1” or more (wait in the decode state ST2). When the first read semaphore S1R becomes “1” by the V operation by the DMAC3, the convolution calculation circuit 4 starts the convolution calculation (convolution calculation 1). When the convolution calculation circuit 4 starts the convolution calculation, the convolution calculation circuit 4 performs a P operation on the first read semaphore S1R. The convolution calculation circuit 4 performs a V operation on the first light semaphore S1W after the convolution calculation is completed.
- the DMAC3 When the DMAC3 starts the DMA transfer described as "DMA transfer 3" in FIG. 19, since the first light semaphore S1W is "0", the DMAC3 waits until the first light semaphore S1W becomes "1" or more. (Wait in decode state ST2). When the first light semaphore S1W becomes "1" or more by the V operation by the convolution calculation circuit 4, the DMAC3 starts the DMA transfer.
- the DMAC 3 and the convolution calculation circuit 4 can prevent access conflicts with respect to the first memory 1 in the first data flow F1. Further, the DMAC 3 and the convolution calculation circuit 4 can operate independently and in parallel while synchronizing the data transfer in the first data flow F1 by using the semaphore S1.
- FIG. 20 is a timing chart of the second data flow F2.
- the second write semaphore S2W is a semaphore that limits writing to the second memory 2 by the convolution arithmetic circuit 4 in the second data flow F2.
- the second write semaphore S2W determines the number of memory areas in the second memory 2 that can store data of a predetermined size such as output data f, for which data has already been read and other data can be written. Shown. When the second light semaphore S2W is "0", the convolution calculation circuit 4 cannot write to the second memory 2 in the second data flow F2, and waits until the second light semaphore S2W becomes "1" or more. ..
- the second read semaphore S2R is a semaphore that limits reading from the second memory 2 by the quantization calculation circuit 5 in the second data flow F2.
- the second read semaphore S2R indicates the number of memory areas in the second memory 2 that can store data of a predetermined size, such as output data f, in which the data has been written and can be read.
- the quantization calculation circuit 5 cannot read from the second memory 2 in the second data flow F2, and waits until the second read semaphore S2R becomes "1" or more. ..
- the convolution calculation circuit 4 performs a P operation on the second light semaphore S2W when starting the convolution calculation.
- the convolution calculation circuit 4 performs a V operation on the second read semaphore S2R after the convolution calculation is completed.
- the quantization operation circuit 5 starts the quantization operation when the instruction command C5 is stored in the instruction queue 55. As shown in FIG. 20, since the second read semaphore S2R is “0”, the quantization calculation circuit 5 waits until the second read semaphore S2R becomes “1” or more (wait in the decode state ST2). When the second read semaphore S2R becomes “1” by the V operation by the convolution calculation circuit 4, the quantization calculation circuit 5 starts the quantization calculation (quantization calculation 1). When the quantization calculation circuit 5 starts the quantization calculation, the quantization calculation circuit 5 performs a P operation on the second read semaphore S2R. After the quantization calculation is completed, the quantization calculation circuit 5 performs a V operation on the second light semaphore S2W.
- the quantization operation circuit 5 starts the quantization operation described as "quantization operation 2" in FIG. 20, the second read semaphore S2R is "0", so that the quantization operation circuit 5 is the second read semaphore. It is waited until S2R becomes "1" or more (Wait in the decode state ST2). When the second read semaphore S2R becomes "1" or more by the V operation by the convolution calculation circuit 4, the quantization calculation circuit 5 starts the quantization calculation.
- the convolution calculation circuit 4 and the quantization calculation circuit 5 can prevent access conflicts with respect to the second memory 2 in the second data flow F2. Further, the convolution calculation circuit 4 and the quantization calculation circuit 5 can operate independently in parallel while synchronizing the data transfer in the second data flow F2 by using the semaphore S2.
- the third light semaphore S3W is a semaphore that limits writing to the first memory 1 by the quantization calculation circuit 5 in the third data flow F3.
- the third write semapho S3W reads other data in the first memory 1 from the memory area capable of storing data of a predetermined size such as the quantization operation output data of the quantization operation circuit 5. Shows the number of writable memory areas.
- the quantization calculation circuit 5 cannot write to the first memory 1 in the third data flow F3, and waits until the third light semaphore S3W becomes "1" or more. Is done.
- the third read semaphore S3R is a semaphore that limits reading from the first memory 1 by the convolution arithmetic circuit 4 in the third data flow F3.
- the third read semaphore S3R is a memory in which the data has been written and can be read out of a memory area capable of storing data of a predetermined size such as the quantization operation output data of the quantization operation circuit 5 in the first memory 1. Shows the number of regions.
- the third read semaphore S3R is "0"
- the convolution arithmetic circuit 4 cannot read from the first memory 1 in the third data flow F3, and waits until the third read semaphore S3R becomes "1" or more.
- the quantization calculation circuit 5 and the convolution calculation circuit 4 can prevent access conflicts with respect to the first memory 1 in the third data flow F3. Further, the quantization calculation circuit 5 and the convolution calculation circuit 4 can operate independently and in parallel while synchronizing the data transfer in the third data flow F3 by using the semaphore S3.
- the first memory 1 is shared by the first data flow F1 and the third data flow F3.
- the NN circuit 100 can distinguish between the first data flow F1 and the third data flow F3 and synchronize the data transfer.
- the convolution operation circuit 4 When the convolution operation is performed, the convolution operation circuit 4 reads from the first memory 1 and writes to the second memory 2. That is, the convolution calculation circuit 4 is a Consumer in the first data flow F1 and a Producer in the second data flow F2. Therefore, when the convolution calculation circuit 4 starts the convolution calculation, it performs a P operation on the first read semaphore S1R (see FIG. 19) and a P operation on the second write semaphore S2W (see FIG. 20). .. After the convolution calculation is completed, the convolution calculation circuit 4 performs a V operation on the first write semaphore S1W (see FIG. 19) and a V operation on the second read semaphore S2R (see FIG. 20).
- the convolution calculation circuit 4 When the convolution calculation circuit 4 starts the convolution calculation, it waits until the first read semaphore S1R becomes “1” or more and the second write semaphore S2W becomes “1” or more (wait in the decode state ST2).
- the quantization calculation circuit 5 When the quantization calculation circuit 5 performs the quantization calculation, it reads from the second memory 2 and writes to the first memory 1. That is, the quantization calculation circuit 5 is a Consumer in the second data flow F2 and a Producer in the third data flow F3. Therefore, when the quantization calculation circuit 5 starts the quantization calculation, the P operation is performed on the second read semaphore S2R and the P operation is performed on the third write semaphore S3W. After the quantization calculation is completed, the quantization calculation circuit 5 performs a V operation on the second write semaphore S2W and a V operation on the third read semaphore S3R.
- the quantization calculation circuit 5 When the quantization calculation circuit 5 starts the quantization calculation, it waits until the second read semaphore S2R becomes “1” or more and the third write semaphore S3W becomes “1” or more (wait in the decode state ST2). ..
- the input data read from the first memory 1 by the convolution calculation circuit 4 may be the data written by the quantization calculation circuit 5 in the third data flow.
- the convolution calculation circuit 4 is a Consumer in the third data flow F3 and a Producer in the second data flow F2. Therefore, when the convolution calculation circuit 4 starts the convolution calculation, the convolution calculation circuit 4 performs a P operation on the third read semaphore S3R and a P operation on the second write semaphore S2W. After the convolution calculation is completed, the convolution calculation circuit 4 performs a V operation on the third write semaphore S3W and a V operation on the second read semaphore S2R.
- the convolution calculation circuit 4 When the convolution calculation circuit 4 starts the convolution calculation, it waits until the third read semaphore S3R becomes “1" or more and the second write semaphore S2W becomes “1" or more (wait in the decode state ST2).
- FIG. 21 is a diagram illustrating a convolution operation execution command.
- the convolution operation execution command is one of the instruction commands C4 for the convolution operation circuit 4.
- the convolution operation execution instruction has an instruction field IF in which instructions for the convolution operation circuit 4 are stored, and a semaphore operation field SF in which operations for the semaphore S and the like are stored.
- the instruction field IF and the semaphore operation field SF are contained in one instruction as a convolution operation execution instruction.
- the instruction field IF of the convolution operation execution instruction is a field in which the instruction for the convolution operation circuit 4 is stored.
- a command for causing the multiplier 42 and the accumulator circuit 43 to perform a convolution operation a command for controlling the clear signal of the accumulator circuit 43, a size of the input vector A and the weight matrix W, a memory address, and the like are stored. Will be done.
- the semaphore operation field SF of the convolution operation execution instruction stores operations for the semaphore S related to the instruction stored in the instruction field IF.
- the convolution calculation circuit 4 is a consumer that receives and consumes data from the other side in the first data flow F1 and the third data flow F3, and is a producer that transmits the produced data to the other side in the second data flow F2. be. Therefore, the related semaphores S are the first semaphore S1, the second semaphore S2, and the third semaphore S3. Therefore, as shown in FIG. 21, the semaphore operation field SF of the convolution operation execution instruction includes operation fields for the first semaphore S1, the second semaphore S2, and the third semaphore S3.
- the semaphore operation field SF is provided with a P operation field and a V operation field for each semaphore. As shown in FIG. 21, the semaphore operation field SF of the convolution operation execution instruction includes six operation fields. Each operation field of the semaphore operation field SF is 1 bit. Each operation field of the semaphore operation field SF may have a plurality of bits.
- the first semaphore S1 and the third semaphore S3 for the first data flow F1 and the third data flow F3 in which the convolution calculation circuit 4 is a controller have a P operation field for the read semaphore (S1R, S3R) and a write semaphore (S1W, S1W, A V operation field for S3W) is provided.
- the second semaphore S2 for the second data flow F2 in which the convolution calculation circuit 4 serves as a producer is provided with a P operation field for the write semaphore (S2W) and a V operation field for the read semaphore (S2R).
- FIG. 22 is a diagram showing a specific example of the convolution operation instruction.
- the specific example shown in FIG. 22 is composed of four convolution operation instructions (hereinafter, referred to as "instruction 1" to "instruction 4"), and the four convolution operation instructions are inputs stored in the first memory 1.
- the data a (x + i, y + j, co) is divided into four times, and the convolution calculation circuit 4 is made to perform the convolution calculation.
- the state controller 44 of the convolution operation circuit 4 transitions to the decoding state ST2 and decodes the instruction 1 stored first among the four instructions (instruction 1 to instruction 4) stored in the instruction queue 45.
- the state controller 44 When the P operation field is set to "1", the state controller 44 reads the semaphore S corresponding to the P operation field set to "1" from the controller 6 via the internal bus IB, and satisfies the execution condition. Determine if it is.
- the implementation condition is that all the semaphores S corresponding to the P operation field set to "1" are "1" or more.
- the P operation field for the first read semaphore S1R and the P operation field for the second write semaphore S2W are set to "1". Therefore, the state controller 44 reads out the first read semaphore S1R and the second write semaphore S2W, and determines whether or not the execution conditions are satisfied.
- the state controller 44 waits until the semaphore S corresponding to the P operation field set to "1" is updated and the execution condition is satisfied.
- instruction 1 if the first read semaphore S1R is not “1” or more and the second write semaphore S2W is not “1” or more (Not Ready), the state controller 44 updates the semaphore S and sets the execution conditions. Wait until it is satisfied.
- the state controller 44 transitions to the execution state ST3 and executes the convolution operation based on the instruction field IF if the execution condition is satisfied.
- instruction 1 if the first read semaphore S1R is "1" or more and the second write semaphore S2W is "1" or more (Ready), the state controller 44 transitions to the execution state ST3 and the instruction field Perform the convolution operation based on the IF.
- the state controller 44 When the P operation field is set to "1", the state controller 44 performs a P operation on the semaphore S corresponding to the P operation field set to "1" before executing the convolution operation. In the case of instruction 1, the state controller 44 performs a P operation on the first read semaphore S1R and the second write semaphore S2W before performing the convolution operation.
- the state controller 44 After executing the instruction 1, the state controller 44 transitions to the decoding state ST2 and decodes the instruction 2. In instruction 2, none of the semaphore operation field SFs is set to "1". Therefore, the state controller 44 transitions to the execution state ST3 without checking or updating the semaphore S, and performs the convolution operation based on the instruction field IF.
- the state controller 44 transitions to the decode state ST2 and decodes the instruction 3.
- instruction 3 none of the semaphore operation field SFs is set to "1". Therefore, the state controller 44 transitions to the execution state ST3 without checking or updating the semaphore S, and performs the convolution operation based on the instruction field IF.
- the state controller 44 transitions to the decoding state ST2 and decodes the instruction 4.
- instruction 4 none of the P operation fields is set to "1". Therefore, the state controller 44 transitions to the execution state ST3 without checking or updating the semaphore S, and performs the convolution operation based on the instruction field IF.
- the state controller 44 When the V operation field is set to "1", the state controller 44 performs a V operation on the semaphore S corresponding to the V operation field set to "1" after the convolution operation of the instruction 4 is completed. ..
- the V operation field for the first write semaphore S1W and the V operation field for the second read semaphore S2R are set to "1". Therefore, the state controller 44 performs a V operation on the first write semaphore S1W and the second read semaphore S2R after the convolution operation of the instruction 4 is completed.
- the state controller 44 transitions to the idle state ST1 and ends the execution of a series of convolution operation instructions composed of the four instructions.
- the convolution calculation circuit 4 uses the quantization calculation output data written in the first memory 1 as the input data by the quantization calculation circuit 5, the operation field corresponding to the third semaphore S3 is used.
- the convolution operation execution command instructs the convolution operation based on the instruction field IF, and confirms and updates the related semaphore S based on the semaphore operation field SF. Since the instruction field IF and the semaphore operation field SF are contained in one instruction as a convolution operation execution command, the number of instruction commands C4 for executing the convolution operation can be reduced. In addition, the processing time for executing instructions such as decoding can be shortened.
- FIG. 23 is a diagram illustrating a quantization operation execution instruction.
- the quantization operation execution command is one of the instruction commands C5 for the quantization calculation circuit 5.
- the quantization calculation execution instruction has an instruction field IF in which instructions for the quantization calculation circuit 5 are stored, and a semaphore operation field SF in which operations for the semaphore S and the like are stored.
- the instruction field IF and the semaphore operation field SF are contained in one instruction as a quantization operation execution instruction.
- the instruction field IF of the quantization operation execution instruction is a field in which the instruction for the quantization operation circuit 5 is stored.
- the instruction field IF for example, a command for causing the vector operation circuit 52 and the quantization circuit 53 to perform an operation, the size of the output data f and the quantization parameter p, a memory address, and the like are stored.
- the semaphore operation field SF of the quantization operation execution instruction stores operations for the semaphore S related to the instruction stored in the instruction field IF.
- the quantization calculation circuit 5 is a Consumer in the second data flow F2 and a Producer in the third data flow F3. Therefore, the related semaphores S are the second semaphore S2 and the third semaphore S3. Therefore, as shown in FIG. 23, the semaphore operation field SF of the quantization operation execution instruction includes an operation field for the second semaphore S2 and the third semaphore S3.
- the second semaphore S2 for the second data flow F2 in which the quantization calculation circuit 5 is a Consumer is provided with a P operation field for the read semaphore (S2R) and a V operation field for the write semaphore (S2W).
- the third semaphore S3 for the third data flow F3 in which the quantization calculation circuit 5 serves as a producer is provided with a P operation field for the write semaphore (S3W) and a V operation field for the read semaphore (S3R).
- the state controller 54 of the quantization operation circuit 5 responds to the quantization operation execution instruction in which the P operation field and the V operation field are set to "1" in the same manner as the operation of the state controller 44 in response to the convolution operation execution instruction. Check and update the semaphore S.
- FIG. 24 is a diagram illustrating a DMA transfer execution command.
- the DMA transfer execution command is one of the command commands C3 for the DMAC3.
- the DMA transfer execution instruction has an instruction field IF in which an instruction for DMAC3 is stored and a semaphore operation field SF in which operations for the semaphore S and the like are stored.
- the instruction field IF and the semaphore operation field SF are contained in one instruction as a DMA transfer execution instruction.
- the instruction field IF of the DMA transfer execution instruction is a field in which the instruction for DMAC3 is stored.
- the instruction field IF for example, the memory address of the memory transfer source and the memory transfer destination, the transfer data size, and the like are stored.
- the semaphore operation field SF of the DMA transfer execution instruction stores operations for the semaphore S related to the instruction stored in the instruction field IF.
- DMAC3 is a producer in the first data flow F1. Therefore, the related semaphore S is the first semaphore S1. Therefore, as shown in FIG. 24, the semaphore operation field SF of the DMA transfer execution command includes an operation field for the first semaphore S1.
- the first semaphore S1 for the first data flow F1 in which the DMAC3 is a producer is provided with a P operation field for the write semaphore (S1W) and a V operation field for the read semaphore (S1R).
- the state controller 32 of the DMAC3 confirms the semaphore S for the DMA transfer execution instruction in which the P operation field and the V operation field are set to "1" in the same manner as the operation of the state controller 44 for the convolution operation execution instruction. Update.
- the NN circuit 100 that can be incorporated into an embedded device such as an IoT device can be operated with high performance.
- the convolution operation execution instruction, the quantization operation execution instruction, and the DMA transfer execution instruction, the instruction field IF and the semaphore operation field SF are contained in one instruction. Therefore, the number of instruction commands for performing the convolution operation and the like can be reduced. In addition, the processing time for executing instructions such as decoding can be shortened.
- Modification example 1 In the above embodiment, an example of an instruction in which a plurality of semaphore operation field SFs are contained in one instruction for one instruction field IF is shown, but the mode of the instruction is not limited to this.
- the instruction may be in a mode in which a plurality of instruction field IFs and a plurality of semaphore operation fields SF associated with each instruction field IF are contained in one instruction.
- the method of storing the instruction field IF and the semaphore operation field SF in one instruction is not limited to the configuration of the above embodiment.
- the instruction field IF and the semaphore operation field SF may be divided into a plurality of instructions and stored. If the instruction field IF is associated with the corresponding semaphore operation field SF in the instruction, the same effect can be obtained.
- the first memory 1 and the second memory 2 are different memories, but the modes of the first memory 1 and the second memory 2 are not limited to this.
- the first memory 1 and the second memory 2 may be, for example, a first memory area and a second memory area in the same memory.
- the semaphore S is provided for the first data flow F1, the second data flow F2, and the third data flow F3, but the aspect of the semaphore S is not limited to this.
- the semaphore S may be provided in, for example, a data flow in which the DMAC 3 writes the weight w to the weight memory 41 and the multiplier 42 reads the weight w.
- the semapho S may be provided in, for example, a data flow in which the DMAC 3 writes the quantization parameter q to the quantization parameter memory 51 and the quantization circuit 53 reads the quantization parameter q.
- the data input to the NN circuit 100 described in the above embodiment is not limited to a single format, and can be composed of still images, moving images, sounds, characters, numerical values, and combinations thereof.
- the data input to the NN circuit 100 can be mounted on an edge device provided with the NN circuit 100 to measure physical quantities such as an optical sensor, a thermometer, a Global Positioning System (GPS) measuring instrument, an angular velocity measuring instrument, and an anemometer. It is not limited to the measurement result in the vessel.
- Peripheral information such as base station information, vehicle / ship information, weather information, and congestion status information received from peripheral devices via wired or wireless communication, and different information such as financial information and personal information may be combined.
- the edge device provided with the NN circuit 100 is assumed to be a communication device such as a mobile phone driven by a battery, a smart device such as a personal computer, a digital camera, a game device, a mobile device such as a robot product, but is limited to this. It's not a thing. It is possible to obtain unprecedented effects by limiting the peak power that can be supplied by Power on Ethernet (PoE), reducing product heat generation, or using it for products that are highly required to be driven for a long time. For example, by applying it to in-vehicle cameras mounted on vehicles and ships, surveillance cameras installed in public facilities and on the streets, etc., it is possible not only to realize long-time shooting, but also to contribute to weight reduction and high durability. .. Further, the same effect can be obtained by applying it to display devices such as televisions and displays, medical devices such as medical cameras and surgical robots, and work robots used at manufacturing sites and construction sites.
- PoE Power on Ethernet
- the NN circuit 100 may realize a part or all of the NN circuit 100 by using one or more processors.
- the NN circuit 100 may realize a part or all of the input layer or the output layer by software processing by a processor.
- a part of the input layer or the output layer realized by software processing is, for example, data normalization or conversion. This makes it possible to support various input formats or output formats.
- the software executed by the processor may be rewritable by using a communication means or an external medium.
- the NN circuit 100 may be realized by combining a part of the processing in the CNN 200 with a Graphics Processing Unit (GPU) or the like on the cloud.
- the NN circuit 100 performs further processing on the cloud in addition to the processing performed on the edge device provided with the NN circuit 100, and processing on the edge device in addition to the processing on the cloud. More complicated processing can be realized with less resources. According to such a configuration, the NN circuit 100 can reduce the amount of communication between the edge device and the cloud by processing distribution.
- the calculation performed by the NN circuit 100 was at least a part of the learned CNN 200, but the target of the calculation performed by the NN circuit 100 is not limited to this.
- the operation performed by the NN circuit 100 may be at least a part of a trained neural network that repeats two types of operations, such as a convolution operation and a quantization operation.
- the present invention can be applied to the calculation of neural networks.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Neurology (AREA)
- Complex Calculations (AREA)
Abstract
Description
本発明の第一の態様に係るニューラルネットワーク回路の制御方法は、入力データを格納する第一メモリと、前記第一メモリに格納された前記入力データに対して畳み込み演算を行う畳み込み演算回路と、前記畳み込み演算回路の畳み込み演算出力データを格納する第二メモリと、前記第二メモリに格納された前記畳み込み演算出力データに対して量子化演算を行う量子化演算回路と、前記畳み込み演算回路による前記第二メモリに対する書き込みを制限する第二ライトセマフォと、前記量子化演算回路による前記第二メモリからの読み出しを制限する第二リードセマフォと、前記量子化演算回路による前記第一メモリに対する書き込みを制限する第三ライトセマフォと、前記畳み込み演算回路による前記第一メモリからの読み出しを制限する第三リードセマフォと、を備えるニューラルネットワーク回路の制御方法であって、前記第三リードセマフォおよび前記第二ライトセマフォに基づき、前記畳み込み演算回路に前記畳み込み演算を実施させる。
本発明の第一実施形態について、図1から図24を参照して説明する。
図1は、畳み込みニューラルネットワーク200(以下、「CNN200」という)を示す図である。第一実施形態に係るニューラルネットワーク回路100(以下、「NN回路100」という)が行う演算は、推論時に使用する学習済みのCNN200の少なくとも一部である。
CNN200は、畳み込み演算を行う畳み込み層210と、量子化演算を行う量子化演算層220と、出力層230と、を含む多層構造のネットワークである。CNN200の少なくとも一部において、畳み込み層210と量子化演算層220とが交互に連結されている。CNN200は、画像認識や動画認識に広く使われるモデルである。CNN200は、全結合層などの他の機能を有する層(レイヤ)をさらに有してもよい。
畳み込み層210は、入力データaに対して重みwを用いた畳み込み演算を行う。畳み込み層210は、入力データaと重みwとを入力とする積和演算を行う。
NN回路100は、畳み込み層210の畳み込み演算(式1)の入力データを部分テンソルに分割して演算する。部分テンソルへの分割方法や分割数は特に限定されない。部分テンソルは、例えば、入力データa(x+i,y+j,c)をa(x+i,y+j,co)に分割することにより形成される。なお、NN回路100は、畳み込み層210の畳み込み演算(式1)の入力データを分割せずに演算することもできる。
NN回路100は、畳み込み層210の畳み込み演算における入力データaおよび重みwを展開して畳み込み演算を行う。
分割入力データa(x+i、y+j、co)は、Bc個の要素を持つベクトルデータに展開される。分割入力データaの要素は、ciでインデックスされる(0≦ci<Bc)。以降の説明において、i,jごとにベクトルデータに展開された分割入力データaを「入力ベクトルA」ともいう。入力ベクトルAは、分割入力データa(x+i、y+j、co×Bc)から分割入力データa(x+i、y+j、co×Bc+(Bc-1))までを要素とする。
図4は、本実施形態に係るNN回路100の全体構成を示す図である。
NN回路100は、第一メモリ1と、第二メモリ2と、DMAコントローラ3(以下、「DMAC3」ともいう)と、畳み込み演算回路4と、量子化演算回路5と、コントローラ6と、を備える。NN回路100は、第一メモリ1および第二メモリ2を介して、畳み込み演算回路4と量子化演算回路5とがループ状に形成されていることを特徴とする。
図5は、NN回路100の動作例を示すタイミングチャートである。
DMAC3は、レイヤ1の入力データaを第一メモリ1に格納する。DMAC3は、畳み込み演算回路4が行う畳み込み演算の順序にあわせて、レイヤ1の入力データaを分割して第一メモリ1に転送してもよい。
図6は、NN回路100の他の動作例を示すタイミングチャートである。
NN回路100は、入力データaを部分テンソルに分割して、時分割により部分テンソルに対する演算を行ってもよい。部分テンソルへの分割方法や分割数は特に限定されない。
図7は、DMAC3の内部ブロック図である。
DMAC3は、データ転送回路31と、ステートコントローラ32と、を有する。DMAC3は、データ転送回路31に対する専用のステートコントローラ32を有しており、命令コマンドが入力されると、外部のコントローラを必要とせずにDMAデータ転送を実施できる。
制御回路34は、命令キュー33に命令コマンドC3が入力されると(Not empty)、アイドルステートST1からデコードステートST2に遷移する。
図9は、畳み込み演算回路4の内部ブロック図である。
畳み込み演算回路4は、重みメモリ41と、乗算器42と、アキュムレータ回路43と、ステートコントローラ44と、を有する。畳み込み演算回路4は、乗算器42およびアキュムレータ回路43に対する専用のステートコントローラ44を有しており、命令コマンドが入力されると、外部のコントローラを必要とせずに畳み込み演算を実施できる。
乗算器42は、入力ベクトルAと重みマトリクスWとを乗算する。入力ベクトルAは、上述したように、分割入力データa(x+i、y+j、co)がi、jごとに展開されたBc個の要素を持つベクトルデータである。また、重みマトリクスWは、分割重みw(i,j,co、do)がi、jごとに展開されたBc×Bd個の要素を持つマトリクスデータである。乗算器42は、Bc×Bd個の積和演算ユニット47を有し、入力ベクトルAと重みマトリクスWとを乗算を並列して実施できる。
積和演算ユニット47は、入力ベクトルAの要素A(ci)と、重みマトリクスWの要素W(ci,di)との乗算を実施する。また、積和演算ユニット47は、乗算結果と他の積和演算ユニット47の乗算結果S(ci,di)と加算する。積和演算ユニット47は、加算結果S(ci+1,di)を出力する。要素A(ci)は、2ビットの符号なし整数(0,1,2,3)である。要素W(ci,di)は、1ビットの符号付整数(0,1)であり、値「0」は+1を表し、値「1」は-1を表す。
アキュムレータ回路43は、乗算器42の積和演算結果O(di)を第二メモリ2にアキュムレートする。アキュムレータ回路43は、Bd個のアキュムレータユニット48を有し、Bd個の積和演算結果O(di)を並列して第二メモリ2にアキュムレートできる。
アキュムレータユニット48は、加算器48aと、マスク部48bとを有している。加算器48aは、積和演算結果Oの要素O(di)と、第二メモリ2に格納された式1に示す畳み込み演算の途中経過である部分和と、を加算する。加算結果は、要素あたり16ビットである。加算結果は、要素あたり16ビットに限定されず、例えば要素あたり15ビットや17ビットであってもよい。
図14は、量子化演算回路5の内部ブロック図である。
量子化演算回路5は、量子化パラメータメモリ51と、ベクトル演算回路52と、量子化回路53と、ステートコントローラ54と、を有する量子化演算回路5は、ベクトル演算回路52および量子化回路53に対する専用のステートコントローラ54を有しており、命令コマンドが入力されると、外部のコントローラを必要とせずに量子化演算を実施できる。
ベクトル演算回路52は、第二メモリ2に格納された出力データf(x,y,do)に対して演算を行う。ベクトル演算回路52は、Bd個の演算ユニット57を有し、出力データf(x,y,do)に対して並列にSIMD演算を行う。
演算ユニット57は、例えば、ALU57aと、第一セレクタ57bと、第二セレクタ57cと、レジスタ57dと、シフタ57eと、を有する。演算ユニット57は、公知の汎用SIMD演算回路が有する他の演算器等をさらに有してもよい。
演算ユニット57は、ALU57aによる比較結果に応じて第二セレクタ57cを制御して、レジスタ57dに格納されたデータと要素f(di)の大きい方を選択できる。演算ユニット57は、第一セレクタ57bの選択により要素f(di)の取りうる値の最小値をALU57aに入力することで比較対象を最小値に初期化できる。本実施形態において要素f(di)は16bit符号付き整数であるので、要素f(di)の取りうる値の最小値は「0x8000」である。ベクトル演算回路52は、Bd個の演算ユニット57による上記の演算等を繰り返すことで、式3のMAXプーリングの演算を実施できる。なお、MAXプーリングの演算ではシフタ57eは第二セレクタ57cの出力をシフトしない。
量子化ユニット58は、ベクトル演算回路52の出力データの要素in(di)に対して量子化を行う。量子化ユニット58は、比較器58aと、エンコーダ58bと、を有する。量子化ユニット58はベクトル演算回路52の出力データ(16ビット/要素)に対して、量子化演算層220における量子化層224の演算(式6)を行う。量子化ユニット58は、量子化パラメータメモリ51から必要な量子化パラメータq(th0,th1,th2)を読み出し、比較器58aにより入力in(di)と量子化パラメータqとの比較を行う。量子化ユニット58は、比較器58aによる比較結果をエンコーダ58bにより2ビット/要素に量子化する。式4におけるα(c)とβ(c)は、変数cごとに異なるパラメータであるため、α(c)とβ(c)を反映する量子化パラメータq(th0,th1,th2)はin(di)ごとに異なるパラメータである。
コントローラ6は、外部ホストCPUから転送される命令コマンドを、DMAC3、畳み込み演算回路4および量子化演算回路5が有する命令キューに転送する。コントローラ6は、各回路に対する命令コマンドを格納する命令メモリを有してもよい。
図18は、セマフォSによるNN回路100の制御を説明する図である。
セマフォSは、第一セマフォS1と、第二セマフォS2と、第三セマフォS3と、を有する。セマフォSは、P操作によりデクリメントされ、V操作によってインクリメントされる。DMAC3、畳み込み演算回路4および量子化演算回路5によるP操作およびV操作は、内部バスIBを経由して、コントローラ6が有するセマフォSを更新する。
図19は、第一データフローF1のタイミングチャートである。
第一ライトセマフォS1Wは、第一データフローF1におけるDMAC3による第一メモリ1に対する書き込みを制限するセマフォである。第一ライトセマフォS1Wは、第一メモリ1において、例えば入力ベクトルAなどの所定のサイズのデータを格納可能なメモリ領域のうち、データが読み出し済みで他のデータを書き込み可能なメモリ領域の数を示している。第一ライトセマフォS1Wが「0」の場合、DMAC3は第一メモリ1に対して第一データフローF1における書き込みを行えず、第一ライトセマフォS1Wが「1」以上となるまで待たされる。
図20は、第二データフローF2のタイミングチャートである。
第二ライトセマフォS2Wは、第二データフローF2における畳み込み演算回路4による第二メモリ2に対する書き込みを制限するセマフォである。第二ライトセマフォS2Wは、第二メモリ2において、例えば出力データfなどの所定のサイズのデータを格納可能なメモリ領域のうち、データが読み出し済みで他のデータを書き込み可能なメモリ領域の数を示している。第二ライトセマフォS2Wが「0」の場合、畳み込み演算回路4は第二メモリ2に対して第二データフローF2における書き込みを行えず、第二ライトセマフォS2Wが「1」以上となるまで待たされる。
第三ライトセマフォS3Wは、第三データフローF3における量子化演算回路5による第一メモリ1に対する書き込みを制限するセマフォである。第三ライトセマフォS3Wは、第一メモリ1において、例えば量子化演算回路5の量子化演算出力データなどの所定のサイズのデータを格納可能なメモリ領域のうち、データが読み出し済みで他のデータを書き込み可能なメモリ領域の数を示している。第三ライトセマフォS3Wが「0」の場合、量子化演算回路5は第一メモリ1に対して第三データフローF3における書き込みを行えず、第三ライトセマフォS3Wが「1」以上となるまで待たされる。
畳み込み演算回路4は、畳み込み演算を行う際、第一メモリ1から読み出しを行い、第二メモリ2に対して書き込みを行う。すなわち、畳み込み演算回路4は、第一データフローF1においてはConsumerであり、第二データフローF2においてはProducerである。そのため、畳み込み演算回路4は、畳み込み演算を開始する際、第一リードセマフォS1Rに対してP操作を行い(図19参照)、第二ライトセマフォS2Wに対してP操作を行う(図20参照)。畳み込み演算回路4は、畳み込み演算の完了後に、第一ライトセマフォS1Wに対してV操作を行い(図19参照)、第二リードセマフォS2Rに対してV操作を行う(図20参照)。
量子化演算回路5は、量子化演算を行う際、第二メモリ2から読み出しを行い、第一メモリ1に対して書き込みを行う。すなわち、量子化演算回路5は、第二データフローF2においてはConsumerであり、第三データフローF3においてはProducerである。そのため、量子化演算回路5は、量子化演算を開始する際、第二リードセマフォS2Rに対してP操作を行い、第三ライトセマフォS3Wに対してP操作を行う。量子化演算回路5は量子化演算の完了後に、第二ライトセマフォS2Wに対してV操作を行い、第三リードセマフォS3Rに対してV操作を行う。
畳み込み演算回路4が第一メモリ1から読み出す入力データは、第三データフローにおいて量子化演算回路5が書き込んだデータである場合もある。この場合、畳み込み演算回路4は、第三データフローF3においてはConsumerであり、第二データフローF2においてはProducerである。そのため、畳み込み演算回路4は、畳み込み演算を開始する際、第三リードセマフォS3Rに対してP操作を行い、第二ライトセマフォS2Wに対してP操作を行う。畳み込み演算回路4は、畳み込み演算の完了後に、第三ライトセマフォS3Wに対してV操作を行い、第二リードセマフォS2Rに対してV操作を行う。
図21は、畳み込み演算実施命令を説明する図である。
畳み込み演算実施命令は、畳み込み演算回路4に対する命令コマンドC4の一つである。畳み込み演算実施命令は、畳み込み演算回路4に対する命令が格納された命令フィールドIFと、セマフォSに対する操作等が格納されたセマフォ操作フィールドSFと、を有する。命令フィールドIFとセマフォ操作フィールドSFとは、畳み込み演算実施命令として一命令に収められている。
図22に示す具体例は、4個の畳み込み演算命令(以降、「命令1」から「命令4」という)で構成されており、4個の畳み込み演算命令は第一メモリ1に格納された入力データa(x+i,y+j,co)を4回に分割して畳み込み演算回路4に畳み込み演算を実施させる。
図23は、量子化演算実施命令を説明する図である。
量子化演算実施命令は、量子化演算回路5に対する命令コマンドC5の一つである。量子化演算実施命令は、量子化演算回路5に対する命令が格納された命令フィールドIFと、セマフォSに対する操作等が格納されたセマフォ操作フィールドSFと、を有する。命令フィールドIFとセマフォ操作フィールドSFとは、量子化演算実施命令として一命令に収められている。
図24は、DMA転送実施命令を説明する図である。
DMA転送実施命令は、DMAC3に対する命令コマンドC3の一つである。DMA転送実施命令は、DMAC3に対する命令が格納された命令フィールドIFと、セマフォSに対する操作等が格納されたセマフォ操作フィールドSFと、を有する。命令フィールドIFとセマフォ操作フィールドSFとは、DMA転送実施命令として一命令に収められている。
上記実施形態において、一つの命令フィールドIFに対して複数のセマフォ操作フィールドSFを一命令に収める命令の例を示したが、命令の態様はこれに限られるものではない。命令は、複数の命令フィールドIFと、命令フィールドIFごとに対して関連づけられた複数のセマフォ操作フィールドSFと、を一命令内に収める態様であってもよい。また、命令フィールドIFとセマフォ操作フィールドSFを一命令に収める方法としては、上記実施形態の構成に限られない。さらに、命令フィールドIFとセマフォ操作フィールドSFとは、複数の命令に分割して収められていてもてよい。命令において命令フィールドIFが対応するセマフォ操作フィールドSFと関連づけられていれば、同様の効果を奏することができる。
上記実施形態において、第一メモリ1と第二メモリ2は別のメモリであったが、第一メモリ1と第二メモリ2の態様はこれに限定されない。第一メモリ1と第二メモリ2は、例えば、同一メモリにおける第一メモリ領域と第二メモリ領域であってもよい。
上記実施形態において、セマフォSは第一データフローF1、第二データフローF2および第三データフローF3に対して設けられていたが、セマフォSの態様はこれに限定されない。セマフォSは、例えば、DMAC3が重みwを重みメモリ41に書き込み、乗算器42が重みwを読み出すデータフローに設けられていてもよい。セマフォSは、例えば、DMAC3が量子化パラメータqを量子化パラメータメモリ51に書き込み、量子化回路53が量子化パラメータqを読み出すデータフローに設けられていてもよい。
例えば、上記実施形態に記載のNN回路100に入力されるデータは単一の形式に限定されず、静止画像、動画像、音声、文字、数値およびこれらの組み合わせで構成することが可能である。なお、NN回路100に入力されるデータは、NN回路100が設けられるエッジデバイスに搭載され得る、光センサ、温度計、Global Positioning System(GPS)計測器、角速度計測器、風速計などの物理量測定器における測定結果に限られない。周辺機器から有線または無線通信経由で受信する基地局情報、車両・船舶等の情報、天候情報、混雑状況に関する情報などの周辺情報や金融情報や個人情報等の異なる情報を組み合わせてもよい。
NN回路100が設けられるエッジデバイスは、バッテリー等で駆動する携帯電話などの通信機器、パーソナルコンピュータなどのスマートデバイス、デジタルカメラ、ゲーム機器、ロボット製品などのモバイル機器を想定するが、これに限られるものではない。Power on Ethernet(PoE)などでの供給可能なピーク電力制限、製品発熱の低減または長時間駆動の要請が高い製品に利用することでも他の先行例にない効果を得ることができる。例えば、車両や船舶などに搭載される車載カメラや、公共施設や路上などに設けられる監視カメラ等に適用することで長時間の撮影を実現できるだけでなく、軽量化や高耐久化にも寄与する。また、テレビやディスプレイ等の表示デバイス、医療カメラや手術ロボット等の医療機器、製造現場や建築現場で使用される作業ロボットなどにも適用することで同様の効果を奏することができる。
NN回路100は、NN回路100の一部または全部を一つ以上のプロセッサを用いて実現してもよい。例えば、NN回路100は、入力層または出力層の一部または全部をプロセッサによるソフトウェア処理により実現してもよい。ソフトウェア処理により実現する入力層または出力層の一部は、例えば、データの正規化や変換である。これにより、様々な形式の入力形式または出力形式に対応できる。なお、プロセッサで実行するソフトウェアは、通信手段や外部メディアを用いて書き換え可能に構成してもよい。
NN回路100は、CNN200における処理の一部をクラウド上のGraphics Processing Unit(GPU)等を組み合わせることで実現してもよい。NN回路100は、NN回路100が設けられるエッジデバイスで行った処理に加えて、クラウド上でさらに処理を行ったり、クラウド上での処理に加えてエッジデバイス上で処理を行ったりすることで、より複雑な処理を少ないリソースで実現できる。このような構成によれば、NN回路100は、処理分散によりエッジデバイスとクラウドとの間の通信量を低減できる。
NN回路100が行う演算は、学習済みのCNN200の少なくとも一部であったが、NN回路100が行う演算の対象はこれに限定されない。NN回路100が行う演算は、例えば畳み込み演算と量子化演算のように、2種類の演算を繰り返す学習済みのニューラルネットワークの少なくとも一部であってもよい。
100 ニューラルネットワーク回路(NN回路)
1 第一メモリ
2 第二メモリ
3 DMAコントローラ(DMAC)
4 畳み込み演算回路
42 乗算器
43 アキュムレータ回路
5 量子化演算回路
52 ベクトル演算回路
53 量子化回路
6 コントローラ
61 レジスタ
S セマフォ
F1 第一データフロー
F2 第二データフロー
F3 第三データフロー
Claims (12)
- 入力データを格納する第一メモリと、
前記第一メモリに格納された前記入力データに対して畳み込み演算を行う畳み込み演算回路と、
前記畳み込み演算回路の畳み込み演算出力データを格納する第二メモリと、
前記第二メモリに格納された前記畳み込み演算出力データに対して量子化演算を行う量子化演算回路と、
前記畳み込み演算回路による前記第二メモリに対する書き込みを制限する第二ライトセマフォと、
前記量子化演算回路による前記第二メモリからの読み出しを制限する第二リードセマフォと、
前記量子化演算回路による前記第一メモリに対する書き込みを制限する第三ライトセマフォと、
前記畳み込み演算回路による前記第一メモリからの読み出しを制限する第三リードセマフォと、
を備えるニューラルネットワーク回路の制御方法であって、
前記第三リードセマフォおよび前記第二ライトセマフォに基づき、前記畳み込み演算回路に前記畳み込み演算を実施させる、
ニューラルネットワーク回路の制御方法。 - 前記畳み込み演算回路に対して、前記第三リードセマフォおよび前記第二ライトセマフォに基づく前記畳み込み演算の実施条件の判定と、前記判定に基づく前記畳み込み演算の実施と、を一命令で指令する畳み込み演算実施命令を備える、
請求項1に記載のニューラルネットワーク回路の制御方法。 - 前記畳み込み演算実施命令は、前記畳み込み演算の実施前に、前記畳み込み演算回路に前記第三リードセマフォおよび前記第二ライトセマフォを更新させる、
請求項2に記載のニューラルネットワーク回路の制御方法。 - 前記畳み込み演算実施命令は、前記畳み込み演算の実施後に、前記畳み込み演算回路に前記第三ライトセマフォおよび前記第二リードセマフォを更新させる、
請求項2または請求項3に記載のニューラルネットワーク回路の制御方法。 - 前記第二リードセマフォおよび前記第三ライトセマフォに基づき、前記量子化演算回路に前記量子化演算を実施させる、
請求項1に記載のニューラルネットワーク回路の制御方法。 - 前記量子化演算回路に対して、前記第二リードセマフォおよび前記第三ライトセマフォに基づく前記量子化演算の実施条件の判定と、前記判定に基づく前記量子化演算の実施と、を一命令で指令する量子化演算実施命令を備える、
請求項5に記載のニューラルネットワーク回路の制御方法。 - 前記量子化演算実施命令は、前記量子化演算の実施前に、前記量子化演算回路に前記第二リードセマフォおよび前記第三ライトセマフォを更新させる、
請求項6に記載のニューラルネットワーク回路の制御方法。 - 前記量子化演算実施命令は、前記量子化演算の実施後に、前記量子化演算回路に前記第二ライトセマフォおよび前記第三リードセマフォを更新させる、
請求項6または請求項7に記載のニューラルネットワーク回路の制御方法。 - 前記ニューラルネットワーク回路は、
前記第一メモリに前記入力データを転送するDMAコントローラと、
前記DMAコントローラによる前記第一メモリに対する書き込みを制限する第一ライトセマフォと、
前記畳み込み演算回路による前記第一メモリからの読み出しを制限する第一リードセマフォと、
をさらに備え、
前記第一リードセマフォおよび前記第二ライトセマフォに基づき、前記畳み込み演算回路に前記畳み込み演算を実施させる、
請求項1に記載のニューラルネットワーク回路の制御方法。 - 前記畳み込み演算回路に対して、前記第一リードセマフォおよび前記第二ライトセマフォに基づく前記畳み込み演算の実施条件の判定と、前記判定に基づく前記畳み込み演算の実施と、を一命令で指令する畳み込み演算実施命令を備える、
請求項9に記載のニューラルネットワーク回路の制御方法。 - 前記畳み込み演算実施命令は、前記畳み込み演算の実施前に、前記畳み込み演算回路に前記第一リードセマフォおよび前記第二ライトセマフォを更新させる、
請求項10に記載のニューラルネットワーク回路の制御方法。 - 前記畳み込み演算実施命令は、前記畳み込み演算の実施後に、前記畳み込み演算回路に前記第一ライトセマフォおよび前記第二リードセマフォを更新させる、
請求項10または請求項11に記載のニューラルネットワーク回路の制御方法。
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202180027773.7A CN115398447A (zh) | 2020-04-13 | 2021-04-12 | 神经网络电路的控制方法 |
| US17/917,795 US12475362B2 (en) | 2020-04-13 | 2021-04-12 | Method for controlling neural network circuit |
| JP2022515367A JPWO2021210527A5 (ja) | 2021-04-12 | ニューラルネットワーク回路およびニューラルネットワーク回路の制御方法 |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2020-071933 | 2020-04-13 | ||
| JP2020071933 | 2020-04-13 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2021210527A1 true WO2021210527A1 (ja) | 2021-10-21 |
Family
ID=78083802
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2021/015148 Ceased WO2021210527A1 (ja) | 2020-04-13 | 2021-04-12 | ニューラルネットワーク回路の制御方法 |
Country Status (3)
| Country | Link |
|---|---|
| US (1) | US12475362B2 (ja) |
| CN (1) | CN115398447A (ja) |
| WO (1) | WO2021210527A1 (ja) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115016948A (zh) * | 2022-08-08 | 2022-09-06 | 阿里巴巴(中国)有限公司 | 一种资源访问方法、装置、电子设备及可读存储介质 |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR20210152244A (ko) * | 2020-06-08 | 2021-12-15 | 삼성전자주식회사 | 뉴럴 네트워크를 구현하는 장치 및 그 동작 방법 |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2010152926A (ja) * | 2001-07-11 | 2010-07-08 | Seiko Epson Corp | データ処理装置、データ入出力装置およびデータ入出力方法 |
| WO2013080289A1 (ja) * | 2011-11-28 | 2013-06-06 | 富士通株式会社 | 信号処理装置及び信号処理方法 |
| JP2013225218A (ja) * | 2012-04-20 | 2013-10-31 | Fuji Electric Co Ltd | 周辺装置アクセスシステム |
| JP2019139747A (ja) * | 2018-02-13 | 2019-08-22 | 北京曠視科技有限公司Beijing Kuangshi Technology Co., Ltd. | 演算装置、演算実行設備及び演算実行方法 |
Family Cites Families (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP6964234B2 (ja) | 2016-11-09 | 2021-11-10 | パナソニックIpマネジメント株式会社 | 情報処理方法、情報処理装置およびプログラム |
| CN107239824A (zh) * | 2016-12-05 | 2017-10-10 | 北京深鉴智能科技有限公司 | 用于实现稀疏卷积神经网络加速器的装置和方法 |
| US12131250B2 (en) * | 2017-09-29 | 2024-10-29 | Intel Corporation | Inner product convolutional neural network accelerator |
| CN107704923B (zh) | 2017-10-19 | 2024-08-20 | 珠海格力电器股份有限公司 | 卷积神经网络运算电路 |
| CN110163334B (zh) * | 2018-02-11 | 2020-10-09 | 上海寒武纪信息科技有限公司 | 集成电路芯片装置及相关产品 |
| CN109685209B (zh) * | 2018-12-29 | 2020-11-06 | 瑞芯微电子股份有限公司 | 一种加快神经网络运算速度的装置和方法 |
| US11977388B2 (en) * | 2019-02-21 | 2024-05-07 | Nvidia Corporation | Quantizing autoencoders in a neural network |
| US11270197B2 (en) * | 2019-03-12 | 2022-03-08 | Nvidia Corp. | Efficient neural network accelerator dataflows |
| DE112020001258T5 (de) * | 2019-03-15 | 2021-12-23 | Intel Corporation | Grafikprozessoren und Grafikverarbeitungseinheiten mit Skalarproduktakkumulationsanweisungen für ein Hybrid-Gleitkommaformat |
-
2021
- 2021-04-12 CN CN202180027773.7A patent/CN115398447A/zh active Pending
- 2021-04-12 WO PCT/JP2021/015148 patent/WO2021210527A1/ja not_active Ceased
- 2021-04-12 US US17/917,795 patent/US12475362B2/en active Active
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2010152926A (ja) * | 2001-07-11 | 2010-07-08 | Seiko Epson Corp | データ処理装置、データ入出力装置およびデータ入出力方法 |
| WO2013080289A1 (ja) * | 2011-11-28 | 2013-06-06 | 富士通株式会社 | 信号処理装置及び信号処理方法 |
| JP2013225218A (ja) * | 2012-04-20 | 2013-10-31 | Fuji Electric Co Ltd | 周辺装置アクセスシステム |
| JP2019139747A (ja) * | 2018-02-13 | 2019-08-22 | 北京曠視科技有限公司Beijing Kuangshi Technology Co., Ltd. | 演算装置、演算実行設備及び演算実行方法 |
Non-Patent Citations (1)
| Title |
|---|
| USUI, TOSHINORI ET AL., COMPILER AND OPTIMIZATION LEVEL ESTIMATION METHOD AIMED AT IMPROVING THE ACCURACY OF ANTI-MALWARE TECHNOLOGY, PROCEEDINGS OF CSS2013 COMPUTER SECURITY SYMPOSIUM 2013, JOINTLY HELD MALWARE COUNTERMEASURE RESEARCH HUMAN RESOURCE DEVELOPMENT WORKSHOP 20, vol. 2013, no. 4, 23 October 2013 (2013-10-23) * |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115016948A (zh) * | 2022-08-08 | 2022-09-06 | 阿里巴巴(中国)有限公司 | 一种资源访问方法、装置、电子设备及可读存储介质 |
| CN115016948B (zh) * | 2022-08-08 | 2022-11-25 | 阿里巴巴(中国)有限公司 | 一种资源访问方法、装置、电子设备及可读存储介质 |
Also Published As
| Publication number | Publication date |
|---|---|
| CN115398447A (zh) | 2022-11-25 |
| US12475362B2 (en) | 2025-11-18 |
| JPWO2021210527A1 (ja) | 2021-10-21 |
| US20230138667A1 (en) | 2023-05-04 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP6896306B1 (ja) | ニューラルネットワーク回路、エッジデバイスおよびニューラルネットワーク演算方法 | |
| JP2025072666A (ja) | ニューラルネットワーク生成装置 | |
| WO2021210527A1 (ja) | ニューラルネットワーク回路の制御方法 | |
| US20240095522A1 (en) | Neural network generation device, neural network computing device, edge device, neural network control method, and software generation program | |
| WO2022230906A1 (ja) | ニューラルネットワーク生成装置、ニューラルネットワーク演算装置、エッジデバイス、ニューラルネットワーク制御方法およびソフトウェア生成プログラム | |
| JP6931252B1 (ja) | ニューラルネットワーク回路およびニューラルネットワーク回路の制御方法 | |
| WO2025105405A1 (ja) | ニューラルネットワーク回路およびニューラルネットワーク演算方法 | |
| JP2022105437A (ja) | ニューラルネットワーク回路およびニューラルネットワーク演算方法 | |
| US20250006230A1 (en) | Neural network circuit and neural network circuit control method | |
| JP2024118195A (ja) | ニューラルネットワーク回路およびニューラルネットワーク演算方法 | |
| JP2022183833A (ja) | ニューラルネットワーク回路およびニューラルネットワーク演算方法 | |
| JP2025065487A (ja) | 制御装置および制御方法 | |
| WO2024038662A1 (ja) | ニューラルネットワーク学習装置およびニューラルネットワーク学習方法 | |
| JP2024075106A (ja) | ニューラルネットワーク回路およびニューラルネットワーク演算方法 | |
| WO2023139990A1 (ja) | ニューラルネットワーク回路およびニューラルネットワーク演算方法 | |
| JP2025015342A (ja) | ニューラルネットワーク回路 | |
| JP2023154880A (ja) | ニューラルネットワーク生成方法およびニューラルネットワーク生成プログラム | |
| JP2022114698A (ja) | ニューラルネットワーク生成装置、ニューラルネットワーク制御方法およびソフトウェア生成プログラム |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21788250 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2022515367 Country of ref document: JP Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 21788250 Country of ref document: EP Kind code of ref document: A1 |
|
| WWG | Wipo information: grant in national office |
Ref document number: 17917795 Country of ref document: US |