CN118014030A

CN118014030A - Neural network accelerator and system

Info

Publication number: CN118014030A
Application number: CN202410276738.4A
Authority: CN
Inventors: 李冰; 张新坤; 张雨宁; 李一凡
Original assignee: Beijing Haipu Fangman Technology Co ltd
Current assignee: Beijing Haipu Fangman Technology Co ltd
Priority date: 2024-03-12
Filing date: 2024-03-12
Publication date: 2024-05-10

Abstract

The invention relates to a neural network accelerator and a system, belongs to the technical field of neural networks, and solves the problem that the neural network is difficult to train and infer at a low bit rate. The technical scheme of the invention mainly comprises the following steps: the device comprises a preprocessing unit, a dot product unit array and a control unit; the preprocessing unit is used for decoding vectors of the entropy coding integer coding matrix to generate decoding vectors, and elements in the decoding vectors consist of symbol bits, effective numbers and an exponential triplet structure; the dot product unit array is composed of a plurality of dot product units, each dot product unit outputs dot products of two decoding vectors, the dot product unit comprises an ECI vector multiplier and an addition tree, the ECI vector multiplier performs item-by-item multiplication operation on triplets of the two decoding vectors, the ECI vector multiplier comprises performing product operation on two effective digital words, performing exclusive OR operation on two sign bits, performing complement operation on a negative number product result, performing addition operation on two indexes, and performing shift operation on the product result according to an index operation result.

Description

Neural network accelerator and system

Technical Field

The invention belongs to the technical field of neural networks, and particularly relates to a neural network accelerator and a neural network system.

Background

The explosive development of deep learning techniques requires more complex, larger models and datasets, and requires higher computational effort. The computational power required for deep learning doubles every few months, also known as "moore's law in deep learning. However, moore's law in the hardware field is expected to die in 2030, and the development of computer hardware is slowing down, so new techniques are needed to resolve the contradiction between two moore's laws.

The training and reasoning of the deep learning require very high calculation amount and access memory bandwidth, and the spending constitutes the main cost of the deep learning training and reasoning. GPT-3 is reported to cost up to $50 ten thousand to train. The inference engine is far more cost-intensive to superstrain due to the huge number of deployments. It is well known that the power consumption and area of an arithmetic circuit is directly related to the word length of an operand. For example, the power consumption and area of an adder are linearly proportional to the operational digital width; the power consumption and area of the multiplier are proportional to the square of the operational digital width. Therefore, reducing the bit width of operands by quantization is one of the key technologies of great interest in order to reduce the power consumption and area overhead of deep-learning processors/accelerators. The traditional fixed point number quantization can introduce larger network prediction error under the condition of low bit width, because the distribution of model weight, activation value and gradient of the neural network approximates to Gaussian distribution, as shown in figure 1, the data distribution is very close to a bell curve, and the traditional fixed point number has difficulty in simultaneously considering the expression range and the precision, so that the fixed point number has larger information loss when expressing a network model.

In the prior art, the quantization is carried out by adopting integer, floating point number, logarithm, table lookup method and the like.

The integer as the quantized value is uniformly distributed in number system and has lower precision, so that the information loss is obvious, and the error accumulation through multi-layer iterative operation causes larger loss of accuracy, in other words, if the equivalent accuracy is required to be maintained, the higher bit width is required, which is contrary to the original purpose of quantization.

The floating point number can overcome the defects of the fixed point number in the aspects of range and precision, and the bit width can be reduced under the condition of keeping the precision. For example TPUv1 uses 8-bit numbers as its quantized data type; TPUv4i uses BF16 as the data type; NVidia A100 GPU adopts data types such as FP16/BF16/TF32 and the like, and the type FP8 is newly added in the latest Hopper 100. However, the floating point number requires exponent pair order, normalization, rounding, exception handling and the like in the operation, so that the area and the power consumption cost of a circuit are high.

Quantization using a logarithmic system results in a significant increase in the overhead of performing the addition operation.

In addition, an indirect method or a table lookup method is adopted, so that the bit width and the model volume can be obviously reduced, but a model expressed by the table lookup method cannot be directly calculated, the model must firstly obtain the true value of each weight through table lookup, and the true value is usually expressed by adopting high-precision floating point numbers, so that the hardware cost and the power consumption of the quantization by the table lookup method are higher. From the standpoint of the volume compression effect and the accuracy, the encoding of the quantization target value itself has a great influence on the quantization effect. The current mainstream deep learning training usually adopts 32/16 bit floating point precision (FP 32/BF16/FP16/TF 32), while reasoning usually adopts 16 bit (FP 16/BF 16) or 8 bit precision (INT 8/FP 8). It is an object of the present invention to provide an accelerator structure for a neural network to solve the problem that the neural network is difficult to train and infer at low bit rates.

Disclosure of Invention

In view of the above analysis, embodiments of the present invention aim to provide a neural network accelerator and system to solve the problem that the neural network is difficult to train and infer at low bit rates. The main computational overhead in neural networks comes from Convolution (CONV) and matrix multiplication (GEMM) operations. Whereas convolution operations can be translated into several matrix multiplications, the neural network accelerator core is optimized for matrix multiplications.

The invention provides a neural network accelerator, which is used for carrying out matrix multiplication operation on a parameter matrix quantized by entropy coding integer coding, and comprises the following steps: the device comprises a preprocessing unit, a dot product unit array and an accumulation unit;

the preprocessing unit is used for decoding vectors of the entropy coding integer coding matrix to generate decoding vectors, and elements of the decoding vectors have a triplet structure of sign bits, significant digits and indexes;

The dot product unit array consists of a plurality of dot product units, wherein the dot product units are used for outputting dot products of two decoding vectors, and the dot product units comprise ECI vector multipliers and addition trees;

The ECI vector multiplier comprises a plurality of ECI scalar multipliers for performing item-by-item multiplication operation on corresponding triples respectively belonging to two decoding vectors; the ECI scalar multiplier comprises the steps of performing product operation on the two effective digits, performing exclusive OR operation on the two sign bits, performing complement operation on a negative number product result, performing addition operation on the two exponents, and performing shift operation on the product result according to the exponent operation result;

the addition tree is used for summing output vectors of the ECI vector multiplier;

The control unit is connected with the preprocessing unit and the dot product unit array and is used for controlling the preprocessing unit to acquire vectors of the entropy coding integer coding matrix and outputting decoding vectors to the dot product unit.

Further, the ECI scalar multiplier comprises an exclusive OR operator, an effective digital multiplier, an exponential accumulator, a complement, a selector and a shifter, wherein the exclusive OR operator is connected with the selector, the effective digital multiplier is connected with the selector on one hand and generates complement codes in parallel through the complement on the other hand, the output of the complement is connected with the selector, and the selector outputs a result according to the generated sign bit; the selector is connected with the shifter, and the exponent accumulator is connected with the shifter.

In some embodiments, the addition tree comprises a Wallace addition tree or Dadda addition tree.

In some embodiments, the sign bit, the exponent field, and the significand field are in order from left to right in the entropy encoded integer encoding;

the sign bit is represented by 1 bit 0 or 1;

the exponent field adopts a binary system to represent the value of the exponent field;

the significant digital field adopts binary system to represent the value of the significant digital field;

The encoding also has a super parameter (N, E), where N represents the bit width of the entropy encoded integer encoding and E represents the maximum number of bits of the exponent field;

the preprocessing unit encoding and decoding the entropy encoded integer includes:

Extracting sign bits, an index field and a valid digital field in the code according to the super parameter;

Performing decoding calculation on the exponent field to obtain a decoding value of the exponent field, and performing decoding calculation on the valid digital field to obtain a decoding value of the valid digital field to obtain the triplet structure;

the correspondence between the codes and the true values is expressed as:

s＝δ

A＝(-1)^δ×m×2^e

Wherein A represents a true value, s represents a decoded value of a sign bit, e represents a decoded value of an exponent field, m represents a decoded value of a significant digit field, delta represents a sign bit, epsilon represents an exponent field, Representing a significant number field,/>The representation 1 and the significant digit field are spliced into a binary number according to a bit string.

In some embodiments, the exponent field includes a number of consecutive bits 1, the number of bits 1 in the consecutive bits 1 representing a value of the exponent field, the value of the exponent field being less than or equal to E;

If the decoding value of the exponent field is not equal to E, a separator is inserted between the exponent field and the significant digit field, and the separator is one bit 0;

If the value of the exponent field is equal to E, the significant digit field is spliced directly after the exponent field.

In some embodiments, the preprocessing unit further comprises a first input buffer and a second input buffer, the first input buffer is connected with the first decoder, the second input buffer is connected with the second decoder, and the first decoder and the second decoder are connected with the dot product unit.

In some embodiments, the dot product unit further comprises a first vector register and a second vector register, the first decoder is connected to the ECI vector multiplier through the first vector register and the second vector register in turn, and the second decoder is connected to the ECI vector multiplier.

Further, the device also comprises a partial sum accumulation unit, wherein the partial sum accumulation unit is used for accumulating the result vector output by the dot product unit array with the cached corresponding partial sum vector; the partial and accumulation unit comprises a partial and accumulator connected to the output of the dot product unit and a partial and buffer connected to the partial and buffer.

In some embodiments, the control unit is coupled to the first input buffer, the second input buffer, the portion and the accumulation unit,

The control unit is further configured to control the first input buffer to acquire a first entropy encoded integer encoding matrix and output a first vector to a first decoder; and

Controlling the second input buffer to acquire a second entropy encoded integer encoding matrix and output a second vector to a second decoder; and

And controlling the dot product unit to output a dot product result to the part and the accumulator, and reading corresponding parts and vectors from the part and the buffer.

The invention also provides a neural network accelerator system comprising a plurality of neural network accelerators as described in any of the embodiments above.

The embodiment of the invention has at least the following beneficial effects:

and the ECI preprocessing unit and the ECI vector dot product unit array are used for accelerating the operation of the neural network model, so that the hardware cost and the power consumption of the neural network accelerator are obviously reduced.

Drawings

In order to more clearly illustrate the embodiments of the present description or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present description, and other drawings may be obtained according to these drawings for a person having ordinary skill in the art.

FIG. 1 is a schematic diagram of the overall architecture of a neural network accelerator according to the present invention;

FIG. 2 is a schematic diagram of a dot product cell architecture according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an ECI vector multiplier architecture according to an embodiment of the present invention;

FIG. 4 is a graph showing the comparison of data of relative energy consumption of computing units with different bit widths and formats according to an embodiment of the present invention;

FIG. 5 is a graph showing the comparison of the relative area data of the operation units with different bit-width formats according to the embodiment of the present invention;

FIG.6 is a graph of accelerator power consumption versus data for different formats and bit widths provided by an embodiment of the present invention;

FIG. 7 is a graph of accelerator area versus data for different formats and bit widths provided by an embodiment of the present invention;

FIG. 8 is a schematic diagram of a cascaded accelerator system architecture according to an embodiment of the present invention;

Fig. 9 is a schematic diagram of an accelerator system architecture interconnected through a network on chip according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. It should be noted that embodiments and features of embodiments in the present disclosure may be combined, separated, interchanged, and/or rearranged with one another without conflict. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Furthermore, when the terms "comprises" and/or "comprising," and variations thereof, are used in the present specification, the presence of stated features, integers, steps, operations, elements, components, and/or groups thereof is described, but the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof is not precluded. It is also noted that, as used herein, the terms "substantially," "about," and other similar terms are used as approximation terms and not as degree terms, and as such, are used to explain the inherent deviations of measured, calculated, and/or provided values that would be recognized by one of ordinary skill in the art.

With the breakthrough progress of deep learning in the fields of image classification, target detection, natural language processing and the like, the demand of applying the deep learning to actual life scenes is also becoming stronger. Currently, mobile and portable electronic devices greatly facilitate people's lives, and deep learning will greatly improve the intelligence and entertainment of these devices. Therefore, deployment of deep-learning neural network models in mobile terminals and embedded systems is an urgent need. However, in practical deployment of the neural network model applying deep learning, the problem that the model size is too large is usually faced, for example, the file size of the neural network model is generally from tens of megameters to hundreds of megameters, and the long transmission waiting time caused by the flow consumed during downloading and the influence of bandwidth is intolerable to users for the mobile terminal; particularly for some embedded systems where memory space is limited, there may not be enough memory space at all to store such large neural network model files. Meanwhile, the neural network model for deep learning has high requirements on computing resources and computing capacity; when the large-scale neural network model is used for calculation, the mobile terminal and the embedded system can not provide the required calculation resources, or the calculation is slow, so that the response delay is too high to meet the actual application scene. In addition, the neural network model also consumes large power. In the neural network calculation process, the processor needs to frequently read parameters of the neural network model, so that a larger neural network model correspondingly brings higher memory access times, the frequent memory access also greatly improves the power consumption, and the high power consumption is unfavorable for deploying the neural network model on a mobile terminal.

Therefore, in order to deploy a neural network with good performance on a hardware device with limited resources, compression and acceleration of the neural network model are required. The distribution of model weights, activation values of the neural network approximates a gaussian distribution, while the gradient approximates a log-normal distribution, and the distribution of these tensor data exhibits a large non-uniformity. Although the traditional fixed point number coding is easy to realize, the traditional fixed point number coding can cause obvious model precision loss under the condition of low bit width; although the floating point number coding can well maintain the model precision, the operation power consumption is high due to the coding of the floating point number coding. The entropy coding integer codes (Entropy Coded Integer, ECI) can solve the contradiction between bit width and power consumption in the traditional coding scheme, and can remarkably reduce the energy consumption of neural network training and deployment under the condition of maintaining the accuracy of a model.

The present invention is to solve the problem that the neural network is difficult to train and infer at low bit rate based on ECI in hardware structure, and provides a neural network accelerator for multiplying parameter matrix quantized by entropy coding integer coding, as shown in FIG. 1, comprising: a preprocessing unit 1, a dot product unit array 2 and a control unit;

the preprocessing unit 1 is used for decoding vectors of the entropy coding integer coding matrix to generate decoding vectors, and elements of the decoding vectors have a triplet structure of sign bits, significant numbers and indexes;

As shown in fig. 1, the dot product cell array 2 comprises a number of dot product cells (EDOT) 21, said dot product cells 21 being arranged to output dot products of two of said decoded vectors, said dot product cells 21 comprising an ECI vector multiplier 211 and an addition tree 212 as shown in fig. 2.

The ECI vector multiplier 211 comprises a plurality of ECI scalar multipliers for performing a multiplication operation item by item on corresponding elements respectively belonging to the two decoding vectors, wherein the ECI scalar multipliers are used for performing a product operation on the two significant digits, performing an exclusive or operation on the two sign bits, performing a complement operation on a negative number product result, performing an addition operation on the two exponents, and performing a shift operation on the product result according to an exponent operation result;

the summing tree 212 is used to sum the outputs of the ECI vector multipliers 211;

it should be appreciated that the output of ECI vector multiplier 211 is a vector and summing tree 212 is required to sum all elements in the output vector. In other words, the summing tree 212 sums the output results of several of the ECI scalar multipliers.

The control unit is connected to the preprocessing unit 1, the dot product unit array 2, the part and the accumulating unit 5, and is used for controlling the preprocessing unit 1 to acquire the vector of the entropy coding integer coding matrix and outputting the decoding vector to the dot product unit 21.

It should be understood that the parameter matrix mentioned in this embodiment includes, but is not limited to, weights, activation values, gradients, etc. of each convolution layer in the neural network, and the operation of the common weight and activation value matrix is illustrated in this embodiment. In addition, the neural network accelerator provided in this embodiment is actually one operation subunit of the overall accelerator, and in practical application, the accelerator deploys a plurality of neural network accelerators provided in this embodiment to calculate each block matrix separately, so as to reduce the size of a single operation subunit and improve the parallel processing efficiency.

Specifically, the sign bit, the exponent field and the significant digit field are sequentially arranged from left to right in the entropy coding integer coding;

the sign bit is represented by 1 bit 0 or 1;

The exponent field adopts a binary system to represent the exponent size;

the significant digit field adopts binary system to represent the significant digit size;

the correspondence between the codes and the true values is expressed as:

s＝δ

A＝(-1)^δ×m×2^e

If the decoding value of the exponent field is not equal to E, a separator is inserted between the exponent field and the effective number field, and the separator is one bit 0;

The ECI coding is designed to adapt the numerical distribution of the coding to the numerical distribution of the parameter to be quantized in a neural network, so that quantization errors can be reduced in the quantization process, the data of the quantization model can still express the knowledge of the original network model, and the accuracy of the quantization model is kept nearly lossless.

Specifically, the ECI coding system is a coding with uneven numerical distribution, and two super parameters N and E are used for defining the ECI number system: n is the bit width of the ECI number, i.e. the number of bits of the ECI code, and E is the maximum number of bits of the exponent field. Where E is referred to as the maximum number of bits of the index, indicating that in ECI (N, E) encoding where the super parameter is determined, the number of bits of the index field of the ECI may be varied, e.g. e=3, and the number of bits of the index field may be 0,1, 2 or 3, and accordingly, since the bit width N is fixed, the number of bits of the effective digital field may be varied. According to the conversion formula of the corresponding relation between ECI codes and the true values, the more the number of bits of the effective digital field is, the more the number of values of the effective digital field is. In other words, entropy-encoded integer codes, which refer to digital segments having a lower bit width, have denser numbers and smaller differences between adjacent numbers (quantization intervals); conversely, the code corresponding to the higher bit width of the digital segment has sparse values and large differences (quantization intervals) between adjacent values. The density of the values is determined by the super parameter E. It can be seen that the numerical distribution of ECI is more flexible to adapt to the tensor data of the neural network than the general coding.

The ECI number provided by the invention has the following characteristics:

1. When the super parameter E is fixed, all values that ECI (N, E) can represent are unique except 0. The value zero has two legal representations, namely + -0.

2. ECI (N, E) can represent a maximum value of 2 ^N+E-2 (for an unsigned ECI system of numbers, a maximum value of 2 ^N+E); when e=n-1, the maximum value that ECI can represent is 2 ^2N-3.

3. By adjusting the superparameter E, the spacing between adjacent ECI values is correspondingly enlarged or reduced. And (3) a value set of ECI numbers for different E values when the bit width is fixed.

4. When e=0 or 1, ECI (N, 1), ECI (N, 0) is equivalent to a conventional N-bit integer system and thus the same bit-wide integer system can be regarded as a subset or special case of the ECI number system.

Nonlinear operation in neural networks typically requires a comparison operation. For example, the activate function ReLU, and MaxPooling operations all require comparison. One possible approach is to decode the ECI number into an integer before performing the compare operation, but since the decoded integer bit width will be nearly doubled, the decode operation and the high bit width integer compare operation will occupy more resources (memory access bandwidth, chip area and power consumption). The ECI number and the integer have strict order-preserving relation, so that the comparison operation in the ECI domain can directly carry out the low-bit-width ECI number comparison operation without decoding. Since the ECI number has a bit width shorter than the number of bits in the integer domain, the comparison operation in the ECI domain can be implemented with a simpler and more efficient circuit.

The ECI dot product operation may be decomposed into sign bit XOR, exponent accumulate, significant digital word multiply, complement operation, shift operation, and reduce accumulate operation. The main hardware overhead to implement the dot product operation is determined by the efficient digital word multiplication operation, and the accumulation operation. Since the complexity (or cost) of the multiplication operation is proportional to the square of the bit width of the operand, i.e., the O (n ²) complexity, the lower the bit width, the lower the cost of the multiplier.

The multiplication between two ECI codes includes an exclusive OR operation of the sign bits, an accumulation operation of the decoded values of the exponent field, and a product operation of the decoded values of the significant digit field.

Specifically, for example, two ECI numbers x and y are decoded as (s _x,e_x,m_x)_d and (s _y,e_y,m_y)_d, the product result of x and y is expressed using integer storage and expression, and a calculation formula is expressed as:

Wherein s _x and s _y represent the sign bits of x and y, respectively,/> The representations s _x and s _y are exclusive-ored, m _x and m _y represent the significant digital fields of x and y, respectively, and e _x and e _y represent the exponent fields of x and y, respectively. Therefore, the multiplication calculation between ECI numbers can be completed without completely decoding to an integer format, and resources are saved for the operation of the quantized neural network.

In some embodiments, as shown in fig. 3, each ECI scalar multiplier in the ECI vector multipliers 211 includes an exclusive-or operator 211a, an effective digital multiplier 211b, an exponent accumulator 211c, a complement 211d, a selector 211e, and a shifter 211f, the exclusive-or operator 211a is connected to the selector 211e, the effective digital multiplier 211b is connected to both the selector 211e and the complement 211d, the complement 211d is connected to the selector 211e, the selector 211e is connected to the shifter 211f, and the exponent accumulator 211c is connected to the shifter 211 f.

Preferably, under the control of the control unit, the preprocessing unit 1 decodes an ECI vector with the bit width of N and the bit width of D into a triplet vector with the length of d× (n+logn+1), and takes a weight matrix and an activation value matrix as an example, for a certain dot product unit 21, obtains a certain decoded corresponding weight vector and activation value vector, inputs the triplet vector pair of weight and activation value into the ECI vector multiplier 211, and the ECI vector multiplier 211 sequentially performs product operation on the triplet corresponding to the vector pair. And finally, inputting the product result of each triplet into the addition tree 212 to finish accumulation operation to obtain the dot product result of the triplet vector pair of the weight and the activation value.

Specifically, the triplet element of the weight vector is (s _w,e_w,m_w), the triplet element of the activation value vector is (s _a,e_a,m_a), the exclusive-or operator 211a obtains two sign bits s _w and s _a to perform exclusive-or operation and output the exclusive-or result to the selector 211e, the effective digital multiplier 211b obtains two effective digital fields m _w and m _a to perform multiplication operation, the result is directly output to the selector 211e on one hand, the complement operator 211d is input on the other hand, the complement operator 211d performs complement operation on the effective digital product result and sends the complement result to the selector 211e, the selector 211e selects and outputs the complement result to the shifter 211f through the sign bit exclusive-or result, if the result is negative, otherwise, the effective digital product result is output to the shifter, the exponent accumulator 211c obtains two exponent fields e _w and e _a to perform accumulation operation and output the result to the shifter 211f, and the shifter 211f performs shift on the received complement result or the effective digital product result according to the exponent accumulated result to obtain the triplet product with the single length of 4-7. The product operation of the triples can be realized through the hardware structure, the subsequent accumulation calculation of the product result of the triples is facilitated through the negative number complement operation, then the result vector with the length of D (4N-7) is output to the addition tree 212, and the D symbol numbers with the length of 4N-6 are added to obtain the dot product result with the length of 4N-6+log D.

In some embodiments, the addition tree 212 may be implemented by a Wallace addition tree, dadda addition tree, or any other addition tree. Since the accumulation operation is completed in the integer domain, there is no alignment, rounding, normalization, etc. operation similar to that of a floating point number, and thus the logic circuit can be realized more efficiently and compactly.

In some embodiments, as shown in fig. 1, further comprising a first input buffer 41 and a second input buffer 42, the preprocessing unit 1 comprises a first decoder 11 and a second decoder 12, the first input buffer 41 is connected to the first decoder 11, the second input buffer 42 is connected to the second decoder 12, and the first decoder 11 and the second decoder 12 are both connected to the dot product unit 21.

The power consumption overhead of ECI decoding in a single multiplication operation is about 10-15% of the multiplication operation. The preprocessing unit 1 is responsible for decoding the operands before the tensor data enters the dot-product cell array 2. A large number of repeated decoding operations may be performed at one time by the preprocessing unit. The energy consumption overhead of the decoder is amortized by a plurality of dot product cell array 2 instances. An ECI number of N bits is decoded into (s, e, m) triples of (N+logN+1) bits; an ECI vector containing D elements is decoded into a vector of D× (N+logN+1) bits.

Preferably, the employed decoding method of the first decoder 11 and the second decoder 12 includes: acquiring the encoded super-parameters (N, E), wherein N represents the bit width of the encoding and E represents the maximum number of bits of an exponent field;

extracting an exponent field and a significant digit field in the code according to the super parameter;

Performing decoding calculation on the exponent field to obtain a decoding value of the exponent field, and performing decoding calculation on the effective digital field to obtain a decoding value of the effective digital field;

Calculating the encoded true value from the value in the significant digit field and the value in the exponent field, the calculation formula being:

A＝m×2^e

Wherein A represents the true value, e represents the decoded value of the exponent field, m represents the decoded value of the significant digit field, ε represents the binary exponent field, Representing a binary significant number field,/>The representation 1 and the significant digit field are spliced into a binary number according to a bit string.

In some embodiments, before extracting the exponent field and the significand field, further comprising:

And extracting a symbol bit positioned at the first bit of the code, and taking a bit value of the symbol bit as a decoding value of the symbol bit.

In some embodiments, the extracting the exponent field and the significand field in the encoding according to the hyper-parameters comprises:

extracting continuous bits 1 which are positioned behind the sign bit and not more than E bits as an exponent field, wherein the number of the bits 1 is the value of the exponent field;

if the value of the exponent field is not equal to E, extracting a segmenter positioned after the exponent field and a valid digital field positioned after the segmenter, wherein the segmenter is one-bit 0, and the valid digital field is a binary number;

and if the value of the exponent field is equal to E, extracting a significant digit field positioned after the exponent field, wherein the significant digit field is a binary number.

The first input buffer 41 and the second input buffer 42 are used to store a matrix of two inputs, respectively. They each contain two banks of SRAM, operating in ping-pong mode.

In some embodiments, as shown in fig. 2, the dot product unit 21 further includes a first vector register 213 and a second vector register 214, the first decoder 11 is connected to the ECI vector multiplier 211 through the first vector register 213 and the second vector register 214 in sequence, and the second decoder 12 is connected to the ECI vector multiplier 211.

Inside each dot product cell array 2 there are two vector registers: a first vector register 213 and a second vector register 214, each of which may hold d× (n+logn+1) bits, holding the decoded vector. The first vector register 213 is used to preload the new vector; and the second vector register 214 participates in vector computation. The first vector register 213 may be regarded as a shadow register of the second vector register 214, so that the second vector register 214 may simultaneously preload new vectors when performing matrix operations, and data in the second vector register 214 may load the data of the first vector register 213 into the second vector register 214 under the action of the control unit after the data in the second vector register 214 is calculated, so that the calculation may be performed uninterruptedly, and a pause in calculation caused by loading data in memory is avoided, so that memory overhead is hidden.

In some embodiments, as shown in fig. 1, the apparatus further includes a partial sum accumulation unit 5, where the partial sum accumulation unit 5 is configured to accumulate the result vector output by the dot product unit array 2 with the buffered corresponding partial sum vector. The partial sum accumulator unit 5 comprises a partial sum accumulator 51 and a partial sum buffer 52, the partial sum accumulator 51 being connected to the output of the dot product unit 21, the partial sum accumulator 51 being connected to the partial sum buffer 52.

The partial sum buffer 52 is composed of two banks of SRAM. It operates in a classical read-write mode. At each cycle, under the control of the CU, it reads the vector of the D element stored in the designated address from the SRAM, accumulates with the vector output from the dot product cell array 2 array, and then writes the operation result back to the place.

In some embodiments, the control unit is connected to the first input buffer 41, the second input buffer 42, part and the accumulation unit 5,

The control unit 5 is further configured to control the first input buffer 51 to obtain a first entropy encoded integer encoding matrix and output a first vector to the first decoder 11; and

Controlling the second input buffer 42 to acquire a second entropy encoded integer encoding matrix and output a second vector to the second decoder 12; and

The dot product unit 21 is controlled to output the dot product result to the partial sum accumulator 51 and read the corresponding partial sum vector to the partial sum buffer 52.

In some embodiments, the control unit controls the flow of the neural network accelerator provided in the above embodiments as follows:

The first step: d vectors A0, A1, ai,) are preloaded from the first input buffer 41, to the first vector register 213 of each dot product unit 21, respectively, this step requiring D clock cycles.

And a second step of: the vectors of the first vector register 213 of all dot-product units 21 are transferred to the second vector register 214 simultaneously in one cycle.

And a third step of: a vector Bi is loaded from the second input buffer 42 and broadcast to all dot product units 21; the plurality of dot product units 21 simultaneously perform dot product operations of the vector Bi and all a vectors A0, A1, ai.

Fourth step: the dot product unit 21 simultaneously passes the output to the partial sum accumulation unit 5 while reading out the corresponding partial sum from the partial sum buffer 52, accumulating both, and then writing back to the original address. Steps three and four are repeated until all vectors B0, B1 of the input buffer B have been completed.

In some embodiments, comprehensive experiments were performed using TSMC 28nm process, 1GHz frequency, 1.0V, 25C process library, resulting in performance comparisons as shown in FIGS. 4-7.

Fig. 4 and 5 show multiplier and adder power consumption/area data of different bit widths and different formats. It can be seen that the FP8 multiplier results in a high area/power consumption benefit due to the reduced bit width compared to the FP16 format. But to maintain sufficient accuracy the product of FP8 typically needs to be accumulated using FP32 format. It can be seen that the accumulation overhead of FP32 far exceeds the overhead of FP8 and ECI8 multipliers.

Fig. 6 and 7 show overall power consumption and area performance comparisons for the ECI accelerator and the FP8 accelerator, respectively. As can be seen from the figure, although FP8 is better in multiplier logic than ECI logic in energy consumption/area, the performance advantage of FP8 multipliers is offset by the 6.2 x overhead required for the accumulation operation of FP 32; on the other hand, the register overhead exceeds that of ECI because FP32 requires more pipeline segments and more logic. Therefore, the overall performance of the ECI accelerator is better than that of the FP8 accelerator, the area efficiency of the ECI accelerator is 1.72 times that of the FP8 accelerator, and the energy consumption efficiency is 1.36 times that of the ECI accelerator, regardless of the area or energy consumption or the combined circuit or register overhead.

In some embodiments, the neural network accelerator provided by the invention can be used for reasoning scenes and training scenes. In mobile and embedded applications scenarios, an energy efficient acceleration unit is often required, and there are stringent requirements on the upper limit of power consumption. The embodiment of the invention can adopt a single EDOT array supporting a 4-bit ECI number system, each array comprises 16 EDOT units, and each EDOT unit can complete the dot product operation of two ECI vectors with the length of 16.

It should be appreciated that the foregoing array accelerator will be applied as a plurality of accelerator components participating in an accelerator system.

In some embodiments, training accelerators typically need to support a greater number of EDOTs than reasoning accelerators in order to be able to provide higher computational power, and each EDOT unit can handle longer vectors. As the number of EDOT cells in the EDOT array 2 increases, a longer broadcast link requires higher power consumption to drive the broadcast signal, which can negatively impact frequency and power consumption. Thus, in large-scale array designs, it is desirable to interconnect multiple arrays to improve peak performance. There are also a number of options for the manner of interconnection between arrays. The embodiment shown in fig. 8 provides a multi-array cascade accelerator system.

In the illustrated embodiment, a total of 16 edo arrays 2 are cascaded in a serial fashion through pipeline registers, each edo array 2 containing k=16 edo cells, each edo cell capable of performing d=256 8-bit ECI dot product operations. Thus, the scheme uses 16 cascaded small-scale arrays to form a larger-scale 256x256 array, and the input activation values and weights between adjacent small arrays are stored/forwarded through the pipeline register 6. The pipeline register plays a role in signal relay and forwarding, and the driving capability of signals is enhanced. It should be understood that the present invention is not limited to the number of arrays, etc. in the embodiment.

In some embodiments, as shown in fig. 9, an accelerator system is provided in which the EDOT array 2 is formed by a network-on-chip interconnection of rings.

In the illustrated embodiment, a total of 8 edo arrays 2 are interconnected by a ring on-chip network, each edo array 2 containing k=64 edo cells, each edo cell capable of performing d=64 8-bit ECI number dot product operations. Therefore, the embodiment uses 8 small-scale EDOT arrays 2 to form a larger-scale 64x64x8 array, and the adjacent small arrays forward the input activation values, weights and/or partial sums through the ring network 7, so that the data multiplexing degree is increased, the DRAM traffic is reduced, and the system delay and the power consumption are reduced.

It should be appreciated that the ring network is used in this embodiment to illustrate the manner in which the arrays in the accelerator are interconnected. The scope of protection of this patent is not limited to ring networks. It is within the scope of the present invention to interconnect the arrays of the present design invention using any form of network on chip, including but not limited to ring networks, mesh networks, torus networks, etc.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of function in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A neural network accelerator for multiplying a parameter matrix quantized by entropy-encoded integer codes, comprising: the device comprises a preprocessing unit, a dot product unit array and a control unit;

the dot product unit array comprises a plurality of dot product units, wherein the dot product units are used for outputting dot products of two decoding vectors, and the dot product units comprise ECI vector multipliers and addition trees;

The ECI vector multiplier comprises a plurality of ECI scalar multipliers for multiplying corresponding elements respectively belonging to two decoding vectors item by item; the ECI scalar multiplier is used for carrying out product operation on the two effective digits, carrying out exclusive OR operation on the two sign bits, carrying out complement operation on a negative number product result, carrying out addition operation on the two exponents, and carrying out shift operation on the product result according to the exponent operation result;

the addition tree is used for summing output result vectors of the ECI vector multiplier;

2. The neural network accelerator of claim 1, wherein: the ECI scalar multiplier comprises an exclusive-or arithmetic unit, an effective digital multiplier, an exponential accumulator, a complement, a selector and a shifter, wherein the exclusive-or arithmetic unit is connected with the selector, and the effective digital multiplier is connected with the selector on one hand and connected with the complement on the other hand to generate complement in parallel; the output of the complement device is connected with the selector, and a result is output according to the generated sign bit; the selector is connected with the shifter, and the exponent accumulator is connected with the shifter.

3. The neural network accelerator of claim 1, wherein: the adder tree includes a Wallace adder tree or Dadda adder tree.

4. The neural network accelerator of claim 1, wherein: the sign bit, the exponent field and the effective number field are sequentially arranged from left to right in the entropy coding integer coding;

the sign bit is represented by 1 bit 0 or 1;

the preprocessing unit decodes the entropy encoded integer code, comprising:

Performing decoding calculation on the exponent field to obtain a decoding value of the exponent field, and performing decoding calculation on the effective digital field to obtain a decoding value of the effective digital field to obtain the triplet structure;

The correspondence between the entropy coding integer codes and the true values is expressed as:

s＝δ

A＝(-1)^δ×m×2^e

5. The neural network accelerator of claim 4, wherein: the exponent field includes a number of consecutive bits 1, the number of bits 1 in the consecutive bits 1 representing a value of the exponent field, the value of the exponent field being less than or equal to E;

6. The neural network accelerator of claim 1, wherein: the preprocessing unit comprises a first decoder and a second decoder, the first input buffer is connected with the first decoder, the second input buffer is connected with the second decoder, and the first decoder and the second decoder are connected with the dot product unit.

7. The neural network accelerator of claim 6, wherein: the dot product unit further comprises a first vector register and a second vector register, the first decoder is connected with the ECI vector multiplier through the first vector register and the second vector register in sequence, and the second decoder is connected with the ECI vector multiplier.

8. The neural network accelerator of claim 7, wherein: the partial sum accumulation unit is used for accumulating the result vector output by the dot product unit array with the cached corresponding partial sum vector; the partial and accumulation unit comprises a partial and accumulator connected to the output of the dot product unit and a partial and buffer connected to the partial and buffer.

9. The neural network accelerator of claim 8, wherein: the control unit is connected with the first input buffer, the second input buffer, the part and the accumulation unit,

10. A neural network accelerator system, characterized by: comprising a plurality of neural network accelerators as claimed in any one of claims 1-9.