WO2022198685A1 - Multiplicateur à pondération multiple à entrée unique basé sur un partage d'approximation réduite - Google Patents
Multiplicateur à pondération multiple à entrée unique basé sur un partage d'approximation réduite Download PDFInfo
- Publication number
- WO2022198685A1 WO2022198685A1 PCT/CN2021/083445 CN2021083445W WO2022198685A1 WO 2022198685 A1 WO2022198685 A1 WO 2022198685A1 CN 2021083445 W CN2021083445 W CN 2021083445W WO 2022198685 A1 WO2022198685 A1 WO 2022198685A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- mac
- multiply
- intermediate results
- bit
- result
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/52—Multiplying; Dividing
- G06F7/523—Multiplying only
Definitions
- This invention relates generally to the field of computing devices, and more specifically, to a neural network accelerator.
- NNs neural networks
- MAC multiply-accumulate
- Neuron accelerators are designed to accelerate given tasks in the neural network, as traditional computing units, such as CPU and GPU, are often not efficient in handing this kind of tasks.
- Resources consumption and computation complexity are the two most important metrics in accelerator design.
- quantization is widely used, in which a small number of digits are used to represent weights.
- approximate computation is an effective technique to reduce the computational complexity of MAC operations due to the inherent noise immunity of NNs.
- An object of the present invention is to provide a reduced approximation sharing-based single-input multi-weights multiplier, to reduce hardware resource consumption and speed up computation of neural networks. Additional features and advantages of this invention will become apparent from the following detailed descriptions.
- the MAC hardware accelerator may comprise a Pre-multiplier (PrM) and a plurality of Post-multipliers (PoM) .
- the method may comprise multiplying in the PrM an input with a plurality of multiply factors to obtain a plurality of intermediate results, and obtaining the results of a plurality of MAC operations in the plurality of PoM by selecting the intermediate results in accordance with a plurality of weights, bit-shifting and accumulating the selected intermediate results.
- the multiply factors may be selected based on the weights.
- the multiplication result of the weight with the input may be obtained by bit-shifting and accumulating the intermediate results.
- the multiplication result of the weight with the input may not be obtained by bit-shifting and accumulating the intermediate results.
- the method may further comprise selecting an intermediate result based on the difference between the corresponding multiply factor for the intermediate result and the omitted multiply factor.
- the multiplication result of the weight with the input may be obtained by bit-shifting and accumulating the omitted intermediate result corresponding to the omitted multiple factor.
- the method may further comprise minimizing the difference between the corresponding multiply factor for the intermediate result and the omitted multiply factor.
- the method may further comprise compensating the intermediate result by adding the corresponding error in accordance with the weight.
- the plurality of MAC operations may be performed by a neural network.
- the PrM may comprise a memory for storing the plurality of intermediate results.
- the PoM may comprise a multiplexer (MUX) for selecting the plurality of intermediate results.
- MUX multiplexer
- the PoM may further comprise a shifter for bit-shifting the selected intermediate results, and an adder for accumulating the selected and bit-shifted intermediate results.
- the accelerator may comprise a Pre-multiplier (PrM) comprising a multiplier for multiplying an input with a plurality of multiply factors to obtain a plurality of intermediate results, and a plurality of Post-multipliers (PoM) , each comprising a multiplexer (MUX) for selecting the intermediate results in accordance with a plurality of weights, a shifter for bit-shifting the selected intermediate results, and an adder for accumulating the selected and bit-shifted intermediate results.
- PrM Pre-multiplier
- PoM Post-multipliers
- the multiply factors may be selected based on the weights.
- the multiplication result of the weight with the input may be obtained by bit-shifting and accumulating the intermediate results.
- the multiplication result of the weight with the input may not be obtained by bit-shifting and accumulating the intermediate results, and the MUX may be configured to select an intermediate result based on the difference between the corresponding multiply factor for the intermediate result and the omitted multiply factor.
- the multiplication result of the weight with the input may be obtained by bit-shifting and accumulating the omitted intermediate result corresponding to the omitted multiple factor.
- the MUX may be configured to select the intermediate result by minimizing the difference between the corresponding multiply factor for the intermediate result and the omitted multiply factor.
- the PoM may further comprise a compensator for compensating the intermediate result by adding the corresponding error in accordance with the weight.
- the MAC may be configured to perform the plurality of MAC operations by a neural network.
- the PrM may comprise a memory for storing the plurality of intermediate results.
- a further aspect of the present disclosure presents an integrated circuit device, which may comprise the aforementioned MAC hardware accelerator.
- a further aspect of the present disclosure presents a processor, which may comprise the aforementioned MAC hardware accelerator.
- an optimized multiplier design framework with quantization and approximate computing for multiplications in NNs is provided.
- the inputs and intermediate values are quantized to fixed-point numbers, and an approximate multiplier is designed to perform multiplications more efficiently by sharing the intermediate results. If the approximate multiplier produces unacceptable errors, a compensation method is provided to mitigate the errors.
- Fig. 1 is an exemplary diagram showing the sharing based single-input multi-weights multiplier (SB-SIMWM) architecture, in accordance with one embodiment of the present invention.
- SB-SIMWM single-input multi-weights multiplier
- Fig. 2 is an exemplary diagram showing the reduced approximation SB-SIMWM, in accordance with one embodiment of the present invention.
- Fig. 3 is an exemplary diagram showing the reduced approximation SB-SIMWM with compensation, in accordance with one embodiment of the present invention.
- MACs are the main operations, where multiplications are the most time and power consuming operations.
- quantization of inputs and weights can speed up the computation and reduce the power consumption and chip area as compared to floating point computation units. Accordingly, the data flow in this design is quantized to fixed-point numbers.
- multiplications of inputs and weights in NNs are implemented as matrix multiplication in the fully-connected layers.
- the computations of the convolution layer can be seen as matrix multiplications when expanded along the dimension of convolution kernels.
- Matrix multiplication can be treated as each fixed input multiplied by a series of weights respectively. Every input in the same layer has a similar pattern.
- SB-SIMWM sharing-based single-input multi-weights multiplier
- the SB-SIMWM 100 is composed of a Pre-multiplier (PrM) 110 and a number of Post-multipliers (PoM) 120.
- the PrM 110 has a multiplier 111 and a memory 112 for storing the intermediate results.
- a specially selected set of intermediate multiplication result sequence is organized in the PrM 110 where normal multiplications are implemented.
- the PrM 110 performs the multiplications of a number of multiple factors with input X to obtain the intermediate results. Note that all possible multiplication results should be covered.
- 8 4-bit intermediate results (0-15 times of the inputs in decimal system) will be calculated in the PrM 110 by multiplying the input with the following multiply factors: ⁇ 1, 3, 5, 7, 9, 11, 13, 15 ⁇ , and all other multiplication results can be obtained by bit-shifting those intermediate results.
- multiply factors there are other ways to select the multiply factors, and they can selected through software optimization, or in the case of neural network, through training.
- Each PoM 120 has a multiplexer (MUX) 121, a shifter 122, an adder 123, and an encoder 124.
- the encoder 124 is used to encode the weights into corresponding control signals for the MUX 121, the shifter 122 and the adder 123.
- the control signal for the MUX 121 chooses one intermediate result from the PrM.
- the control signal for the shifter 122 determines how to shift the bits of the intermediate result.
- the MUX 121 selects the needed results from PrM 120. Then the multiplication result is produced by shifting and accumulating the selected results one by one. For example, to obtain 1010X, 101X from the intermediate result sequences is selected, and then it is shifted one bit to the right. Based on the intermediate results in the PrM 110, all weights can be obtained by the combinations of accumulation and bit-shifting in PoM 120.
- the PrM 120 may have the following intermediate results ⁇ 1011X, 0001X, 0101X ⁇ , can the final multiplication results will be:
- the intermediate result (0001) X will only be calculated once in PrM 110 and then be selected and shared in different target results in PoMs 120.
- a reduced SB-SIMWM architecture is further provided in accordance with the embodiments of the present invention, in which some intermediate results in the PrM are taken out.
- the reduced architecture is shown in Fig. 2.
- some of the bigger intermediate results are chosen to be taken out, as smaller intermediate results can realize more multiplication results on limited bits multiplications.
- an omitted intermediate result is needed, it is substituted by an available intermediate result that is closest to it.
- this reduced architecture enables a chip area reduction up to 30%as compared with the original design.
- the chip area can decrease by about 25%.
- the reduced SB-SIMWM 200 is composed of a Pre-multiplier (PrM) 210 and a number of Post-multipliers (PoM) 220.
- the PrM 210 has a multiplier 211 and a memory 212 for storing the intermediate results.
- Each PoM 220 has a multiplexer (MUX) 221, a shifter 222, an adder 223, and an encoder 224.
- MUX multiplexer
- k components, such as the third component, in PrM 200 are discarded and MUX 221 [n: 1] is modified to MUX 221 [n-k: 1] .
- these discarded components may be needed.
- the error is only about 1/11.
- the approximate computing result can be compensated.
- the approximation strategy can be fixed in design that the errors of approximated weights are known. Also, when a weight is used for computation, it can be detected that whether this multiplication result needs to be approximated. So, it is possible to add an extra structure to compensate the error with little overhead.
- the compensation strategy can be determined by the designer.
- the compensation architecture is shown in Fig. 3.
- the reduced SB-SIMWM 300 with compensation is composed of a Pre-multiplier (PrM) 310 and a number of Post-multipliers (PoM) 320.
- the PrM 310 has a multiplier 311 and a memory 312 for storing the intermediate results.
- Each PoM 320 has a multiplexer (MUX) 321, a shifter 322, an adder 323, an encoder 324, and an compensator 325.
- the encoder 324 can be used to determine the amount of error compensations needed to add at the result.
- the normal approximation computing operates to select the intermediate result 14X.
- the related compensation structure takes the weight 15 as input, decodes it and determines that this intermediate value requires the compensation. Then a corresponding error (X) is added to it.
- This corresponding error can be chosen based on past experience or software optimization such as training of the neural network.
Landscapes
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Complex Calculations (AREA)
Abstract
L'invention concerne un cadre de conception de multiplicateur optimisé assorti d'une quantification et d'un calcul approximatif destiné à des multiplications dans des réseaux neuronaux. Dans les accélérateurs de réseau neuronal selon la présente invention, les entrées et les valeurs intermédiaires sont quantifiées en nombres à virgule fixe, et un multiplicateur approximatif est configuré pour effectuer des multiplications plus efficacement en partageant les résultats intermédiaires. Si le multiplicateur approximatif produit des erreurs inacceptables, un procédé de compensation est prévu afin d'atténuer les erreurs.
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2021/083445 WO2022198685A1 (fr) | 2021-03-26 | 2021-03-26 | Multiplicateur à pondération multiple à entrée unique basé sur un partage d'approximation réduite |
| CN202180093015.5A CN116888575A (zh) | 2021-03-26 | 2021-03-26 | 精简近似的基于共享的单输入多权重乘法器 |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2021/083445 WO2022198685A1 (fr) | 2021-03-26 | 2021-03-26 | Multiplicateur à pondération multiple à entrée unique basé sur un partage d'approximation réduite |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2022198685A1 true WO2022198685A1 (fr) | 2022-09-29 |
Family
ID=83395096
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2021/083445 Ceased WO2022198685A1 (fr) | 2021-03-26 | 2021-03-26 | Multiplicateur à pondération multiple à entrée unique basé sur un partage d'approximation réduite |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN116888575A (fr) |
| WO (1) | WO2022198685A1 (fr) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120146122B (zh) * | 2025-02-26 | 2025-10-28 | 哈尔滨工业大学 | 一种面向神经网络加速器的算存融合新型架构的设计方法 |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106909970A (zh) * | 2017-01-12 | 2017-06-30 | 南京大学 | 一种基于近似计算的二值权重卷积神经网络硬件加速器计算模块 |
| CN107797962A (zh) * | 2017-10-17 | 2018-03-13 | 清华大学 | 基于神经网络的计算阵列 |
| CN107908389A (zh) * | 2017-11-21 | 2018-04-13 | 天津大学 | 小点数fft旋转因子复数乘法加速器 |
| KR20190005043A (ko) * | 2017-07-05 | 2019-01-15 | 울산과학기술원 | 연산 속도를 향상시킨 simd mac 유닛, 그 동작 방법 및 simd mac 유닛의 배열을 이용한 콘볼루션 신경망 가속기 |
| CN109543140A (zh) * | 2018-09-20 | 2019-03-29 | 中国科学院计算技术研究所 | 一种卷积神经网络加速器 |
-
2021
- 2021-03-26 CN CN202180093015.5A patent/CN116888575A/zh active Pending
- 2021-03-26 WO PCT/CN2021/083445 patent/WO2022198685A1/fr not_active Ceased
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106909970A (zh) * | 2017-01-12 | 2017-06-30 | 南京大学 | 一种基于近似计算的二值权重卷积神经网络硬件加速器计算模块 |
| KR20190005043A (ko) * | 2017-07-05 | 2019-01-15 | 울산과학기술원 | 연산 속도를 향상시킨 simd mac 유닛, 그 동작 방법 및 simd mac 유닛의 배열을 이용한 콘볼루션 신경망 가속기 |
| CN107797962A (zh) * | 2017-10-17 | 2018-03-13 | 清华大学 | 基于神经网络的计算阵列 |
| CN107908389A (zh) * | 2017-11-21 | 2018-04-13 | 天津大学 | 小点数fft旋转因子复数乘法加速器 |
| CN109543140A (zh) * | 2018-09-20 | 2019-03-29 | 中国科学院计算技术研究所 | 一种卷积神经网络加速器 |
Also Published As
| Publication number | Publication date |
|---|---|
| CN116888575A (zh) | 2023-10-13 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| EP0657804B1 (fr) | Commande de dépassement pour opérations arithmétiques | |
| WO2021051463A1 (fr) | Quantification résiduelle de poids de décalage de bits dans un réseau neuronal artificiel | |
| US10853034B2 (en) | Common factor mass multiplication circuitry | |
| CN109634558B (zh) | 可编程的混合精度运算单元 | |
| US20220075598A1 (en) | Systems and Methods for Numerical Precision in Digital Multiplier Circuitry | |
| US5629885A (en) | Squaring circuit for binary numbers | |
| WO2022164678A1 (fr) | Circuit numérique pour fonctions de normalisation | |
| US6370556B1 (en) | Method and arrangement in a transposed digital FIR filter for multiplying a binary input signal with tap coefficients and a method for designing a transposed digital filter | |
| CN114492779B (zh) | 神经网络模型的运行方法、可读介质和电子设备 | |
| US5619440A (en) | Multiplier circuit with rounding-off function | |
| WO2022198685A1 (fr) | Multiplicateur à pondération multiple à entrée unique basé sur un partage d'approximation réduite | |
| JP2022544854A (ja) | 符号付きマルチワード乗算器 | |
| JP3003467B2 (ja) | 演算装置 | |
| US12327178B2 (en) | Neural network accelerator configured to perform operation on logarithm domain | |
| JP6863907B2 (ja) | 演算回路 | |
| CN111492369B (zh) | 人工神经网络中移位权重的残差量化 | |
| US6269385B1 (en) | Apparatus and method for performing rounding and addition in parallel in floating point multiplier | |
| JPH0312738B2 (fr) | ||
| US20240036822A1 (en) | Enhanced Block Floating Point Number Multiplier | |
| US20230144950A1 (en) | Modulo-space processing in multiply-and-accumulate units | |
| CN118798219A (zh) | 计算装置及方法、电子设备和存储介质 | |
| KR20240057754A (ko) | 인메모리 컴퓨팅을 위한 메모리 장치 및 그 동작 방법 | |
| WO2018131059A1 (fr) | Circuit de réseau neuronal | |
| JP2606326B2 (ja) | 乗算器 | |
| CN113111998A (zh) | 信息处理设备、计算机可读存储介质和神经网络计算方法 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21932303 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 202180093015.5 Country of ref document: CN |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 21932303 Country of ref document: EP Kind code of ref document: A1 |