[go: up one dir, main page]

WO2022198685A1 - Multiplicateur à pondération multiple à entrée unique basé sur un partage d'approximation réduite - Google Patents

Multiplicateur à pondération multiple à entrée unique basé sur un partage d'approximation réduite Download PDF

Info

Publication number
WO2022198685A1
WO2022198685A1 PCT/CN2021/083445 CN2021083445W WO2022198685A1 WO 2022198685 A1 WO2022198685 A1 WO 2022198685A1 CN 2021083445 W CN2021083445 W CN 2021083445W WO 2022198685 A1 WO2022198685 A1 WO 2022198685A1
Authority
WO
WIPO (PCT)
Prior art keywords
mac
multiply
intermediate results
bit
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2021/083445
Other languages
English (en)
Inventor
Chaolin RAO
Jindong ZHOU
Yu Ma
Minye WU
Xin LOU
Pingqiang ZHOU
Jingyi Yu
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ShanghaiTech University
Original Assignee
ShanghaiTech University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ShanghaiTech University filed Critical ShanghaiTech University
Priority to PCT/CN2021/083445 priority Critical patent/WO2022198685A1/fr
Priority to CN202180093015.5A priority patent/CN116888575A/zh
Publication of WO2022198685A1 publication Critical patent/WO2022198685A1/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only

Definitions

  • This invention relates generally to the field of computing devices, and more specifically, to a neural network accelerator.
  • NNs neural networks
  • MAC multiply-accumulate
  • Neuron accelerators are designed to accelerate given tasks in the neural network, as traditional computing units, such as CPU and GPU, are often not efficient in handing this kind of tasks.
  • Resources consumption and computation complexity are the two most important metrics in accelerator design.
  • quantization is widely used, in which a small number of digits are used to represent weights.
  • approximate computation is an effective technique to reduce the computational complexity of MAC operations due to the inherent noise immunity of NNs.
  • An object of the present invention is to provide a reduced approximation sharing-based single-input multi-weights multiplier, to reduce hardware resource consumption and speed up computation of neural networks. Additional features and advantages of this invention will become apparent from the following detailed descriptions.
  • the MAC hardware accelerator may comprise a Pre-multiplier (PrM) and a plurality of Post-multipliers (PoM) .
  • the method may comprise multiplying in the PrM an input with a plurality of multiply factors to obtain a plurality of intermediate results, and obtaining the results of a plurality of MAC operations in the plurality of PoM by selecting the intermediate results in accordance with a plurality of weights, bit-shifting and accumulating the selected intermediate results.
  • the multiply factors may be selected based on the weights.
  • the multiplication result of the weight with the input may be obtained by bit-shifting and accumulating the intermediate results.
  • the multiplication result of the weight with the input may not be obtained by bit-shifting and accumulating the intermediate results.
  • the method may further comprise selecting an intermediate result based on the difference between the corresponding multiply factor for the intermediate result and the omitted multiply factor.
  • the multiplication result of the weight with the input may be obtained by bit-shifting and accumulating the omitted intermediate result corresponding to the omitted multiple factor.
  • the method may further comprise minimizing the difference between the corresponding multiply factor for the intermediate result and the omitted multiply factor.
  • the method may further comprise compensating the intermediate result by adding the corresponding error in accordance with the weight.
  • the plurality of MAC operations may be performed by a neural network.
  • the PrM may comprise a memory for storing the plurality of intermediate results.
  • the PoM may comprise a multiplexer (MUX) for selecting the plurality of intermediate results.
  • MUX multiplexer
  • the PoM may further comprise a shifter for bit-shifting the selected intermediate results, and an adder for accumulating the selected and bit-shifted intermediate results.
  • the accelerator may comprise a Pre-multiplier (PrM) comprising a multiplier for multiplying an input with a plurality of multiply factors to obtain a plurality of intermediate results, and a plurality of Post-multipliers (PoM) , each comprising a multiplexer (MUX) for selecting the intermediate results in accordance with a plurality of weights, a shifter for bit-shifting the selected intermediate results, and an adder for accumulating the selected and bit-shifted intermediate results.
  • PrM Pre-multiplier
  • PoM Post-multipliers
  • the multiply factors may be selected based on the weights.
  • the multiplication result of the weight with the input may be obtained by bit-shifting and accumulating the intermediate results.
  • the multiplication result of the weight with the input may not be obtained by bit-shifting and accumulating the intermediate results, and the MUX may be configured to select an intermediate result based on the difference between the corresponding multiply factor for the intermediate result and the omitted multiply factor.
  • the multiplication result of the weight with the input may be obtained by bit-shifting and accumulating the omitted intermediate result corresponding to the omitted multiple factor.
  • the MUX may be configured to select the intermediate result by minimizing the difference between the corresponding multiply factor for the intermediate result and the omitted multiply factor.
  • the PoM may further comprise a compensator for compensating the intermediate result by adding the corresponding error in accordance with the weight.
  • the MAC may be configured to perform the plurality of MAC operations by a neural network.
  • the PrM may comprise a memory for storing the plurality of intermediate results.
  • a further aspect of the present disclosure presents an integrated circuit device, which may comprise the aforementioned MAC hardware accelerator.
  • a further aspect of the present disclosure presents a processor, which may comprise the aforementioned MAC hardware accelerator.
  • an optimized multiplier design framework with quantization and approximate computing for multiplications in NNs is provided.
  • the inputs and intermediate values are quantized to fixed-point numbers, and an approximate multiplier is designed to perform multiplications more efficiently by sharing the intermediate results. If the approximate multiplier produces unacceptable errors, a compensation method is provided to mitigate the errors.
  • Fig. 1 is an exemplary diagram showing the sharing based single-input multi-weights multiplier (SB-SIMWM) architecture, in accordance with one embodiment of the present invention.
  • SB-SIMWM single-input multi-weights multiplier
  • Fig. 2 is an exemplary diagram showing the reduced approximation SB-SIMWM, in accordance with one embodiment of the present invention.
  • Fig. 3 is an exemplary diagram showing the reduced approximation SB-SIMWM with compensation, in accordance with one embodiment of the present invention.
  • MACs are the main operations, where multiplications are the most time and power consuming operations.
  • quantization of inputs and weights can speed up the computation and reduce the power consumption and chip area as compared to floating point computation units. Accordingly, the data flow in this design is quantized to fixed-point numbers.
  • multiplications of inputs and weights in NNs are implemented as matrix multiplication in the fully-connected layers.
  • the computations of the convolution layer can be seen as matrix multiplications when expanded along the dimension of convolution kernels.
  • Matrix multiplication can be treated as each fixed input multiplied by a series of weights respectively. Every input in the same layer has a similar pattern.
  • SB-SIMWM sharing-based single-input multi-weights multiplier
  • the SB-SIMWM 100 is composed of a Pre-multiplier (PrM) 110 and a number of Post-multipliers (PoM) 120.
  • the PrM 110 has a multiplier 111 and a memory 112 for storing the intermediate results.
  • a specially selected set of intermediate multiplication result sequence is organized in the PrM 110 where normal multiplications are implemented.
  • the PrM 110 performs the multiplications of a number of multiple factors with input X to obtain the intermediate results. Note that all possible multiplication results should be covered.
  • 8 4-bit intermediate results (0-15 times of the inputs in decimal system) will be calculated in the PrM 110 by multiplying the input with the following multiply factors: ⁇ 1, 3, 5, 7, 9, 11, 13, 15 ⁇ , and all other multiplication results can be obtained by bit-shifting those intermediate results.
  • multiply factors there are other ways to select the multiply factors, and they can selected through software optimization, or in the case of neural network, through training.
  • Each PoM 120 has a multiplexer (MUX) 121, a shifter 122, an adder 123, and an encoder 124.
  • the encoder 124 is used to encode the weights into corresponding control signals for the MUX 121, the shifter 122 and the adder 123.
  • the control signal for the MUX 121 chooses one intermediate result from the PrM.
  • the control signal for the shifter 122 determines how to shift the bits of the intermediate result.
  • the MUX 121 selects the needed results from PrM 120. Then the multiplication result is produced by shifting and accumulating the selected results one by one. For example, to obtain 1010X, 101X from the intermediate result sequences is selected, and then it is shifted one bit to the right. Based on the intermediate results in the PrM 110, all weights can be obtained by the combinations of accumulation and bit-shifting in PoM 120.
  • the PrM 120 may have the following intermediate results ⁇ 1011X, 0001X, 0101X ⁇ , can the final multiplication results will be:
  • the intermediate result (0001) X will only be calculated once in PrM 110 and then be selected and shared in different target results in PoMs 120.
  • a reduced SB-SIMWM architecture is further provided in accordance with the embodiments of the present invention, in which some intermediate results in the PrM are taken out.
  • the reduced architecture is shown in Fig. 2.
  • some of the bigger intermediate results are chosen to be taken out, as smaller intermediate results can realize more multiplication results on limited bits multiplications.
  • an omitted intermediate result is needed, it is substituted by an available intermediate result that is closest to it.
  • this reduced architecture enables a chip area reduction up to 30%as compared with the original design.
  • the chip area can decrease by about 25%.
  • the reduced SB-SIMWM 200 is composed of a Pre-multiplier (PrM) 210 and a number of Post-multipliers (PoM) 220.
  • the PrM 210 has a multiplier 211 and a memory 212 for storing the intermediate results.
  • Each PoM 220 has a multiplexer (MUX) 221, a shifter 222, an adder 223, and an encoder 224.
  • MUX multiplexer
  • k components, such as the third component, in PrM 200 are discarded and MUX 221 [n: 1] is modified to MUX 221 [n-k: 1] .
  • these discarded components may be needed.
  • the error is only about 1/11.
  • the approximate computing result can be compensated.
  • the approximation strategy can be fixed in design that the errors of approximated weights are known. Also, when a weight is used for computation, it can be detected that whether this multiplication result needs to be approximated. So, it is possible to add an extra structure to compensate the error with little overhead.
  • the compensation strategy can be determined by the designer.
  • the compensation architecture is shown in Fig. 3.
  • the reduced SB-SIMWM 300 with compensation is composed of a Pre-multiplier (PrM) 310 and a number of Post-multipliers (PoM) 320.
  • the PrM 310 has a multiplier 311 and a memory 312 for storing the intermediate results.
  • Each PoM 320 has a multiplexer (MUX) 321, a shifter 322, an adder 323, an encoder 324, and an compensator 325.
  • the encoder 324 can be used to determine the amount of error compensations needed to add at the result.
  • the normal approximation computing operates to select the intermediate result 14X.
  • the related compensation structure takes the weight 15 as input, decodes it and determines that this intermediate value requires the compensation. Then a corresponding error (X) is added to it.
  • This corresponding error can be chosen based on past experience or software optimization such as training of the neural network.

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Complex Calculations (AREA)

Abstract

L'invention concerne un cadre de conception de multiplicateur optimisé assorti d'une quantification et d'un calcul approximatif destiné à des multiplications dans des réseaux neuronaux. Dans les accélérateurs de réseau neuronal selon la présente invention, les entrées et les valeurs intermédiaires sont quantifiées en nombres à virgule fixe, et un multiplicateur approximatif est configuré pour effectuer des multiplications plus efficacement en partageant les résultats intermédiaires. Si le multiplicateur approximatif produit des erreurs inacceptables, un procédé de compensation est prévu afin d'atténuer les erreurs.
PCT/CN2021/083445 2021-03-26 2021-03-26 Multiplicateur à pondération multiple à entrée unique basé sur un partage d'approximation réduite Ceased WO2022198685A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2021/083445 WO2022198685A1 (fr) 2021-03-26 2021-03-26 Multiplicateur à pondération multiple à entrée unique basé sur un partage d'approximation réduite
CN202180093015.5A CN116888575A (zh) 2021-03-26 2021-03-26 精简近似的基于共享的单输入多权重乘法器

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/083445 WO2022198685A1 (fr) 2021-03-26 2021-03-26 Multiplicateur à pondération multiple à entrée unique basé sur un partage d'approximation réduite

Publications (1)

Publication Number Publication Date
WO2022198685A1 true WO2022198685A1 (fr) 2022-09-29

Family

ID=83395096

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/083445 Ceased WO2022198685A1 (fr) 2021-03-26 2021-03-26 Multiplicateur à pondération multiple à entrée unique basé sur un partage d'approximation réduite

Country Status (2)

Country Link
CN (1) CN116888575A (fr)
WO (1) WO2022198685A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120146122B (zh) * 2025-02-26 2025-10-28 哈尔滨工业大学 一种面向神经网络加速器的算存融合新型架构的设计方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106909970A (zh) * 2017-01-12 2017-06-30 南京大学 一种基于近似计算的二值权重卷积神经网络硬件加速器计算模块
CN107797962A (zh) * 2017-10-17 2018-03-13 清华大学 基于神经网络的计算阵列
CN107908389A (zh) * 2017-11-21 2018-04-13 天津大学 小点数fft旋转因子复数乘法加速器
KR20190005043A (ko) * 2017-07-05 2019-01-15 울산과학기술원 연산 속도를 향상시킨 simd mac 유닛, 그 동작 방법 및 simd mac 유닛의 배열을 이용한 콘볼루션 신경망 가속기
CN109543140A (zh) * 2018-09-20 2019-03-29 中国科学院计算技术研究所 一种卷积神经网络加速器

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106909970A (zh) * 2017-01-12 2017-06-30 南京大学 一种基于近似计算的二值权重卷积神经网络硬件加速器计算模块
KR20190005043A (ko) * 2017-07-05 2019-01-15 울산과학기술원 연산 속도를 향상시킨 simd mac 유닛, 그 동작 방법 및 simd mac 유닛의 배열을 이용한 콘볼루션 신경망 가속기
CN107797962A (zh) * 2017-10-17 2018-03-13 清华大学 基于神经网络的计算阵列
CN107908389A (zh) * 2017-11-21 2018-04-13 天津大学 小点数fft旋转因子复数乘法加速器
CN109543140A (zh) * 2018-09-20 2019-03-29 中国科学院计算技术研究所 一种卷积神经网络加速器

Also Published As

Publication number Publication date
CN116888575A (zh) 2023-10-13

Similar Documents

Publication Publication Date Title
EP0657804B1 (fr) Commande de dépassement pour opérations arithmétiques
WO2021051463A1 (fr) Quantification résiduelle de poids de décalage de bits dans un réseau neuronal artificiel
US10853034B2 (en) Common factor mass multiplication circuitry
CN109634558B (zh) 可编程的混合精度运算单元
US20220075598A1 (en) Systems and Methods for Numerical Precision in Digital Multiplier Circuitry
US5629885A (en) Squaring circuit for binary numbers
WO2022164678A1 (fr) Circuit numérique pour fonctions de normalisation
US6370556B1 (en) Method and arrangement in a transposed digital FIR filter for multiplying a binary input signal with tap coefficients and a method for designing a transposed digital filter
CN114492779B (zh) 神经网络模型的运行方法、可读介质和电子设备
US5619440A (en) Multiplier circuit with rounding-off function
WO2022198685A1 (fr) Multiplicateur à pondération multiple à entrée unique basé sur un partage d'approximation réduite
JP2022544854A (ja) 符号付きマルチワード乗算器
JP3003467B2 (ja) 演算装置
US12327178B2 (en) Neural network accelerator configured to perform operation on logarithm domain
JP6863907B2 (ja) 演算回路
CN111492369B (zh) 人工神经网络中移位权重的残差量化
US6269385B1 (en) Apparatus and method for performing rounding and addition in parallel in floating point multiplier
JPH0312738B2 (fr)
US20240036822A1 (en) Enhanced Block Floating Point Number Multiplier
US20230144950A1 (en) Modulo-space processing in multiply-and-accumulate units
CN118798219A (zh) 计算装置及方法、电子设备和存储介质
KR20240057754A (ko) 인메모리 컴퓨팅을 위한 메모리 장치 및 그 동작 방법
WO2018131059A1 (fr) Circuit de réseau neuronal
JP2606326B2 (ja) 乗算器
CN113111998A (zh) 信息处理设备、计算机可读存储介质和神经网络计算方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21932303

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202180093015.5

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21932303

Country of ref document: EP

Kind code of ref document: A1