CN118312136A - Computing circuits and AI accelerators - Google Patents
Computing circuits and AI accelerators Download PDFInfo
- Publication number
- CN118312136A CN118312136A CN202410444916.XA CN202410444916A CN118312136A CN 118312136 A CN118312136 A CN 118312136A CN 202410444916 A CN202410444916 A CN 202410444916A CN 118312136 A CN118312136 A CN 118312136A
- Authority
- CN
- China
- Prior art keywords
- mantissa
- exponent
- point numbers
- floating point
- term
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/57—Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations
- G06F7/575—Basic arithmetic logic units, i.e. devices selectable to perform either addition, subtraction or one of several logical operations, using, at least partially, the same circuitry
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/483—Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/544—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
- G06F7/556—Logarithmic or exponential functions
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/2431—Multiple classes
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Analysis (AREA)
- Computational Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Optimization (AREA)
- Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Neurology (AREA)
- Nonlinear Science (AREA)
- Complex Calculations (AREA)
- Advance Control (AREA)
- Memory System (AREA)
- Power Sources (AREA)
Abstract
本申请实施例提供了一种计算电路及人工智能加速器,涉及集成电路技术领域。该计算电路接收多个第一块浮点数,对多个第一块浮点数进行计算,获得多个第二块浮点数,其包括:第一计算单元、转换单元和第二计算单元;第一计算单元接收多个第一块浮点数,对多个第一块浮点数的尾数项进行计算,得到多个第一中间浮点数;转换单元将多个第一中间浮点数归一化为多个归一化尾数项;第二计算单元对多个归一化尾数项进行计算,获得多个第二中间浮点数,基于归一化尾数项将多个第二中间浮点数归一化为多个第二块浮点数。该计算电路降低了电路的逻辑复杂度与功耗,有效降低数据存储需求,解决了人工智能业务存储空间、带宽不足的问题。
The embodiment of the present application provides a computing circuit and an artificial intelligence accelerator, which relate to the field of integrated circuit technology. The computing circuit receives multiple first-block floating-point numbers, calculates the multiple first-block floating-point numbers, and obtains multiple second-block floating-point numbers, which includes: a first computing unit, a conversion unit, and a second computing unit; the first computing unit receives multiple first-block floating-point numbers, calculates the mantissa items of the multiple first-block floating-point numbers, and obtains multiple first intermediate floating-point numbers; the conversion unit normalizes the multiple first intermediate floating-point numbers into multiple normalized mantissa items; the second computing unit calculates the multiple normalized mantissa items to obtain multiple second intermediate floating-point numbers, and normalizes the multiple second intermediate floating-point numbers into multiple second-block floating-point numbers based on the normalized mantissa items. The computing circuit reduces the logic complexity and power consumption of the circuit, effectively reduces the data storage requirements, and solves the problem of insufficient storage space and bandwidth for artificial intelligence services.
Description
技术领域Technical Field
本申请涉及集成电路技术领域,特别是涉及一种计算电路和人工智能加速器。The present application relates to the field of integrated circuit technology, and in particular to a computing circuit and an artificial intelligence accelerator.
背景技术Background technique
AI(Artificial Intelligence,人工智能)加速器是一种专门设计的硬件加速器或计算机系统,其主要功能是加速人工智能应用程序,特别是人工神经网络、机器学习、机器视觉和其他数据密集型或传感器驱动的任务。人工智能算法包含着庞大的计算量和存储量,例如,归一化指数函数(Softmax函数)是人工智能模型中常见的算子之一,其将一个任意K维实数矢量压缩至另一个K维实数矢量中,并使新矢量中每一个元素的取值范围都在(0,1)区间内,且所有元素和为1。在概率论中,归一化指数函数的输出可用来表示一个分类概率分布。归一化指数函数的数学表达式如下式所示:An AI (Artificial Intelligence) accelerator is a specially designed hardware accelerator or computer system whose main function is to accelerate artificial intelligence applications, especially artificial neural networks, machine learning, machine vision, and other data-intensive or sensor-driven tasks. Artificial intelligence algorithms involve huge amounts of computation and storage. For example, the normalized exponential function (Softmax function) is one of the common operators in artificial intelligence models. It compresses an arbitrary K-dimensional real vector into another K-dimensional real vector, and makes the value range of each element in the new vector within the interval (0,1), and the sum of all elements is 1. In probability theory, the output of the normalized exponential function can be used to represent a classification probability distribution. The mathematical expression of the normalized exponential function is shown as follows:
然而,目前常用的归一化指数函数计算电路采用浮点数作为电路的。归一化指数函数计算过程中涉及大量的指数、除法操作,传统的电路设计均采用浮点数计算单元,即采用浮点数作为电路的输入、输出,面积大、功耗高、存储与带宽要求高,从而导致人工智能业务存在算力、存储空间、带宽不足等挑战。However, the currently commonly used normalized exponential function calculation circuit uses floating-point numbers as the circuit. The normalized exponential function calculation process involves a large number of exponentiation and division operations. Traditional circuit designs all use floating-point calculation units, that is, floating-point numbers are used as the input and output of the circuit, which has a large area, high power consumption, and high storage and bandwidth requirements, resulting in challenges such as insufficient computing power, storage space, and bandwidth for artificial intelligence services.
发明内容Summary of the invention
为解决上述技术问题或至少部分地解决上述技术问题,本申请实施例提供了一种计算电路和人工智能加速器。In order to solve the above technical problems or at least partially solve the above technical problems, an embodiment of the present application provides a computing circuit and an artificial intelligence accelerator.
第一方面,本申请实施例提供了一种计算电路,所述计算电路接收多个第一块浮点数,对所述多个第一块浮点数进行计算,获得多个第二块浮点数;所述多个第一块浮点数共享第一指数项,所述多个第二块浮点数共享第二指数项;In a first aspect, an embodiment of the present application provides a calculation circuit, the calculation circuit receiving a plurality of first blocks of floating-point numbers, calculating the plurality of first blocks of floating-point numbers, and obtaining a plurality of second blocks of floating-point numbers; the plurality of first blocks of floating-point numbers share a first exponent term, and the plurality of second blocks of floating-point numbers share a second exponent term;
所述计算电路包括:第一计算单元、转换单元和第二计算单元;所述第一计算单元与所述转换单元耦合,所述转换单元与所述第二计算单元耦合;The calculation circuit comprises: a first calculation unit, a conversion unit and a second calculation unit; the first calculation unit is coupled to the conversion unit, and the conversion unit is coupled to the second calculation unit;
所述第一计算单元接收所述多个第一块浮点数,对所述多个第一块浮点数的尾数项进行计算,得到多个第一中间浮点数;The first calculation unit receives the plurality of first blocks of floating-point numbers, and calculates the mantissa items of the plurality of first blocks of floating-point numbers to obtain a plurality of first intermediate floating-point numbers;
所述转换单元接收所述多个第一中间浮点数,将所述多个第一中间浮点数归一化为多个归一化尾数项;The conversion unit receives the plurality of first intermediate floating-point numbers, and normalizes the plurality of first intermediate floating-point numbers into a plurality of normalized mantissa items;
所述第二计算单元对所述多个归一化尾数项进行计算,获得多个第二中间浮点数,基于所述归一化尾数项将所述多个第二中间浮点数归一化为所述多个第二块浮点数。The second calculation unit calculates the multiple normalized mantissa items to obtain multiple second intermediate floating-point numbers, and normalizes the multiple second intermediate floating-point numbers into the multiple second block floating-point numbers based on the normalized mantissa items.
在可选的实施例中,所述第一计算单元包括最大值提取子单元、矢量减法子单元、矢量指数计算子单元;In an optional embodiment, the first calculation unit includes a maximum value extraction subunit, a vector subtraction subunit, and a vector index calculation subunit;
所述最大值提取子单元接收所述多个第一块浮点数,获取所述多个第一块浮点数的尾数项中最大的尾数项,将所述最大的尾数项作为第一尾数项;The maximum value extraction subunit receives the plurality of first-block floating-point numbers, obtains the largest mantissa item among the mantissa items of the plurality of first-block floating-point numbers, and uses the largest mantissa item as the first mantissa item;
所述矢量减法子单元分别计算各个所述第一块浮点数的尾数项与所述第一尾数项的差值,将所述差值作为第二尾数项;The vector subtraction subunit calculates the difference between the mantissa item of each of the first block floating-point numbers and the first mantissa item respectively, and uses the difference as the second mantissa item;
所述矢量指数计算子单元基于所述第一指数项和所述第二尾数项,构造多个第三块浮点数,对所述第三块浮点数进行指数运算,获得所述多个第一中间浮点数。The vector exponent calculation subunit constructs a plurality of third-block floating-point numbers based on the first exponent term and the second mantissa term, and performs exponential operation on the third-block floating-point numbers to obtain the plurality of first intermediate floating-point numbers.
在可选的实施例中,所述最大值提取子单元包括比较器阵列,所述比较器阵列对所述多个第一块浮点数的尾数项进行比较,获得所述多个第一块浮点数的尾数项中最大的尾数项。In an optional embodiment, the maximum value extraction subunit includes a comparator array, and the comparator array compares the mantissa items of the plurality of first block floating-point numbers to obtain the largest mantissa item among the mantissa items of the plurality of first block floating-point numbers.
在可选的实施例中,所述矢量减法子单元包括减法器阵列,所述减法器阵列计算各个所述第一块浮点数的尾数项与所述第一尾数项的差值。In an optional embodiment, the vector subtraction subunit comprises a subtractor array, and the subtractor array calculates the difference between the mantissa term of each of the first block floating-point numbers and the first mantissa term.
在可选的实施例中,所述矢量指数计算子单元包括指数计算模块,所述指数计算模块对所述第一指数项和所述第二尾数项进行拼接,通过查表法查询拼接结果对应的指数函数值,所述指数函数值为所述第一中间浮点数;或者,所述指数计算模块通过拟合法获得所述第一中间浮点数。In an optional embodiment, the vector exponent calculation subunit includes an exponent calculation module, which concatenates the first exponent term and the second mantissa term, queries the exponential function value corresponding to the concatenation result by a table lookup method, and the exponential function value is the first intermediate floating-point number; or, the exponential calculation module obtains the first intermediate floating-point number by a fitting method.
在可选的实施例中,所述转换单元包括共享指数提取子单元和归一化子单元;In an optional embodiment, the conversion unit includes a shared index extraction subunit and a normalization subunit;
所述共享指数提取子单元接收所述多个第一中间浮点数,获取第一中间浮点数的指数项中最大的指数项,将所述最大的指数项作为第三指数项;The shared exponent extraction subunit receives the plurality of first intermediate floating-point numbers, obtains the largest exponent term among the exponent terms of the first intermediate floating-point numbers, and uses the largest exponent term as the third exponent term;
所述归一化子单元根据所述第三指数项,将所述多个第一中间浮点数归一化为多个归一化尾数项。The normalization subunit normalizes the plurality of first intermediate floating-point numbers into a plurality of normalized mantissa terms according to the third exponent term.
在可选的实施例中,所述归一化子单元包括多个基本电路,所述基本电路包括指数提取器、尾数提取器、减法器和移位电路;In an optional embodiment, the normalization subunit includes a plurality of basic circuits, wherein the basic circuits include an exponent extractor, a mantissa extractor, a subtractor, and a shift circuit;
所述指数提取器获取所述第一中间浮点数的指数项;The exponent extractor obtains the exponent term of the first intermediate floating-point number;
所述尾数提取器获取所述第一中间浮点数的尾数项;The mantissa extractor obtains the mantissa term of the first intermediate floating-point number;
所述减法器获取所述第一中间浮点数的指数项与所述第三指数项的差值;The subtractor obtains a difference between the exponential term of the first intermediate floating-point number and the third exponential term;
所述移位电路根据所述第一中间浮点数的指数项与所述第三指数项的差值,对所述第一中间浮点数的尾数项进行移位对齐,获得所述归一化尾数项。The shift circuit shifts and aligns the mantissa term of the first intermediate floating-point number according to the difference between the exponent term of the first intermediate floating-point number and the third exponent term to obtain the normalized mantissa term.
在可选的实施例中,所述第二计算单元包括求和子单元和乘法子单元;In an optional embodiment, the second calculation unit includes a summation subunit and a multiplication subunit;
所述求和子单元计算所述归一化尾数项之和,将所述归一化尾数项之和的倒数作为中间值;The summing subunit calculates the sum of the normalized mantissa items, and takes the reciprocal of the sum of the normalized mantissa items as an intermediate value;
所述乘法子单元计算所述中间值的尾数项与所述归一化尾数项的乘积,将所述第三尾数项与所述中间值的指数项作归一化对齐,获得目标尾数项和目标指数项,所述目标尾数项为所述第二块浮点数的尾数项,所述目标指数项为所述第二块浮点数共享的第二指数项。The multiplication subunit calculates the product of the mantissa term of the intermediate value and the normalized mantissa term, normalizes and aligns the third mantissa term with the exponent term of the intermediate value, and obtains a target mantissa term and a target exponent term, wherein the target mantissa term is the mantissa term of the second block of floating-point numbers, and the target exponent term is a second exponent term shared by the second block of floating-point numbers.
在可选的实施例中,所述乘法子单元包括乘法器和所述归一化子单元;In an optional embodiment, the multiplication subunit includes a multiplier and the normalization subunit;
所述乘法器计算所述中间值的尾数项与所述归一化尾数项的乘积,获得第三尾数项;The multiplier calculates the product of the mantissa term of the intermediate value and the normalized mantissa term to obtain a third mantissa term;
所述归一化子单元将所述第三尾数项与所述中间值的指数项作归一化对齐,获得目标尾数项和目标指数项。The normalization subunit normalizes and aligns the third mantissa term with the exponent term of the intermediate value to obtain a target mantissa term and a target exponent term.
第二方面,本申请实施例提供了一种人工智能加速器,所述人工智能加速器包括本申请任一实施例所述的计算电路。In a second aspect, an embodiment of the present application provides an artificial intelligence accelerator, which includes the computing circuit described in any embodiment of the present application.
本申请的实施例提供的技术方案至少带来以下有益效果:The technical solution provided by the embodiments of the present application brings at least the following beneficial effects:
本申请实施例提供的计算电路接收多个第一块浮点数,对该多个第一块浮点数进行计算,输出多个第二块浮点数。该计算电路包括第一计算单元、转换单元和第二计算单元。其中,第一计算单元接收多个第一块浮点数,对多个第一块浮点数的尾数项进行计算,得到多个第一中间浮点数;转换单元将多个第一中间浮点数归一化为多个归一化尾数项;第二计算单元对多个归一化尾数项进行计算,获得多个第二中间浮点数,基于归一化尾数项将多个第二中间浮点数归一化为多个第二块浮点数。The computing circuit provided in the embodiment of the present application receives a plurality of first blocks of floating-point numbers, calculates the plurality of first blocks of floating-point numbers, and outputs a plurality of second blocks of floating-point numbers. The computing circuit includes a first computing unit, a conversion unit, and a second computing unit. The first computing unit receives a plurality of first blocks of floating-point numbers, calculates the mantissa items of the plurality of first blocks of floating-point numbers, and obtains a plurality of first intermediate floating-point numbers; the conversion unit normalizes the plurality of first intermediate floating-point numbers into a plurality of normalized mantissa items; the second computing unit calculates the plurality of normalized mantissa items, obtains a plurality of second intermediate floating-point numbers, and normalizes the plurality of second intermediate floating-point numbers into a plurality of second blocks of floating-point numbers based on the normalized mantissa items.
本申请实施例所提供的计算电路采用块浮点数作为电路的输入输出,相比于传统采用浮点数作为输入、输出的电路,逻辑复杂度与功耗大幅降低,同时可有效降低数据存储需求,解决了大模型等人工智能业务存储空间不足与带宽不足的问题。而且,该计算电路基于块浮点数共享指数的特性,支持并行计算,可有效提高计算处理能力,且容易与人工智能模型算法中其他块浮点算子兼容(例如矩阵乘等),无需数值转换。The computing circuit provided in the embodiment of the present application uses block floating-point numbers as the input and output of the circuit. Compared with the traditional circuit using floating-point numbers as input and output, the logic complexity and power consumption are greatly reduced, and the data storage requirements can be effectively reduced, solving the problem of insufficient storage space and insufficient bandwidth for artificial intelligence services such as large models. Moreover, based on the characteristics of block floating-point number shared exponents, the computing circuit supports parallel computing, which can effectively improve the computing processing capability, and is easily compatible with other block floating-point operators in the artificial intelligence model algorithm (such as matrix multiplication, etc.), without the need for numerical conversion.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings required for use in the embodiments or the description of the prior art are briefly introduced below.
图1(a)示出了单精度浮点数的结构示意图;FIG. 1( a ) shows a schematic diagram of the structure of a single-precision floating point number;
图1(b)示出了Google BFloat16的结构示意图;Figure 1(b) shows a schematic diagram of the structure of Google BFloat16;
图1(c)示出了块浮点数的结构示意图;Figure 1(c) shows a schematic diagram of the structure of a block floating point number;
图2示出了本申请实施例计算电路的结构示意图;FIG2 is a schematic diagram showing the structure of a calculation circuit according to an embodiment of the present application;
图3示出了本申请实施例的计算电路中第一计算单元的结构示意图;FIG3 shows a schematic diagram of the structure of a first computing unit in a computing circuit according to an embodiment of the present application;
图4示出了本申请实施例的计算电路中最大值提取子单元的结构示意图;FIG4 shows a schematic diagram of the structure of a maximum value extraction subunit in a calculation circuit in an embodiment of the present application;
图5示出了本申请实施例的计算电路中矢量减法子单元的结构示意图;FIG5 is a schematic diagram showing the structure of a vector subtraction subunit in a calculation circuit according to an embodiment of the present application;
图6示出了本申请实施例的计算电路中矢量指数计算子单元的结构示意图;FIG6 shows a schematic diagram of the structure of a vector index calculation subunit in a calculation circuit in an embodiment of the present application;
图7示出了本申请实施例的计算电路中转换单元的结构示意图;FIG7 shows a schematic diagram of the structure of a conversion unit in a calculation circuit according to an embodiment of the present application;
图8示出了本申请实施例的计算电路中共享指数提取子单元的结构示意图;FIG8 is a schematic diagram showing the structure of a shared index extraction subunit in a calculation circuit according to an embodiment of the present application;
图9示出了本申请实施例的计算电路中归一化子单元的结构示意图;FIG9 is a schematic diagram showing the structure of a normalization subunit in a calculation circuit according to an embodiment of the present application;
图10示出了本申请实施例的计算电路中第二计算单元的结构示意图;FIG10 is a schematic diagram showing the structure of a second computing unit in a computing circuit according to an embodiment of the present application;
图11示出了本申请实施例的计算电路中求和子单元的结构示意图;FIG11 is a schematic diagram showing the structure of a summing subunit in a calculation circuit according to an embodiment of the present application;
图12示出了本申请实施例的计算电路中乘法子单元的结构示意图;FIG12 is a schematic diagram showing the structure of a multiplication subunit in a calculation circuit according to an embodiment of the present application;
图13示出了本申请实施例的计算电路的数据流转示意图。FIG. 13 is a schematic diagram showing data flow of a computing circuit according to an embodiment of the present application.
具体实施方式Detailed ways
以下结合附图对本申请的示范性实施例做出说明,其中包括本申请实施例的各种细节以助于理解,应当将它们认为仅仅是示范性的。因此,本领域普通技术人员应当认识到,可以对这里描述的实施例做出各种改变和修改,而不会背离本申请的范围和精神。同样,为了清楚和简明,以下的描述中省略了对公知功能和结构的描述。The following is a description of exemplary embodiments of the present application in conjunction with the accompanying drawings, including various details of the embodiments of the present application to facilitate understanding, which should be considered as merely exemplary. Therefore, it should be recognized by those of ordinary skill in the art that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of the present application. Similarly, for the sake of clarity and conciseness, the description of well-known functions and structures is omitted in the following description.
本申请的说明书和权利要求书中的术语“第一”、“第二”等是用于区别类似的对象,而不用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施,且“第一”、“第二”等所区分的对象通常为一类,并不限定对象的个数,例如第一对象可以是一个,也可以是多个。此外,说明书以及权利要求中“和/或”表示所连接对象的至少其中之一,字符“/”,一般表示前后关联对象是一种“或”的关系。The terms "first", "second", etc. in the specification and claims of this application are used to distinguish similar objects, and are not used to describe a specific order or sequence. It should be understood that the data used in this way can be interchangeable under appropriate circumstances, so that the embodiments of the present application can be implemented in an order other than those illustrated or described here, and the objects distinguished by "first", "second", etc. are generally of one type, and the number of objects is not limited. For example, the first object can be one or more. In addition, "and/or" in the specification and claims represents at least one of the connected objects, and the character "/" generally indicates that the objects associated with each other are in an "or" relationship.
为便于理解,下面先解释本申请相关的技术术语。For ease of understanding, the technical terms related to this application are explained below.
浮点数是指用一种特殊的编码格式来表达实数的数值。这个实数有一个整数或定点数(即尾数)乘以某个基数(计算机中通常是2)的整数次幂得到。在计算机系统中,传统的浮点数定义如float32(32位浮点数)、float64(64位浮点数)等(参考二进制浮点数算术标准IEEE-754)主要由三部分组成:符号位,指数项以及尾数项。定点数是计算机中采用的一种数的表示方法,其特点是参与运算的数的小数点位置固定不变。定点数包括定点整数和定点小数。浮点数相比定点数,具有更高的表达精度与动态范围。A floating-point number is a number that uses a special encoding format to express the value of a real number. This real number is obtained by multiplying an integer or fixed-point number (i.e., the mantissa) by an integer power of a base number (usually 2 in computers). In computer systems, traditional floating-point number definitions such as float32 (32-bit floating-point number), float64 (64-bit floating-point number), etc. (refer to the binary floating-point arithmetic standard IEEE-754) are mainly composed of three parts: the sign bit, the exponent term, and the mantissa term. Fixed-point numbers are a method of representing numbers used in computers, and their characteristic is that the decimal point position of the numbers involved in the calculation is fixed. Fixed-point numbers include fixed-point integers and fixed-point decimals. Compared with fixed-point numbers, floating-point numbers have higher expression accuracy and dynamic range.
块浮点(Block Floating Point)是一种创新的浮点数,近年来由于其数据存储空间小、精度较高等特点,在人工智能模型推理加速中逐步成为热点,尤其对大模型等带宽存储瓶颈业务而言收益巨大。图1(a)、图1(b)、图1(c)分别展示了单精度浮点(float32)、Google BFloat16(Google Brain Floating point,16位脑浮点,主要思想是提供16位浮点格式,其动态范围与标准IEEE-FP32相同,但精度较低),以及块浮点数的结构示意图。在图1(a)、图1(b)、图1(c)中,a0、a1、a2和an分别表示不同的浮点数,紫色矩形框表示符号位(sign),蓝色矩形框表示指数项(exponent),黄色矩形框表示尾数项(mantissa)。如图1(c)所示,块浮点数由一组符号位与尾数项构成,同时该组所有数据共享一个指数项。Block floating point is an innovative floating point number. In recent years, due to its small data storage space and high precision, it has gradually become a hot spot in the acceleration of artificial intelligence model reasoning, especially for bandwidth storage bottleneck businesses such as large models. Figure 1 (a), Figure 1 (b), and Figure 1 (c) respectively show single-precision floating point (float32), Google BFloat16 (Google Brain Floating point, 16-bit brain floating point, the main idea is to provide a 16-bit floating point format, its dynamic range is the same as the standard IEEE-FP32, but the precision is lower), and the block floating point structure diagram. In Figure 1 (a), Figure 1 (b), and Figure 1 (c), a0, a1, a2, and an represent different floating point numbers, the purple rectangle represents the sign bit (sign), the blue rectangle represents the exponent term (exponent), and the yellow rectangle represents the mantissa term (mantissa). As shown in Figure 1 (c), the block floating point number consists of a group of sign bits and mantissa terms, and all data in the group share an exponent term.
图2示出了本申请一实施例的计算电路的结构示意图。该计算电路接收多个第一块浮点数,记为zi=2p·di,其中,p、di均为定点值,p表示第一块浮点数zi的共享指数项(即第一指数项),di表示各个第一块浮点数的尾数项,i为大于1的正整数,例如15。Fig. 2 shows a schematic diagram of the structure of a calculation circuit according to an embodiment of the present application. The calculation circuit receives a plurality of first-block floating-point numbers, denoted as z i =2 p ·d i , wherein p and d i are both fixed-point values, p represents a shared exponent term (i.e., a first exponent term) of the first-block floating-point numbers z i , d i represents a mantissa term of each first-block floating-point number, and i is a positive integer greater than 1, such as 15.
计算电路对该多个第一块浮点数进行计算,获得多个第二块浮点数,记为hi=2q·mi,其中,q、mi均为定点数,q表示第二块浮点数zi的共享指数项(即第二指数项),mi表示各个第二块浮点数的尾数项。The calculation circuit calculates the multiple first block floating point numbers to obtain multiple second block floating point numbers, recorded as hi = 2q · mi , where q and mi are both fixed point numbers, q represents the shared exponent term (i.e., the second exponent term) of the second block floating point numbers z , and mi represents the mantissa term of each second block floating point number.
本申请实施例的计算电路可以应用于人工智能加速器。例如,可以应用在人工智能加速器中,以提高例如多项逻辑回归、多项线性判别分析、深度学习算法等基于概率的多分类方法的计算效率,降低加速器的功耗以及数据存储需求,有效节省算力资源和存储资源。作为可选的示例,在大模型Transformer结构中,归一化指数函数被广泛用于多头注意力层中,如下式所示,因此可以将该计算电路应用于人工智能加速器中,以加速多头注意力层的运算效率。The computing circuit of the embodiment of the present application can be applied to an artificial intelligence accelerator. For example, it can be applied in an artificial intelligence accelerator to improve the computational efficiency of probability-based multi-classification methods such as multinomial logistic regression, multinomial linear discriminant analysis, and deep learning algorithms, reduce the power consumption of the accelerator and data storage requirements, and effectively save computing resources and storage resources. As an optional example, in the large model Transformer structure, the normalized exponential function is widely used in the multi-head attention layer, as shown in the following formula, so the computing circuit can be applied to the artificial intelligence accelerator to accelerate the computing efficiency of the multi-head attention layer.
如图2所示,该计算电路包括第一计算单元H1、转换单元H2和第二计算单元H3。其中,第一计算单元H1与转换单元H2耦合,转换单元H2与第二计算单元H3耦合。本发明实施例提供的计算电路中第一计算单元H1、转换单元H2和第二计算单元H3均为硬件电路构成的硬件电路单元。对于第一计算单元H1、转换单元H2和第二计算单元H3等各硬件电路单元的构成,只要能实现其对应功能的硬件电路单元均可,在此不做限定。As shown in FIG2 , the calculation circuit includes a first calculation unit H1, a conversion unit H2, and a second calculation unit H3. The first calculation unit H1 is coupled to the conversion unit H2, and the conversion unit H2 is coupled to the second calculation unit H3. In the calculation circuit provided in the embodiment of the present invention, the first calculation unit H1, the conversion unit H2, and the second calculation unit H3 are all hardware circuit units composed of hardware circuits. As for the composition of each hardware circuit unit such as the first calculation unit H1, the conversion unit H2, and the second calculation unit H3, any hardware circuit unit that can realize its corresponding function is acceptable, and no limitation is made here.
其中,第一计算单元H1接收多个第一块浮点数,对多个第一块浮点数的尾数项进行计算,得到多个第一中间浮点数。The first calculation unit H1 receives a plurality of first blocks of floating-point numbers, calculates the mantissa items of the plurality of first blocks of floating-point numbers, and obtains a plurality of first intermediate floating-point numbers.
本申请实施例计算电路的目标为计算其中,表示第i个第一块浮点数zi对应的计算结果,K表示第一块浮点数的数量。为防止溢出,在第一计算单元中对其进行数学等价变换,即在进行指数计算之前,先对每一第一块浮点数的尾数项进行处理,将其减去尾数项最大值,代入表达式可得: The goal of the computing circuit in the embodiment of the present application is to calculate in, represents the calculation result corresponding to the first block of floating-point numbers z i , and K represents the number of floating-point numbers in the first block. Overflow, in the first calculation unit, it is mathematically equivalent, that is, before performing exponential calculation, the mantissa of each first block of floating-point numbers is processed, and the maximum mantissa is subtracted from it, and the expression is substituted to obtain:
因此,第一计算单元H1接收多个第一块浮点数,确定多个第一块浮点数的尾数项中的最大值,计算每个第一块浮点数的尾数项与尾数项最大值之间的差值,将该差值作为指数项调整项,基于第一块浮点数共享的第一指数项和指数项调整项进行指数计算,得到多个第一中间浮点数。Therefore, the first computing unit H1 receives multiple first block floating-point numbers, determines the maximum value among the mantissa terms of the multiple first block floating-point numbers, calculates the difference between the mantissa term of each first block floating-point number and the maximum value of the mantissa term, uses the difference as an exponent term adjustment term, performs exponent calculation based on the first exponent term and the exponent term adjustment term shared by the first block floating-point numbers, and obtains multiple first intermediate floating-point numbers.
在可选的实施例中,图3示出了本申请实施例的第一计算单元的结构示意图。如图3所示,第一计算单元H1包括最大值提取子单元Q1、矢量减法子单元Q2、矢量指数计算子单元Q3。In an optional embodiment, a schematic diagram of the structure of the first calculation unit of the embodiment of the present application is shown in Fig. 3. As shown in Fig. 3, the first calculation unit H1 includes a maximum value extraction subunit Q1, a vector subtraction subunit Q2, and a vector index calculation subunit Q3.
其中,最大值提取子单元Q1接收多个第一块浮点数zi=2p·di,获取各个第一块浮点数的尾数项di,比较各个第一块浮点数的尾数项di,获取多个第一块浮点数的尾数项中最大的尾数项,将最大的尾数项作为第一尾数项dmax。The maximum value extraction subunit Q1 receives multiple first block floating point numbers z i =2 p ·d i , obtains the mantissa item d i of each first block floating point number, compares the mantissa items d i of each first block floating point number, obtains the largest mantissa item among the mantissa items of the multiple first block floating point numbers, and uses the largest mantissa item as the first mantissa item d max .
在一些可选的实施场景中,最大值提取子单元Q1的结构示意图如图4所示,该最大值提取子单元Q1由比较器阵列构成,该比较器阵列由多个比较器通过树形结构组织构建而成。比较器比较输入的两个数值,并输出较大的数值。图4展示了使用十五个比较器构建的最大值提取子单元Q1,该最大值提取子单元Q1接收十六个第一块浮点数的尾数项{d0,d1,d2…d15},输出尾数项最大值dmax。In some optional implementation scenarios, the structural schematic diagram of the maximum value extraction subunit Q1 is shown in FIG4 , and the maximum value extraction subunit Q1 is composed of a comparator array, and the comparator array is constructed by organizing multiple comparators in a tree structure. The comparator compares two input values and outputs a larger value. FIG4 shows the maximum value extraction subunit Q1 constructed using fifteen comparators, and the maximum value extraction subunit Q1 receives sixteen first block floating point number mantissa items {d 0 ,d 1 ,d 2 …d 15 } and outputs the mantissa item maximum value d max .
矢量减法子单元Q2分别计算各个第一块浮点数的尾数项di与第一尾数项dmax的差值,将差值作为第二尾数项d′i。The vector subtraction subunit Q2 calculates the difference between the mantissa item d i of each first block of floating-point numbers and the first mantissa item d max , and uses the difference as the second mantissa item d′ i .
在一些可选的实施场景中,矢量减法子单元Q2的结构示意图如图5所示。该矢量减法子单元Q2由减法器阵列构成(图5展示了使用十六个减法器构建的矢量减法子单元)。减法器的一个输入端输入第一块浮点数的尾数项di(i=0,1,2…15),另一个输入端输入第一尾数项dmax,对输入的两个数值做减法运算,输出两者的差值d′i=di-dmax。In some optional implementation scenarios, the structural diagram of the vector subtraction subunit Q2 is shown in FIG5 . The vector subtraction subunit Q2 is composed of a subtractor array ( FIG5 shows a vector subtraction subunit constructed using sixteen subtractors). The subtractor inputs the mantissa term d i (i=0,1,2…15) of the first block of floating-point numbers at one input end and the first mantissa term d max at the other input end, performs a subtraction operation on the two input values, and outputs the difference d′ i =d i -d max .
矢量指数计算子单元Q3基于第一指数项p和第二尾数项d′i,构造多个第三块浮点数z′i=2p·d′i,对第三块浮点数进行指数运算,获得多个第一中间浮点数矢量指数计算子单元Q3对输入块浮点数进行指数计算,输出为常规浮点数。可选地,矢量指数计算子单元Q3可以通过拟合法或查表法(Look-Up-Table,简称LUT)实现指数计算的功能。The vector exponent calculation subunit Q3 constructs a plurality of third-block floating-point numbers z′ i =2 p ·d′ i based on the first exponent term p and the second mantissa term d′ i , and performs exponential operation on the third-block floating-point numbers to obtain a plurality of first intermediate floating-point numbers. The vector exponential calculation subunit Q3 performs exponential calculation on the input block floating point number and outputs a regular floating point number. Optionally, the vector exponential calculation subunit Q3 can implement the exponential calculation function by a fitting method or a look-up table (LUT).
在一些可选的实施场景中,矢量指数计算子单元Q3的结构示意图如图6所示。矢量指数计算子单元Q3包括指数计算模块LUT。指数计算模块LUT拼接定点数p以及定点数d′i作为查表索引,通过查表法查询拼接结果对应的指数函数值,从而获得指数计算的浮点数结果xi。其中,假定第一指数项p为4bit数据,第二尾数项d′i为8bit数据,第一指数项与第二尾数项拼接得到的结果为12bit数据,例如,若p=y1y2y3y4,第二尾数项d′1=y5y6y7y8y9y10y11y12,将p与d′1拼接,得到的拼接结果为y1y2y3y4y5y6y7y8y9y10y11y12。y1、y2、…、y12分别表示不同bit位置的二进制取值(0或1)。In some optional implementation scenarios, the structural diagram of the vector exponential calculation subunit Q3 is shown in Figure 6. The vector exponential calculation subunit Q3 includes an exponential calculation module LUT. The exponential calculation module LUT splices the fixed-point number p and the fixed-point number d′ i as a table lookup index, and queries the exponential function value corresponding to the splicing result through the table lookup method, thereby obtaining the floating-point result x i of the exponential calculation. Wherein, assuming that the first exponent term p is 4-bit data and the second mantissa term d′ i is 8-bit data, the result obtained by concatenating the first exponent term and the second mantissa term is 12-bit data. For example, if p = y 1 y 2 y 3 y 4 , the second mantissa term d′ 1 = y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 , p is concatenated with d′ 1 to obtain a concatenated result of y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 . y 1 , y 2 , …, y 12 respectively represent binary values (0 or 1) of different bit positions.
在其他可选的实施例中,矢量指数计算单元Q3可以由拟合单元构成,通过拟合单元获得指数计算的浮点数结果xi。In other optional embodiments, the vector exponent calculation unit Q3 may be composed of a fitting unit, and the floating point number result x i of the exponent calculation is obtained through the fitting unit.
在本申请实施例中,为防溢出,在第一计算单元中对其进行数学等价变换,在进行指数计算之前,先对每一第一块浮点数的尾数项减去尾数项最大值处理,代入zi表达式可得: In the embodiment of the present application, to prevent Overflow, in the first calculation unit, it is mathematically equivalent. Before performing exponential calculation, the mantissa of each first block of floating-point numbers is subtracted from the maximum mantissa, and the expression z i is substituted to obtain:
转换单元H2接收多个第一中间浮点数,将多个第一中间浮点数归一化为多个归一化尾数项。The conversion unit H2 receives a plurality of first intermediate floating-point numbers, and normalizes the plurality of first intermediate floating-point numbers into a plurality of normalized mantissa items.
转换单元H2接收到多个第一中间浮点数之后,先从第一中间浮点数的二进制码中获取其指数项,确定最大的指数项,然后基于该指数项最大值对多个第一中间浮点数进行归一化,将归一化的结果作为块浮点数的尾数项(即归一化尾数项)。After receiving multiple first intermediate floating-point numbers, the conversion unit H2 first obtains the exponent term from the binary code of the first intermediate floating-point number, determines the largest exponent term, and then normalizes the multiple first intermediate floating-point numbers based on the maximum value of the exponent term, and uses the normalized result as the mantissa term of the block floating-point number (i.e., the normalized mantissa term).
在可选的实施例中,如图7所示,转换单元H2包括共享指数提取子单元U1和归一化子单元U2。In an optional embodiment, as shown in FIG. 7 , the conversion unit H2 includes a shared index extraction subunit U1 and a normalization subunit U2 .
共享指数提取子单元U1接收多个第一中间浮点数,获取第一中间浮点数的指数项中最大的指数项,将最大的指数项作为第三指数项p′。The shared exponent extraction subunit U1 receives a plurality of first intermediate floating-point numbers, obtains the largest exponent term among the exponent terms of the first intermediate floating-point numbers, and uses the largest exponent term as the third exponent term p′.
在一些可选的实施场景中,共享指数提取子单元U1的结构示意图如图8所示。共享指数提取子单元包括指数提取单元和比较器阵列。指数提取单元从输入的浮点数二进制码中提取指数项,并输入到比较器阵列。可选地,共享指数提取子单元U1中的比较器阵列可以与图4所示的比较器阵列结构相同,其对输入的指数项进行比较,并输出指数项最大值。In some optional implementation scenarios, the structural schematic diagram of the shared exponent extraction subunit U1 is shown in FIG8. The shared exponent extraction subunit includes an exponent extraction unit and a comparator array. The exponent extraction unit extracts the exponent term from the input floating point binary code and inputs it to the comparator array. Optionally, the comparator array in the shared exponent extraction subunit U1 can have the same structure as the comparator array shown in FIG4, which compares the input exponent terms and outputs the maximum value of the exponent terms.
归一化子单元U2根据第三指数项p′,将多个第一中间浮点数归一化为多个归一化尾数项。例如,归一化子单元U2首先将第一中间浮点数的符号位、指数项、尾数项分别从二进制码中拆分出,并用指数项减去指数项最大值,再用此值对尾数项进行移位对齐,与符号位拼接得归一化尾数项。The normalization subunit U2 normalizes the plurality of first intermediate floating-point numbers into a plurality of normalized mantissa items according to the third exponent item p′. For example, the normalization subunit U2 first separates the sign bit, the exponent item, and the mantissa item of the first intermediate floating-point number from the binary code, and subtracts the maximum value of the exponent item from the exponent item, and then uses this value to shift and align the mantissa item, and splices it with the sign bit to obtain the normalized mantissa item.
在一些可选的实施场景中,归一化子单元U2的结构示意图如图9所示。归一化子单元由多个并行的基本电路G构成,基本电路G包括指数提取器、尾数提取器、减法器和移位电路。In some optional implementation scenarios, the structural diagram of the normalization subunit U2 is shown in Figure 9. The normalization subunit is composed of a plurality of parallel basic circuits G, and the basic circuit G includes an exponent extractor, a mantissa extractor, a subtractor and a shift circuit.
指数提取器获取第一中间浮点数的指数项。The exponent extractor obtains the exponent term of the first intermediate floating-point number.
尾数提取器获取第一中间浮点数的尾数项。The mantissa extractor obtains the mantissa term of the first intermediate floating-point number.
减法器获取第一中间浮点数的指数项与第三指数项的差值。The subtractor obtains a difference between the exponent term of the first intermediate floating-point number and the third exponent term.
移位电路根据第一中间浮点数的指数项与第三指数项的差值,对第一中间浮点数的尾数项进行移位对齐,获得归一化尾数项d″i。The shift circuit shifts and aligns the mantissa term of the first intermediate floating point number according to the difference between the exponent term of the first intermediate floating point number and the third exponent term to obtain a normalized mantissa term d″ i .
第二计算单元H3对多个归一化尾数项进行计算,获得多个第二中间浮点数,基于归一化尾数项将多个第二中间浮点数归一化为多个第二块浮点数。The second calculation unit H3 calculates the multiple normalized mantissa items to obtain multiple second intermediate floating-point numbers, and normalizes the multiple second intermediate floating-point numbers into multiple second block floating-point numbers based on the normalized mantissa items.
在一些可选的实施例中,如图10所示,第二计算单元H3包括求和子单元W1和乘法子单元W2。In some optional embodiments, as shown in FIG. 10 , the second calculation unit H3 includes a summing subunit W1 and a multiplication subunit W2 .
求和子单元W1计算归一化尾数项d″i之和,将归一化尾数项之和的倒数作为中间值,然后输出中间值的指数项p″和尾数项。求和子单元W1的输出如下式所示:The summing subunit W1 calculates the sum of the normalized mantissa terms d″ i , takes the reciprocal of the sum of the normalized mantissa terms as the intermediate value, and then outputs the exponential term p″ and the mantissa term of the intermediate value. The output of the summing subunit W1 is shown in the following formula:
在一些可选的实施场景中,求和子单元W1的结构示意图如图11所示。该求和子单元包括加法器阵列和倒数模块。加法器阵列由多个加法器通过树形结构构建而成,加法器对输入的两个数值做加法运算,输出两者之和。倒数模块计算输入值的倒数,并分别输出其指数项和尾数项。In some optional implementation scenarios, the structural diagram of the summing subunit W1 is shown in FIG11. The summing subunit includes an adder array and a reciprocal module. The adder array is constructed by a plurality of adders through a tree structure, and the adder performs an addition operation on two input values and outputs the sum of the two. The reciprocal module calculates the reciprocal of the input value and outputs its exponential term and mantissa term respectively.
乘法子单元W2计算中间值的尾数项s与归一化尾数项d″i的乘积,将乘积与中间值的指数项作归一化对齐,获得目标尾数项和目标指数项,目标尾数项为第二块浮点数的尾数项,目标指数项为第二块浮点数共享的第二指数项。The multiplication subunit W2 calculates the product of the mantissa term s of the intermediate value and the normalized mantissa term d″ i , normalizes and aligns the product with the exponent term of the intermediate value, and obtains a target mantissa term and a target exponent term, wherein the target mantissa term is the mantissa term of the second block of floating-point numbers, and the target exponent term is the second exponent term shared by the second block of floating-point numbers.
在一些可选的实施场景中,乘法子单元W2的结构示意图如图12所示。乘法子单元W2包括乘法器和归一化子单元。In some optional implementation scenarios, the structural diagram of the multiplication subunit W2 is shown in Figure 12. The multiplication subunit W2 includes a multiplier and a normalization subunit.
乘法器U算中间值的尾数项与归一化尾数项的乘积,获得第三尾数项d″′i=(d″i·s)。The multiplier U calculates the product of the mantissa term of the intermediate value and the normalized mantissa term to obtain a third mantissa term d″′ i =(d″ i ·s).
归一化子单元将第三尾数项d″′i与中间值的指数项p″作归一化对齐,获得目标尾数项d″′i和目标指数项p″′。The normalization subunit normalizes and aligns the third mantissa term d"' i with the exponent term p" of the intermediate value to obtain a target mantissa term d"' i and a target exponent term p"'.
目标尾数项d″′i和目标指数项p″′即为计算电路输出的块浮点数的尾数项mi和指数项q。The target mantissa term d″′ i and the target exponent term p″′ are the mantissa term mi and the exponent term q of the block floating point number output by the calculation circuit.
本申请实施例的计算电路实现的功能是对块浮点数进行归一化指数计算,其计算公式如下所示:The function implemented by the calculation circuit of the embodiment of the present application is to perform normalized exponential calculation on the block floating point number, and the calculation formula is as follows:
在上述计算公式中分子与分母的公共因子2p′被消除,硬件电路得到了简化。In the above calculation formula, the common factor 2p′ between the numerator and the denominator is eliminated, and the hardware circuit is simplified.
图13示出了本申请实施例的计算电路的数据流转图。如图13所示,该计算电路包括:最大值提取子单元Q1、矢量减法子单元Q2、矢量指数计算子单元Q3、共享指数提取子单元U1、归一化子单元U2、求和子单元W1和乘法子单元W2。Figure 13 shows a data flow diagram of a computing circuit of an embodiment of the present application. As shown in Figure 13, the computing circuit includes: a maximum value extraction subunit Q1, a vector subtraction subunit Q2, a vector index calculation subunit Q3, a shared index extraction subunit U1, a normalization subunit U2, a summation subunit W1 and a multiplication subunit W2.
向计算电路输入块浮点数zi=2p·di(i=0,1,2,…15),其中p、di均为定点值,p为各个块浮点数zi的共享指数项,该共享指数项作为第一指数项。di为各个块浮点数zi的尾数项。The block floating point number z i =2 p ·d i (i=0, 1, 2, ... 15) is input to the calculation circuit, where p and d i are both fixed-point values, p is the shared exponent term of each block floating point number z i , and the shared exponent term is used as the first exponent term. d i is the mantissa term of each block floating point number z i .
最大值提取子单元Q1比较各个块浮点数zi的尾数项di,输出尾数项最大值dmax(如图13中过程(1))。The maximum value extraction subunit Q1 compares the mantissa terms d i of each block floating point number z i and outputs the maximum mantissa term d max (such as process (1) in FIG. 13 ).
矢量减法子单元Q2分别计算各个块浮点数zi的尾数项di与尾数项最大值dmax的差值,将差值作为第二尾数项d′i=di-dmax(如图13中过程(2))。The vector subtraction subunit Q2 calculates the difference between the mantissa term d i of each block floating point number z i and the mantissa term maximum value d max , and uses the difference as the second mantissa term d′ i =d i −d max (such as process (2) in FIG. 13 ).
矢量指数计算子单元Q3基于各个块浮点数zi的共享指数项p和第二尾数项d′i构造多个块浮点数zi′=2p·di′,块浮点数zi′=2p·di′进行指数运算,获得多个浮点数(如图13中过程(3))。The vector exponent calculation subunit Q3 constructs a plurality of block floating point numbers z i ′=2 p ·d i ′ based on the shared exponent term p and the second mantissa term d′ i of each block floating point number z i ′, performs exponential operation on the block floating point number z i ′=2 p ·d i ′, and obtains a plurality of floating point numbers (Process (3) in Figure 13).
共享指数提取子单元U1接收多个浮点数xi,获取浮点数xi的指数项中最大的指数项,将最大的指数项作为第三指数项p′(如图13中过程(4))。The shared exponent extraction subunit U1 receives multiple floating point numbers x i , obtains the largest exponent term among the exponent terms of the floating point numbers x i , and uses the largest exponent term as the third exponent term p′ (such as process (4) in FIG. 13 ).
归一化子单元U2根据第三指数项p′,将多个浮点数xi归一化为多个归一化尾数项d″i(如图13中过程(5))。The normalization subunit U2 normalizes the plurality of floating point numbers x i into a plurality of normalized mantissa terms d″ i according to the third exponent term p′ (such as process (5) in FIG. 13 ).
求和子单元W1计算归一化尾数项d″i之和,将归一化尾数项之和的倒数作为中间值,然后输出中间值的指数项p”和尾数项s(如图13中过程(6))。The summing subunit W1 calculates the sum of the normalized mantissa terms d″ i , takes the reciprocal of the sum of the normalized mantissa terms as the intermediate value, and then outputs the exponential term p″ and the mantissa term s of the intermediate value (such as process (6) in Figure 13).
乘法子单元W2计算中间值的尾数项s与归一化尾数项d″i的乘积,将乘积与中间值的指数项作归一化对齐,获得目标尾数项和目标指数项,目标尾数项d″′i为第二块浮点数的尾数项,目标指数项p″′为第二块浮点数共享的第二指数项(如图13中过程(7))。The multiplication subunit W2 calculates the product of the mantissa term s of the intermediate value and the normalized mantissa term d″ i , normalizes the product and the exponent term of the intermediate value, and obtains the target mantissa term and the target exponent term. The target mantissa term d″′ i is the mantissa term of the second block of floating-point numbers, and the target exponent term p″′ is the second exponent term shared by the second block of floating-point numbers (such as process (7) in Figure 13).
目标尾数项d″′i和目标指数项p″′即为计算电路输出的块浮点数的尾数项和指数项。The target mantissa term d″′ i and the target exponent term p″′ are the mantissa term and exponent term of the block floating point number output by the calculation circuit.
本申请实施例的计算电路采用块浮点数作为输入、输出,保证了数据精度和有效性,容易与人工智能模型算法中其他块浮点算子兼容(例如矩阵乘等),无需数值转换,同时各个子单元均采用定点格式计算,速度快、开销小、存储与带宽要求低。而且,该计算电路基于块浮点数共享指数的特性,支持并行计算,可有效提高计算处理能力。The computing circuit of the embodiment of the present application uses block floating-point numbers as input and output, which ensures data accuracy and validity, and is easily compatible with other block floating-point operators in the artificial intelligence model algorithm (such as matrix multiplication, etc.), without the need for numerical conversion. At the same time, each subunit uses fixed-point format for calculation, which is fast, has low overhead, and has low storage and bandwidth requirements. Moreover, the computing circuit supports parallel computing based on the characteristics of block floating-point numbers sharing exponents, which can effectively improve computing processing capabilities.
本说明书中的各个实施例均采用相关的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。以上所述仅为本申请的较佳实施例而已,并非用于限定本申请的保护范围。凡在本申请的精神和原则之内所作的任何修改、等同替换、改进等,均包含在本申请的保护范围内。Each embodiment in this specification is described in a related manner, and the same or similar parts between the embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. The above is only a preferred embodiment of the present application and is not intended to limit the scope of protection of the present application. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present application are included in the scope of protection of the present application.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410444916.XA CN118312136B (en) | 2024-04-12 | 2024-04-12 | Computing circuits and AI accelerators |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410444916.XA CN118312136B (en) | 2024-04-12 | 2024-04-12 | Computing circuits and AI accelerators |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN118312136A true CN118312136A (en) | 2024-07-09 |
| CN118312136B CN118312136B (en) | 2025-02-11 |
Family
ID=91731548
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202410444916.XA Active CN118312136B (en) | 2024-04-12 | 2024-04-12 | Computing circuits and AI accelerators |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN118312136B (en) |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US9552189B1 (en) * | 2014-09-25 | 2017-01-24 | Altera Corporation | Embedded floating-point operator circuitry |
| CN112988110A (en) * | 2019-12-17 | 2021-06-18 | 深圳市中兴微电子技术有限公司 | Floating point processing device and data processing method |
| CN115812194A (en) * | 2020-10-31 | 2023-03-17 | 华为技术有限公司 | A floating-point number calculation circuit and a floating-point number calculation method |
| CN116783577A (en) * | 2021-01-29 | 2023-09-19 | 微软技术许可有限责任公司 | Digital circuit for normalization function |
| US20230376769A1 (en) * | 2022-05-18 | 2023-11-23 | Seyed Alireza GHAFFARI | Method and system for training machine learning models using dynamic fixed-point data representations |
-
2024
- 2024-04-12 CN CN202410444916.XA patent/CN118312136B/en active Active
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US9552189B1 (en) * | 2014-09-25 | 2017-01-24 | Altera Corporation | Embedded floating-point operator circuitry |
| CN112988110A (en) * | 2019-12-17 | 2021-06-18 | 深圳市中兴微电子技术有限公司 | Floating point processing device and data processing method |
| CN115812194A (en) * | 2020-10-31 | 2023-03-17 | 华为技术有限公司 | A floating-point number calculation circuit and a floating-point number calculation method |
| CN116783577A (en) * | 2021-01-29 | 2023-09-19 | 微软技术许可有限责任公司 | Digital circuit for normalization function |
| US20230376769A1 (en) * | 2022-05-18 | 2023-11-23 | Seyed Alireza GHAFFARI | Method and system for training machine learning models using dynamic fixed-point data representations |
Non-Patent Citations (1)
| Title |
|---|
| HESHAN ZHANG 等: ""A Block-Floating-Point Arithmetic Based FPGA Accelerator for Convolutional Neural Networks "", 《2019 IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (GLOBALSIP)》, 28 January 2020 (2020-01-28), pages 1 - 5 * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN118312136B (en) | 2025-02-11 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| TWI701612B (en) | Circuit system and processing method for neural network activation function | |
| CN110362292B (en) | Approximate multiplication method and approximate multiplier based on approximate 4-2 compressor | |
| CN112051980B (en) | Non-linear activation function computing device based on Newton iteration method | |
| CN114115803B (en) | An Approximate Floating-Point Multiplier Based on Partial Product Probabilistic Analysis | |
| CN112506935B (en) | Data processing method, device, electronic device, storage medium, and program product | |
| EP3769208B1 (en) | Stochastic rounding logic | |
| CN110888623B (en) | Data conversion method, multiplier, adder, terminal device and storage medium | |
| CN111984227A (en) | Approximate calculation device and method for complex square root | |
| CN116974517A (en) | Floating point number processing methods, devices, computer equipment and processors | |
| CN107967132A (en) | A kind of adder and multiplier for neural network processor | |
| CN112835551B (en) | Data processing method, electronic device and computer readable storage medium for processing unit | |
| CN111860792A (en) | A kind of hardware realization device and method of activation function | |
| CN112860218B (en) | Mixed precision arithmetic unit for FP16 floating point data and INT8 integer data operation | |
| WO2023078364A1 (en) | Operation method and apparatus for matrix multiplication | |
| CN113608718B (en) | Method for realizing prime number domain large integer modular multiplication calculation acceleration | |
| CN106682258A (en) | Method and system for multi-operand addition optimization in high-level synthesis tool | |
| CN115544447A (en) | Dot product arithmetic device | |
| CN118312136A (en) | Computing circuits and AI accelerators | |
| CN118259873B (en) | Computing circuits, chips, computing devices | |
| CN119806473B (en) | Multiplication device | |
| Kim et al. | Applying piecewise linear approximation for DNN non-linear activation functions to Bfloat16 MACs | |
| CN115374904B (en) | A low-power floating-point multiplication-accumulation method for neural network inference acceleration | |
| Yang et al. | A low-power approximate multiply-add unit | |
| CN114217764B (en) | A high-precision floating-point simulation method based on domestic heterogeneous many-core platform | |
| CN114237550A (en) | A Multi-input Shift-Sum Accumulator Based on Wallace Tree |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |