CN116306825A

CN116306825A - Hardware acceleration circuit, data processing acceleration method, chip and accelerator

Info

Publication number: CN116306825A
Application number: CN202111557306.3A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Guangzhou Xiaopeng Autopilot Technology Co Ltd
Current assignee: Guangzhou Xiaopeng Autopilot Technology Co Ltd
Priority date: 2021-12-18
Filing date: 2021-12-18
Publication date: 2023-06-23

Abstract

The application relates to a hardware acceleration circuit, a data processing acceleration method, a chip and an accelerator. The hardware acceleration circuit includes: an exponential function module, used to obtain multiple exponential function values of multiple data elements in the data set; an adder, used to obtain the addition operation results of the multiple exponential function values; a first processing circuit, It is used to perform preset processing on the addition result, so as to process the addition result into at least first data and second data; a second processing circuit is used to at least perform processing on the first data and second data preset processing, to obtain the reciprocal of the addition result; a third processing circuit, configured to perform preset processing on the exponential function value of the i-th data element among the plurality of data elements and the reciprocal, to obtain The specific function value of the ith data element. The solution provided by the present application can reduce the calculation amount of data processing, thereby improving the acquisition speed of nonlinear function values.

Description

Hardware acceleration circuit, data processing acceleration method, chip and accelerator

技术领域technical field

本申请涉及人工智能技术领域，尤其涉及一种硬件加速电路、数据处理加速方法、芯片及加速器。The present application relates to the technical field of artificial intelligence, and in particular to a hardware acceleration circuit, a data processing acceleration method, a chip and an accelerator.

背景技术Background technique

非线性函数将非线性特性引入到人工神经网络中，对于人工神经网络学习、理解复杂场景来说具有十分重要的作用。非线性函数包括但不限于：柔性最大(Softmax)函数、Sigmoid函数等。Nonlinear functions introduce nonlinear characteristics into artificial neural networks, which play a very important role in learning and understanding complex scenes for artificial neural networks. Non-linear functions include, but are not limited to: softmax (Softmax) function, Sigmoid function, and the like.

以Softmax函数为例，广泛应用于深度学习。相关技术中可以通过通用计算单元，例如中央处理器(CPU)或图形处理器(GPU)计算Softmax函数的函数值。但是，神经网络的处理过程例如由深度学习加速器(Deep Learning Accelerator，简称DLA)或神经网络处理器(Neural Network Processing Unit，简称NPU)等硬件电路执行的情况下，如果Softmax函数层位于神经网络的网络中间层，会导致DLA/NPU与CPU/GPU之间的作业迁移(jobmigration)开销，使得使用CPU/GPU确定非线性函数值的方案的效率不高，导致系统带宽增加和更高的功耗。Take the Softmax function as an example, which is widely used in deep learning. In the related art, the function value of the Softmax function can be calculated by a general computing unit, such as a central processing unit (CPU) or a graphics processing unit (GPU). However, when the processing of the neural network is executed by hardware circuits such as a deep learning accelerator (Deep Learning Accelerator, DLA for short) or a neural network processor (Neural Network Processing Unit, NPU for short), if the Softmax function layer is located in the neural network The middle layer of the network will cause job migration (jobmigration) overhead between DLA/NPU and CPU/GPU, making the solution of using CPU/GPU to determine the nonlinear function value inefficient, resulting in increased system bandwidth and higher power consumption .

发明内容Contents of the invention

为解决或部分解决相关技术中存在的问题，本申请提供一种硬件加速电路、数据处理加速方法、芯片及加速器，可以降低数据处理的运算量，从而能够提高非线性函数值的获取速度。In order to solve or partly solve the problems existing in related technologies, the application provides a hardware acceleration circuit, data processing acceleration method, chip and accelerator, which can reduce the amount of data processing calculations, thereby improving the acquisition speed of nonlinear function values.

本申请一方面提供一种硬件加速电路，所述硬件加速电路包括：The present application provides a hardware acceleration circuit on the one hand, and the hardware acceleration circuit includes:

指数函数模块，用于获得数据集合中多个数据元素的多个指数函数值；Exponential function module, is used for obtaining multiple exponential function values of multiple data elements in the data set;

加法器，用于获得所述多个指数函数值的加法运算结果；an adder, configured to obtain an addition result of the plurality of exponential function values;

第一处理电路，用于对所述加法运算结果进行预设处理，以将所述加法运算结果至少处理为第一数据和第二数据，其中，所述加法运算结果的长度为N1比特，所述第一数据的长度为N2比特，所述第二数据的长度为N3比特，N2和N3均小于N1；The first processing circuit is configured to perform preset processing on the addition result, so as to process the addition result into at least first data and second data, wherein the length of the addition result is N1 bits, so The length of the first data is N2 bits, the length of the second data is N3 bits, and both N2 and N3 are smaller than N1;

第二处理电路，用于至少对所述第一数据和第二数据进行预设处理，以获得所述加法运算结果的倒数；A second processing circuit, configured to perform preset processing on at least the first data and the second data, so as to obtain the reciprocal of the addition result;

第三处理电路，用于对所述多个数据元素中的第i个数据元素的指数函数值与所述倒数进行预设处理，以获得所述第i个数据元素的特定函数值。The third processing circuit is configured to perform preset processing on the exponential function value and the reciprocal of the i-th data element among the plurality of data elements, so as to obtain the specific function value of the i-th data element.

本申请另一方面提供一种人工智能芯片，包括如上所述的硬件加速电路。Another aspect of the present application provides an artificial intelligence chip, including the above-mentioned hardware acceleration circuit.

本申请又一方面提供一种数据处理加速方法，应用于人工智能加速器，所述方法包括：Another aspect of the present application provides a data processing acceleration method, which is applied to an artificial intelligence accelerator, and the method includes:

获得数据集合中多个数据元素的多个指数函数值；Obtain multiple exponential function values for multiple data elements in the data set;

获得所述多个指数函数值的加法运算结果；obtaining an addition operation result of the plurality of exponential function values;

获得所述加法运算结果的倒数；obtaining the reciprocal of the result of the addition operation;

基于所述多个数据元素中的第i个数据元素的指数函数值与所述倒数，获得所述第i个数据元素的特定函数值；Obtaining a specific function value of the i-th data element based on the exponential function value and the reciprocal of the i-th data element among the plurality of data elements;

其中，获得所述加法运算结果的倒数包括：Wherein, obtaining the reciprocal of the addition operation result includes:

将所述加法运算结果至少处理为第一数据和第二数据；processing the addition operation result into at least first data and second data;

至少根据所述第一数据和第二数据，获得所述加法运算结果的倒数；Obtaining the reciprocal of the addition result based on at least the first data and the second data;

其中，所述加法运算结果的长度为N1比特，所述第一数据的长度为N2比特，所述第二数据的长度为N3比特，N2和N3均小于N1。Wherein, the length of the addition operation result is N1 bits, the length of the first data is N2 bits, the length of the second data is N3 bits, and both N2 and N3 are smaller than N1.

本申请再一方面提供一种人工智能加速器，包括：Another aspect of the present application provides an artificial intelligence accelerator, including:

处理器；以及processor; and

存储器，其上存储有可执行代码，当所述可执行代码被所述处理器执行时，使所述处理器执行如上所述的方法。A memory, on which executable codes are stored, which, when executed by the processor, cause the processor to perform the method as described above.

本申请提供的技术方案可以包括以下有益效果：The technical solution provided by this application may include the following beneficial effects:

本申请实施例的技术方案，将各数据元素的指数函数值的加法运算结果至少处理为长度均低于加法运算结果的第一数据和第二数据，并通过至少对第一数据和第二数据进行预设处理，获得加法运算结果的倒数，通过降低所处理数据的位宽，可以降低数据处理的运算量，从而能够提高非线性函数值的获取速度。In the technical solution of the embodiment of the present application, the addition result of the exponential function value of each data element is at least processed into the first data and the second data whose length is lower than the addition result, and through at least the first data and the second data Preset processing is performed to obtain the reciprocal of the result of the addition operation. By reducing the bit width of the processed data, the calculation amount of data processing can be reduced, thereby improving the acquisition speed of the nonlinear function value.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性和解释性的，并不能限制本申请。It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

附图说明Description of drawings

通过结合附图对本申请示例性实施方式进行更详细地描述，本申请的上述以及其它目的、特征和优势将变得更加明显，其中，在本申请示例性实施方式中，相同的参考标号通常代表相同部件。The above and other objects, features and advantages of the present application will become more apparent by describing the exemplary embodiments of the present application in more detail with reference to the accompanying drawings, wherein, in the exemplary embodiments of the present application, the same reference numerals generally represent same parts.

图1是本申请一实施例示出的神经网络的结构示意图；Fig. 1 is a schematic structural diagram of a neural network shown in an embodiment of the present application;

图2是本申请一实施例示出的用于分类的神经网络的结构示意图；Fig. 2 is a schematic structural diagram of a neural network for classification shown in an embodiment of the present application;

图3是本申请一实施例的硬件加速电路的结构框图；FIG. 3 is a structural block diagram of a hardware acceleration circuit according to an embodiment of the present application;

图4A是本申请另一实施例的硬件加速电路的结构框图；FIG. 4A is a structural block diagram of a hardware acceleration circuit according to another embodiment of the present application;

图4B是本申请一实施例的基本查找表电路单元的结构示意图；Fig. 4B is a schematic structural diagram of a basic look-up table circuit unit according to an embodiment of the present application;

图5是本申请另一实施例的硬件加速电路的结构框图；FIG. 5 is a structural block diagram of a hardware acceleration circuit according to another embodiment of the present application;

图6是本申请另一实施例示出的硬件加速电路的结构框图；FIG. 6 is a structural block diagram of a hardware acceleration circuit shown in another embodiment of the present application;

图7至图9是本申请一些实施例的数据处理加速方法的流程示意图；7 to 9 are schematic flowcharts of data processing acceleration methods in some embodiments of the present application;

图10是本申请一实施例的人工智能加速器的结构框图。Fig. 10 is a structural block diagram of an artificial intelligence accelerator according to an embodiment of the present application.

具体实施方式Detailed ways

下面将参照附图更详细地描述本申请的实施方式。虽然附图中显示了本申请的实施方式，然而应该理解，可以以各种形式实现本申请而不应被这里阐述的实施方式所限制。相反，提供这些实施方式是为了使本申请更加透彻和完整，并且能够将本申请的范围完整地传达给本领域的技术人员。Embodiments of the present application will be described in more detail below with reference to the accompanying drawings. Although embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that this application will be thorough and complete, and will fully convey the scope of this application to those skilled in the art.

在本申请使用的术语是仅仅出于描述特定实施例的目的，而非旨在限制本申请。在本申请和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式，除非上下文清楚地表示其他含义。还应当理解，本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。The terminology used in this application is for the purpose of describing particular embodiments only, and is not intended to limit the application. As used in this application and the appended claims, the singular forms "a", "the", and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It should also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.

应当理解，尽管在本申请可能采用术语“第一”、“第二”、“第三”等来描述各种信息，但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如，在不脱离本申请范围的情况下，第一信息也可以被称为第二信息，类似地，第二信息也可以被称为第一信息。由此，限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本申请的描述中，“多个”的含义是两个或两个以上，除非另有明确具体的限定。It should be understood that although the terms "first", "second", "third" and so on may be used in this application to describe various information, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of the present application, first information may also be called second information, and similarly, second information may also be called first information. Thus, a feature defined as "first" and "second" may explicitly or implicitly include one or more of these features. In the description of the present application, "plurality" means two or more, unless otherwise specifically defined.

非线性函数的计算过程可能涉及指数函数和/或倒数的运算过程，如Softmax函数的运算过程可以涉及指数(exp)和指数和的倒数(1/sum_of_exp)的运算过程。The calculation process of the nonlinear function may involve the calculation process of the exponential function and/or the reciprocal, for example, the calculation process of the Softmax function may involve the calculation process of the exponent (exp) and the reciprocal of the sum of the exponents (1/sum_of_exp).

本申请实施例提供一种数据处理加速方案，将各数据元素的指数函数值的加法运算结果至少处理为长度(即位宽)均低于加法运算结果的第一数据和第二数据，并通过至少对第一数据和第二数据进行预设处理，获得加法运算结果的倒数，通过降低所处理数据的位宽，可以降低数据处理的运算量，从而提高非线性函数值的获取速度。An embodiment of the present application provides a data processing acceleration solution, which processes the addition result of the exponential function value of each data element at least as the first data and the second data whose length (that is, the bit width) is lower than the addition result, and passes at least Preset processing is performed on the first data and the second data to obtain the reciprocal of the addition result. By reducing the bit width of the processed data, the calculation amount of data processing can be reduced, thereby increasing the acquisition speed of the nonlinear function value.

图1是本申请一实施例示出的神经网络的结构示意图。FIG. 1 is a schematic structural diagram of a neural network shown in an embodiment of the present application.

参见图1，示出了一种神经网络100的拓扑结构，包括输入层、隐藏层和输出层。该神经网络100能够基于输入层接收的数据元素I₁、I₂来执行计算或操作，并且基于执行计算的结果来生成输出数据O₁、O₂。Referring to FIG. 1 , a topology structure of a neural network 100 is shown, including an input layer, a hidden layer and an output layer. The neural network 100 is capable of performing calculations or operations based on the data elements I ₁ , I ₂ received by the input layer and generating output data O ₁ , O ₂ based on the results of the performed calculations.

例如，神经网络100可以是包括一个或多个隐藏层的深度神经网络(Deep NeuralNetworks，简称DNN)。图1中的神经网络100包括输入层L1、两个隐藏层L2、L3和输出层L4。其中，DNN包括但不限于卷积神经网络(Convolutional Neural Networks,简称CNN)和循环神经网络(Recurrent Neural Network，简称RNN)等。For example, the neural network 100 may be a deep neural network (Deep Neural Networks, DNN for short) including one or more hidden layers. The neural network 100 in FIG. 1 includes an input layer L1, two hidden layers L2, L3 and an output layer L4. Among them, DNN includes but is not limited to convolutional neural network (Convolutional Neural Networks, referred to as CNN) and recurrent neural network (Recurrent Neural Network, referred to as RNN).

需要说明的是，图1中示出的四个层，仅为了便于理解本申请的技术方案，不能理解为对本申请的限定。例如，神经网络可以包括更多或更少的隐藏层。It should be noted that the four layers shown in FIG. 1 are only for the convenience of understanding the technical solution of the present application, and should not be construed as limiting the present application. For example, a neural network may include more or fewer hidden layers.

神经网络100的不同层的节点之间可以彼此连接，以进行数据传输。例如，一个节点可从其他节点接收数据以对接收的数据执行计算，并将计算结果输出给其它层的节点。Nodes in different layers of the neural network 100 can be connected to each other for data transmission. For example, a node may receive data from other nodes to perform calculations on the received data, and output calculation results to nodes in other layers.

每个节点可以基于从先前层中的节点接收的输出数据和权重来确定该节点的输出数据。例如，图1中

表示第一层的第一个节点与第二层的第一个节点之间的权重。/>

表示第一层的第一个节点的输出数据。/>

表示第二层中的第一个节点的偏置值，则第二层的第一个节点的输出数据可以表示为：/>

其它节点的输出数据的计算方式类似，在此不再详述。Each node may determine output data for that node based on output data and weights received from nodes in previous layers. For example, in Figure 1

Indicates the weight between the first node of the first layer and the first node of the second layer. />

Represents the output data of the first node of the first layer. />

Indicates the bias value of the first node in the second layer, then the output data of the first node in the second layer can be expressed as: />

The calculation methods of the output data of other nodes are similar and will not be described in detail here.

一些实施例中，神经网络中配置有激活函数层，例如柔性最大(softmax)函数层，柔性最大函数层可以将关于每个类的结果值转换为概率值。In some embodiments, the neural network is configured with an activation function layer, such as a softmax (softmax) function layer, and the softmax function layer can convert the result value of each class into a probability value.

一些实施例中，神经网络在柔性最大函数层之后配置有损失函数层，损失函数层能够计算损失作为用于训练或学习的目标函数。In some embodiments, the neural network is configured with a loss function layer after the flexible maximum function layer, and the loss function layer can calculate loss as an objective function for training or learning.

可以理解的，神经网络可响应于待处理数据，对待处理数据进行处理后，得到识别结果；待处理数据例如可以包括语音数据、文本数据和图像数据中的至少一种。It can be understood that the neural network can respond to the data to be processed, and obtain a recognition result after processing the data to be processed; the data to be processed can include at least one of voice data, text data and image data, for example.

神经网络的一个典型类型是用于分类的神经网络。用于分类的神经网络可以通过计算数据元素与每个类对应的概率，来确定数据元素所属的类别。A typical type of neural network is a neural network for classification. A neural network for classification can determine the class to which a data element belongs by calculating the probability that the data element corresponds to each class.

图2是本申请一实施例示出的用于分类的神经网络的结构示意图。Fig. 2 is a schematic structural diagram of a neural network for classification shown in an embodiment of the present application.

参见图2，本实施例的用于分类的神经网络200可包括隐藏层210、全连接层(FullConnect Layer，简称FC层)220、柔性最大函数层230和损失函数层240。Referring to FIG. 2 , the neural network 200 for classification in this embodiment may include a hidden layer 210 , a fully connected layer (FullConnect Layer, FC layer for short) 220 , a flexible maximum function layer 230 and a loss function layer 240 .

如图2所示，神经网络200响应于待处理数据，按照隐藏层210和FC层220的顺序依序进行计算，FC层220输出计算结果s，该结果与数据元素的分类概率对应。其中，FC层220可包括分别与多个类对应的多个节点，每个节点输出与数据元素被分类到相应类的概率对应的结果值。例如，一并参见图1，FC层220对应图1中的输出层L4，具有两个节点，与两个分类(第一类和第二类)对应，则其中一个节点的输出值可以是表示数据元素被分类到第一类的概率的结果值，另一个节点的输出值可以是表示数据元素被分类到第二类的概率的结果值。FC层220将计算结果s输出到柔性最大函数层230，柔性最大函数层230将计算结果s转换为概率值y，还可以对概率值y进行归一化处理。As shown in FIG. 2 , the neural network 200 performs calculations in sequence in response to the data to be processed in the order of the hidden layer 210 and the FC layer 220 , and the FC layer 220 outputs a calculation result s, which corresponds to the classification probability of the data element. Wherein, the FC layer 220 may include a plurality of nodes respectively corresponding to a plurality of classes, and each node outputs a result value corresponding to a probability that a data element is classified into a corresponding class. For example, referring to Fig. 1 together, the FC layer 220 corresponds to the output layer L4 in Fig. 1, has two nodes, and corresponds to two classifications (the first class and the second class), then the output value of one of the nodes can be represented The resulting value of the probability that the data element is classified into the first class, and the output value of the other node may be a result value representing the probability of the data element being classified into the second class. The FC layer 220 outputs the calculation result s to the flexible maximum function layer 230, and the flexible maximum function layer 230 converts the calculation result s into a probability value y, and can also perform normalization processing on the probability value y.

柔性最大函数层230将概率值y输出到损失函数层240，损失函数层240可基于概率值y计算结果s的交叉熵损失(cross-entropy loss)L。The softmax function layer 230 outputs the probability value y to the loss function layer 240 , and the loss function layer 240 can calculate the cross-entropy loss (cross-entropy loss) L of the result s based on the probability value y.

在反向传播学习过程中，柔性最大函数层230计算交叉熵损失L的梯度

然后，FC层220执行基于交叉熵损失L的梯度的学习处理。例如，可根据梯度下降算法来更新FC层220的权重。进一步的，可在隐藏层210中执行后续的学习处理。During the backpropagation learning process, the softmax function layer 230 calculates the gradient of the cross-entropy loss L

Then, the FC layer 220 performs learning processing based on the gradient of the cross-entropy loss L. For example, the weights of the FC layer 220 may be updated according to a gradient descent algorithm. Further, subsequent learning processing may be performed in the hidden layer 210 .

神经网络200可以采用软件实现，或者采用硬件电路实现，或者采用软件和硬件相结合的方式实现。例如，采用硬件电路实现的情况下，隐藏层210、FC层220、柔性最大函数层230和损失函数层240均由硬件电路实现，可以集成在一个人工智能芯片中，或分布在多个芯片中实现。通过这样的配置，避免了柔性最大函数层230由CPU/GPU实现时，神经网络的其他层与CPU/GPU等处理器之间的数据迁移，能够提高神经网络数据处理的效率，降低数据处理延时和功耗，避免增加占用的带宽。The neural network 200 can be implemented by software, or by hardware circuits, or by a combination of software and hardware. For example, in the case of hardware circuit implementation, the hidden layer 210, FC layer 220, flexible maximum function layer 230, and loss function layer 240 are all implemented by hardware circuits, which can be integrated in an artificial intelligence chip, or distributed among multiple chips accomplish. Through such a configuration, when the flexible maximum function layer 230 is implemented by CPU/GPU, data migration between other layers of the neural network and processors such as CPU/GPU can be avoided, the efficiency of neural network data processing can be improved, and the data processing delay can be reduced. time and power consumption, and avoid increasing occupied bandwidth.

以下结合附图详细描述本申请实施例的技术方案。The technical solutions of the embodiments of the present application are described in detail below with reference to the accompanying drawings.

图3是本申请一实施例的硬件加速电路的结构框图。本申请中，硬件加速电路例如可用于但不限于实现上述神经网络200中的柔性最大函数层230，该硬件加速电路例如可以但不限于是CPLD(Complex Programming logic device，复杂可编程逻辑器件)芯片、FPGA(Field Programmable Gate Array，现场可编程门阵列)芯片、专用芯片等中的电路组成部分。FIG. 3 is a structural block diagram of a hardware acceleration circuit according to an embodiment of the present application. In the present application, the hardware acceleration circuit, for example, can be used but not limited to realize the flexible maximum function layer 230 in the above-mentioned neural network 200, and the hardware acceleration circuit can be, for example but not limited to, a CPLD (Complex Programming logic device, complex programmable logic device) chip , FPGA (Field Programmable Gate Array, Field Programmable Gate Array) chips, circuit components in special chips, etc.

为便于理解本申请，对柔性最大函数Softmax函数说明如下。假设有一个数组X，则第i个元素x_i的Softmax函数值的计算公式可以如式(1)所示。In order to facilitate the understanding of this application, the Softmax function is explained as follows. Assuming that there is an array X, the calculation formula of the Softmax function value of the i-th element x _i can be shown in formula (1).

式(1)中，σ(x)_i表示第i个元素x_i的Softmax函数值，e为自然常数，x_i表示数组X的第i个元素，c表示数组X中的最大元素，

表示数组X中至少部分元素的指数函数值的加法运算结果。In formula (1), σ(x) _i represents the Softmax function value of the i-th element x _i , e is a natural constant, x _i represents the i-th element of the array X, and c represents the largest element in the array X,

Represents the result of the addition of the values of the exponential function for at least some of the elements in array X.

参见图3，一种硬件加速电路，包括指数函数模块11、加法器21、第一处理电路31、第二处理电路32、第三处理电路33、Referring to FIG. 3 , a hardware acceleration circuit includes an exponential function module 11, an adder 21, a first processing circuit 31, a second processing circuit 32, a third processing circuit 33,

指数函数模块11，用于获得数据集合中多个数据元素的多个指数函数值。Exponential function module 11, configured to obtain multiple exponential function values of multiple data elements in the data set.

在一实施例中，指数函数模块11可以响应于数据集合中的各个数据元素的索引值，基于第一查找表输出与各个数据元素对应的指数函数值，输出数据集合中所有数据元素各自的指数函数值。In one embodiment, the exponential function module 11 can respond to the index value of each data element in the data set, output the index function value corresponding to each data element based on the first lookup table, and output the respective indices of all data elements in the data set function value.

加法器21，用于获得多个指数函数值的加法运算结果。The adder 21 is used to obtain the addition operation result of a plurality of exponential function values.

在一实施例中，加法器21可以根据指数函数模块11输入的所有数据元素各自的指数函数值，对所有数据元素各自的指数函数值进行加法运算，输出数据集合所有数据元素的指数函数值的加法运算结果。In one embodiment, the adder 21 may perform an addition operation on the respective exponential function values of all the data elements according to the respective exponential function values of all the data elements input by the exponential function module 11, and output the sum of the exponential function values of all the data elements in the data set The result of the addition operation.

可以理解的，指数函数值的加法运算结果可以是对指数函数值直接进行加法运算获得的结果，也可以是对指数函数值进行特定变换后进行加法运算所获得的结果。对于进行特定变换的情形，可视变换类型对后续获得的数据处理结果进行相应的逆变换或不另做逆变换处理。同理，对于其他数据进行的各处理也应广义理解为包括上述两种情形，而不应局限于仅指对该数据本身进行的处理。其他实施例类似，后面不再赘述。It can be understood that the addition result of the exponential function value may be a result obtained by directly adding the exponential function value, or may be a result obtained by performing an addition operation on the exponential function value after a specific transformation. For the case of performing a specific transformation, depending on the transformation type, corresponding inverse transformation may be performed on the subsequently obtained data processing results or no other inverse transformation processing shall be performed. Similarly, the processing of other data should also be broadly understood to include the above two situations, and should not be limited to the processing of the data itself. Other embodiments are similar and will not be described in detail below.

第一处理电路31，用于对加法运算结果进行预设处理，以将加法运算结果至少处理为第一数据和第二数据。The first processing circuit 31 is configured to perform preset processing on the addition result, so as to process the addition result into at least first data and second data.

在一实施例中，加法器21输出的加法运算结果的长度为N1比特，第一处理电路31可以对长度为N1比特的加法运算结果进行数据转换，输出第一数据和第二数据，第一数据的长度为N2比特，第二数据的长度为N3比特，N2和N3均小于N1。In one embodiment, the length of the addition operation result output by the adder 21 is N1 bits, and the first processing circuit 31 can perform data conversion on the addition operation result whose length is N1 bits, and output the first data and the second data, the first The length of the data is N2 bits, the length of the second data is N3 bits, and both N2 and N3 are smaller than N1.

第二处理电路32，用于至少对第一数据和第二数据进行预设处理，以获得加法运算结果的倒数。The second processing circuit 32 is configured to perform preset processing on at least the first data and the second data, so as to obtain the reciprocal of the addition result.

在一实施例中，第二处理电路32对第一数据和第二数据进行预设处理，响应于第一数据和第二数据，基于相应查找表输出与第一数据和第二数据对应的查表结果，通过对与第一数据和第二数据对应的查表结果的数据运算，获得加法运算结果的倒数。In one embodiment, the second processing circuit 32 performs preset processing on the first data and the second data, and in response to the first data and the second data, outputs a lookup table corresponding to the first data and the second data based on a corresponding lookup table. For the table result, the reciprocal of the addition operation result is obtained by performing data operations on the table lookup results corresponding to the first data and the second data.

第三处理电路33，用于对多个数据元素中的第i个数据元素的指数函数值与倒数进行预设处理，以获得第i个数据元素的特定函数值。The third processing circuit 33 is configured to perform preset processing on the exponential function value and reciprocal of the i-th data element among the plurality of data elements, so as to obtain a specific function value of the i-th data element.

在一实施例中，第三处理电路33可以对多个数据元素中的第i个数据元素的指数函数值与加法运算结果的倒数进行乘法运算，输出第i个数据元素的柔性最大函数值。In an embodiment, the third processing circuit 33 may perform a multiplication operation on the exponential function value of the i-th data element among the plurality of data elements and the reciprocal of the addition result, and output the flexible maximum function value of the i-th data element.

本实施例中，将各数据元素的指数函数值的加法运算结果至少处理为长度(即位宽)均低于加法运算结果的第一数据和第二数据，并通过至少对第一数据和第二数据进行预设处理，获得加法运算结果的倒数，通过降低所处理数据的位宽，可以降低数据处理的运算量，从而能够提高非线性函数值的获取速度。In this embodiment, the addition result of the exponential function value of each data element is at least treated as the first data and the second data whose length (that is, the bit width) is lower than the addition result, and at least the first data and the second The data is pre-processed to obtain the reciprocal of the result of the addition operation. By reducing the bit width of the processed data, the calculation amount of data processing can be reduced, thereby improving the acquisition speed of nonlinear function values.

图4A是本申请另一实施例示出的硬件加速电路的结构框图。FIG. 4A is a structural block diagram of a hardware acceleration circuit shown in another embodiment of the present application.

参见图4A，一种硬件加速电路，包括指数函数模块11、加法器21、第一处理电路31、第二处理电路32、第三处理电路33。Referring to FIG. 4A , a hardware acceleration circuit includes an exponential function module 11 , an adder 21 , a first processing circuit 31 , a second processing circuit 32 , and a third processing circuit 33 .

指数函数模块11包括第一查找表电路1101；第一查找表电路1101，用于基于第一查找表，获得数据集合中多个数据元素对应的多个指数函数值。The exponential function module 11 includes a first lookup table circuit 1101; the first lookup table circuit 1101 is configured to obtain multiple exponential function values corresponding to multiple data elements in the data set based on the first lookup table.

第一查找表电路1101可以基于第一查找表，输出与各个数据元素对应的指数函数值，输出数据集合中所有数据元素各自的N4比特的指数函数值。The first lookup table circuit 1101 may output the exponential function value corresponding to each data element based on the first lookup table, and output the respective N4-bit exponential function values of all data elements in the data set.

在一实施例中，加法器21可以根据指数函数模块11输入的所有数据元素各自的N4比特的指数函数值，对所有数据元素各自的N4比特的指数函数值进行加法运算，输出数据集合所有数据元素的指数函数值的加法运算结果，加法运算结果可以是N1比特的定点整数。In one embodiment, the adder 21 may perform addition operation on the respective N4-bit exponential function values of all data elements according to the respective N4-bit exponential function values of all data elements input by the exponential function module 11, and output all data of the data set The addition operation result of the exponential function value of the element, and the addition operation result may be an N1-bit fixed-point integer.

第一处理电路31包括整数转浮点电路311；整数转浮点电路311，用于将加法运算结果从整数转换为由第一指数数据和第一尾数数据表示的浮点数。The first processing circuit 31 includes an integer-to-floating-point conversion circuit 311; an integer-to-floating-point conversion circuit 311 is used to convert the addition result from an integer to a floating-point number represented by the first exponent data and the first mantissa data.

一个具体实现中，整数转浮点电路311包括：前导0计数电路或前导1检测电路、以及移位器和减法器。In a specific implementation, the integer-to-floating-point conversion circuit 311 includes: a leading 0 counting circuit or a leading 1 detection circuit, a shifter and a subtractor.

前导0计数电路用于输出加法运算结果中的前导0数量；前导0数量为从二进制数据的最高位开始扫描截止到第一个1为止之间所出现的0的个数。前导1检测电路用于输出加法运算结果中的前导1所在位数；前导1为从二进制数据的最高位开始扫描到的第一个1。The leading 0 counting circuit is used to output the number of leading 0s in the result of the addition operation; the number of leading 0s is the number of 0s that appear between scanning from the highest bit of binary data to the first 1. The leading 1 detection circuit is used to output the number of digits of the leading 1 in the result of the addition operation; the leading 1 is the first 1 scanned from the highest bit of the binary data.

移位器用于根据前导0数量或前导1所在位数，输出加法运算结果的第一尾数数据；在一个具体实现中，移位器以前导0数量作为移位位数，将加法运算结果左移该移位位数，输出位宽为N3比特的移位后数据，也即，从加法运算结果中自前导1的后一位开始向低位方向截取N3个连续比特的数据，作为加法运算结果的第一尾数数据。The shifter is used to output the first mantissa data of the addition operation result according to the number of leading 0s or the number of digits of the leading 1; in a specific implementation, the shifter uses the number of leading 0s as the number of shift bits to shift the addition result to the left The number of bits shifted, the output bit width is the shifted data of N3 bits, that is, the data of N3 consecutive bits is intercepted from the next bit of the leading 1 to the low bit direction from the addition result, as the result of the addition First mantissa data.

减法器用于对一预设值和前导0数量或前导1所在位数进行减法处理，以输出加法运算结果的第一指数数据。The subtractor is used for subtracting a preset value and the number of leading 0s or the number of leading 1s, so as to output the first exponent data of the addition result.

第二处理电路32包括第一转换电路321、第二转换电路322和第三转换电路323。The second processing circuit 32 includes a first conversion circuit 321 , a second conversion circuit 322 and a third conversion circuit 323 .

第一转换电路321，用于将第一指数数据转换为负数。The first conversion circuit 321 is used for converting the first exponent data into a negative number.

在一实施例中，第一转换电路321包括第二查找表电路3212；第二查找表电路3212，用于基于第二查找表，输出第一指数数据对应的负数。In one embodiment, the first conversion circuit 321 includes a second lookup table circuit 3212; the second lookup table circuit 3212 is configured to output a negative number corresponding to the first exponent data based on the second lookup table.

第二转换电路322，用于根据第一尾数数据，将由第一指数数据和第一尾数数据表示的浮点数的小数部分转换为由第二指数数据和第二尾数数据表示的另一浮点数。The second conversion circuit 322 is used for converting the fractional part of the floating point number represented by the first exponent data and the first mantissa data into another floating point number represented by the second exponent data and the second mantissa data according to the first mantissa data.

在一实施例中，第二转换电路322包括第三查找表电路3223、第四查找表电路3224；第三查找表电路3223，用于基于第三查找表，获得与第一尾数数据对应的第二指数数据exp2；第四查找表电路3224，用于基于第四查找表，获得与第一尾数数据对应的第二尾数数据frac1。In one embodiment, the second conversion circuit 322 includes a third lookup table circuit 3223 and a fourth lookup table circuit 3224; the third lookup table circuit 3223 is used to obtain the first mantissa corresponding to the first mantissa data based on the third lookup table. The second exponent data exp2; the fourth look-up table circuit 3224, configured to obtain the second mantissa data frac1 corresponding to the first mantissa data based on the fourth look-up table.

第三转换电路323包括指数加法器3231和移位器3232。指数加法器3231用于获得第一指数数据的负数与第二指数数据的和值；移位器3232用于将该和值作为移位参数，对第二尾数数据进行移位处理，获得加法运算结果的倒数。The third conversion circuit 323 includes an exponent adder 3231 and a shifter 3232 . The exponent adder 3231 is used to obtain the sum of the negative number of the first exponent data and the second exponent data; the shifter 3232 is used to use the sum value as a shift parameter to perform shift processing on the second mantissa data to obtain an addition operation The inverse of the result.

可以理解的，对第二尾数数据进行移位处理可以是对第二尾数数据进行必要转换或处理后(例如后面所说的补1处理)再进行移位。It can be understood that the shift processing on the second mantissa data may be performed after necessary conversion or processing (such as 1-complement processing mentioned later) on the second mantissa data.

第三处理电路33包括乘法器331，用于对指数函数模块11输出的多个数据元素中的第i个数据元素的指数函数值与第二处理电路32输出的多个指数函数值的加法运算结果的倒数进行乘法运算，输出第i个数据元素的柔性最大函数值。The third processing circuit 33 includes a multiplier 331 for adding the exponential function value of the i-th data element among the multiple data elements output by the exponential function module 11 and the multiple exponential function values output by the second processing circuit 32 The reciprocal of the result is multiplied, and the value of the softmax function for the ith data element is output.

可以理解的，在另一些实施例中，上述过程中的部分或全部查表方式也可以替代为由处理器(如CPU或GPU)的软件计算。It can be understood that, in some other embodiments, part or all of the table look-up manner in the above process may also be replaced by software calculation by a processor (such as CPU or GPU).

下面结合公式进行更详细的说明。A more detailed description will be given below in conjunction with the formula.

在一实施例中，加法运算结果fp0的浮点表达式为：In one embodiment, the floating-point expression of the addition operation result fp0 is:

则其倒数可表示为：Then its reciprocal can be expressed as:

令

make

则：but:

结合上述公式，整数转浮点电路311将定点整数格式的加法运算结果fp0转换为由第一指数数据exp0和第一尾数数据frac0表示的浮点数。第二查找表电路3212基于第二查找表，输出与第一指数数据exp0对应的负数-exp0。第二转换电路322的电路3223基于第三查找表，输出与第一尾数数据frac0对应的第二指数数据exp1。第二转换电路322的第四查找表电路3224基于第四查找表，输出与第一尾数数据frac0对应的第二尾数数据frac1。In combination with the above formula, the integer-to-floating-point conversion circuit 311 converts the addition operation result fp0 in fixed-point integer format into a floating-point number represented by the first exponent data exp0 and the first mantissa data frac0. The second lookup table circuit 3212 outputs a negative number -exp0 corresponding to the first exponent data exp0 based on the second lookup table. The circuit 3223 of the second converting circuit 322 outputs the second exponent data exp1 corresponding to the first mantissa data frac0 based on the third lookup table. The fourth lookup table circuit 3224 of the second conversion circuit 322 outputs the second mantissa data frac1 corresponding to the first mantissa data frac0 based on the fourth lookup table.

如上面公式中所示，加法运算结果fp0的倒数

可通过将2^{-exp 0+exp 1}与

相乘获得。在一个具体实现中，可以将/>

补1后，将-exp0+exp1的结果作为移位参数，对/>

进行移位获得。As shown in the above formula, the reciprocal of the addition result fp0

can be obtained by combining 2 ^{-exp 0+exp 1} with

Multiply to get. In a specific implementation, the /> can be

After complementing 1, the result of -exp0+exp1 is used as a shift parameter, for />

Obtained by shifting.

可以理解的，上面式子中fp0的转换和frac0的转换是近似转换，转换带来的误差在应用中对计算精度的影响可忽略。It can be understood that the conversion of fp0 and frac0 in the above formula is an approximate conversion, and the error caused by the conversion has negligible impact on the calculation accuracy in the application.

第三处理电路33用于对多个数据元素中的第i个数据元素的N4比特的指数函数值与N5比特的加法运算结果的倒数进行乘法运行，以获得第i个数据元素的N6比特的乘法运算结果。进一步的，可以对N6比特的乘法运算结果进行转换，例如转换为位宽更低的N7比特的结果，该转换后的结果可作为硬件加速电路所输出的第i个数据元素的柔性最大函数值。可以理解的，将乘法运算结果从位宽为N6比特转换为N7比特例如可以通过饱和(saturate)、取整等处理实现。取整处理例如包括四舍五入(round)、向上取整、向下取整、向零取整等。The third processing circuit 33 is used to multiply the N4-bit exponential function value of the i-th data element and the reciprocal of the N5-bit addition result of the multiple data elements, so as to obtain the N6-bit index of the i-th data element The result of the multiplication operation. Further, the N6-bit multiplication result can be converted, for example, converted to an N7-bit result with a lower bit width, and the converted result can be used as the flexible maximum function value of the i-th data element output by the hardware acceleration circuit . It can be understood that converting the result of the multiplication operation from a bit width of N6 bits to N7 bits can be realized, for example, by saturating, rounding, and the like. Rounding processing includes, for example, rounding, rounding up, rounding down, rounding toward zero, and the like.

在一实施例中，第一指数数据和第二指数数据的长度可以均为N2比特，第一尾数数据和第二尾数数据的长度可以均为N3比特；N2和N3的值可以在[1，32]范围内，在一些具体实例中取值范围可为[8，12]，N2与N3可以相等，或者也可以不相等。In one embodiment, the length of the first exponent data and the second exponent data can be both N2 bits, the length of the first mantissa data and the second mantissa data can be both N3 bits; the values of N2 and N3 can be in [1, 32], in some specific examples, the value range can be [8, 12], and N2 and N3 can be equal or unequal.

在本实施例中，第一查找表、第二查找表、第三查找表和第四查找表可以存储于存储模块，存储模块例如可以是RAM(Random-Access Memory,随机存取存储器)、ROM(Read-Only Memory，只读存储器)、FLASH等。In this embodiment, the first lookup table, the second lookup table, the third lookup table and the fourth lookup table can be stored in a storage module, and the storage module can be, for example, RAM (Random-Access Memory, random access memory), ROM (Read-Only Memory, read-only memory), FLASH, etc.

在一实施例中，硬件加速电路包括第一查找表电路至第四查找表电路中的至少两个查找表电路，即包括其中两个、三个或全部查找表电路，至少两个查找表电路各自具有基本查找表电路单元。In one embodiment, the hardware acceleration circuit includes at least two look-up table circuits from the first look-up table circuit to the fourth look-up table circuit, that is, includes two, three or all of the look-up table circuits, and at least two look-up table circuits Each has a basic look-up table circuit element.

参见图4B所示，一个实施例中，基本查找表电路单元20包括输入端组22、控制端组23、输出端组24、和逻辑电路21；输入端组22与存储器10连接，将查找表的数据输入逻辑电路21；逻辑电路21通过自控制端组23输入的索引值(也称为地址)，选择查找表中与该索引值对应的值，自输出端组24进行输出。逻辑电路21例如可以是逻辑门电路或逻辑开关电路。可以理解的，本申请中，端组指一组连接端，包括一个或多个连接端的情况。控制端组23有A个控制端，输出端组24有B个输出端的情况下，称基本查找表电路单元20为A输入B输出。Referring to shown in Fig. 4B, in one embodiment, basic look-up table circuit unit 20 comprises input terminal group 22, control terminal group 23, output terminal group 24 and logic circuit 21; Input terminal group 22 is connected with memory 10, will look-up table The data input logic circuit 21; The logic circuit 21 selects the value corresponding to the index value in the lookup table through the index value (also called address) input from the control terminal group 23, and outputs from the output terminal group 24. The logic circuit 21 may be, for example, a logic gate circuit or a logic switch circuit. It can be understood that in this application, a terminal group refers to a group of connection terminals, including one or more connection terminals. In the case where the control terminal group 23 has A control terminals and the output terminal group 24 has B output terminals, the basic lookup table circuit unit 20 is called A input and B output.

基本查找表电路单元20可基于存储的查找表进行查表输出。以第一查找表为例，查找表也为A输入B输出，查找表的数据元素是位宽为A比特的索引值，输出数据是位宽为B比特的指数函数值。存储区中第一查找表存储指数函数值的真值，基本查找表电路单元用于实现索引值与指数函数值真值之间的映射关系。The basic look-up table circuit unit 20 can perform a table look-up output based on the stored look-up table. Taking the first lookup table as an example, the lookup table also has A input and B output, the data element of the lookup table is an index value with a bit width of A bits, and the output data is an exponential function value with a bit width of B bits. The first lookup table in the storage area stores the true value of the exponential function value, and the basic lookup table circuit unit is used to realize the mapping relationship between the index value and the true value of the exponential function value.

以硬件加速电路的第一查找表电路至第四查找表电路各自具有基本查找表电路单元为例，存储模块包括第一存储区至第四存储区，第一查找表至第四查找表分别存储于第一存储区至第四存储区；第一查找表电路包括第一基本查找表电路单元，第二查找表电路包括第二基本查找表电路单元，第三查找表电路包括第三基本查找表电路单元，第四查找表电路包括第四基本查找表电路单元。第一基本查找表电路单元与第一存储区连接，用于响应于第i个数据元素的索引值，输出第一存储区的第一查找表中存储的对应的指数函数值；第二基本查找表电路单元与第二存储区连接，用于响应于第一指数数据的索引值，输出第二存储区的第二查找表中存储的对应的负数；第三基本查找表电路单元与第三存储区连接，用于响应于第一尾数数据的索引值，输出第三存储区的第三查找表中存储的对应的第二指数数据；第四基本查找表电路单元与第四存储区连接，用于响应于第一尾数数据的索引值，输出第四存储区的第四查找表中存储的对应的第二尾数数据。在另一实施例中，硬件加速电路包括第一查找表电路至第四查找表电路中的至少两个查找表电路，且部分查找表电路共用一个基本查找表电路单元。通过基本查找表电路单元的复用，可减少所需要的基本查找表电路单元，从而有效降低查找硬件加速电路的面积和成本。Taking the first look-up table circuit to the fourth look-up table circuit of the hardware acceleration circuit as an example each having a basic look-up table circuit unit, the storage module includes a first storage area to a fourth storage area, and the first look-up table to the fourth look-up table respectively store In the first storage area to the fourth storage area; the first look-up table circuit includes the first basic look-up table circuit unit, the second look-up table circuit includes the second basic look-up table circuit unit, and the third look-up table circuit includes the third basic look-up table The circuit unit, the fourth look-up table circuit includes a fourth basic look-up table circuit unit. The first basic lookup table circuit unit is connected to the first storage area, and is used to output the corresponding exponential function value stored in the first lookup table of the first storage area in response to the index value of the i-th data element; the second basic lookup The table circuit unit is connected with the second storage area, and is used to output the corresponding negative number stored in the second lookup table of the second storage area in response to the index value of the first index data; the third basic lookup table circuit unit is connected with the third storage area The area connection is used to respond to the index value of the first mantissa data, and output the corresponding second index data stored in the third look-up table of the third storage area; the fourth basic look-up table circuit unit is connected with the fourth storage area, using In response to the index value corresponding to the first mantissa data, the corresponding second mantissa data stored in the fourth lookup table of the fourth storage area is output. In another embodiment, the hardware acceleration circuit includes at least two look-up table circuits among the first look-up table circuit to the fourth look-up table circuit, and some of the look-up table circuits share one basic look-up table circuit unit. By multiplexing the basic lookup table circuit units, the required basic lookup table circuit units can be reduced, thereby effectively reducing the area and cost of the lookup hardware acceleration circuit.

以硬件加速电路的第一查找表电路和第二查找表电路共用一个基本查找表电路单元(例如称为第一基本查找表电路单元)为例，该第一基本查找表电路单元包括第一输入端组、第一控制端组、第一输出端组和第一逻辑门电路，第一输入端组与存储模块连接，第一逻辑门电路用于：在第一时间段响应于自第一控制端组输入的第i个数据元素的索引值，基于第一查找表输自第一输出端组输出与第i个数据元素对应的指数函数值，以及，在第一时间段之后的第二时间段，响应于自第一控制端组输入的第一指数数据的索引值，基于第二查找表自第一输出端组输出与第一指数数据对应的负数。Taking the first look-up table circuit and the second look-up table circuit of the hardware acceleration circuit as an example sharing a basic look-up table circuit unit (for example, called the first basic look-up table circuit unit), the first basic look-up table circuit unit includes a first input The terminal group, the first control terminal group, the first output terminal group and the first logic gate circuit, the first input terminal group is connected with the storage module, and the first logic gate circuit is used for: in the first time period, in response to the first control an index value of the i-th data element input by the terminal group, an exponential function value corresponding to the i-th data element output from the first output terminal group based on the first lookup table, and, at a second time after the first time period A segment, in response to an index value of the first exponent data input from the first control terminal group, outputs a negative number corresponding to the first exponent data from the first output terminal group based on the second lookup table.

可以理解的，该实施例的一个具体实现中，存储模块包括第一存储区，第一查找表和第二查找表分时存储于第一存储区。由于只需要配置一个存储区用于分时存储第一查找表、第二查找表中的任意一个，有效减小了查找表占用的存储空间，可减小硬件成本。另一个具体实现中，存储模块包括第一存储区和第二存储区，第一查找表存储于第一存储区，第二查找表存储于第二存储区。It can be understood that in a specific implementation of this embodiment, the storage module includes a first storage area, and the first lookup table and the second lookup table are stored in the first storage area in time division. Since only one storage area needs to be configured for time-sharing storage of any one of the first lookup table and the second lookup table, the storage space occupied by the lookup table is effectively reduced, and the hardware cost can be reduced. In another specific implementation, the storage module includes a first storage area and a second storage area, the first lookup table is stored in the first storage area, and the second lookup table is stored in the second storage area.

可以理解的，本申请中，某个数据的索引值可以是数据本身，也可以是数据进行特定转换后得到的。It can be understood that in this application, the index value of a certain data may be the data itself, or it may be obtained after specific transformation of the data.

本实施例中，通过硬件查找表电路以查表方式获得各数据元素的指数函数值，通过加法器获得指数函数值的加法运算结果，将加法运算结果转换为位宽均更低的多个数据部分，进而通过查表和后续的加法和乘法处理实现加法运算结果的除法运算，获得加法运算结果的对应的倒数；由于避免了复杂的指数运算和倒数运算，能够提高非线性函数计算过程中的数据处理速度，更快获得非线性函数值。另一方面，避免了为实现指数运算和倒数运算而产生的过大硬件电路面积和过高成本。In this embodiment, the exponential function value of each data element is obtained by means of a table lookup through the hardware lookup table circuit, the addition result of the exponential function value is obtained through the adder, and the addition result is converted into a plurality of data with lower bit widths. part, and then realize the division operation of the addition result through table lookup and subsequent addition and multiplication processing, and obtain the corresponding reciprocal of the addition result; because complex exponential operations and reciprocal operations are avoided, the nonlinear function calculation process can be improved. Data processing speed, faster acquisition of nonlinear function values. On the other hand, excessive hardware circuit area and high cost for realizing exponential operation and reciprocal operation are avoided.

进一步的，通过三个较低位宽数据的查表，即在将加法运算结果从整数转换为由位宽均降低的第一指数数据和第一尾数数据表示的浮点数后，通过第一查找表获得第一指数数据的负数、通过第二查找表和第三查找表获得第二指数数据和第二尾数数据，能够显著降低查找表对大存储空间的依赖，降低查找表逻辑电路的面积和成本，加快查表时间，进而提高数据处理速度。Further, through the table lookup of three lower bit width data, that is, after the result of the addition operation is converted from an integer to a floating point number represented by the first exponent data and the first mantissa data whose bit width is reduced, through the first lookup The table obtains the negative number of the first exponent data, and obtains the second exponent data and the second mantissa data through the second lookup table and the third lookup table, which can significantly reduce the dependence of the lookup table on a large storage space, and reduce the area and size of the lookup table logic circuit. Cost, speed up table lookup time, and then improve data processing speed.

例如，若加法运算结果是16比特的整数，直接查表所需的查找表包括2^16(即65536)个条目，需要很大的存储空间来存储数据，且导致查找表逻辑电路的成本过高。另一方面，完成单个查表结果需要65536个周期，处理时长过长。而本申请中，例如可将16比特的加法运算结果从整数转换为由位宽为8比特的第一指数数据和8比特的第一尾数数据表示的浮点数，将第一查找表至第三查找表配置为8输入8输出，则三个查找表的总条目数为3×2⁸＝768条；显然，后者大大节省了查找表所需的存储空间、查找表逻辑电路的面积和成本，且加快查表速度。For example, if the result of the addition operation is a 16-bit integer, the lookup table required for direct table lookup includes 2^16 (i.e. 65536) entries, which requires a large storage space to store data, and causes the cost of the lookup table logic circuit to be too high. high. On the other hand, it takes 65536 cycles to complete a single table lookup result, and the processing time is too long. In this application, for example, the 16-bit addition result can be converted from an integer to a floating-point number represented by the first exponent data of 8 bits and the first mantissa data of 8 bits, and the first look-up table to the third The lookup table is configured as 8 inputs and 8 outputs, then the total number of entries of the three lookup tables is 3× ²⁸ =768; obviously, the latter greatly saves the storage space required by the lookup table, the area and cost of the lookup table logic circuit , and speed up the table lookup.

图5是本申请另一实施例示出的硬件加速电路的结构框图。FIG. 5 is a structural block diagram of a hardware acceleration circuit shown in another embodiment of the present application.

参见图5，一种硬件加速电路，包括指数函数模块11、加法器21、第一处理电路31、第二处理电路32、第三处理电路33。Referring to FIG. 5 , a hardware acceleration circuit includes an exponential function module 11 , an adder 21 , a first processing circuit 31 , a second processing circuit 32 , and a third processing circuit 33 .

在本实施例中，加法运算结果为浮点数。加法器21可以根据指数函数模块11输入的所有数据元素各自的N4比特的指数函数值，对所有数据元素各自的N4比特的指数函数值进行加法运算，输出数据集合所有数据元素的指数函数值的N1比特的浮点类型的加法运算结果。In this embodiment, the result of the addition operation is a floating point number. The adder 21 can add the respective N4-bit exponential function values of all data elements according to the respective N4-bit exponential function values of all data elements input by the exponential function module 11, and output the exponential function values of all data elements in the data set. The addition operation result of N1-bit floating-point type.

第一处理电路31包括第三查找表电路313、第四查找表电路314；第三查找表电路313用于基于第三查找表，获得与加法运算结果对应的指数数据；第四查找表电路314用于基于第四查找表，获得与加法运算结果对应的尾数数据。The first processing circuit 31 includes a third look-up table circuit 313 and a fourth look-up table circuit 314; the third look-up table circuit 313 is used to obtain index data corresponding to the addition result based on the third look-up table; the fourth look-up table circuit 314 It is used for obtaining the mantissa data corresponding to the addition operation result based on the fourth lookup table.

第二处理电路32，用于对指数数据和尾数数据进行预设处理，以获得加法运算结果的倒数。The second processing circuit 32 is configured to perform preset processing on the exponent data and the mantissa data, so as to obtain the reciprocal of the addition operation result.

图6是本申请另一实施例示出的硬件加速电路的结构框图。FIG. 6 is a structural block diagram of a hardware acceleration circuit shown in another embodiment of the present application.

参见图6，一种硬件加速电路，包括减法器61、指数函数模块11、加法器21、第一处理电路31、第二处理电路32、第三处理电路33。Referring to FIG. 6 , a hardware acceleration circuit includes a subtractor 61 , an exponential function module 11 , an adder 21 , a first processing circuit 31 , a second processing circuit 32 , and a third processing circuit 33 .

减法器61，用于将初始数据集中的多个初始数据与多个初始数据中的最大值相减，以获得包含多个数据元素的数据集合。The subtractor 61 is configured to subtract the multiple initial data in the initial data set from the maximum value among the multiple initial data, so as to obtain a data set including multiple data elements.

在一个具体实施例中，对输入硬件加速电路的初始数据集进行数学变换，令元素x_i＝x_i′-c，通过减法器61分别对初始数据集X中的各个初始数据与初始数据集中的最大值进行减法运算，输出与初始数据集X中各个初始数据对应的数据元素，由与初始数据集X中各个初始数据对应的数据元素组成数据集合，数据集合中的每个数据元素的取值为0或负数。通过减法器61对初始数据集中的多个初始数据进行减法运算，可减小数据元素的值域范围，从而便于以较少位宽的数据及相应硬件电路实现本申请方案。另一方面，数据集合中各数据元素的取值为负数或0，从而可将该数据元素以e为底的指数函数值归一化到(0，1]范围内。In a specific embodiment, a mathematical transformation is performed on the initial data set input to the hardware acceleration circuit, so that the element x _i =xi _′ -c is used to subtract each initial data in the initial data set X and the initial data set The maximum value is subtracted, and the data elements corresponding to each initial data in the initial data set X are output, and the data set is composed of data elements corresponding to each initial data in the initial data set X, and the extraction of each data element in the data set The value is 0 or a negative number. Subtracting multiple initial data in the initial data set by the subtractor 61 can reduce the value range of the data elements, thereby facilitating the realization of the solution of the present application with less bit-width data and corresponding hardware circuits. On the other hand, the value of each data element in the data set is a negative number or 0, so that the value of the exponential function of the data element with e as the base can be normalized to a range of (0, 1].

为更好理解本实施例的查找过程，下表1示出第一查找表的一个具体实例，该表为N0比特输入N4比特输出，其中，N0和N4均为8。第一查找表的数据元素可以是位宽为N0比特的索引值，输出数据可以是位宽为N4比特的指数函数值。为便于理解，表1中的数据均表示为10进制格式。可以理解的，存储模块中第一查找表仅存储指数函数值的真值，第一查找表电路用于实现索引值与指数函数值真值之间的映射关系，数据元素和归一化指数函数值是为了更好理解本申请而一并列入表中。In order to better understand the lookup process of this embodiment, the following table 1 shows a specific example of the first lookup table, which is N0 bit input and N4 bit output, where N0 and N4 are both 8. The data element of the first lookup table may be an index value with a bit width of N0 bits, and the output data may be an exponential function value with a bit width of N4 bits. For ease of understanding, the data in Table 1 are expressed in decimal format. It can be understood that the first lookup table in the storage module only stores the true value of the exponential function value, and the first lookup table circuit is used to realize the mapping relationship between the index value and the true value of the exponential function value, and the data elements and the normalized exponential function Values are included in the table for better understanding of the application.

表1Table 1

结合表1中所示，减法器61输出的数据元素为负数或0，数据元素的值域被定义为[-10，0]。为进行查表，将值域[-10，0]离散化为如列“数据元素”所示的256(即2^N0)个点，每个点对应的指数函数值如列“归一化指数函数值”所示，各数据元素点对应于列“索引值”中所示的[0，255]范围内的一个整数值，各归一化指数函数值均对应到列“指数函数值”中所示的[0，255]范围内的一个整数值，列“指数函数值”中的数据作为真值被存储于存储模块的第一查找表中，通过索引值即可实现查表。As shown in Table 1, the data element output by the subtractor 61 is a negative number or 0, and the value range of the data element is defined as [-10, 0]. In order to look up the table, the value range [-10, 0] is discretized into 256 points (that is, 2 ^N0 ) as shown in the column "data element", and the index function value corresponding to each point is shown in the column "normalized exponent function value", each data element point corresponds to an integer value in the range [0, 255] shown in the column "index value", and each normalized exponential function value corresponds to the column "exponential function value" An integer value within the range of [0, 255] shown, the data in the column "exponential function value" is stored in the first lookup table of the storage module as a true value, and the table lookup can be realized through the index value.

加法器21、第一处理电路31、第二处理电路32、第三处理电路33的实现可以参阅前面实现例，不再赘述。For the implementation of the adder 21 , the first processing circuit 31 , the second processing circuit 32 , and the third processing circuit 33 , reference may be made to the previous implementation examples, and details are not repeated here.

在一具体实现中，数据元素可以是位宽8比特的定点整数。第一查找表中的每个指数函数值是位宽为8比特的定点整数，多个指数函数值的加法运算结果是位宽为32比特的定点整数，加法运算结果的第一指数数据和第一尾数数据、第二指数数据及第二尾数数据均是位宽为8比特的定点整数，即第二查找表、第三查找表和第四查找表均为8输入8输出，乘法运算结果是位宽为16比特的定点整数，乘法运算结果转换后的特定函数值是位宽为8比特的定点整数；即，N0、N2、N3、N4、N5、N7为8，N1为32，N6为16。In a specific implementation, the data elements may be fixed-point integers with a bit width of 8 bits. Each exponential function value in the first lookup table is a fixed-point integer whose bit width is 8 bits, and the addition result of multiple exponential function values is a fixed-point integer whose bit width is 32 bits, and the first index data and the second index data of the addition result The first mantissa data, the second exponent data and the second mantissa data are fixed-point integers with a bit width of 8 bits, that is, the second lookup table, the third lookup table and the fourth lookup table are all 8 inputs and 8 outputs, and the result of the multiplication operation is For a fixed-point integer with a bit width of 16 bits, the specific function value after the conversion of the multiplication operation result is a fixed-point integer with a bit width of 8 bits; that is, N0, N2, N3, N4, N5, and N7 are 8, N1 is 32, and N6 is 16.

可以理解的，在另一些实现中，N0、N2、N3、N4、N5、N7可以是其他值；例如，N0，N2，N3，N4、N5、N7的取值范围可为[1，32]，在一些具体实例中取值范围可为[8，12]，例如，N0，N2，N3，N4、N5、N7也可不相等，例如，N0，N3的取值可以为9，10，11，12，N2、N4、N5、N7为8。由于Softmax函数值的动态范围很广，相关技术中多是使用软件模块实现该函数。本申请实施例提供基本上基于8比特的硬件电路解决方案，且能够有效平衡电路成本、功耗、带宽、性能和数据精度等重要指标。It can be understood that in other implementations, N0, N2, N3, N4, N5, and N7 can be other values; for example, the value range of N0, N2, N3, N4, N5, and N7 can be [1, 32] , in some specific examples, the value range can be [8, 12], for example, N0, N2, N3, N4, N5, N7 can also be unequal, for example, the values of N0, N3 can be 9, 10, 11, 12, N2, N4, N5, N7 are 8. Because the dynamic range of the value of the Softmax function is very wide, most of the related technologies use software modules to realize the function. The embodiment of the present application provides a hardware circuit solution basically based on 8 bits, and can effectively balance important indicators such as circuit cost, power consumption, bandwidth, performance, and data accuracy.

本实施例中，在通过查表方式获得各数据元素的指数函数值的加法运算结果的倒数的过程中，对加法运算结果进行浮点形式的转换，获得浮点形式的加法运算结果的指数数据和尾数数据，分别基于加法运算结果的指数数据和尾数数据从多个查找表中查找，基于多次查表输出获得该加法运算结果的倒数，能够获得精度更高的倒数。In this embodiment, in the process of obtaining the reciprocal of the addition operation result of the exponential function value of each data element by means of table lookup, the addition operation result is converted to floating point form to obtain the index data of the addition operation result in floating point form and mantissa data, the exponent data and mantissa data of the addition result are respectively searched from multiple lookup tables, and the reciprocal of the addition result is obtained based on the output of multiple lookup tables, and a reciprocal with higher precision can be obtained.

进一步的，通过在softmax函数的计算过程，将加法运算结果转换为浮点形式，通过多次查表，基于多次查表的数据获得加法运算结果的倒数，且将多次查表的输入/输出数据位宽配置在较小的范围内，可以降低查找表占用的存储资源和查找表电路的面积，降低所占用的带宽，另一方面，能够在精度允许范围内提高查表速度和定点运算速度，从而进一步加快电路的响应速度，降低功耗。Further, through the calculation process of the softmax function, the result of the addition operation is converted into a floating-point form, and the reciprocal of the result of the addition operation is obtained based on the data of the table lookup multiple times through multiple table lookups, and the input/ The output data bit width is configured in a small range, which can reduce the storage resources occupied by the lookup table and the area of the lookup table circuit, and reduce the occupied bandwidth. On the other hand, it can improve the table lookup speed and fixed-point operation within the allowable range of precision. speed, thereby further speeding up the response speed of the circuit and reducing power consumption.

本申请还提供数据处理加速方法的实施例。The present application also provides embodiments of the data processing acceleration method.

图7是本申请一实施例示出的数据处理加速方法的流程示意图。Fig. 7 is a schematic flowchart of a data processing acceleration method shown in an embodiment of the present application.

参见图7，一种数据处理加速方法，包括：Referring to Figure 7, a data processing acceleration method, including:

在步骤S110中，获得数据集合中多个数据元素的多个指数函数值。In step S110, multiple exponential function values of multiple data elements in the data set are obtained.

在步骤S120中，获得多个指数函数值的加法运算结果。In step S120, an addition operation result of a plurality of exponential function values is obtained.

在步骤S130中，获得加法运算结果的倒数。In step S130, the reciprocal of the addition result is obtained.

在步骤S140中，基于多个数据元素中的第i个数据元素的指数函数值与加法运算结果的倒数，获得第i个数据元素的特定函数值。In step S140, the specific function value of the i-th data element is obtained based on the reciprocal of the exponential function value of the i-th data element and the result of the addition among the plurality of data elements.

其中，步骤S130中获得加法运算结果的倒数包括：Wherein, obtaining the reciprocal of the addition result in step S130 includes:

步骤S130A，将加法运算结果至少转换为第一数据和第二数据。Step S130A, converting the addition result into at least first data and second data.

步骤S130B，至少根据第一数据和第二数据，获得该加法运算结果的倒数。Step S130B, at least according to the first data and the second data, obtain the reciprocal of the addition result.

其中，加法运算结果是长度为N1比特的数据，第一数据是长度为N2比特的数据，第二数据是长度为N3比特的数据，N2和N2均小于N1。Wherein, the addition result is data with a length of N1 bits, the first data is data with a length of N2 bits, the second data is data with a length of N3 bits, and both N2 and N2 are smaller than N1.

可以理解的，指数函数值的加法运算结果可以是对指数函数值直接进行加法运算获得的结果，也可以是对指数函数值进行特定变换后进行加法运算所获得的结果。对于进行变换的情形，可视变换类型对后续获得的数据处理结果进行相应的逆变换或不另做逆变换处理。同理，对于其他数据进行的各处理也应广义理解为包括上述两种情形，而不应局限于仅指对该数据本身进行的处理。It can be understood that the addition result of the exponential function value may be a result obtained by directly adding the exponential function value, or may be a result obtained by performing an addition operation on the exponential function value after a specific transformation. For the case of transformation, depending on the type of transformation, corresponding inverse transformation may be performed on the subsequently obtained data processing results or no other inverse transformation processing shall be performed. Similarly, the processing of other data should also be broadly understood to include the above two situations, and should not be limited to the processing of the data itself.

图8是本申请另一实施例示出的数据处理加速方法的流程示意图。Fig. 8 is a schematic flowchart of a data processing acceleration method according to another embodiment of the present application.

参见图8，一种数据处理加速方法，包括：Referring to Figure 8, a data processing acceleration method includes:

在步骤S801中，获得数据集合中多个数据元素对应的多个指数函数值。In step S801, multiple exponential function values corresponding to multiple data elements in the data set are obtained.

在一实施例中，可以通过第一查找表模块，响应于数据集合中的各个数据元素的索引值，基于第一查找表输出与各个数据元素对应的指数函数值，输出数据集合中所有数据元素各自的N4比特的指数函数值。In an embodiment, the first lookup table module may respond to the index value of each data element in the data set, output the exponential function value corresponding to each data element based on the first lookup table, and output all the data elements in the data set The respective N4-bit exponential function value.

在步骤S802中，获得多个指数函数值的加法运算结果。In step S802, an addition operation result of a plurality of exponential function values is obtained.

在一实施例中，可以通过加法器对所有数据元素各自的N4比特的指数函数值进行加法运算，获得加法器输出的数据集合所有数据元素的指数函数值的N1比特的加法运算结果，加法运算结果可以是N1比特的定点整数。In one embodiment, the N4-bit exponential function values of all data elements can be added by an adder to obtain the N1-bit addition results of the exponential function values of all data elements in the data set output by the adder, and the addition operation The result may be an N1-bit fixed-point integer.

在步骤S803中，将加法运算结果从整数转换为由第一指数数据和第一尾数数据表示的浮点数。In step S803, the addition result is converted from an integer to a floating point number represented by the first exponent data and the first mantissa data.

在一实施例中，加法器输出的加法运算结果为定点表示的N1比特的整数，可以通过整数转浮点电路对定点整数进行数据转换，获得加法运算结果的N2比特的第一指数数据exp0和N3比特的第一尾数数据frac0。In one embodiment, the addition operation result output by the adder is an N1-bit integer represented by a fixed point, and the fixed-point integer may be converted to data by an integer-to-floating-point circuit to obtain the N2-bit first exponent data exp0 and First mantissa data frac0 of N3 bits.

在步骤S804中，将第一指数数据转换为负数。In step S804, the first exponent data is converted into a negative number.

在一实施例中，可以通过第二查找表电路，响应于第一指数数据的索引值，基于第二查找表，输出与第一指数数据对应的负数exp1。In an embodiment, the negative number exp1 corresponding to the first exponent data may be output based on the second lookup table in response to the index value of the first exponent data through the second lookup table circuit.

在步骤S805中，根据第一尾数数据，将浮点数的小数部分转换为由第二指数数据和第二尾数数据表示的另一浮点数。In step S805, according to the first mantissa data, the decimal part of the floating point number is converted into another floating point number represented by the second exponent data and the second mantissa data.

在一实施例中，可以通过第三查找表电路，响应于第一尾数数据的索引值，基于第三查找表，输出与第一尾数数据对应的第二指数数据exp2；可以通过第四查找表电路，响应于第一尾数数据的索引值，基于第四查找表，输出与第一尾数数据对应的第二尾数数据frac1。In one embodiment, the second index data exp2 corresponding to the first mantissa data can be output based on the third look-up table in response to the index value of the first mantissa data through the third look-up table circuit; The circuit, in response to the index value of the first mantissa data, outputs second mantissa data frac1 corresponding to the first mantissa data based on the fourth lookup table.

在步骤S806中，根据第一指数数据的负数、第二指数数据、第二尾数数据，获得加法运算结果的倒数。In step S806, the reciprocal of the addition result is obtained according to the negative number of the first exponent data, the second exponent data, and the second mantissa data.

在一实施例中，通过指数加法器获得第一指数数据的负数与第二指数数据的和值，通过移位器，将该和值作为移位参数，对第二尾数数据进行移位处理，得到加法运算结果的N5比特的倒数。In one embodiment, the sum of the negative number of the first exponent data and the second exponent data is obtained by an exponent adder, and the sum is used as a shift parameter by a shifter to perform shift processing on the second mantissa data, Get the reciprocal of the N5 bits of the addition result.

在步骤S807中，对多个数据元素中的第i个数据元素的指数函数值与加法运算结果的倒数进行预设处理，以获得第i个数据元素的特定函数值。In step S807, preset processing is performed on the exponential function value of the ith data element among the multiple data elements and the reciprocal of the addition result, so as to obtain the specific function value of the ith data element.

在一实施例中，可以通过乘法电路对多个数据元素中的第i个数据元素的N4比特的指数函数值与N5比特的倒数进行乘法运行，以获得第i个数据元素的N6比特的乘法运算结果；进一步的，可以对N6比特的乘法运算结果进行转换，例如转换为位宽更低的N7比特的结果，该转换后的结果可作为硬件加速电路所输出的第i个数据元素的柔性最大函数值。可以理解的，将乘法运算结果从位宽为N6比特转换为N7比特例如可以通过饱和(saturate)、取整等处理实现。取整处理例如包括四舍五入(round)、向上取整、向下取整、向零取整等。In one embodiment, the exponential function value of the N4 bits of the i-th data element among the plurality of data elements can be multiplied by the reciprocal of the N5 bits through a multiplication circuit to obtain the multiplication of the N6 bits of the i-th data element Operation results; further, the multiplication result of N6 bits can be converted, for example, converted to a result of N7 bits with a lower bit width, and the converted result can be used as the flexibility of the i-th data element output by the hardware acceleration circuit Maximum function value. It can be understood that converting the result of the multiplication operation from a bit width of N6 bits to N7 bits can be realized, for example, by saturating, rounding, and the like. Rounding processing includes, for example, rounding, rounding up, rounding down, rounding toward zero, and the like.

在一实施例中，第一指数数据和第二指数数据的长度可以均为N2比特，第一尾数数据和第二尾数数据的长度可以均为N3比特；N2和N3的值可以在[1，32]范围内，在一些具体实例中取值范围可为[8，12]。In one embodiment, the length of the first exponent data and the second exponent data can be both N2 bits, the length of the first mantissa data and the second mantissa data can be both N3 bits; the values of N2 and N3 can be in [1, 32], and in some specific examples, the value range can be [8, 12].

本实施例中，通过硬件查找表电路以查表方式获得各数据元素的指数函数值，通过加法器获得指数函数值的加法运算结果，对加法运算结果进行浮点转换，以浮点形式的加法运算结果的指数部分和尾数部分为输入，通过查找表电路实现加法运算结果的除法运算，获得加法运算结果的对应的倒数，避免了复杂的指数运算和倒数运算，能够提高Softmax函数计算过程中的数据处理速度，更快获得Softmax函数值。另一方面，避免了为实现指数运算和倒数运算而产生的过大硬件电路面积和过高成本。In this embodiment, the exponential function value of each data element is obtained in a table look-up manner through the hardware lookup table circuit, the addition result of the exponential function value is obtained through the adder, and the addition result is converted to floating point, and the addition in floating point form The exponent part and the mantissa part of the operation result are input, the division operation of the addition operation result is realized through the lookup table circuit, and the corresponding reciprocal of the addition operation result is obtained, which avoids complicated exponent operation and reciprocal operation, and can improve the softmax function calculation process. Data processing speed, get Softmax function value faster. On the other hand, excessive hardware circuit area and high cost for realizing exponential operation and reciprocal operation are avoided.

图9是本申请另一实施例示出的数据处理加速方法的流程示意图。Fig. 9 is a schematic flowchart of a data processing acceleration method according to another embodiment of the present application.

参见图9，一种数据处理加速方法，包括：Referring to Figure 9, a data processing acceleration method includes:

在步骤S901中，将初始数据集中的多个初始数据与多个初始数据中的最大值相减，以获得包含多个数据元素的数据集合。In step S901, the maximum value among the plurality of initial data in the initial data set is subtracted to obtain a data set including a plurality of data elements.

在步骤S902中，获得数据集合中多个数据元素对应的多个指数函数值。In step S902, multiple exponential function values corresponding to multiple data elements in the data set are obtained.

在步骤S903中，获得多个指数函数值的加法运算结果。In step S903, the addition operation results of a plurality of exponential function values are obtained.

在步骤S904中，获得加法运算结果的倒数。In step S904, the reciprocal of the addition result is obtained.

在步骤S905中，基于多个数据元素中的第i个数据元素的指数函数值与加法运算结果的倒数，获得第i个数据元素的特定函数值。In step S905, the specific function value of the i-th data element is obtained based on the reciprocal of the exponential function value of the i-th data element and the result of the addition among the plurality of data elements.

其中，加法运算结果为浮点数；Among them, the result of the addition operation is a floating-point number;

步骤S904中获得加法运算结果的倒数包括：Obtaining the reciprocal of the addition result in step S904 includes:

将加法运算结果转换为指数数据和尾数数据；Convert the addition operation result into exponent data and mantissa data;

至少根据指数数据和尾数数据，获得加法运算结果的倒数。Based on at least the exponent data and the mantissa data, the reciprocal of the result of the addition operation is obtained.

其中，加法运算结果的长度为N1比特，指数数据的长度为N2比特，尾数数据的长度为N3比特，N2和N3均小于N1。Wherein, the length of the addition operation result is N1 bits, the length of the exponent data is N2 bits, and the length of the mantissa data is N3 bits, and both N2 and N3 are smaller than N1.

本申请实施例的数据处理加速方法中相关特征可参阅前述硬件加速电路实施例中的相关内容，不再赘述。For relevant features of the data processing acceleration method in the embodiment of the present application, please refer to the relevant content in the foregoing hardware acceleration circuit embodiment, and details are not repeated here.

根据本申请实施例的数据处理加速方法可应用于人工智能加速器中。图10是本申请一实施例的人工智能加速器的结构示意图。The data processing acceleration method according to the embodiment of the present application can be applied to an artificial intelligence accelerator. FIG. 10 is a schematic structural diagram of an artificial intelligence accelerator according to an embodiment of the present application.

参见图10，人工智能加速器1000包括存储器1010和处理器1020。Referring to FIG. 10 , an artificial intelligence accelerator 1000 includes a memory 1010 and a processor 1020 .

人工智能加速器1020可以是通用处理器，例如CPU(Central Processing Unit，中央处理器)，也可以是用于执行人工智能运算的人工智能处理器(IPU)。人工智能运算可包括机器学习运算，类脑运算等。其中，机器学习运算包括神经网络运算、k-means运算、支持向量机运算等。该人工智能处理器可例如包括GPU(Graphics Processing Unit，图形处理单元)、DLA(Deep Learning Accelerator,深度学习加速器)、NPU(Neural-NetworkProcessing Unit，神经网络处理单元)、DSP(Digital Signal Process，数字信号处理单元)、现场可编程门阵列(Field－Programmable Gate Array，FPGA)、专用集成电路(Application Specific Integrated Circuit，ASIC)中的一种或组合。本申请对处理器的具体类型不作限制。The artificial intelligence accelerator 1020 may be a general processor, such as a CPU (Central Processing Unit, central processing unit), or an artificial intelligence processor (IPU) for performing artificial intelligence calculations. Artificial intelligence computing may include machine learning computing, brain-like computing, etc. Among them, machine learning operations include neural network operations, k-means operations, support vector machine operations, and the like. The artificial intelligence processor may for example include GPU (Graphics Processing Unit, graphics processing unit), DLA (Deep Learning Accelerator, deep learning accelerator), NPU (Neural-Network Processing Unit, neural network processing unit), DSP (Digital Signal Process, digital Signal processing unit), field programmable gate array (Field-Programmable Gate Array, FPGA), application specific integrated circuit (Application Specific Integrated Circuit, ASIC) or a combination. The present application does not limit the specific type of the processor.

存储器1010可以包括各种类型的存储单元，例如系统内存、只读存储器(ROM)和永久存储装置。其中，ROM可以存储处理器1020或者计算机的其他模块需要的静态数据或者指令。永久存储装置可以是可读写的存储装置。永久存储装置可以是即使计算机断电后也不会失去存储的指令和数据的非易失性存储设备。在一些实施方式中，永久性存储装置采用大容量存储装置(例如磁或光盘、闪存)作为永久存储装置。另外一些实施方式中，永久性存储装置可以是可移除的存储设备(例如软盘、光驱)。系统内存可以是可读写存储设备或者易失性可读写存储设备，例如动态随机访问内存。系统内存可以存储一些或者所有处理器在运行时需要的指令和数据。此外，存储器1010可以包括任意计算机可读存储媒介的组合，包括各种类型的半导体存储芯片(例如DRAM，SRAM，SDRAM，闪存，可编程只读存储器)，磁盘和/或光盘也可以采用。在一些实施方式中，存储器1010可以包括可读和/或写的可移除的存储设备，例如激光唱片(CD)、只读数字多功能光盘(例如DVD-ROM，双层DVD-ROM)、只读蓝光光盘、超密度光盘、闪存卡(例如SD卡、min SD卡、Micro-SD卡等)、磁性软盘等。计算机可读存储媒介不包含载波和通过无线或有线传输的瞬间电子信号。The memory 1010 may include various types of storage units such as system memory, read only memory (ROM), and persistent storage. Wherein, the ROM may store static data or instructions required by the processor 1020 or other modules of the computer. The persistent storage device may be a readable and writable storage device. Persistent storage may be a non-volatile storage device that does not lose stored instructions and data even if the computer is powered off. In some embodiments, the permanent storage device adopts a mass storage device (such as a magnetic or optical disk, flash memory) as the permanent storage device. In some other implementations, the permanent storage device may be a removable storage device (such as a floppy disk, an optical drive). The system memory can be a readable and writable storage device or a volatile readable and writable storage device, such as dynamic random access memory. System memory can store some or all of the instructions and data that the processor needs at runtime. In addition, the memory 1010 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (such as DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), and magnetic disks and/or optical disks may also be used. In some embodiments, memory 1010 may include a readable and/or writable removable storage device, such as a compact disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual-layer DVD-ROM), Read-only Blu-ray Disc, Super Density Disc, Flash memory card (eg SD card, min SD card, Micro-SD card, etc.), magnetic floppy disk, etc. Computer-readable storage media do not contain carrier waves and transient electronic signals transmitted by wireless or wire.

存储器1010上存储有可执行代码，当可执行代码被处理器1020处理时，可以使处理器1020执行上文述及的方法中的部分或全部。Executable codes are stored in the memory 1010 , and when the executable codes are processed by the processor 1020 , the processor 1020 may execute part or all of the methods mentioned above.

在一种可能的实现方式中，人工智能加速器可包括多个处理器，每个处理器可以独立运行所分配到的各种任务。本申请对处理器及处理器所运行的任务不作限制。In a possible implementation manner, the artificial intelligence accelerator may include multiple processors, and each processor may independently run various assigned tasks. The present application does not limit the processor and the tasks run by the processor.

可以理解的，若无特别说明，在本申请各个实施例中的各功能单元/模块可以集成在一个单元/模块中，也可以是各个单元/模块单独物理存在，也可以两个或两个以上单元/模块集成在一起。上述集成的单元/模块既可以采用硬件的形式实现，也可以采用软件程序模块的形式实现。It can be understood that, unless otherwise specified, each functional unit/module in each embodiment of the present application may be integrated into one unit/module, or each unit/module may exist separately physically, or two or more than two Units/modules are integrated together. The above-mentioned integrated units/modules can be implemented in the form of hardware or in the form of software program modules.

所述集成的单元/模块如果以硬件的形式实现时，该硬件可以是数字电路，模拟电路等等。硬件结构的物理实现包括但不局限于晶体管，忆阻器等等。若无特别说明，所述人工智能处理器可以是任何适当的硬件处理器，比如CPU、GPU、FPGA、DSP和ASIC等等。若无特别说明，存储模块可以是任何适当的磁存储介质或者磁光存储介质，比如，阻变式存储器RRAM(Resistive Random Access Memory)、动态随机存取存储器DRAM(Dynamic RandomAccess Memory)、静态随机存取存储器SRAM(Static Random-Access Memory)、增强动态随机存取存储器EDRAM(Enhanced Dynamic Random Access Memory)、高带宽内存HBM(High-Bandwidth Memory)、混合存储立方HMC(Hybrid Memory Cube)等等。If the integrated unit/module is implemented in the form of hardware, the hardware may be a digital circuit, an analog circuit, or the like. Physical implementations of hardware structures include, but are not limited to, transistors, memristors, and so on. Unless otherwise specified, the artificial intelligence processor may be any appropriate hardware processor, such as CPU, GPU, FPGA, DSP, ASIC, and the like. Unless otherwise specified, the storage module can be any suitable magnetic storage medium or magneto-optical storage medium, such as, resistive variable memory RRAM (Resistive Random Access Memory), dynamic random access memory DRAM (Dynamic Random Access Memory), static random access memory Access memory SRAM (Static Random-Access Memory), Enhanced Dynamic Random Access Memory EDRAM (Enhanced Dynamic Random Access Memory), high-bandwidth memory HBM (High-Bandwidth Memory), hybrid memory cube HMC (Hybrid Memory Cube), etc.

所述集成的单元/模块如果以软件程序模块的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储器中。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储器中，包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储器包括：U盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit/module is realized in the form of a software program module and sold or used as an independent product, it can be stored in a computer-readable memory. Based on this understanding, the technical solution of the present application is essentially or part of the contribution to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory. Several instructions are included to make a computer device (which may be a personal computer, server or network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.

在一种可能的实现方式中，还公开了一种人工智能芯片，其包括了上述硬件加速电路。In a possible implementation manner, an artificial intelligence chip is also disclosed, which includes the above-mentioned hardware acceleration circuit.

在一种可能的实现方式中，还公开了一种板卡，其包括存储器件、接口装置和控制器件以及上述人工智能芯片；其中，所述人工智能芯片与所述存储器件、所述控制器件以及所述接口装置分别连接；所述存储器件，用于存储数据；所述接口装置，用于实现所述人工智能芯片与外部设备之间的数据传输；所述控制器件，用于对所述人工智能芯片的状态进行监控。In a possible implementation manner, a board card is also disclosed, which includes a storage device, an interface device, a control device, and the above-mentioned artificial intelligence chip; wherein, the artificial intelligence chip and the storage device, the control device And the interface devices are respectively connected; the storage device is used to store data; the interface device is used to realize the data transmission between the artificial intelligence chip and the external equipment; the control device is used to control the The state of the artificial intelligence chip is monitored.

在一种可能的实现方式中，公开了一种电子设备，其包括了上述人工智能芯片。电子设备包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆；所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机；所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。In a possible implementation manner, an electronic device including the above-mentioned artificial intelligence chip is disclosed. Electronic equipment includes data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, video cameras, projectors, watches, headphones , mobile storage, wearable devices, vehicles, household appliances, and/or medical equipment. Said vehicles include airplanes, ships and/or vehicles; said household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, range hoods; said medical equipment includes nuclear magnetic resonance instruments, Ultrasound and/or electrocardiograph.

此外，根据本申请的方法还可以实现为一种计算机程序或计算机程序产品，该计算机程序或计算机程序产品包括用于执行本申请的上述方法中部分或全部步骤的计算机程序代码指令。In addition, the method according to the present application can also be implemented as a computer program or computer program product, the computer program or computer program product including computer program code instructions for executing some or all of the steps in the above method of the present application.

或者，本申请还可以实施为一种计算机可读存储介质(或非暂时性机器可读存储介质或机器可读存储介质)，其上存储有可执行代码(或计算机程序或计算机指令代码)，当可执行代码(或计算机程序或计算机指令代码)被电子设备(或服务器等)的处理器执行时，使处理器执行根据本申请的上述方法的各个步骤的部分或全部。Alternatively, the present application may also be implemented as a computer-readable storage medium (or a non-transitory machine-readable storage medium or a machine-readable storage medium), on which executable code (or computer program or computer instruction code) is stored, When the executable code (or computer program or computer instruction code) is executed by the processor of the electronic device (or server, etc.), the processor is made to perform part or all of the steps of the above-mentioned method according to the present application.

以上已经描述了本申请的各实施例，上述说明是示例性的，并非穷尽性的，并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下，对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择，旨在最好地解释各实施例的原理、实际应用或对市场中的技术的改进，或者使本技术领域的其他普通技术人员能理解本文披露的各实施例。Having described various embodiments of the present application above, the foregoing description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and alterations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principle of each embodiment, practical application or improvement of technology in the market, or to enable other ordinary skilled in the art to understand each embodiment disclosed herein.

Claims

1. A hardware acceleration circuit, characterized in that, comprising:

Exponential function module, is used for obtaining multiple exponential function values of multiple data elements in the data set;

an adder, configured to obtain an addition result of the plurality of exponential function values;

The first processing circuit is configured to perform preset processing on the addition result, so as to process the addition result into at least first data and second data, wherein the length of the addition result is N1 bits, so The length of the first data is N2 bits, the length of the second data is N3 bits, and both N2 and N3 are smaller than N1;

A second processing circuit, configured to perform preset processing on at least the first data and the second data, so as to obtain the reciprocal of the addition result;

The third processing circuit is configured to perform preset processing on the exponential function value and the reciprocal of the i-th data element among the plurality of data elements, so as to obtain the specific function value of the i-th data element.

2. The hardware acceleration circuit as claimed in claim 1, characterized in that:

The result of the addition operation is an integer;

The first processing circuit includes: an integer-to-floating-point conversion circuit, configured to convert the addition operation result from an integer to a floating-point number represented by the first exponent data and the first mantissa data.

3. The hardware acceleration circuit according to claim 2, wherein the second processing circuit comprises:

a first conversion circuit for converting the first exponent data into a negative number;

a second conversion circuit for converting the fractional part of the floating point number into another floating point number represented by the second exponent data and the second mantissa data according to the first mantissa data;

The third conversion circuit is used to obtain the reciprocal of the addition result according to the negative number, the second exponent data and the second mantissa data.

4. The hardware acceleration circuit according to claim 3, wherein the third conversion circuit comprises:

an exponent adder, used to obtain the sum of the negative number of the first exponent data and the second exponent data;

a shifter, configured to use the sum value as a shift parameter to perform shift processing on the second mantissa data.

5. The hardware acceleration circuit as claimed in claim 3, characterized in that:

The exponential function module includes: a first lookup table circuit, configured to obtain multiple exponential function values corresponding to multiple data elements in the data set based on the first lookup table; and/or,

The first conversion circuit includes: a second lookup table circuit, configured to obtain a negative number corresponding to the first index data based on the second lookup table; and/or,

The second conversion circuit includes: a third lookup table circuit, used to obtain second exponent data corresponding to the first mantissa data based on the third lookup table; a fourth lookup table circuit, used to obtain the second index data corresponding to the first mantissa data based on the fourth lookup table , to obtain second mantissa data corresponding to the first mantissa data.

6. The hardware acceleration circuit as claimed in claim 3, characterized in that:

The length of the first exponent data and the second exponent data is N2 bits, and the length of the first mantissa data and the second mantissa data is N3 bits;

The values of N2 and N3 are in the range [1, 32].

7. The hardware acceleration circuit according to claim 5, characterized in that: the hardware acceleration circuit comprises at least two look-up table circuits in the first look-up table circuit to the fourth look-up table circuit,

The at least two look-up table circuits each have a basic look-up table circuit unit; or,

The at least two look-up table circuits share one basic look-up table circuit unit.

8. The hardware acceleration circuit as claimed in claim 1, characterized in that:

The result of the addition operation is a floating point number;

The first processing circuit includes: a third lookup table circuit, used to obtain index data corresponding to the addition result based on the third lookup table; a fourth lookup table circuit, used to obtain the index data corresponding to the addition operation result based on the fourth lookup table Mantissa data corresponding to the addition result;

The second processing circuit is configured to perform preset processing on the exponent data and the mantissa data, so as to obtain the reciprocal of the addition result;

9. The hardware acceleration circuit according to claim 1, characterized in that:

Also comprising: a subtractor, configured to subtract a plurality of initial data in the initial data set from a maximum value among the plurality of initial data, so as to obtain the data set including the plurality of data elements;

The third processing unit includes a multiplier, configured to multiply the exponential function value of the i-th data element among the plurality of data elements and the reciprocal, and output the flexible maximum function of the i-th data element value.

10. An artificial intelligence chip, comprising the hardware acceleration circuit according to any one of claims 1 to 9.

11. A data processing acceleration method, characterized in that being applied to an artificial intelligence accelerator, the method comprises:

Obtain multiple exponential function values for multiple data elements in the data set;

obtaining an addition operation result of the plurality of exponential function values;

obtaining the reciprocal of the result of the addition operation;

Obtaining a specific function value of the i-th data element based on the exponential function value and the reciprocal of the i-th data element among the plurality of data elements;

Wherein, obtaining the reciprocal of the addition operation result includes:

processing the addition operation result into at least first data and second data;

Obtaining the reciprocal of the addition result based on at least the first data and the second data;

Wherein, the length of the addition operation result is N1 bits, the length of the first data is N2 bits, the length of the second data is N3 bits, and both N2 and N3 are smaller than N1.

12. The method of claim 11, wherein:

The result of the addition operation is a floating point number;

The processing the addition result at least into first data and second data includes: converting the addition result into exponent data and mantissa data.

13. The method of claim 11, wherein:

The result of the addition operation is an integer;

Said processing said addition result into at least first data and second data includes: converting said addition result from an integer into a floating point number represented by first exponent data and first mantissa data, and, according to said first mantissa data, converting the fractional part of the floating point number into another floating point number represented by the second exponent data and the second mantissa data;

The obtaining the reciprocal corresponding to the addition result at least according to the first data and the second data includes: obtaining the addition result at least according to the first exponent data, the second exponent data and the second mantissa data corresponding reciprocal.

14. The method according to claim 13, wherein at least according to the first exponent data, the second exponent data and the second mantissa data, obtaining the reciprocal corresponding to the addition operation result comprises:

obtaining the negative of said first index data;

obtaining the sum of the negative number of the first index data and the second index data;

The sum value is used as a shift parameter to perform shift processing on the second mantissa data.

15. The method of claim 14, wherein,

The obtaining multiple exponential function values of multiple data elements in the data set includes: obtaining multiple exponential function values corresponding to multiple data elements in the data set based on a first lookup table; and/or,

The obtaining a negative number of the first index data includes: obtaining a negative number corresponding to the first index data based on a second lookup table.

16. The method according to claim 13, wherein said obtaining second exponent data and second mantissa data according to said first mantissa data comprises:

Obtaining second exponent data corresponding to the first mantissa data based on a third lookup table;

Based on the fourth lookup table, second mantissa data corresponding to the first mantissa data is obtained.

17. The method of claim 13, wherein:

The values of N2 and N3 are in the range [1, 32].

18. The method of claim 11, wherein,

Before the obtaining multiple exponential function values of multiple data elements in the data set, it also includes: subtracting the multiple initial data in the initial data set from the maximum value among the multiple initial data, so as to obtain the The data collection of a plurality of data elements.

19. The method according to claim 11, wherein the specific value of the i-th data element is obtained based on the exponential function value and the reciprocal of the i-th data element in the plurality of data elements Function values include:

Multiplying the exponential function value of the i-th data element among the plurality of data elements with the reciprocal to obtain the flexible maximum function value of the i-th data element.

20. The method according to any one of claims 11 to 19, wherein the method is used to realize the flexible maximum function layer of the neural network, and the neural network is used to classify the data to be processed; wherein,

The data to be processed includes at least one of voice data, text data and image data.

21. An artificial intelligence accelerator, comprising:

processor; and

A memory, on which executable codes are stored, and when the executable codes are executed by the processor, cause the processor to execute the method according to any one of claims 11-20.