Disclosure of Invention
In order to solve the problems, the application provides a floating-point multiplier, a calculation method and equipment based on an FPGA, wherein the floating-point multiplier is applied to floating-point multiplication under the IEEE-754 format and comprises a sign calculation module, an exponent addition module and a binary multiplier module, wherein the sign calculation module is used for determining signs of target output floating points through an exclusive OR gate and sign bits of first input floating points and second input floating points, the exponent addition module is used for adding exponents of the first input floating points and the second input floating points and subtracting deviation values in corresponding formats to obtain exponent output of the target output floating points, the binary multiplier module is used for multiplying the first mantissas of the first input floating points and the second mantissas of the second input floating points by using a Karatuba algorithm and a Urdhva-Tiryagbhyam algorithm, and the normalization module is used for performing normalization operation based on the mantissa product and comprises at least one of shifting operation and exponent value adjustment.
In one example, the multiplying operation is performed on the first mantissa of the first input floating point number and the second mantissa of the second input floating point number by using a Karatsuba algorithm and a Urdhva-Tiryagbhyam algorithm based on the mantissa bit widths of the first input floating point number and the second input floating point number, specifically includes determining that the mantissa bit widths meet a first preset condition, performing a divide-by-divide multiplication operation on the first mantissa and the second mantissa by using an improved Karatsuba algorithm until a mantissa product of the target output floating point number is obtained or an intermediate product result meets a second preset condition, and when the intermediate product result meets the second preset condition, inputting the intermediate product result as an improved Urdhva-Tiryagbhyam algorithm, and continuing the multiplying operation until a mantissa product of the target output floating point number is obtained.
In one example, the dividing and multiplying the first mantissa and the second mantissa by using a modified Karatsuba algorithm specifically includes determining a length ratio between the first mantissa and the second mantissa, determining a numerical distribution of the first mantissa and the second mantissa, the numerical distribution being a number of digits and a digit position of values zero in adjacent preset digits, determining dividing points of the first mantissa and the second mantissa based on the length ratio and the numerical distribution, and dividing the first mantissa and the second mantissa according to the dividing points.
In one example, an improved Karatsuba algorithm is used for carrying out divide-by-multiply operation on the first mantissa and the second mantissa, and the method specifically comprises the steps of setting a parallel computing unit, a register and a control logic module in the floating point multiplier based on the FPGA, wherein the parallel computing unit is used for concurrently executing computing tasks in the same clock cycle, the computing tasks comprise multiplication computation, addition computation and result merging, the register is used for temporarily storing intermediate results and transmitting data between different computing tasks, and the control logic module is used for managing starting, completion and error processing of each computing task.
In one example, before the intermediate product result is input as the improved Urdhva-Tiryagbhyam algorithm and multiplication operation is continued, the binary multiplier module is further configured to determine the number of computations corresponding to different preset computing tasks respectively when the Urdhva-Tiryagbhyam algorithm is used, where the preset computing tasks include the same value pairs appearing at different time points or different positions and fixed digit combinations appearing in the multiplication tasks, and store the input item group of the preset computing tasks, the type of adder used and the computing result after hash mapping in a preset computing table.
In one example, the method for inputting the intermediate product result as the improved Urdhva-Tiryagbhyam algorithm and continuing the multiplication operation specifically includes determining a local correlation between the intermediate product result and each of the input term groups in the preset computation table, if there is a local correlation between an input term group and the intermediate product result that is higher than a first preset threshold and lower than a second preset threshold, reading an adder model corresponding to the input term group in the preset computation table, using the adder model as an adder of the intermediate product result, and if the local correlation is higher than the second preset threshold, reading a corresponding computation result in the preset computation table as a computation result of a partial computation task of the first mantissa and the second mantissa.
In one example, the binary multiplier module has a plurality of types of adders disposed therein, wherein the types of adders include at least one of a carry propagate adder, a carry save adder, a carry select adder, and a carry look ahead adder.
In one example, the binary multiplier module is further configured to dynamically select a type of adder based on a bit width and a data characteristic of an intermediate product result, the dynamically selecting the type of adder based on the bit width and the data characteristic of the intermediate product result, specifically including determining an addition carry probability of the intermediate product result based on a vector machine model to determine a first adder expectation of the intermediate product result, determining a second adder expectation of the intermediate product result based on the bit width of the intermediate product result, determining a time limit for an addition operation based on a delay requirement set by a user, and determining a third adder expectation of the intermediate product result based on the time limit, determining a fourth adder expectation of the intermediate product result based on a number and a type of available hardware resources, and determining an adder type corresponding to the intermediate product result based on the first adder expectation, the second adder expectation, the third adder expectation, and the fourth adder expectation.
The application further provides a calculation method of the floating-point multiplier based on the FPGA, which comprises the steps of determining the sign of a target output floating point through an exclusive OR gate and sign bits of a first input floating point and a second input floating point, adding the exponents of the first input floating point and the second input floating point, subtracting deviation values of corresponding formats to obtain exponent output of the target output floating point, carrying out multiplication operation on the first mantissa of the first input floating point and the second mantissa of the second input floating point by using a Karatuba algorithm and a Urdhva-Tiryagbhyam algorithm based on the mantissa width of the first input floating point and the second mantissa of the second input floating point to obtain a mantissa product of the target output floating point, and carrying out normalization operation based on the mantissa product, wherein the normalization operation comprises at least one of shifting operation and adjusting the exponent value.
The application also provides a computing device of the floating-point multiplier based on the FPGA, which comprises at least one processor and a memory which is in communication connection with the at least one processor, wherein the memory stores instructions which can be executed by the at least one processor, the instructions are executed by the at least one processor, so that the at least one processor can execute the steps of determining the sign of a target output floating point through sign bits of an exclusive-OR gate, a first input floating point and a second input floating point, adding the exponents of the first input floating point and the second input floating point and subtracting deviation values in corresponding formats to obtain the exponent output of the target output floating point, multiplying the first mantissa and the second mantissa of the first input floating point by using a Kalsusuba algorithm and a Urdhva-Tiryagbhyam algorithm to obtain the sign of the target output floating point, and performing normalization operation on the mantissa based on the normalized product of the first input floating point and the exponent value, and performing at least one of the normalization operation based on the exponent product and the shift operation.
The method provided by the application has the following beneficial effects:
1. By combining the Karatuba algorithm and the Urdhva-Tiryagbhyam algorithm, the time of mantissa multiplication operation can be obviously reduced, so that the operation speed of the whole floating-point multiplier is improved. Especially, the fast floating point multiplication capability can greatly improve the performance and response speed of the system in the face of high-speed application scenes such as real-time signal processing and graphic rendering.
2. Compared with the traditional floating point multiplier design method, the method adopts the algorithm combination to reduce the number of multipliers and hardware complexity. Particularly, when high-precision floating point operation is processed, a large number of complex hardware circuits are not needed to realize multiplication operation, so that the chip area requirement is reduced. For hardware platforms such as FPGA, the method can more effectively utilize limited resources and reduce manufacturing cost.
3. Because of the reduction of hardware complexity and the improvement of operation speed, the power consumption of the floating-point multiplier in the running process is correspondingly reduced. In applications sensitive to power consumption, such as mobile devices and embedded systems, the characteristic of low power consumption can prolong the endurance time of the device and improve the practicability of the device.
4. The floating point multiplier design provided by the application is based on an IEEE-754 standard format, and has good compatibility with the existing computer system and software. The system can be conveniently integrated into various computer hardware platforms and application programs, and large-scale modification of the existing system is not needed, so that the application cost and popularization difficulty are reduced.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.
As shown in FIGS. 1 and 2, the present application provides a floating-point multiplier based on an FPGA, which is applied to floating-point multiplication under IEEE-754 format, where a floating-point number is composed of a sign, an exponent, a mantissa and an exponent base. For the multiplication of two floating-point sums, the final result is obtained by calculating the product of sign, exponent and mantissa separately. The floating-point multiplier includes:
And the symbol calculation module is used for determining the symbol of the target output floating point number through the exclusive OR gate and the symbol bits of the first input floating point number and the second input floating point number. Specifically, the sign computation module determines the sign of the product based on the Most Significant Bit (MSB) of the input floating point number. The sign of the product may be positive, for example, by a simple XOR gate, if the sign bits of the two floating point numbers are the same, and negative if the sign bits are different. The sign calculating method based on the exclusive-OR gate is simple and efficient, does not need a complex circuit structure, can quickly determine the sign of the product, and saves time for the whole floating-point multiplication operation.
And the exponent addition module is used for adding the exponents of the first input floating point number and the second input floating point number and subtracting the deviation value of the corresponding format to obtain the exponent output of the target output floating point number. Specifically, the exponent adding module is configured to add the input exponents and subtract the deviation values in the corresponding formats to obtain the actual product exponents. In the IEEE-754 standard, the exponent in single precision format is 8 bits wide, the offset is 127, the exponent in double precision format is 11 bits wide, and the offset is 1023. Because the calculation time of mantissa multiplication operation is much longer than that of exponent addition, simple travelling wave carry adder and travelling wave borrow subtracter are adopted to carry out the addition and subtraction operation of exponents, the occupied resources are less, and the hardware complexity and cost can be reduced.
And the binary multiplier module is used for multiplying the first mantissa of the first input floating point number and the second mantissa of the second input floating point number by using a Karatuba algorithm and a Urdhva-Tiryagbhyam algorithm based on the mantissa bit widths of the first input floating point number and the second input floating point number so as to obtain a mantissa product of the target output floating point number. In floating point multiplication, the most important, complex part is mantissa multiplication. Multiplication requires more time than addition. And as the number of bits increases, it consumes more area and time. In order to reduce the area of multiplication and improve the efficiency of multiplication, the invention adopts a method of combining Karatuba algorithm and Urdhva-Tiryagbhyam algorithm.
And the result normalization module is used for performing normalization operation based on the mantissa product, wherein the normalization operation comprises at least one of shifting operation on the mantissa and adjusting index value. In an IEEE-754 floating point representation, there is one hidden bit for the mantissa, typically 1. This hidden bit has a special handling during storage and operation. When the most significant bit of the multiplication result is not the first bit to the left of the hidden bit, the result needs to be normalized to meet the requirements of the IEEE-754 standard format. The operation of the result normalization module includes shifting the mantissa and adjusting the exponent value accordingly. If the most significant bit of the multiplication result is not the first bit to the left of the hidden bit, the result is shifted to the left until the most significant bit is 1 (non-zero is 1 at radix 2). The index value is incremented by 1 for each left shift operation.
By combining the advantages of the Karatuba algorithm and the Urdhva-Tiryagbhyam algorithm, the application not only reduces the delay, but also reduces the percentage increase of the hardware area, and has remarkable advantages compared with the traditional method.
Specifically, as shown in fig. 3, when the binary multiplier module performs floating-point multiplication, the Karatsuba algorithm is a divide-and-conquer algorithm, and is suitable for processing multiplication operations of high-order numbers. When the input bit width is larger, the algorithm is more efficient. However, at lower bit widths, the Karatsuba algorithm is not efficient. To solve this problem, the present invention uses Urdhva-Tiryagbhyam algorithm applicable to low bit widths at lower bit widths. Specifically, it is determined that the mantissa bit width satisfies a first preset condition (e.g., when the bit widths of the operands are all greater than eight bits), and a modified Karatsuba algorithm is used to perform a divide-by-divide multiplication operation on the first mantissa and the second mantissa until a mantissa product of the target output floating point number is obtained, or the intermediate product result satisfies a second preset condition (e.g., when the bit widths of the operands are all eight bits). When the intermediate product result meets a second preset condition, the intermediate product result is input as an improved Urdhva-Tiryagbhyam algorithm, and multiplication operation is continued until the mantissa product of the target output floating point number is obtained.
The differences between the improved Karatsuba algorithm of the present application and the existing Karatsuba algorithm are described herein:
The traditional Karatsuba algorithm adopts a fixed segmentation mode, divides two large integers into two parts, and then recursively calculates the result of each group of sub-problems. This approach is applicable in most cases, but may not be the optimal choice in some special cases. For example, when one operand is much larger than another, direct application of Karatsuba may result in unnecessary complexity.
Therefore, the application introduces a dynamic segmentation strategy, and adaptively adjusts segmentation points according to the characteristics of input data. For unbalanced operands, the larger numbers may be split asymmetrically so that each recursive call can maximize the speed advantage of utilizing the addition operation. This reduces unnecessary levels of recursion and reduces overall delay. In the asymmetric division, a length ratio between the first mantissa (or the first operand in the calculation process) and the second mantissa (or the second operand in the calculation process) can be determined, and a numerical distribution of the first mantissa and the second mantissa, which is a number of digits and a digit position of which a numerical value in adjacent preset digits is zero, is determined. And then determining the dividing points of the first mantissa and the second mantissa based on the length ratio and the numerical distribution, and dividing the first mantissa and the second mantissa according to the dividing points. When determining the segmentation points, the segmentation expectations corresponding to the segmentation points arranged at different positions can be calculated through preset weights, and the proper segmentation points can be selected through the segmentation expectations. In a hardware implementation, the partitioning point may be determined according to a performance index of the hardware (e.g., delay of an adder, delay of a multiplier, etc.). If the performance overhead of the different length operand multiply and add operations on the hardware is known, the split point can be selected based on this information so that the overall computational overhead is minimized. The asymmetric segmentation provided by the application can reduce the recursion level, thereby simplifying the calculation flow, and for unbalanced operands, the asymmetric segmentation can segment larger numbers into parts which are closer to smaller numbers, thereby reducing the number of recursion calls. Meanwhile, the addition load is balanced, the parallel processing capacity is improved, and the addition tasks generated by recursion at each time can be more uniformly distributed through asymmetric segmentation, so that excessive or insufficient addition requirements at certain stages are avoided. And operands with different scales are also adapted, so that resource waste is avoided, and for operands with larger length differences, symmetrical segmentation can cause that a part of computing resources are wasted on processing redundant data. The asymmetric segmentation can be flexibly adjusted according to the actual input, so that each part can be processed most effectively.
Meanwhile, the conventional Karatsuba algorithm is performed serially, i.e., it is necessary to wait for the completion of the previous step before starting the next step. This results in a long delay time in hardware implementation, especially when processing high precision values. Therefore, the application adopts a multi-stage pipeline structure, the task of each stage is decomposed into smaller subtasks, and the subtasks are executed in parallel. Specifically, a plurality of parallel units may be provided in hardware, each of which is responsible for different sub-multiplications and addition operations. This not only reduces overall delay but also improves throughput, since multiple multiplications can be performed at the same time. Specifically, a parallel computing unit, a register and a control logic module are arranged in a binary multiplier module of the floating-point multiplier. The parallel computing unit is used for concurrently executing computing tasks in the same clock period, wherein the computing tasks comprise multiplication computing, addition computing and result merging. The register is used for temporarily storing intermediate results and transmitting data among different computing tasks, and the control logic module is used for managing the starting, the completion and the error processing of each computing task.
In one embodiment, the existing Urdhva-Tiryagbhyam algorithm is essentially a method of vertical and cross multiplication that quickly generates partial products through a series of rules. However, for some common patterns or repeated data combinations, each recalculation results in wasted resources.
Thus, in the improved Urdhva-Tiryagbhyam algorithm of the present application, the local correlation of the data can be used for pre-computation and buffering when generating partial products. Meanwhile, a pre-calculation table mechanism is added, and a pre-calculation part of multiplication results are stored in a small lookup table. The result can be read directly from the look-up table whenever the same pattern of data is encountered, avoiding repeated calculations. The look-up table may be tailored to the expected application scenario to cover the most frequently occurring data patterns. In addition, the size of the lookup table can be reduced by combining the compression technology, so that the memory space is saved. Specifically, when Urdhva-Tiryagbhyam algorithm is used, the number of computations corresponding to different preset computing tasks is determined, where the preset computing tasks include the same number pairs appearing at different time points or different positions and fixed digit combinations appearing in the multiplication tasks, and then the input item group of the preset computing tasks, the type of adder used and the computation result after hash mapping are stored in a preset computing table.
Further, when the improved Urdhva-Tiryagbhyam algorithm is used for calculation, the local correlation between the intermediate product result and each input item group in the preset calculation table needs to be determined, if the local correlation between the input item group and the intermediate product result is higher than a first preset threshold and lower than a second preset threshold, the adder model corresponding to the input item group is read in the preset calculation table, and the adder model is used as an adder of the intermediate product result. And if the local correlation is higher than a second preset threshold value, reading a corresponding calculation result in a preset calculation table to serve as a calculation result of part of calculation tasks of the first mantissa and the second mantissa. The calculation of local dependencies is described herein as being performed by traversing two input operands to determine whether there are any numerical duplicates, identical bit level patterns, or local digits of a mathematical nature in the two operands. The bit level pattern here refers to that some bit combinations frequently occur in multiplication operations, for example, the low order or high order bits are always fixed. Mathematical characteristics herein refer to patterns formed based on mathematical laws, such as the input operand being the power of another input operand or a history operand, a multiple relationship, and so on.
In one embodiment, in order to accelerate the calculation process of the floating-point multiplier, the floating-point multiplier is provided with a plurality of types of adders, the adders can be selected based on the characteristics of input operands when the addition calculation is performed, and in one embodiment, the types of the adders comprise a carry propagation adder, a carry save adder, a carry select adder and a carry look ahead adder.
And each adder corresponds to the characteristics of different input operands, in particular, the carry propagation adder is suitable for small-scale or low-delay less demanding addition operations. The carry save adder is adapted to accumulate the results of a plurality of addition operations, reducing carry propagation delay. The carry select adder is suitable for large bit width data and can quickly generate a final result. Carry look ahead adders are suitable for delay sensitive applications such as high speed operations. In selecting an adder, a set of selection criteria may be defined to determine the type of adder that best suits the current situation. The criteria are exemplified herein, wherein the dynamically selecting the type of adder according to the bit width and the data characteristic of the intermediate product result specifically includes:
The method comprises the steps of determining an addition carry probability of an intermediate product result based on a vector machine model to determine a first adder expectation of the intermediate product result, determining a second adder expectation of the intermediate product result based on a bit width of the intermediate product result, determining a time limit of an addition operation based on a delay requirement set by a user, determining a third adder expectation of the intermediate product result based on the time limit, determining a fourth adder expectation of the intermediate product result based on the number and types of available hardware resources, and finally determining an adder type corresponding to the intermediate product result based on the first adder expectation, the second adder expectation, the third adder expectation and the fourth adder expectation. Finally, the adder type dynamic selection process can be managed through the use of decision trees or finite state machines where each node or state represents one possible selection condition and an edge or transition represents a transition from one condition to another.
Fig. 6 is a flowchart illustrating a method for calculating a floating point multiplier based on an FPGA according to one or more embodiments of the present disclosure. The process may be performed by computing devices in the respective areas, with some input parameters or intermediate results in the process allowing manual intervention adjustments to help improve accuracy.
The implementation of the analysis method according to the embodiment of the present application may be a terminal device or a server, which is not particularly limited in the present application. For ease of understanding and description, the following embodiments are described in detail with reference to a server.
It should be noted that the server may be a single device, or may be a system composed of a plurality of devices, that is, a distributed server, which is not particularly limited in the present application.
As shown in fig. 6, an embodiment of the present application provides a method for calculating a floating point multiplier based on an FPGA, including:
S601, determining the sign of the target output floating point number through the exclusive OR gate and the sign bits of the first input floating point number and the second input floating point number.
S602, adding the exponents of the first input floating point number and the second input floating point number, and subtracting the deviation value of the corresponding format to obtain the exponent output of the target output floating point number.
S603, based on the mantissa bit widths of the first input floating point number and the second input floating point number, performing multiplication operation on the first mantissa of the first input floating point number and the second mantissa of the second input floating point number by using a Karatuba algorithm and a Urdhva-Tiryagbhyam algorithm to obtain a mantissa product of the target output floating point number.
And S604, performing normalization operation based on the mantissa product, wherein the normalization operation comprises at least one of shifting mantissa and adjusting exponent value.
The embodiments of the present application are described in a progressive manner, and the same and similar parts of the embodiments are all referred to each other, and each embodiment is mainly described in the differences from the other embodiments. In particular, for the apparatus and medium embodiments, the description is relatively simple, as it is substantially similar to the method embodiments, with reference to the section of the method embodiments being relevant.
The devices and media provided in the embodiments of the present application are in one-to-one correspondence with the methods, so that the devices and media also have similar beneficial technical effects as the corresponding methods, and since the beneficial technical effects of the methods have been described in detail above, the beneficial technical effects of the devices and media are not repeated here.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.