HK40017940A - Providing efficient floating-point operations using matrix processors in processor-based systems - Google Patents
Providing efficient floating-point operations using matrix processors in processor-based systems Download PDFInfo
- Publication number
- HK40017940A HK40017940A HK62020008075.7A HK62020008075A HK40017940A HK 40017940 A HK40017940 A HK 40017940A HK 62020008075 A HK62020008075 A HK 62020008075A HK 40017940 A HK40017940 A HK 40017940A
- Authority
- HK
- Hong Kong
- Prior art keywords
- accumulator
- floating
- fraction
- intermediate product
- point operand
- Prior art date
Links
Description
Priority application
The present application claims priority from U.S. provisional patent application No. 62/552,890, filed on 2017, 8/31, entitled "PROVIDING EFFICIENT FLOATING POINT ADDITION operation USING a matrix PROCESSOR IN a PROCESSOR-BASED system (PROVIDING EFFICIENT FLOATING POINT ADDITION FLOATING-POINT ADDITION operation USING matrix processing IN PROCESSOR-BASED SYSTEMS"), the contents of which are incorporated herein by reference IN their entirety.
The present application also claims priority from U.S. patent application No. 16/118,099, filed on 2018, month 8, 30, entitled "PROVIDING EFFICIENT FLOATING POINT OPERATIONS USING a MATRIX processor in a processor-BASED system" (priority of patent application No. 16/118,099).
Technical Field
The present technology relates generally to matrix processing in processor-based systems, and more particularly to techniques and apparatus for efficient floating point operations suitable for matrix multiplication.
Background
The field of machine learning relates to developing and studying algorithms that can make data-driven predictions or decisions by building models from sample inputs. Machine learning can be applied to compute tasks when designing and programming explicit algorithms with acceptable performance is difficult or infeasible. One class of machine learning techniques, referred to as "deep learning," employs an Artificial Neural Network (ANN) containing multiple hidden layers to perform tasks such as pattern analysis and classification. The ANN is first "trained" by determining operational parameters based on typical inputs and instances of corresponding desired outputs. The ANN may then perform "inference" in which the determined operational parameters are used to classify, identify, and/or process the new input.
In an ANN for deep learning, each hidden layer within the ANN uses the output from previous layers as input. Because each layer is represented as a two-dimensional matrix, most of the computational operations involving deep learning consist of matrix multiplication operations. Therefore, optimization of matrix multiplication operations has the potential to greatly improve the performance of deep learning applications. In particular, processing units that perform floating-point matrix multiplication occupy more chip area and consume more power than processing units that perform integer matrix based multiplication. Therefore, a more efficient apparatus to perform floating-point matrix multiplication operations is desirable.
Disclosure of Invention
Aspects disclosed in the detailed description include providing efficient floating point operations using a matrix processor in a processor-based system. In this regard, in one aspect, a matrix processor-based apparatus is provided that includes a matrix processor. The matrix processor includes a positive portion and accumulator and a negative portion and accumulator. As the matrix processor processes multiple pairs of floating-point operands (such as when performing a matrix multiply operation, as a non-limiting example), the matrix processor calculates an intermediate product based on a first floating-point operand and a second floating-point operand. After determining the sign of the intermediate product (i.e., whether the intermediate product is positive or negative), the matrix processor normalizes the intermediate product with the positive portion and accumulator or the negative portion and fraction of the accumulator depending on the sign. The matrix processor adds the intermediate product to the positive sum accumulator if the intermediate product is positive or to the negative sum accumulator if the intermediate product is negative. After processing all pairs of floating-point operands, the matrix processor subtracts the values of the negative part and the accumulator from the values of the positive part and the accumulator to arrive at a final sum, and then renormalizes the final sum once (as compared to performing renormalization after adding each intermediate product). In this way, the matrix processor reduces the number of processor cycles for renormalization, thereby improving power consumption and overall processor performance.
In another aspect, a matrix processor based apparatus is provided. A matrix processor-based device includes a matrix processor including a positive portion and an accumulator and a negative portion and an accumulator. The matrix processor is configured to determine, for each of a first and a second of the plurality of pairs of floating-point operands, a sign of an intermediate product of the first and second floating-point operands, the sign indicating whether the intermediate product is positive or negative. The matrix processor is further configured to normalize the intermediate product with a portion and fraction comprising one of a fraction of the positive portion and the accumulator and a fraction of the negative portion and the accumulator based on a sign of the intermediate product. The matrix processor is also configured to add the intermediate product to one of the positive portion and accumulator and the negative portion and accumulator based on a sign of the intermediate product. The matrix processor is additionally configured to subtract the values of the negative portion and the accumulator from the values of the positive portion and the accumulator to generate a final sum. The matrix processor is further configured to renormalize the final sum.
In another aspect, a matrix processor based apparatus is provided. The matrix processor comprises means for determining, for each of a first and a second of the plurality of pairs of floating-point operands, a sign of an intermediate product of the first and second floating-point operands, the sign indicating whether the intermediate product is positive or negative. The matrix processor-based device further comprises means for normalizing the intermediate product with a portion and fraction comprising one of a fraction of the positive portion and the accumulator and a fraction of the negative portion and the accumulator based on a sign of the intermediate product. The matrix processor-based device also includes means for adding the intermediate product to one of the positive portion and accumulator and the negative portion and accumulator based on the sign of the intermediate product. The matrix processor based apparatus additionally includes means for subtracting the values of the negative part and the accumulator from the values of the positive part and the accumulator to produce a final sum. The matrix processor based apparatus further comprises means for normalizing the final sum.
In another aspect, a method for providing efficient floating point operations is provided. The method includes determining, by a matrix processor of a matrix processor-based device, a sign of an intermediate product of a first floating-point operand and a second floating-point operand for each of a plurality of pairs of floating-point operands, the sign indicating whether the intermediate product is positive or negative. The method further includes normalizing the intermediate product with a fraction of the positive portion and accumulator and a fraction and fraction of one of the negative portion and the fraction of the accumulator based on a sign of the intermediate product. The method also includes adding the intermediate product to one of the positive portion and accumulator and the negative portion and accumulator based on a sign of the intermediate product. The method additionally includes subtracting the values of the negative portion and the accumulator from the values of the positive portion and the accumulator to generate a final sum. The method further includes renormalizing the final sum.
Drawings
FIGS. 1A and 1B are block diagrams of an exemplary processor-based system including a matrix processor configured to provide efficient floating-point matrix multiplication;
FIG. 2 is a block diagram showing a conventional operation of performing a floating-point matrix multiply operation;
FIG. 3 is a block diagram depicting exemplary operations of the matrix processor of FIGS. 1A and 1B for efficiently performing floating-point matrix multiply operations;
FIG. 4 is a flow diagram depicting exemplary operations of the processor-based system of FIGS. 1A and 1B for efficiently performing floating-point matrix multiplication using a matrix processor; and is
FIG. 5 is a block diagram of an exemplary processor-based system that may include the matrix processor of FIGS. 1A and 1B for providing efficient floating point operations.
Detailed Description
Referring now to the drawings, several exemplary aspects of the invention are described. The word "exemplary" is used herein to mean "serving as an example, instance, or illustration. Any aspect described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other aspects.
Aspects disclosed in the detailed description include providing efficient floating point operations using a matrix processor in a processor-based system. In this regard, FIGS. 1A and 1B depict an exemplary matrix processor-based device 100 configured to provide efficient floating-point matrix multiplication using a matrix processor. Referring to FIG. 1A, a matrix processor-based apparatus 100 provides a host system 102, which in some aspects may include a matrix processor-basedOrx86 server computer. The host system 102 includes a processor 104, such as one or more Central Processing Units (CPUs), processors, and/or processor cores, and a memory 106, such as a double data rate synchronous dynamic random access memory (DDR SDRAM). The matrix processor-based device 100 further provides a peripheral component interconnect express (PCIe) card 108 on which a system on chip (SoC)110 is configured to communicate with the host system 102 via a PCIe interface 112 of the host system 102 and a PCIe interface 114 of the SoC 110. PCIe card 108 also includes DDR memory 116 and High Bandwidth Memory (HBM)118 that interface with SoC110 via memory controller 120 and memory controller 122, respectively.
SoC110 provides a command processor 124, which in some aspects may include, for example, based onOrx86 processor. The SoC110 also includes a Direct Memory Access (DMA) unit 126 configured to move data to and from the DDR memory 116 and the PCIe interface 114, and thus to and from the host system 102. The SoC110 of fig. 1A provides eight (8) processor slices ("slices") 128(0) through 128(7) interconnected by a network on chip (NoC) 130. It is understood that in some aspects, SoC110 may include more or fewer slices 128(0) -128 (7) than depicted in fig. 1A.
To illustrate the constituent elements of slices 128(0) - (128) (7), fig. 1A shows an expanded view of slice 128 (7). Slice 128(7) includes a plurality of microprocessors 132(0) through 132(p) and local scratchpad (scratchpad)134 and global scratchpad 136. The local intermediate result register 134 is a high bandwidth memory available only to the microprocessors 132(0) through 132(p) of the slice 128 (7). In contrast, the global intermediate result register 136 is a low bandwidth memory available from any of the slices 128(0) -128 (7). To move data into and out of the local and global intermediate result registers 134, 136, the slice 128(7) provides a DMA unit 138 communicatively coupled to the NoC 130. It is understood that, in this example, each of slices 128(0) -128 (6) includes elements corresponding to slice 128(7) described above.
Fig. 1B provides a more detailed view of the constituent elements of microprocessors 132(0) through 132(P) of slice 128(7) of fig. 1A, using microprocessor 132(P) as an example. As seen in fig. 1B, microprocessor 132(P) provides a scalar processor 140 and a vector processor 142. The microprocessor 132(P) further provides a plurality of matrix processors 144(0) to 144 (M). In the example of fig. 1B, matrix processors 144(0) -144 (M) are configured to use 16-bit floating point precision, since higher precision is neither necessary for machine learning applications nor leads to performance degradation. Scalar processor 140, vector processor 142, and matrix processors 144(0) through 144(M) are controlled by CPU 146, which in some aspects provides a dedicated instruction set for matrix processing. It should be understood that in the example of fig. 1B, each of the microprocessors 132(0) -132 (P) includes elements corresponding to those of the microprocessor 132(P) described above.
The matrix processor-based device 100 and its constituent elements as depicted in fig. 1A and 1B may encompass any known digital logic elements, semiconductor circuits, processing cores and/or memory structures, or combinations thereof, in addition to other elements. The aspects described herein are not limited to any particular arrangement of elements, and the disclosed techniques can be readily extended to various structures and layouts on semiconductor sockets or packages. It should be understood that some aspects of the matrix processor-based device 100 may include elements other than those depicted in fig. 1A and 1B, and/or some of the elements depicted in fig. 1A and 1B may be omitted.
To perform matrix multiplication, each element of the output matrix is calculated as the sum of the products of the elements of a row of the first input matrix and the elements of a corresponding column of the second input matrix as a "dot product". Some deep learning applications that may employ the matrix processors 144(0) through 144(M) of FIG. 1B require floating point precision in performing matrix multiplication operations. However, floating-point matrix multiplication operations typically require a matrix processor that occupies a larger processor area and consumes more power than integer matrix multiplication operations. Before describing operations that provide efficient floating point operations as disclosed herein, operations of a conventional matrix processor that performs floating point matrix multiplication are first discussed. In this regard, fig. 2 is provided. In FIG. 2, a first floating-point operand 200 is represented by a one-bit (1-bit) sign 202, a five-bit (5-bit) exponent 204, and a 10-bit fraction 206. Likewise, a second floating-point operand 208 is represented by a one-bit (1-bit) sign 210, a five-bit (5-bit) exponent 212, and a 10-bit fraction 214. The first floating-point operand and the second floating-point operand may be one of a plurality of pairs of floating-point operands to be multiplied and added to generate a dot product. The partial sum represented by a one-bit (1-bit) sign 216, an eight-bit (8-bit) exponent 218, and a 23-bit fraction 220 is held in an accumulator to sum the results of the multiplication operation. It is understood that in some aspects, the size of the exponents 204, 212 and fractions 206, 214 may be greater or less than the five bit (5-bit) exponents 204, 212 and 10-bit fractions 206, 214 shown in fig. 2.
To perform floating-point multiplication when generating a dot product, exponent 204 and exponent 212 are first added together, as indicated by element 222. It should be noted that in the example of fig. 2, element 222 also receives an input having a value of 97. This value represents a constant value depending on the exponential deviation of the input from the variable and the exponential deviation of the calculated partial sum. In this example, the value 97 is calculated as 2 x (-15) +127, where the value 15 represents the exponent bias for the number of half precision floating points (FP16), and 127 represents the exponent bias for the number of single precision floating points (FP 32). The result calculated at element 222 is compared to the partial sum index 218, as indicated by element 224, and the comparison result is forwarded to elements 226 and 228, which are set forth in more detail below. Subsequently, the score 206 and the score 214 are multiplied. Because the leading "1" of the binary representations of the scores 206 and 214 is "hidden" (i.e., not stored as part of the scores 206 and 214), "'hidden' 1" is added to each of the scores 206, 214, respectively, by prepending a bit having a value of one (1) as the most significant bit of the scores 206, 214 as indicated by elements 230 and 232. The 11 bit values of the fractions 206 and 214 are then multiplied at element 234. The 22-bit product 236 is right-hand padded with two zeros (0)238, and the resulting 24-bit value (referred to herein as the "intermediate product") is forwarded to element 228.
At element 228, the intermediate product of the fractions 206 and 214 is compared to the fraction 220 of the partial sum (after "'concealing' 1" is added to the fraction 220 at element 240 by preceding a bit having a value of one (1) as the most significant bit of the fraction 220). The larger of the two remains unchanged, while the smaller of the two is normalized by shifting it to the right, as indicated by element 242. After normalization of the smaller values, the intermediate product of the fractions 206 and 214 is added to the fraction 220 or subtracted from the fraction 220 as needed, as indicated by element 244. For example, at element 244, if an exclusive-OR operation performed on sign 202 of first floating-point operand 200 and sign 210 of second floating-point operand 208 evaluates to true, the intermediate product is subtracted from fraction 220, and if the exclusive-OR operation evaluates to error, the intermediate product is added to fraction 220.
The final result is then renormalized at element 246. The renormalization process includes positioning the leading "1" within the binary representation of the result, and then shifting the bits of the result to the left until the leading "1" has been shifted out of the binary representation. The partial sum index 218 is also adjusted as necessary based on the renormalization and based on the sum of the indices 204 and 212 from element 224.
As the multiplication of the size of the matrix grows, the number of operations required to perform a floating-point matrix multiplication operation increases significantly. As a non-limiting example, consider a matrix processor configured to be multiplied by two 32 x 32 matrices (i.e., each matrix has 32 rows and 32 columns). If the matrix processor provides 1,024 multiply/accumulate (MAC) units, each MAC unit must perform a total of 32 floating-point multiply operations in calculating the dot product, resulting in a total of 32,768 floating-point multiply operations. Therefore, it is desirable to optimize the process of multiplying floating point values to reduce the amount of processing time and power required.
In this regard, FIG. 3 is provided to illustrate exemplary elements of the matrix processors 144(0) through 144(M) of FIG. 1B and operations performed thereby to more efficiently multiply floating point values during matrix multiplication operations. In the example of fig. 3, the data is processed in a manner similar to that described with respect to fig. 2, except that the sum of the products of score 220 and scores 206 and 214 is not renormalized during the calculation of each dot product, as performed at element 246 of fig. 2. In contrast, each of the matrix processors 144(0) to 144(M) holds two accumulators: a positive partial sum accumulator 300 for storing a positive partial sum, and a negative partial sum accumulator 302 for storing a negative partial sum. In the example of fig. 3, positive portion and accumulator 300 and negative portion and accumulator 302 comprise 31-bit fractions 304 and 306, respectively, and eight (8) bit exponents 308 and 310, respectively.
In an exemplary operation, after the fractions 206 and 214 are multiplied to produce the intermediate product 312, the matrix processors 144(0) -144 (M) determine the sign of the intermediate product 312 (indicating whether the intermediate product 312 is positive or negative). The intermediate product 312 is then normalized with a "partial sum fraction" (where the fraction 304 of the positive partial sum accumulator 300 is used as the partial sum fraction if the intermediate product 312 is positive, and the fraction 306 of the negative partial sum accumulator 302 is used as the partial sum fraction if the intermediate product is negative). In some aspects, normalizing the intermediate product 312 with the partial and fractional may include performing a bitwise right shift operation on the lesser of the intermediate product 312 and the partial and fractional. Only one of positive portion and accumulator 300 and negative portion and accumulator 302 is updated during each processor cycle based on the sign of intermediate product 312 (i.e., intermediate product 312 is added to positive portion and accumulator 300 if intermediate product 312 is positive or negative portion and accumulator 302 if intermediate product 312 is negative). The other of positive portion and accumulator 300 and negative portion and accumulator 302, which corresponds to the inverse of the sign of intermediate product 312, may be clock gated so that it does not consume power. At the end of the computation of the dot product, the value stored in the negative part and accumulator 302 is subtracted from the value stored in the positive part and accumulator 300 to produce a final sum 314, which is renormalized once. Renormalization thus still consumes power, but only at the end of the dot product calculation during one processor cycle.
In some aspects, renormalization may be spread out across multiple processor cycles, and the hardware performing the renormalization may be shared. As a non-limiting example, if the matrix multiplication operation requires 32 processor cycles, the process of renormalization may also be done in 32 cycles if done in the second pipeline stage. Although the overall latency of the matrix multiply operation increases to a total of 64 processor cycles, the throughput remains the same (i.e., one matrix multiply operation/32 processor cycles). By enabling the renormalization logic to "loop" over various output portions and registers, the renormalization overall requires less logic. In addition, the operation illustrated in fig. 3 requires less logic in the main path, which allows the matrix processors 144(0) -144 (M) to operate at a higher clock frequency.
It should be noted that the operations depicted in FIG. 3 may adversely affect the precision of the floating-point multiply operation depending on the order of the input values. To compensate for this possibility, positive portion and accumulator 300 and negative portion and accumulator 302 according to some aspects may each provide more bits in fractions 304 and 306 relative to fractions 206, 214. For example, in FIG. 3, the fractions 206, 214 each include 10 bits, while the fractions 304, 306 each include 31 bits. Likewise, some aspects may also provide additional bits in exponents 308 and 310 in positive and negative portions and accumulators 300 and 302, respectively. In the example of FIG. 3, the exponents 204, 212 each include five (5) bits, while the exponents 308, 310 each include eight (8) bits.
In some aspects, matrix processors 144(0) -144 (M) may be configured to compute dot products, which would require only a unit to perform addition and subtraction, rather than also a multiplication unit. The matrix processors 144(0) to 144(M) according to some aspects may also be configured to use different resolutions of the input values and the partial and accumulators. For example, rather than using 32 bits for each of the positive portion and accumulator 300 and the negative portion and accumulator 302, the positive portion and accumulator 300 and the negative portion and accumulator 302 may each include 64 bits.
FIG. 4 is provided to depict exemplary operations of the matrix processor based device 100 of FIG. 1 for performing efficient floating-point matrix multiplication using a matrix processor. For clarity, reference is made to elements of FIGS. 1A, 1B, and 3 in describing FIG. 4. The operations in fig. 4 begin with a series of operations performed by a matrix processor, such as one of the matrix processors 144(0) -144 (M) of fig. 1B, for each of a first floating point operand 200 and a second floating point operand 208 of a plurality of pairs of floating point operands (block 400). In some aspects, the matrix processors 144(0) to 144(M) may multiply the first fraction 206 of the first floating-point operand 200 by the second fraction 214 of the second floating-point operand 208 to generate an intermediate product (block 402). In this regard, matrix processors 144(0) through 144(M) may be referred to herein as "means for multiplying a first fraction of a first floating-point operand by a second fraction of a second floating-point operand to produce an intermediate product.
The matrix processors 144(0) -144 (M) determine a sign of an intermediate product 312 of the first floating-point operand 200 and the second floating-point operand 208, the sign indicating whether the intermediate product 312 is positive or negative (block 404). Thus, matrix processors 144(0) through 144(M) may be referred to herein as "means for determining, for each of a first and second of a plurality of pairs of floating-point operands, a sign of an intermediate product of the first and second floating-point operands, the sign indicating whether the intermediate product is positive or negative. Matrix processors 144(0) -144 (M) then normalize intermediate product 312 with the partial sum fraction including one of fraction 304 of positive portion and accumulator 300 and fraction 306 of negative portion and accumulator 302 based on the sign of intermediate product 312 (block 406). Matrix processors 144(0) through 144(M) may thus be referred to herein as "means for normalizing an intermediate product with a portion and fraction including one of a fraction of a positive portion and accumulator and a fraction of a negative portion and accumulator based on the sign of the intermediate product. In some aspects, the operations of block 406 to normalize the intermediate product with the portion and the fraction may include performing a bitwise right shift operation on the lesser of the intermediate product 312 and the portion and the fraction (block 408). In this regard, matrix processors 144(0) through 144(M) may be referred to herein as "means for performing a bitwise right shift operation on the smaller of the intermediate product and the partial sum fraction".
Matrix processors 144(0) -144 (M) then add the intermediate product 312 to one of the positive portion and accumulator 300 and the negative portion and accumulator 302 based on the sign of the intermediate product 312 (block 410). Thus, matrix processors 144(0) through 144(M) may be referred to herein as "means for adding an intermediate product to one of a positive portion and accumulator and a negative portion and accumulator based on the sign of the intermediate product. According to some aspects, matrix processors 144(0) -144 (M) may also clock-gate one of positive portion and accumulator 300 and negative portion and accumulator 302 (block 412) corresponding to the inverse of the sign of intermediate product 312. Matrix processors 144(0) through 144(M) may be referred to herein as "means for clock gating one of a positive portion and an accumulator and a negative portion and an accumulator corresponding to the inverse of the sign of the intermediate product.
After processing each of the multiple pairs of floating-point operands, matrix processors 144(0) through 144(M) subtract the values of negative portion and accumulator 302 from the values of positive portion and accumulator 300 to generate final sum 314 (block 414). In this regard, matrix processors 144(0) through 144(M) may be referred to herein as "means for subtracting the values of the negative portion and accumulator from the values of the positive portion and accumulator to produce a final sum. The matrix processors 144(0) through 144(M) then renormalize the final sum 314 (block 416). Accordingly, matrix processors 144(0) through 144(M) may be referred to herein as "means for final and renormalization".
Providing efficient floating point operations using a matrix processor in a processor-based system according to aspects disclosed herein may be provided in or integrated into any processor-based device. Examples include, without limitation, a set top box, an entertainment unit, a navigation device, a communications device, a fixed location data unit, a mobile location data unit, a Global Positioning System (GPS) device, a mobile phone, a cellular phone, a smartphone, a Session Initiation Protocol (SIP) phone, a tablet computer, a tablet phone, a server, a computer, a portable computer, a mobile computing device, a wearable computing device (e.g., a smart watch, a health or wellness tracker, goggles, and the like), a desktop computer, a Personal Digital Assistant (PDA), a monitor, a computer monitor, a television, a tuner, a radio, a satellite radio, a music player, a digital music player, a portable music player, a digital video player, a Digital Video Disc (DVD) player, a portable digital video player, a computer monitor, a television, a tuner, a radio, a, Car, vehicle subassembly, avionics system, unmanned aerial vehicle and many rotor crafts.
In this regard, fig. 5 depicts an example of a processor-based system 500 that may correspond to the matrix processor-based device 100 of fig. 1A and may include the matrix processors 144(0) through 144(M) of fig. 1B. The processor-based system 500 includes one or more CPUs 502, each including one or more processors 504. The CPU 502 may have a cache 506 coupled to the processor 504 for fast access to temporarily stored data. The CPU 502 is coupled to a system bus 508, and may couple master and slave devices included in the processor-based system 500 to each other. As is well known, the CPU 502 communicates with these other devices by exchanging address, control, and data information over the system bus 508. For example, the CPU 502 may communicate a bus transaction request to the memory controller 510, which is an example of a controlled device.
Other master and slave devices may be connected to the system bus 508. As an example, as depicted in fig. 5, these devices may include a memory system 512, one or more input devices 514, one or more output devices 516, one or more network interface devices 518, and one or more display controllers 520. Input device 514 may include any type of input device including, but not limited to, input keys, switches, a speech processor, and the like. Output devices 516 may include any type of output device, including but not limited to audio, video, other visual indicators, and the like. Network interface device 518 may be any device configured to allow the exchange of data to and from network 522. Network 522 may be any type of network including, but not limited to, a wired or wireless network, a private or public network, a Local Area Network (LAN), a Wireless Local Area Network (WLAN), a Wide Area Network (WAN), BLUETOOTHTMNetworks and the internet. The network interface device 518 may be configured to support any type of communication protocol desired. Memory system 512 may include one or more memory units 524(0) through 524 (N).
The CPU 502 may also be configured to access a display controller 520 over the system bus 508 to control information transfer to one or more displays 526. Display controller 520 communicates the information to display 526 for display via one or more video processors 528, such as one or more Graphics Processing Units (GPUs), as a non-limiting example, which process the information to be displayed into a form suitable for display 526. Display 526 may include any type of display including, but not limited to, a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), a plasma display, and the like.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithms described in connection with the aspects disclosed herein may be implemented as electronic hardware, instructions stored in a memory or another computer-readable medium and executed by a processor or other processing device, or combinations of both. As an example, the master and slave devices described herein may be used in any circuit, hardware component, Integrated Circuit (IC), or IC chip. The memory disclosed herein may be any type and size of memory and may be configured to store any type of information as desired. To clearly illustrate this interchangeability, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. How this functionality is implemented depends upon the particular application, design options, and/or design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The various illustrative logical blocks, modules, and circuits described in connection with the aspects disclosed herein may be implemented or performed with a processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. The processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices (e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration).
Aspects disclosed herein may be embodied in hardware and instructions stored in hardware, and may reside, for example, in Random Access Memory (RAM), flash memory, read-only memory (ROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), registers, a hard disk, a removable disk, a CD-ROM, or any other form of computer-readable medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a remote station. In the alternative, the processor and the storage medium may reside as discrete components in a remote station, base station, or server.
It should also be noted that the operational steps described herein in any of the exemplary aspects are described to provide examples and discussion. The operations described may be performed in many different sequences than those depicted. Furthermore, the operations described in a single operational step may in fact be performed in a large number of different steps. Furthermore, one or more operational steps discussed in the exemplary aspects may be combined. It is understood that the operational steps depicted in the flow diagrams may be subject to a number of different modifications as will be readily apparent to those skilled in the art. Those of skill in the art would also understand that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
The previous description of the invention is provided to enable any person skilled in the art to make or use the invention. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (20)
1. A matrix processor-based device comprising a matrix processor, the matrix processor comprising a positive portion and an accumulator and a negative portion and an accumulator,
the matrix processor is configured to:
for each of a first floating-point operand and a second floating-point operand in a plurality of pairs of floating-point operands:
determining a sign of an intermediate product of the first floating-point operand and the second floating-point operand, the sign indicating whether the intermediate product is positive or negative;
normalizing the intermediate product with a portion and fraction comprising one of a fraction of the positive portion and accumulator and a fraction of the negative portion and accumulator based on the sign of the intermediate product; and
adding the intermediate product to one of the positive portion and accumulator and the negative portion and accumulator based on the sign of the intermediate product;
subtracting the values of the negative portion and accumulator from the values of the positive portion and accumulator to produce a final sum; and
renormalizing the final sum.
2. The matrix processor-based device of claim 1, wherein the matrix processor is further configured to multiply a first fraction of the first floating point operand by a second fraction of the second floating point operand to generate the intermediate product.
3. The matrix processor-based device of claim 2, wherein the matrix processor is configured to normalize the intermediate product with the portion and fraction by being configured to perform a bitwise right shift operation on the intermediate product and a smaller of the portion and fraction.
4. The matrix processor-based device of claim 1, wherein the matrix processor is further configured to clock gate one of the positive portion and accumulator and the negative portion and accumulator corresponding to the inverse of the sign of the intermediate product.
5. The matrix processor based apparatus of claim 1, wherein:
a fraction of the first floating-point operand and a fraction of the second floating-point operand each comprise 10 bits; and is
The fraction of the positive portion and accumulator and the fraction of the negative portion and accumulator each comprise 31 bits.
6. The matrix processor based apparatus of claim 1, wherein:
an exponent of the first floating-point operand and an exponent of the second floating-point operand each comprise five (5) bits; and is
The exponent of the positive portion and accumulator and the exponent of the negative portion and accumulator each comprise eight (8) bits.
7. The matrix processor based device according to claim 1 integrated into an integrated circuit IC.
8. The matrix processor based device according to claim 1 integrated into a device selected from the group consisting of: a set top box, entertainment unit, navigation device, communications device, fixed location data unit, mobile location data unit, Global Positioning System (GPS) device, mobile phone, cellular phone, smart phone, Session Initiation Protocol (SIP) phone, tablet computer, tablet phone, server, computer, portable computer, mobile computing device, wearable computing device, desktop computer, Personal Digital Assistant (PDA), monitor, computer monitor, television, tuner, radio, satellite radio, music player, digital music player, portable music player, digital video player, Digital Video Disc (DVD) player, portable digital video player, automobile, vehicle component, avionics system, unmanned aerial vehicle, and multi-rotor aircraft.
9. A matrix processor-based device, comprising:
means for determining, for each of a first and second floating-point operand in a plurality of pairs of floating-point operands, a sign of an intermediate product of the first and second floating-point operands, the sign indicating whether the intermediate product is positive or negative;
means for normalizing the intermediate product with a portion and fraction comprising one of a fraction of a positive portion and an accumulator and a fraction of a negative portion and an accumulator based on the sign of the intermediate product;
means for adding the intermediate product to one of the positive portion and accumulator and the negative portion and accumulator based on the sign of the intermediate product;
means for subtracting the values of the negative portion and accumulator from the values of the positive portion and accumulator to produce a final sum; and
means for normalizing said final sum.
10. The matrix processor-based device of claim 9, further comprising means for multiplying a first fraction of the first floating point operand by a second fraction of the second floating point operand to produce the intermediate product.
11. The matrix processor-based device of claim 10, wherein the means for normalizing the intermediate product with the portion and fraction comprises means for performing a bitwise right shift operation on the smaller of the intermediate product and the portion and fraction.
12. The matrix processor-based device according to claim 9, further comprising means for clock gating one of the positive portion and accumulator and the negative portion and accumulator corresponding to the inverse of the sign of the intermediate product.
13. The matrix processor based apparatus of claim 9, wherein:
a fraction of the first floating-point operand and a fraction of the second floating-point operand each comprise 10 bits; and is
The fraction of the positive portion and accumulator and the fraction of the negative portion and accumulator each comprise 31 bits.
14. The matrix processor based apparatus of claim 9, wherein:
an exponent of the first floating-point operand and an exponent of the second floating-point operand each comprise five (5) bits; and is
The exponent of the positive portion and accumulator and the exponent of the negative portion and accumulator each comprise eight (8) bits.
15. A method for providing efficient floating point operations, comprising:
for each of a first floating-point operand and a second floating-point operand in a plurality of pairs of floating-point operands:
determining, by a matrix processor of a matrix processor-based device, a sign of an intermediate product of the first floating-point operand and the second floating-point operand, the sign indicating whether the intermediate product is positive or negative;
normalizing the intermediate product with a portion and fraction comprising one of a fraction of a positive portion and accumulator and a fraction of a negative portion and accumulator based on the sign of the intermediate product; and
adding the intermediate product to one of the positive portion and accumulator and the negative portion and accumulator based on the sign of the intermediate product;
subtracting the values of the negative portion and accumulator from the values of the positive portion and accumulator to produce a final sum; and
renormalizing the final sum.
16. The method of claim 15, further comprising multiplying, by the matrix processor, a first fraction of the first floating point operand by a second fraction of the second floating point operand to produce the intermediate product.
17. The method of claim 16, wherein normalizing the intermediate product with the portion and fraction comprises performing a bitwise right shift operation on the smaller of the intermediate product and the portion and fraction.
18. The method of claim 15, further comprising clock gating one of the positive portion and accumulator and the negative portion and accumulator corresponding to a negative of the sign of the intermediate product.
19. The method of claim 15, wherein:
a fraction of the first floating-point operand and a fraction of the second floating-point operand each comprise 10 bits; and is
The fraction of the positive portion and accumulator and the fraction of the negative portion and accumulator each comprise 31 bits.
20. The method of claim 15, wherein:
an exponent of the first floating-point operand and an exponent of the second floating-point operand each comprise five (5) bits; and is
The exponent of the positive portion and accumulator and the exponent of the negative portion and accumulator each comprise eight (8) bits.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US62/552,890 | 2017-08-31 | ||
| US16/118,099 | 2018-08-30 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| HK40017940A true HK40017940A (en) | 2020-09-25 |
| HK40017940B HK40017940B (en) | 2023-12-08 |
Family
ID=
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| EP3676698B1 (en) | Providing efficient floating-point operations using matrix processors in processor-based systems | |
| US10725740B2 (en) | Providing efficient multiplication of sparse matrices in matrix-processor-based devices | |
| CN112074806B (en) | Systems, methods, and computer storage media for block floating point computations | |
| US20240362471A1 (en) | Method and apparatus for processing convolution operation in neural network using sub-multipliers | |
| EP3549069B1 (en) | Neural network data entry system | |
| KR102252137B1 (en) | Calculation device and method | |
| US12175359B2 (en) | Machine learning hardware having reduced precision parameter components for efficient parameter update | |
| US20190079903A1 (en) | Providing matrix multiplication using vector registers in processor-based devices | |
| CN114127680B (en) | System and method for supporting alternative digital formats for efficient multiplication | |
| CN108229648B (en) | Convolution calculation method, device, equipment and medium for matching data bit width in memory | |
| US10936943B2 (en) | Providing flexible matrix processors for performing neural network convolution in matrix-processor-based devices | |
| WO2020024093A1 (en) | Method and apparatus for keeping statistical inference accuracy with 8-bit winograd convolution | |
| CN112101541B (en) | Device, method, chip and board for splitting high bit width data | |
| CN113570053B (en) | A training method, device and computing device for a neural network model | |
| CN118170347B (en) | Precision conversion device, data processing method, processor, and electronic device | |
| US20220108150A1 (en) | Method and apparatus for processing data, and related products | |
| US20250130987A1 (en) | Data type conversion method, storage medium, device, and printed circuit board | |
| US20240094986A1 (en) | Method and apparatus for matrix computation using data conversion in a compute accelerator | |
| US20160085507A1 (en) | Method and apparatus for controlling range of representable numbers | |
| CN111198714B (en) | Retraining method and related product | |
| HK40017940A (en) | Providing efficient floating-point operations using matrix processors in processor-based systems | |
| WO2020256836A1 (en) | Sparse convolutional neural network | |
| US20250265041A1 (en) | Performing floating-point operations using an expanded-range floating-point format in processor devices | |
| CN111258546B (en) | Multiplier, data processing method, chip and electronic equipment | |
| US9311272B1 (en) | System and method for faster division |