WO2025230529A1 - Reinterpretable data type format for accurate and efficient model compression - Google Patents
Reinterpretable data type format for accurate and efficient model compressionInfo
- Publication number
- WO2025230529A1 WO2025230529A1 PCT/US2024/027423 US2024027423W WO2025230529A1 WO 2025230529 A1 WO2025230529 A1 WO 2025230529A1 US 2024027423 W US2024027423 W US 2024027423W WO 2025230529 A1 WO2025230529 A1 WO 2025230529A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data type
- type format
- weights
- bit
- integer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0495—Quantised networks; Sparse networks; Compressed networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/096—Transfer learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/01—Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
Definitions
- LLM large language model
- FP32 32-bit floating point
- FIG. 1 is an illustration of an example of an artificial intelligence (AI) model compression solution according to an embodiment
- FIG. 2 is an illustration of an example of a decision tree according to an embodiment
- FIG. 3 is an illustration of an example of an AI model decompression solution according to an embodiment
- FIG. 4 is a flowchart of an example of a method of compressing a pre-trained AI model according to an embodiment
- FIG.5 is a flowchart of an example of a method of decompressing an output AI model according to an embodiment
- Docket No. AF7863-PCT FIG.6 is a block diagram of an example of a performance-enhanced computing system according to an embodiment
- FIG. 7 is an illustration of an example of a semiconductor package apparatus according to an embodiment
- FIG. 8 is a block diagram of an example of a processor according to an embodiment
- FIG.9 is a block diagram of an example of a multi-processor based computing system according to an embodiment.
- NF4 normal floating point
- NF4 4-bit normal floating point
- the data type can be used only with floating-point instructions (e.g., 32-bit floating point/FP32, 16-bit floating point/FP16, 16-bit brain floating point/Bfloat16) when computing dot product operations during inferences.
- the NF4 data type does not allow the use of 8-bit integer (INT8) instructions, which are more power-efficient, performant, and available in most contemporary hardware (e.g., central processing units/CPUs, integrated and discrete graphics processing units/GPUs, and specialized accelerators such as network processing units/NPUs).
- Docket No. AF7863-PCT Other proposed methods similar to NF4 include palletization, which may cluster and fit weights into a set of floating-point values. Although palletization solutions may improve accuracy, such solutions are typically not efficient during inference for certain platforms.
- the technology described herein provides a 4-bit reinterpretable floating point (RF4) data type, which is a unique 4-bit data type to store weights that can be interpreted as the floating-point data type or cast to an 8-bit integer data type depending on the available instruction set.
- RF4 4-bit reinterpretable floating point
- the proposed data type is performant in various types of hardware (HW) architectures.
- the proposed data type allows for preserving compressed model accuracy on a level similar to other less compute- efficient data types.
- the proposed RF4 type enables accurate post-training and training- time weight compression of Deep Learning models and provides cross-hardware compatibility when running inference operations on various types of HW that have different instruction sets.
- RF4 is highly applicable to the optimization of transformer- based models including LLMs for existing HW accelerators (e.g., CPUs, GPUs, NPUs, etc.).
- a pre-trained AI model 10 e.g., source LLM
- source weights 12 e.g., weight tensor, group of weights
- FP floating point
- a set of model parameters 14 is determined based on the source weights 12, wherein the model parameters 14 include information such as a weight scale factor, a decision tree, a lookup table, and so forth.
- the weight scale factor is determined based on a maximum value in the plurality of source weights 12.
- the plurality of source weights 12 are converted into a plurality of quantized weights 16 (16a, 16b) based on the model parameters 14.
- the quantized weights 16 are in the reinterpretable format (RF) data type and contained within a quantization range (e.g., [-1, 1]), wherein the RF data type is interpretable as quantized weights 16a in an FP data type format and quantized weights 16b in an integer (INT) data type format.
- An output AI model 18 is generated based on the plurality of quantized weights 16 and the model parameters 14.
- Configuring the quantized weights 16 in the RF data type enables accurate post- training and training-time weight compression of the output AI model 18 and provides cross-hardware compatibility when running inference operations on various types of HW that have different instruction sets.
- the RF data type is highly applicable to the Docket No. AF7863-PCT optimization of transformer-based models including LLMs for existing HW accelerators (e.g., CPUs, GPUs, NPUs, etc.).
- the bit-width of the RF data type format is 4-bits (e.g., RF4).
- ⁇ ⁇ 94 ⁇ : ⁇ ; ⁇ ⁇ ./B.D
- the quantization process from FP32 to RF4 type is as follows: Docket No. AF7863-PCT -
- This scale can be computed per weight tensor or per group of weights (e.g., for each 128 weights). Thus, every weight tensor of RF4 values in the model is accompanied by the scaling factor.
- FIG.2 shows a decision tree 30 that may be used to convert the source weights 12 (FIG.1) into the quantized weights 16 (FIG.1).
- x is a quantized weight value in the [-1, 1] quantization range and the decision tree 30 returns one of sixteen values in an RF4 lookup table based on the location of the quantized weight value relative to the ⁇ 94 @ ⁇ A/ ⁇ floating point representation of the RF4 data type.
- a return operation 32 outputs the binary value “1111” if the quantized weight value is greater than the median value (e.g., “0.862204725”) between “0.72440945” and “1”.
- a return operation 34 outputs the binary value “1110” if the quantized weight value is greater than the median value (e.g., “0.641732285”) between “0.55905512” and “0.72440945”.
- the quantization process can be computationally expensive, quantization may not have any impact on inference latency since quantization is typically conducted offline during model preparation prior to deployment.
- FIG.3 shows a decompression solution in which weight dequantization occurs at inference time (e.g., right before a matrix multiplication/MatMul operation).
- the output AI model 18 includes the reinterpretable quantized weights 16.
- a determination is made (e.g., based on the instruction set architecture/ISA) as to whether a plurality of input activations 20 are in the FP data type format (e.g., FP32) or the INT data type format (e.g., INT8). If the plurality of input activations 20 are in the FP data type format, the quantized weights 16 in the RF data type format are interpreted as the quantized weights 16a in the FP data type format during an FP matrix multiplication.
- the FP matrix multiplication may be conducted based on first model parameters 22 including the weight scale factor associated with the plurality of quantized weights 16 and the lookup table associated with the RF data type format. Docket No. AF7863-PCT If the plurality of input activations 20 are in the INT data type format, the quantized weights 16 in the RF data type are interpreted as the quantized weights 16b in the INT data type format during an INT matrix multiplication. In such a case, the INT matrix multiplication may be conducted based on the first model parameters 22 as well as second model parameters 24 including a fixed scale factor and an activation scale factor associated with the plurality of input activations 20.
- the fixed scale factor can be associated with an integer range (e.g., [- 127, 127]) of the INT data type format.
- an integer range e.g., [- 127, 127]
- either the FP matrix multiplication or the INT matrix multiplication generate matrix multiplication results 26.
- - Floating-point instructions are used for MatMul (dot product).
- FIG.4 shows a method 40 of compressing a pre-trained AI model.
- the method 40 may be implemented in one or more modules as a plurality of logic instructions (e.g., compression instructions) stored in a machine- or computer-readable storage medium such as random access memory (RAM), read only memory (ROM), programmable ROM (PROM), firmware, flash memory, etc., in hardware, or any combination thereof.
- logic instructions e.g., compression instructions
- RAM random access memory
- ROM read only memory
- PROM programmable ROM
- firmware flash memory
- hardware implementations may include configurable logic, fixed-functionality logic, or any combination thereof.
- configurable logic e.g., configurable hardware
- PLAs programmable logic arrays
- FPGAs field programmable gate arrays
- CPLDs complex programmable logic devices
- general purpose microprocessors Examples of fixed- Docket No.
- AF7863-PCT functionality logic e.g., fixed-functionality hardware
- ASICs application specific integrated circuits
- the configurable or fixed-functionality logic can be implemented with complementary metal oxide semiconductor (CMOS) logic circuits, transistor-transistor logic (TTL) logic circuits, or other circuits.
- CMOS complementary metal oxide semiconductor
- TTL transistor-transistor logic
- computer program code to carry out operations shown in the method 40 can be written in any combination of one or more programming languages, including an object oriented programming language such as JAVA, SMALLTALK, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
- Illustrated block 42 provides for determining a weight scale factor for a plurality of source weights in a pre-trained model, wherein the plurality of source weights are contained within a source range. In an embodiment, the weight scale factor is determined based on a maximum value in the plurality of source weights.
- Block 44 converts the plurality of source weights into a plurality of quantized weights based on the weight scale factor, wherein the plurality of quantized weights are contained within a quantization range (e.g., [-1, 1]).
- Block 44 may also convert the plurality of source weights into the plurality of quantized weights further based on a decision tree and a lookup table associated with a first data type format corresponding to the plurality of quantized weights.
- the first data type format e.g., RF data type format
- INT data type format is interpretable as an FP data type format and an INT data type format.
- a first bit-width of the first data type format may be less than a second bit-width of a second data type format (e.g., FP32) corresponding to the plurality of source weights and less than a third bit-width of the INT data type format (e.g., INT8).
- the first bit-width of the first data type is 4-bits (e.g., RF4) in one example.
- Other bit-widths may be used, however, for the RF data type format.
- Block 46 generates an output AI model based on the plurality of quantized weights and the weight scale factor. Docket No.
- the method 40 therefore enhances performance at least to the extent that generating the output AI model based on the plurality of quantized weights and the weight scale factor enables accurate post-training and training-time weight compression of the output AI model and provides cross-hardware compatibility when running inference operations on various types of HW that have different instruction sets.
- the RF data type is highly applicable to the optimization of transformer-based models including LLMs for existing HW accelerators (e.g., CPUs, GPUs, NPUs, etc.).
- FIG.5 shows a method 50 of decompressing an output AI model.
- the method 50 may be implemented in one or more modules as a plurality of logic instructions (e.g., decompression instructions) stored in a machine- or computer-readable storage medium such as RAM, ROM, PROM, firmware, flash memory, etc., in hardware, or any combination thereof.
- Illustrated processing block 52 provides for determining whether a plurality of input activations are in an FP data type format (e.g., FP16, FP32) or an INT data type format (e.g., INT8).
- Block 52 may determine the format of the input activations based on an ISA instruction (e.g., lookup instruction) associated with the underlying hardware.
- ISA instruction e.g., lookup instruction
- block 56 interprets, during an FP matrix multiplication (e.g., dot product), a first data type format (e.g., RF) corresponding to a plurality of quantized weights as the FP data type format.
- block 56 conducts the FP matrix multiplication based on a lookup table associated with the first data type format and a weight scale factor associated with the plurality of quantized weights.
- block 60 interprets, during an INT matrix multiplication (e.g., dot product), the first data type format as the INT data type format.
- block 60 conducts the INT matrix multiplication based on a lookup table associated with the first data type format, a weight scale factor associated with the plurality of quantized weights, a fixed scale factor (e.g., 127), and an activation scale factor associated with the plurality of input activations.
- the fixed scale factor can be associated with an integer range (e.g., [-127, 127]) of the INT data type format.
- a first bit-width (e.g., 4-bits) of the first data type can be less than a second bit-width (e.g., 32-bits) of the FP data type and a third bit-width (e.g., 8-bits) of the INT data type format.
- the plurality of quantized weights are contained within a quantization such as, for Docket No. AF7863-PCT example, [-1, 1]. The method 50 therefore enhances performance at least to the extent that the RF type format improves accuracy and/or reduces latency during inference operations.
- FIG. 6 a performance-enhanced computing system 280 is shown.
- the system 280 may generally be part of an electronic device/platform having computing functionality (e.g., personal digital assistant/PDA, notebook computer, tablet computer, convertible tablet, server), communications functionality (e.g., smart phone), imaging functionality (e.g., camera, camcorder), media playing functionality (e.g., smart television/TV), wearable functionality (e.g., watch, eyewear, headwear, footwear, jewelry), vehicular functionality (e.g., car, truck, motorcycle), robotic functionality (e.g., autonomous robot), Internet of Things (IoT) functionality, drone functionality, etc., or any combination thereof.
- computing functionality e.g., personal digital assistant/PDA, notebook computer, tablet computer, convertible tablet, server
- communications functionality e.g., smart phone
- imaging functionality e.g., camera, camcorder
- media playing functionality e.g., smart television/TV
- wearable functionality e.g., watch, eyewear, headwear, footwear, jewelry
- vehicular functionality e.g., car, truck,
- the system 280 includes a host processor 282 (e.g., central processing unit/CPU) having an integrated memory controller (IMC) 284 that is coupled to a system memory 286 (e.g., dual inline memory module/DIMM).
- IMC integrated memory controller
- an IO module 288 is coupled to the host processor 282.
- the illustrated IO module 288 communicates with, for example, a display 290 (e.g., touch screen, liquid crystal display/LCD, light emitting diode/LED display), and a network controller 292 (e.g., conducting wired and/or wireless communications).
- the host processor 282 may be combined with the IO module 288, a graphics processor 294, and an AI accelerator 296 into a system on chip (SoC) 298.
- SoC system on chip
- the AI accelerator 296, the host processor 282 and/or the SoC 298 executes a plurality of executable program instructions 300 (e.g., compression and/or decompression instructions) retrieved from mass storage 302 and/or the system memory 286 to perform one or more aspects of the method 40 (FIG. 4) and/or the method 50 (FIG.5), already discussed.
- a plurality of executable program instructions 300 e.g., compression and/or decompression instructions
- execution of the instructions 300 causes the AI accelerator 296, the host processor 282 and/or the SoC 298 to determine a weight scale factor for a plurality of source weights in a pre-trained AI model, wherein the plurality of source weights are contained within a source range, convert the plurality of source weights into a plurality of quantized weights based on the weight scale factor, wherein the plurality of quantized weights are contained within a quantization range, and generate an output AI model based on the plurality of quantized weights and the weight scale factor.
- a first data type format e.g., reinterpretable floating Docket No.
- AF7863-PCT point/RF corresponding to the plurality of quantized weights is interpretable as a floating point data type format and an integer data type format.
- execution of the instructions 300 causes the AI accelerator 296, the host processor 282 and/or the SoC to determine whether a plurality of input activations are in the floating point data type format or the integer data type format. If the plurality of input activations are in the floating point data type format, the first data type format is interpreted as the floating data type format during a floating point matrix multiplication. If the plurality of input activations are in the integer data type format, the first data type format is interpreted as the integer data type format during an integer matrix multiplication.
- the computing system 280 is therefore considered performance-enhanced at least to the extent that generating the output AI model based on the plurality of quantized weights and the weight scale factor enables accurate post-training and training-time weight compression of the output AI model and provides cross-hardware compatibility when running inference operations on various types of HW that have different instruction sets.
- the RF data type is highly applicable to the optimization of transformer-based models including LLMs for existing HW accelerators (e.g., CPUs, GPUs, NPUs, etc.). Additionally, the RF type format improves accuracy and/or reduces latency during inference operations.
- FIG. 7 shows a semiconductor apparatus 350 (e.g., chip, die, package).
- the illustrated apparatus 350 includes one or more substrates 352 (e.g., silicon, sapphire, gallium arsenide) and logic 354 (e.g., transistor array and other integrated circuit/IC components) coupled to the substrate(s) 352.
- the logic 354 implements one or more aspects of the method 40 (FIG.4) and/or the method 50 (FIG. 5), already discussed.
- the logic 354 may be implemented at least partly in configurable or fixed- functionality hardware.
- the logic 354 includes transistor channel regions that are positioned (e.g., embedded) within the substrate(s) 352. Thus, the interface between the logic 354 and the substrate(s) 352 may not be an abrupt junction.
- the logic 354 may also be considered to include an epitaxial layer that is grown on an initial wafer of the substrate(s) 352.
- the method 40 (FIG. 4) and/or the method 50 (FIG. 5) are incorporated into an INTEL OPENVINO tookit, which streamlines AI model development and integration of deep learning in domains such as computer vision, large Docket No. AF7863-PCT language models, ⁇ and generative AI.
- the use of an RF data type format as described herein improves accuracy and/or reduces latency during inference operations.
- FIG. 8 illustrates a processor core 400 according to one embodiment.
- the processor core 400 may be the core for any type of processor, such as a micro-processor, an embedded processor, a digital signal processor (DSP), a network processor, or other device to execute code. Although only one processor core 400 is illustrated in FIG.8, a processing element may alternatively include more than one of the processor core 400 illustrated in FIG. 8.
- the processor core 400 may be a single-threaded core or, for at least one embodiment, the processor core 400 may be multithreaded in that it may include more than one hardware thread context (or “logical processor”) per core.
- FIG. 8 also illustrates a memory 470 coupled to the processor core 400.
- the memory 470 may be any of a wide variety of memories (including various layers of memory hierarchy) as are known or otherwise available to those of skill in the art.
- the memory 470 may include one or more code 413 instruction(s) to be executed by the processor core 400, wherein the code 413 may implement the method 40 (FIG.4) and/or the method 50 (FIG.5), already discussed.
- the processor core 400 follows a program sequence of instructions indicated by the code 413. Each instruction may enter a front end portion 410 and be processed by one or more decoders 420.
- the decoder 420 may generate as its output a micro operation such as a fixed width micro operation in a predefined format, or may generate other instructions, microinstructions, or control signals which reflect the original code instruction.
- the illustrated front end portion 410 also includes register renaming logic 425 and scheduling logic 430, which generally allocate resources and queue the operation corresponding to the convert instruction for execution.
- the processor core 400 is shown including execution logic 450 having a set of execution units 455-1 through 455-N. Some embodiments may include a number of execution units dedicated to specific functions or sets of functions. Other embodiments may include only one execution unit or one execution unit that can perform a particular function.
- the illustrated execution logic 450 performs the operations specified by code instructions. After completion of execution of the operations specified by the code instructions, back end logic 460 retires the instructions of the code 413. In one embodiment, the processor core 400 allows out of order execution but requires in order Docket No. AF7863-PCT retirement of instructions.
- Retirement logic 465 may take a variety of forms as known to those of skill in the art (e.g., re-order buffers or the like).
- a processing element may include other elements on chip with the processor core 400.
- a processing element may include memory control logic along with the processor core 400.
- the processing element may include I/O control logic and/or may include I/O control logic integrated with memory control logic.
- the processing element may also include one or more caches. Referring now to FIG.9, shown is a block diagram of a computing system 1000 embodiment in accordance with an embodiment.
- FIG.9 Shown in FIG.9 is a multiprocessor system 1000 that includes a first processing element 1070 and a second processing element 1080. While two processing elements 1070 and 1080 are shown, it is to be understood that an embodiment of the system 1000 may also include only one such processing element.
- the system 1000 is illustrated as a point-to-point interconnect system, wherein the first processing element 1070 and the second processing element 1080 are coupled via a point-to-point interconnect 1050. It should be understood that any or all of the interconnects illustrated in FIG.9 may be implemented as a multi-drop bus rather than point-to-point interconnect. As shown in FIG.
- each of processing elements 1070 and 1080 may be multicore processors, including first and second processor cores (i.e., processor cores 1074a and 1074b and processor cores 1084a and 1084b). Such cores 1074a, 1074b, 1084a, 1084b may be configured to execute instruction code in a manner similar to that discussed above in connection with FIG.8.
- Each processing element 1070, 1080 may include at least one shared cache 1896a, 1896b.
- the shared cache 1896a, 1896b may store data (e.g., instructions) that are utilized by one or more components of the processor, such as the cores 1074a, 1074b and 1084a, 1084b, respectively.
- the shared cache 1896a, 1896b may locally cache data stored in a memory 1032, 1034 for faster access by components of the processor.
- the shared cache 1896a, 1896b may Docket No. AF7863-PCT include one or more mid-level caches, such as level 2 (L2), level 3 (L3), level 4 (L4), or other levels of cache, a last level cache (LLC), and/or combinations thereof. While shown with only two processing elements 1070, 1080, it is to be understood that the scope of the embodiments are not so limited. In other embodiments, one or more additional processing elements may be present in a given processor.
- processing elements 1070, 1080 may be an element other than a processor, such as an accelerator or a field programmable gate array.
- additional processing element(s) may include additional processors(s) that are the same as a first processor 1070, additional processor(s) that are heterogeneous or asymmetric to processor a first processor 1070, accelerators (such as, e.g., graphics accelerators or digital signal processing (DSP) units), field programmable gate arrays, or any other processing element.
- accelerators such as, e.g., graphics accelerators or digital signal processing (DSP) units
- DSP digital signal processing
- the various processing elements 1070, 1080 may reside in the same die package.
- the first processing element 1070 may further include memory controller logic (MC) 1072 and point-to-point (P-P) interfaces 1076 and 1078.
- the second processing element 1080 may include a MC 1082 and P-P interfaces 1086 and 1088.
- MC’s 1072 and 1082 couple the processors to respective memories, namely a memory 1032 and a memory 1034, which may be portions of main memory locally attached to the respective processors.
- the first processing element 1070 and the second processing element 1080 may be coupled to an I/O subsystem 1090 via P-P interconnects 10761086, respectively.
- the I/O subsystem 1090 includes P-P interfaces 1094 and 1098.
- I/O subsystem 1090 includes an interface 1092 to couple I/O subsystem 1090 with a high performance graphics engine 1038.
- bus 1049 may be used to couple the graphics engine 1038 to the I/O subsystem 1090.
- I/O subsystem 1090 may be coupled to a first bus 1016 via an interface 1096.
- the first bus 1016 may be a Peripheral Component Interconnect (PCI) bus, or a bus such as a PCI Express bus or another third generation I/O interconnect bus, although the scope of the embodiments are not so limited.
- PCI Peripheral Component Interconnect
- various I/O devices 1014 e.g., biometric scanners, speakers, cameras, sensors
- the second bus 1020 may be a low pin count (LPC) bus.
- Various devices may be coupled to the second bus 1020 including, for example, a keyboard/mouse 1012, communication device(s) 1026, and a data storage unit 1019 such as a disk drive or other mass storage device which may include code 1030, in one embodiment.
- the illustrated code 1030 may implement the method 40 (FIG. 4) and/or the method 50 (FIG.5), already discussed.
- an audio I/O 1024 may be coupled to second bus 1020 and a battery 1010 may supply power to the computing system 1000. Note that other embodiments are contemplated. For example, instead of the point-to-point architecture of FIG.
- Example 1 includes a performance-enhanced computing system comprising a network controller, a processor coupled to the network controller, and a memory coupled to the processor, the memory including a plurality of compression instructions, which when executed by the processor, cause the processor to determine a weight scale factor for a plurality of source weights in a pre-trained artificial intelligence (AI) model, wherein the plurality of source weights are contained within a source range, convert the plurality of source weights into a plurality of quantized weights based on the weight scale factor, wherein the plurality of quantized weights are contained within a quantization range, and generate an output AI model based on the plurality of quantized weights and the weight scale factor.
- AI artificial intelligence
- Example 2 includes the computing system of Example 1, wherein a first data type format corresponding to the plurality of quantized weights is interpretable as a floating point data type format and an integer data type format.
- Example 3 includes the computing system of Example 2, wherein the memory further includes a plurality of decompression instructions, which when executed by the Docket No.
- AF7863-PCT processor cause the processor to determine whether a plurality of input activations are in the floating point data type format or the integer data type format, interpret, during a floating point matrix multiplication, the first data type format corresponding to the plurality of quantized weights as the floating point data type format if the plurality of input activations are in the floating point data type format, and interpret, during an integer matrix multiplication, the first data type format as the integer data type format if the plurality of input activations are in the integer data type format.
- Example 4 includes the computing system of Example 2, wherein a first bit- width of the first data type format is less than a third bit-width of the integer data type format.
- Example 5 includes the computing system of Example 4, wherein the first bit- width of the first data type format is less than a second bit-width of a second data type format corresponding to the plurality of source weights, and wherein the first bit-width of the first data type format is 4-bits.
- Example 6 includes the computing system of Example 1, wherein the plurality of source weights are converted into the plurality of quantized weights further based on a decision tree and a lookup table associated with a first data type format corresponding to the plurality of quantized weights.
- Example 7 includes at least one computer readable storage medium comprising a plurality of compression instructions, which when executed by a computing system, cause the computing system to determine a weight scale factor for a plurality of source weights in a pre-trained artificial intelligence (AI) model, wherein the plurality of source weights are contained within a source range, convert the plurality of source weights into a plurality of quantized weights based on the weight scale factor, wherein the plurality of quantized weights are contained within a quantization range, and generate an output AI model based on the plurality of quantized weights and the weight scale factor.
- AI artificial intelligence
- Example 8 includes the at least one computer readable storage medium of Example 7, wherein a first data type format corresponding to the plurality of quantized weights is interpretable as a floating point data type format and an integer data type format.
- Example 9 includes the at least one computer readable storage medium of Example 8, wherein a first bit-width of the first data type format is less than a third bit- width of the integer data type format.
- Example 10 includes the at least one computer readable storage medium of Example 9, wherein the first bit-width of the first data type format is less than a second bit-width of a second data type format corresponding to the plurality of source weights, and wherein the first bit-width of the first data type format is 4-bits.
- Example 11 includes the at least one computer readable storage medium of Example 7, wherein the plurality of source weights are converted into the plurality of quantized weights further based on a decision tree and a lookup table associated with a first data type format corresponding to the plurality of quantized weights.
- Example 12 includes the at least one computer readable storage medium of any one of Examples 7 to 11, wherein the weight scale factor is determined based on a maximum value in the plurality of source weights.
- Example 13 includes the at least one computer readable storage medium of any one of Examples 7 to 11, wherein the quantization range is [-1, 1].
- Example 14 includes at least one computer readable storage medium comprising a plurality of decompression instructions, which when executed by a computing system, cause the computing system to determine whether a plurality of input activations are in a floating point data type format or an integer data type format, interpret, during a floating point matrix multiplication, a first data type format corresponding to a plurality of quantized weights as the floating point data type format if the plurality of input activations are in the floating point data type format, and interpret, during an integer matrix multiplication, the first data type format as the integer data type format if the plurality of input activations are in the integer data type format.
- Example 15 includes the at least one computer readable storage medium of Example 14, wherein the plurality of decompression instructions, when executed, further cause the computing system to conduct the floating point matrix multiplication based on a lookup table associated with the first data type format and a weight scale factor associated with the plurality of quantized weights.
- Example 16 includes the at least one computer readable storage medium of Example 14, wherein the plurality of decompression instructions, when executed, further cause the computing system to conduct the integer matrix multiplication based on a lookup table associated with the first data type format, a weight scale factor associated with the plurality of quantized weights, a fixed scale factor, and an activation scale factor associated with the plurality of input activations. Docket No.
- Example 17 includes the at least one computer readable storage medium of Example 16, wherein the fixed scale factor is associated with an integer range of the integer data type format, and wherein the integer range is [-127, 127].
- Example 18 includes the at least one computer readable storage medium of any one of Examples 14 to 17, wherein a first bit-width of the first data type format is less than a second bit-width of the floating point data type format and a third bit-width of the integer data type format.
- Example 19 includes the at least one computer readable storage medium of Example 18, wherein the first bit-width of the first data type format is 4-bits.
- Example 20 includes the at least one computer readable storage medium of any one of Examples 14 to 17, wherein the plurality of quantized weights are contained within a quantization range, and wherein the quantization range is [-1, 1].
- Example 21 includes a method of compressing a pre-trained artificial intelligence (AI) model, the method comprising determining a weight scale factor for a plurality of source weights in the pre-trained AI model, wherein the plurality of source weights are contained within a source range, converting the plurality of source weights into a plurality of quantized weights based on the weight scale factor, wherein the plurality of quantized weights are contained within a quantization range, and generating an output AI model based on the plurality of quantized weights and the weight scale factor.
- AI artificial intelligence
- Example 22 includes a method of decompressing an output artificial intelligence (AI) model, the method comprising determining whether a plurality of input activations are in a floating point data type format or an integer data type format, interpreting, during a floating point matrix multiplication, a first data type format corresponding to a plurality of quantized weights as the floating point data type format if the plurality of input activations are in the floating point data type format, and interpreting, during an integer matrix multiplication, the first data type format as the integer data type format if the plurality of input activations are in the integer data type format.
- Example 23 includes an apparatus comprising means for performing the method of any of Examples 21 to 23. Embodiments are applicable for use with all types of semiconductor integrated circuit (“IC”) chips.
- IC semiconductor integrated circuit
- IC chips examples include but are not limited to processors, controllers, chipset components, programmable logic arrays (PLAs), memory chips, network chips, systems on chip (SoCs), SSD/NAND controller ASICs, Docket No. AF7863-PCT and the like.
- PLAs programmable logic arrays
- SoCs systems on chip
- SSD/NAND controller ASICs Docket No. AF7863-PCT and the like.
- signal conductor lines are represented with lines. Some may be different, to indicate more constituent signal paths, have a number label, to indicate a number of constituent signal paths, and/or have arrows at one or more ends, to indicate primary information flow direction. This, however, should not be construed in a limiting manner. Rather, such added detail may be used in connection with one or more exemplary embodiments to facilitate easier understanding of a circuit.
- Any represented signal lines, whether or not having additional information, may actually comprise one or more signals that may travel in multiple directions and may be implemented with any suitable type of signal scheme, e.g., digital or analog lines implemented with differential pairs, optical fiber lines, and/or single-ended lines.
- Example sizes/models/values/ranges may have been given, although embodiments are not limited to the same. As manufacturing techniques (e.g., photolithography) mature over time, it is expected that devices of smaller size could be manufactured.
- well known power/ground connections to IC chips and other components may or may not be shown within the figures, for simplicity of illustration and discussion, and so as not to obscure certain aspects of the embodiments.
- Coupled may be used herein to refer to any type of relationship, direct or indirect, between the components in question, and may apply to electrical, mechanical, fluid, optical, electromagnetic, electromechanical or other connections.
- first”, “second”, etc. may be used herein only to facilitate discussion, and carry no particular temporal or chronological significance unless otherwise indicated.
- a list of items joined by the term “one or more of” may mean any combination of the listed terms.
- the Docket No. AF7863-PCT phrases “one or more of A, B or C” may mean A; B; C; A and B; A and C; B and C; or A, B and C.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Neurology (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
Systems, apparatuses and methods may provide for technology that determines a weight scale factor for a plurality of source weights in a pre-trained artificial intelligence (AI) model, wherein the source weights are contained within a source range, convert the source weights into a plurality of quantized weights based on the weight scale factor, wherein the quantized weights are contained within a quantization range, and generate an output AI model based on the plurality of quantized weights and the weight scale factor. In addition, a first data type format corresponding to the quantized weights is interpretable as a floating point data type format and an integer data type format.
Description
Docket No. AF7863-PCT REINTERPRETABLE DATA TYPE FORMAT FOR ACCURATE AND EFFICIENT MODEL COMPRESSION BACKGROUND A large language model (LLM) is a type of language model notable for the ability to achieve general-purpose language understanding and generation. LLMs acquire these abilities by using large amounts of data to learn billions of parameters during training and consuming large computational resources during training and operation (e.g., inference). Weight data used by LLMs may originally be in a relatively high precision format such as, for example, the 32-bit floating point (FP32) format. Execution of LLMs on edge/client devices may be limited due to memory pressure during the loading of weights throughout the inference process. Quantizing the weight data used by LLMs to a lower-precision format such as, for example, 4-bit integer (INT4), can reduce the computational and memory demands of these modern architectures. Conventional quantization approaches, however, may encounter accuracy problems that negate the benefits of quantization. Additionally, the weight representation format may not run efficiently on all instruction sets, which limits data type adoption. BRIEF DESCRIPTION OF THE DRAWINGS The various advantages of the embodiments will become apparent to one skilled in the art by reading the following specification and appended claims, and by referencing the following drawings, in which: FIG. 1 is an illustration of an example of an artificial intelligence (AI) model compression solution according to an embodiment; FIG. 2 is an illustration of an example of a decision tree according to an embodiment; FIG. 3 is an illustration of an example of an AI model decompression solution according to an embodiment; FIG. 4 is a flowchart of an example of a method of compressing a pre-trained AI model according to an embodiment; FIG.5 is a flowchart of an example of a method of decompressing an output AI model according to an embodiment;
Docket No. AF7863-PCT FIG.6 is a block diagram of an example of a performance-enhanced computing system according to an embodiment; FIG. 7 is an illustration of an example of a semiconductor package apparatus according to an embodiment; FIG. 8 is a block diagram of an example of a processor according to an embodiment; and FIG.9 is a block diagram of an example of a multi-processor based computing system according to an embodiment. DETAILED DESCRIPTION Current solutions in the model optimization domain may include methods for large language model (LLM) optimization that are inaccurate or inefficient, as well as post-optimization methods for accuracy improvement that do not focus on LLM optimization or involve relatively long model tuning procedures. For example, generative pre-trained transformer quantization (GPTQ) may quantize model weights in a layer-wise fashion into 4-bit integer (INT4) precision by default using zero-point and scale factor. A substantial challenge of GPTQ, however, is that the data type (INT4) that is used does not consider the nature of the weight distribution, which mostly corresponds to a unimodal normal distribution. As a result, the results are less accurate compressed models despite the complicated compression process. Another alternative is a 4-bit normal floating point (NF4) data type that may contain sixteen (24) floating point values in the range [-1,1] that are based on the quantiles of the normal distribution. Although this type allows a more accurate representation of the model weights and may lead to better accuracy after weight compression, one limitation of the NF4 type is that the data type can be used only with floating-point instructions (e.g., 32-bit floating point/FP32, 16-bit floating point/FP16, 16-bit brain floating point/Bfloat16) when computing dot product operations during inferences. The NF4 data type does not allow the use of 8-bit integer (INT8) instructions, which are more power-efficient, performant, and available in most contemporary hardware (e.g., central processing units/CPUs, integrated and discrete graphics processing units/GPUs, and specialized accelerators such as network processing units/NPUs).
Docket No. AF7863-PCT Other proposed methods similar to NF4 include palletization, which may cluster and fit weights into a set of floating-point values. Although palletization solutions may improve accuracy, such solutions are typically not efficient during inference for certain platforms. The technology described herein provides a 4-bit reinterpretable floating point (RF4) data type, which is a unique 4-bit data type to store weights that can be interpreted as the floating-point data type or cast to an 8-bit integer data type depending on the available instruction set. As a result, the proposed data type is performant in various types of hardware (HW) architectures. In addition, the proposed data type allows for preserving compressed model accuracy on a level similar to other less compute- efficient data types. The proposed RF4 type enables accurate post-training and training- time weight compression of Deep Learning models and provides cross-hardware compatibility when running inference operations on various types of HW that have different instruction sets. RF4 is highly applicable to the optimization of transformer- based models including LLMs for existing HW accelerators (e.g., CPUs, GPUs, NPUs, etc.). Turning now to FIG.1, a compression solution is shown in which a pre-trained AI model 10 (e.g., source LLM) includes a plurality of source weights 12 (e.g., weight tensor, group of weights) that are in a floating point (FP) data type (e.g., FP32) and contained within a source range. A set of model parameters 14 is determined based on the source weights 12, wherein the model parameters 14 include information such as a weight scale factor, a decision tree, a lookup table, and so forth. In one example, the weight scale factor is determined based on a maximum value in the plurality of source weights 12. The plurality of source weights 12 are converted into a plurality of quantized weights 16 (16a, 16b) based on the model parameters 14. As will be discussed in greater detail, the quantized weights 16 are in the reinterpretable format (RF) data type and contained within a quantization range (e.g., [-1, 1]), wherein the RF data type is interpretable as quantized weights 16a in an FP data type format and quantized weights 16b in an integer (INT) data type format. An output AI model 18 is generated based on the plurality of quantized weights 16 and the model parameters 14. Configuring the quantized weights 16 in the RF data type enables accurate post- training and training-time weight compression of the output AI model 18 and provides cross-hardware compatibility when running inference operations on various types of HW that have different instruction sets. The RF data type is highly applicable to the
Docket No. AF7863-PCT optimization of transformer-based models including LLMs for existing HW accelerators (e.g., CPUs, GPUs, NPUs, etc.). In one example, the bit-width of the RF data type format is 4-bits (e.g., RF4). Thus, considering a single linear (e.g., Fully-Connected) layer that is optimized with the RF4 data type format, weights of the source and optimized models have the following relationship: ^ ^ ^^^^^^^^ = ^^^ ^^^^^^ ^ ^^^^^^^ ^ ^^^^^ ^ Where scale is estimated in the output AI
model 18 and the floating-point value from the range [-1,1] to the nearest value from RF4 type. More particularly, the RF4 data type leverages the fact that model weights are normally distributed around zero. First, the [-1,1] interval is considered because any weight tensor can be projected into that interval using the following formula: ^^ ^ = ^^ / m %ax |^%|. Then, the 2()* = 4, values -^ of the data type are estimated using the following equation: - . ^ ^4. ^ = / )012 /34.5 + 012 /34.5,, 01 - quantile function of the symmetrically distributed values
with no zero value. To tackle this problem, 2 negative values and 2(7. + 1 positive values are sampled and one of the two zero values are dropped, which results overall in exactly 2( asymmetrically distributed values in the range [-1,1]. Next, this value is projected into the integer range of [-127,127] using the equation: ^94^:^; ^ = ^^<^=)-^ ∗ 127,, where round() is the operation of rounding to nearest neighbor. Finally, the following set of values is obtained, which represents projections of the NF4 data type into the INT8 data type: ^94^ ^:^; = [-127, -88, -67, -50, -36, -23, -12, 0, 10, 20, 31, 43, 56, 71, 92, 127] Scaling these values back to the floating point precision (e.g., by multiplying 1/127.0) forms the floating-point representation of the RF4 data type: ^94@^A/ ^ = [-1, -0.69291339, -0.52755906, -0.39370079, -0.28346457, -0.18110236, - 0.09448819, 0, 0.07874016, 0.15748031, 0.24409449, 0.33858268, 0.44094488, 0.55905512, 0.72440945, 1] Both representations are connected as follows: ^94@^A/ . ^ = ^94^:^; ^ ∗ ./B.D The quantization process from FP32 to RF4 type is as follows:
Docket No. AF7863-PCT - The weight scale factor that projects weights to [-1,1] is computed as follows: ^^^^^ = m %ax |^%|. This scale can be computed per weight tensor or per group of weights (e.g., for each 128 weights). Thus, every weight tensor of RF4 values in the model is accompanied by the scaling factor. - Weights are projected to the [-1,1] range using pre-computed scales: ^^ ^ = ^^/ ^^^^^( - RF4 4-bit “nibbles” (e.g., half a byte) are determined using a decision tree associated with the RF4 data type format. FIG.2 shows a decision tree 30 that may be used to convert the source weights 12 (FIG.1) into the quantized weights 16 (FIG.1). In general, x is a quantized weight value in the [-1, 1] quantization range and the decision tree 30 returns one of sixteen values in an RF4 lookup table based on the location of the quantized weight value relative to the ^94@^A/ ^ floating point representation of the RF4 data type. For example, a return operation 32 outputs the binary value “1111” if the quantized weight value is greater than the median value (e.g., “0.862204725”) between “0.72440945” and “1”. By contrast, a return operation 34 outputs the binary value “1110” if the quantized weight value is greater than the median value (e.g., “0.641732285”) between “0.55905512” and “0.72440945”. Although the quantization process can be computationally expensive, quantization may not have any impact on inference latency since quantization is typically conducted offline during model preparation prior to deployment. FIG.3 shows a decompression solution in which weight dequantization occurs at inference time (e.g., right before a matrix multiplication/MatMul operation). In the illustrated example, the output AI model 18 includes the reinterpretable quantized weights 16. A determination is made (e.g., based on the instruction set architecture/ISA) as to whether a plurality of input activations 20 are in the FP data type format (e.g., FP32) or the INT data type format (e.g., INT8). If the plurality of input activations 20 are in the FP data type format, the quantized weights 16 in the RF data type format are interpreted as the quantized weights 16a in the FP data type format during an FP matrix multiplication. In such a case, the FP matrix multiplication may be conducted based on first model parameters 22 including the weight scale factor associated with the plurality of quantized weights 16 and the lookup table associated with the RF data type format.
Docket No. AF7863-PCT If the plurality of input activations 20 are in the INT data type format, the quantized weights 16 in the RF data type are interpreted as the quantized weights 16b in the INT data type format during an INT matrix multiplication. In such a case, the INT matrix multiplication may be conducted based on the first model parameters 22 as well as second model parameters 24 including a fixed scale factor and an activation scale factor associated with the plurality of input activations 20. As will be discussed in greater detail, the fixed scale factor can be associated with an integer range (e.g., [- 127, 127]) of the INT data type format. In the illustrated example, either the FP matrix multiplication or the INT matrix multiplication generate matrix multiplication results 26. Thus, there are two possible options: - Floating-point instructions are used for MatMul (dot product). In this case, the quantized weights 16a are mapped to the floating-point range using the look up table and scaled back to the source range using the weight scaling factor: E = ∑ ^ ^@^ @^A/ @^A/ @^A G ∗ ^ ^ ∑ / ^ ^ ∗ H = ^ ^^ ∗ H^ activations are quantized to 8-
bits (e.g., using dynamic quantization). The RF4 weights are converted into the INT8 data type using the ^94^:^; ^ values defined above: ^94@^A/ ^ = ^94^:^; ^ ∗ 127.0. Then, the dot product operation can be computed in 8-bit precision as follows: E = ∑ ^@^ ^:^; ^K ^:^; ^:^; ^ ^G ∗ ^^ ∗ ^J ∗ H ^ K ^ = ∑^ ./B.D ^^ ∗ ^J ∗ H^ = ∑^ ./B.D ∗ ^J ∗
performance gains in terms of latency. FIG.4 shows a method 40 of compressing a pre-trained AI model. The method 40 may be implemented in one or more modules as a plurality of logic instructions (e.g., compression instructions) stored in a machine- or computer-readable storage medium such as random access memory (RAM), read only memory (ROM), programmable ROM (PROM), firmware, flash memory, etc., in hardware, or any combination thereof. For example, hardware implementations may include configurable logic, fixed-functionality logic, or any combination thereof. Examples of configurable logic (e.g., configurable hardware) include suitably configured programmable logic arrays (PLAs), field programmable gate arrays (FPGAs), complex programmable logic devices (CPLDs), and general purpose microprocessors. Examples of fixed-
Docket No. AF7863-PCT functionality logic (e.g., fixed-functionality hardware) include suitably configured application specific integrated circuits (ASICs), combinational logic circuits, and sequential logic circuits. The configurable or fixed-functionality logic can be implemented with complementary metal oxide semiconductor (CMOS) logic circuits, transistor-transistor logic (TTL) logic circuits, or other circuits. For example, computer program code to carry out operations shown in the method 40 can be written in any combination of one or more programming languages, including an object oriented programming language such as JAVA, SMALLTALK, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. Additionally, logic instructions might include assembler instructions, instruction set architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, state- setting data, configuration data for integrated circuitry, state information that personalizes electronic circuitry and/or other structural components that are native to hardware (e.g., host processor, central processing unit/CPU, microcontroller, etc.). Illustrated block 42 provides for determining a weight scale factor for a plurality of source weights in a pre-trained model, wherein the plurality of source weights are contained within a source range. In an embodiment, the weight scale factor is determined based on a maximum value in the plurality of source weights. Block 44 converts the plurality of source weights into a plurality of quantized weights based on the weight scale factor, wherein the plurality of quantized weights are contained within a quantization range (e.g., [-1, 1]). Block 44 may also convert the plurality of source weights into the plurality of quantized weights further based on a decision tree and a lookup table associated with a first data type format corresponding to the plurality of quantized weights. As already noted, the first data type format (e.g., RF data type format) corresponding to the plurality of quantized weights is interpretable as an FP data type format and an INT data type format. Moreover, a first bit-width of the first data type format may be less than a second bit-width of a second data type format (e.g., FP32) corresponding to the plurality of source weights and less than a third bit-width of the INT data type format (e.g., INT8). For example, the first bit-width of the first data type is 4-bits (e.g., RF4) in one example. Other bit-widths may be used, however, for the RF data type format. Block 46 generates an output AI model based on the plurality of quantized weights and the weight scale factor.
Docket No. AF7863-PCT The method 40 therefore enhances performance at least to the extent that generating the output AI model based on the plurality of quantized weights and the weight scale factor enables accurate post-training and training-time weight compression of the output AI model and provides cross-hardware compatibility when running inference operations on various types of HW that have different instruction sets. The RF data type is highly applicable to the optimization of transformer-based models including LLMs for existing HW accelerators (e.g., CPUs, GPUs, NPUs, etc.). FIG.5 shows a method 50 of decompressing an output AI model. The method 50 may be implemented in one or more modules as a plurality of logic instructions (e.g., decompression instructions) stored in a machine- or computer-readable storage medium such as RAM, ROM, PROM, firmware, flash memory, etc., in hardware, or any combination thereof. Illustrated processing block 52 provides for determining whether a plurality of input activations are in an FP data type format (e.g., FP16, FP32) or an INT data type format (e.g., INT8). Block 52 may determine the format of the input activations based on an ISA instruction (e.g., lookup instruction) associated with the underlying hardware. If it is determined at block 54 that the input activations are in the FP data type format, block 56 interprets, during an FP matrix multiplication (e.g., dot product), a first data type format (e.g., RF) corresponding to a plurality of quantized weights as the FP data type format. In an embodiment, block 56 conducts the FP matrix multiplication based on a lookup table associated with the first data type format and a weight scale factor associated with the plurality of quantized weights. If it is determined at block 58 that the input activations are in the INT data type format, block 60 interprets, during an INT matrix multiplication (e.g., dot product), the first data type format as the INT data type format. In an embodiment, block 60 conducts the INT matrix multiplication based on a lookup table associated with the first data type format, a weight scale factor associated with the plurality of quantized weights, a fixed scale factor (e.g., 127), and an activation scale factor associated with the plurality of input activations. As already noted, the fixed scale factor can be associated with an integer range (e.g., [-127, 127]) of the INT data type format. Moreover, a first bit-width (e.g., 4-bits) of the first data type can be less than a second bit-width (e.g., 32-bits) of the FP data type and a third bit-width (e.g., 8-bits) of the INT data type format. In an embodiment, the plurality of quantized weights are contained within a quantization such as, for
Docket No. AF7863-PCT example, [-1, 1]. The method 50 therefore enhances performance at least to the extent that the RF type format improves accuracy and/or reduces latency during inference operations. Turning now to FIG. 6, a performance-enhanced computing system 280 is shown. The system 280 may generally be part of an electronic device/platform having computing functionality (e.g., personal digital assistant/PDA, notebook computer, tablet computer, convertible tablet, server), communications functionality (e.g., smart phone), imaging functionality (e.g., camera, camcorder), media playing functionality (e.g., smart television/TV), wearable functionality (e.g., watch, eyewear, headwear, footwear, jewelry), vehicular functionality (e.g., car, truck, motorcycle), robotic functionality (e.g., autonomous robot), Internet of Things (IoT) functionality, drone functionality, etc., or any combination thereof. In the illustrated example, the system 280 includes a host processor 282 (e.g., central processing unit/CPU) having an integrated memory controller (IMC) 284 that is coupled to a system memory 286 (e.g., dual inline memory module/DIMM). In an embodiment, an IO module 288 is coupled to the host processor 282. The illustrated IO module 288 communicates with, for example, a display 290 (e.g., touch screen, liquid crystal display/LCD, light emitting diode/LED display), and a network controller 292 (e.g., conducting wired and/or wireless communications). The host processor 282 may be combined with the IO module 288, a graphics processor 294, and an AI accelerator 296 into a system on chip (SoC) 298. In an embodiment, the AI accelerator 296, the host processor 282 and/or the SoC 298 executes a plurality of executable program instructions 300 (e.g., compression and/or decompression instructions) retrieved from mass storage 302 and/or the system memory 286 to perform one or more aspects of the method 40 (FIG. 4) and/or the method 50 (FIG.5), already discussed. Thus, during compression, execution of the instructions 300 causes the AI accelerator 296, the host processor 282 and/or the SoC 298 to determine a weight scale factor for a plurality of source weights in a pre-trained AI model, wherein the plurality of source weights are contained within a source range, convert the plurality of source weights into a plurality of quantized weights based on the weight scale factor, wherein the plurality of quantized weights are contained within a quantization range, and generate an output AI model based on the plurality of quantized weights and the weight scale factor. In an embodiment, a first data type format (e.g., reinterpretable floating
Docket No. AF7863-PCT point/RF) corresponding to the plurality of quantized weights is interpretable as a floating point data type format and an integer data type format. During decompression, execution of the instructions 300 causes the AI accelerator 296, the host processor 282 and/or the SoC to determine whether a plurality of input activations are in the floating point data type format or the integer data type format. If the plurality of input activations are in the floating point data type format, the first data type format is interpreted as the floating data type format during a floating point matrix multiplication. If the plurality of input activations are in the integer data type format, the first data type format is interpreted as the integer data type format during an integer matrix multiplication. The computing system 280 is therefore considered performance-enhanced at least to the extent that generating the output AI model based on the plurality of quantized weights and the weight scale factor enables accurate post-training and training-time weight compression of the output AI model and provides cross-hardware compatibility when running inference operations on various types of HW that have different instruction sets. The RF data type is highly applicable to the optimization of transformer-based models including LLMs for existing HW accelerators (e.g., CPUs, GPUs, NPUs, etc.). Additionally, the RF type format improves accuracy and/or reduces latency during inference operations. FIG. 7 shows a semiconductor apparatus 350 (e.g., chip, die, package). The illustrated apparatus 350 includes one or more substrates 352 (e.g., silicon, sapphire, gallium arsenide) and logic 354 (e.g., transistor array and other integrated circuit/IC components) coupled to the substrate(s) 352. In an embodiment, the logic 354 implements one or more aspects of the method 40 (FIG.4) and/or the method 50 (FIG. 5), already discussed. The logic 354 may be implemented at least partly in configurable or fixed- functionality hardware. In one example, the logic 354 includes transistor channel regions that are positioned (e.g., embedded) within the substrate(s) 352. Thus, the interface between the logic 354 and the substrate(s) 352 may not be an abrupt junction. The logic 354 may also be considered to include an epitaxial layer that is grown on an initial wafer of the substrate(s) 352. In an embodiment, the method 40 (FIG. 4) and/or the method 50 (FIG. 5) are incorporated into an INTEL OPENVINO tookit, which streamlines AI model development and integration of deep learning in domains such as computer vision, large
Docket No. AF7863-PCT language models,^and generative AI. In such a case, the use of an RF data type format as described herein improves accuracy and/or reduces latency during inference operations. FIG. 8 illustrates a processor core 400 according to one embodiment. The processor core 400 may be the core for any type of processor, such as a micro-processor, an embedded processor, a digital signal processor (DSP), a network processor, or other device to execute code. Although only one processor core 400 is illustrated in FIG.8, a processing element may alternatively include more than one of the processor core 400 illustrated in FIG. 8. The processor core 400 may be a single-threaded core or, for at least one embodiment, the processor core 400 may be multithreaded in that it may include more than one hardware thread context (or “logical processor”) per core. FIG. 8 also illustrates a memory 470 coupled to the processor core 400. The memory 470 may be any of a wide variety of memories (including various layers of memory hierarchy) as are known or otherwise available to those of skill in the art. The memory 470 may include one or more code 413 instruction(s) to be executed by the processor core 400, wherein the code 413 may implement the method 40 (FIG.4) and/or the method 50 (FIG.5), already discussed. The processor core 400 follows a program sequence of instructions indicated by the code 413. Each instruction may enter a front end portion 410 and be processed by one or more decoders 420. The decoder 420 may generate as its output a micro operation such as a fixed width micro operation in a predefined format, or may generate other instructions, microinstructions, or control signals which reflect the original code instruction. The illustrated front end portion 410 also includes register renaming logic 425 and scheduling logic 430, which generally allocate resources and queue the operation corresponding to the convert instruction for execution. The processor core 400 is shown including execution logic 450 having a set of execution units 455-1 through 455-N. Some embodiments may include a number of execution units dedicated to specific functions or sets of functions. Other embodiments may include only one execution unit or one execution unit that can perform a particular function. The illustrated execution logic 450 performs the operations specified by code instructions. After completion of execution of the operations specified by the code instructions, back end logic 460 retires the instructions of the code 413. In one embodiment, the processor core 400 allows out of order execution but requires in order
Docket No. AF7863-PCT retirement of instructions. Retirement logic 465 may take a variety of forms as known to those of skill in the art (e.g., re-order buffers or the like). In this manner, the processor core 400 is transformed during execution of the code 413, at least in terms of the output generated by the decoder, the hardware registers and tables utilized by the register renaming logic 425, and any registers (not shown) modified by the execution logic 450. Although not illustrated in FIG. 8, a processing element may include other elements on chip with the processor core 400. For example, a processing element may include memory control logic along with the processor core 400. The processing element may include I/O control logic and/or may include I/O control logic integrated with memory control logic. The processing element may also include one or more caches. Referring now to FIG.9, shown is a block diagram of a computing system 1000 embodiment in accordance with an embodiment. Shown in FIG.9 is a multiprocessor system 1000 that includes a first processing element 1070 and a second processing element 1080. While two processing elements 1070 and 1080 are shown, it is to be understood that an embodiment of the system 1000 may also include only one such processing element. The system 1000 is illustrated as a point-to-point interconnect system, wherein the first processing element 1070 and the second processing element 1080 are coupled via a point-to-point interconnect 1050. It should be understood that any or all of the interconnects illustrated in FIG.9 may be implemented as a multi-drop bus rather than point-to-point interconnect. As shown in FIG. 9, each of processing elements 1070 and 1080 may be multicore processors, including first and second processor cores (i.e., processor cores 1074a and 1074b and processor cores 1084a and 1084b). Such cores 1074a, 1074b, 1084a, 1084b may be configured to execute instruction code in a manner similar to that discussed above in connection with FIG.8. Each processing element 1070, 1080 may include at least one shared cache 1896a, 1896b. The shared cache 1896a, 1896b may store data (e.g., instructions) that are utilized by one or more components of the processor, such as the cores 1074a, 1074b and 1084a, 1084b, respectively. For example, the shared cache 1896a, 1896b may locally cache data stored in a memory 1032, 1034 for faster access by components of the processor. In one or more embodiments, the shared cache 1896a, 1896b may
Docket No. AF7863-PCT include one or more mid-level caches, such as level 2 (L2), level 3 (L3), level 4 (L4), or other levels of cache, a last level cache (LLC), and/or combinations thereof. While shown with only two processing elements 1070, 1080, it is to be understood that the scope of the embodiments are not so limited. In other embodiments, one or more additional processing elements may be present in a given processor. Alternatively, one or more of processing elements 1070, 1080 may be an element other than a processor, such as an accelerator or a field programmable gate array. For example, additional processing element(s) may include additional processors(s) that are the same as a first processor 1070, additional processor(s) that are heterogeneous or asymmetric to processor a first processor 1070, accelerators (such as, e.g., graphics accelerators or digital signal processing (DSP) units), field programmable gate arrays, or any other processing element. There can be a variety of differences between the processing elements 1070, 1080 in terms of a spectrum of metrics of merit including architectural, micro architectural, thermal, power consumption characteristics, and the like. These differences may effectively manifest themselves as asymmetry and heterogeneity amongst the processing elements 1070, 1080. For at least one embodiment, the various processing elements 1070, 1080 may reside in the same die package. The first processing element 1070 may further include memory controller logic (MC) 1072 and point-to-point (P-P) interfaces 1076 and 1078. Similarly, the second processing element 1080 may include a MC 1082 and P-P interfaces 1086 and 1088. As shown in FIG.9, MC’s 1072 and 1082 couple the processors to respective memories, namely a memory 1032 and a memory 1034, which may be portions of main memory locally attached to the respective processors. While the MC 1072 and 1082 is illustrated as integrated into the processing elements 1070, 1080, for alternative embodiments the MC logic may be discrete logic outside the processing elements 1070, 1080 rather than integrated therein. The first processing element 1070 and the second processing element 1080 may be coupled to an I/O subsystem 1090 via P-P interconnects 10761086, respectively. As shown in FIG. 9, the I/O subsystem 1090 includes P-P interfaces 1094 and 1098. Furthermore, I/O subsystem 1090 includes an interface 1092 to couple I/O subsystem 1090 with a high performance graphics engine 1038. In one embodiment, bus 1049 may be used to couple the graphics engine 1038 to the I/O subsystem 1090. Alternately, a point-to-point interconnect may couple these components.
Docket No. AF7863-PCT In turn, I/O subsystem 1090 may be coupled to a first bus 1016 via an interface 1096. In one embodiment, the first bus 1016 may be a Peripheral Component Interconnect (PCI) bus, or a bus such as a PCI Express bus or another third generation I/O interconnect bus, although the scope of the embodiments are not so limited. As shown in FIG. 9, various I/O devices 1014 (e.g., biometric scanners, speakers, cameras, sensors) may be coupled to the first bus 1016, along with a bus bridge 1018 which may couple the first bus 1016 to a second bus 1020. In one embodiment, the second bus 1020 may be a low pin count (LPC) bus. Various devices may be coupled to the second bus 1020 including, for example, a keyboard/mouse 1012, communication device(s) 1026, and a data storage unit 1019 such as a disk drive or other mass storage device which may include code 1030, in one embodiment. The illustrated code 1030 may implement the method 40 (FIG. 4) and/or the method 50 (FIG.5), already discussed. Further, an audio I/O 1024 may be coupled to second bus 1020 and a battery 1010 may supply power to the computing system 1000. Note that other embodiments are contemplated. For example, instead of the point-to-point architecture of FIG. 9, a system may implement a multi-drop bus or another such communication topology. Also, the elements of FIG.9 may alternatively be partitioned using more or fewer integrated chips than shown in FIG.9. Additional Notes and Examples: Example 1 includes a performance-enhanced computing system comprising a network controller, a processor coupled to the network controller, and a memory coupled to the processor, the memory including a plurality of compression instructions, which when executed by the processor, cause the processor to determine a weight scale factor for a plurality of source weights in a pre-trained artificial intelligence (AI) model, wherein the plurality of source weights are contained within a source range, convert the plurality of source weights into a plurality of quantized weights based on the weight scale factor, wherein the plurality of quantized weights are contained within a quantization range, and generate an output AI model based on the plurality of quantized weights and the weight scale factor. Example 2 includes the computing system of Example 1, wherein a first data type format corresponding to the plurality of quantized weights is interpretable as a floating point data type format and an integer data type format. Example 3 includes the computing system of Example 2, wherein the memory further includes a plurality of decompression instructions, which when executed by the
Docket No. AF7863-PCT processor, cause the processor to determine whether a plurality of input activations are in the floating point data type format or the integer data type format, interpret, during a floating point matrix multiplication, the first data type format corresponding to the plurality of quantized weights as the floating point data type format if the plurality of input activations are in the floating point data type format, and interpret, during an integer matrix multiplication, the first data type format as the integer data type format if the plurality of input activations are in the integer data type format. Example 4 includes the computing system of Example 2, wherein a first bit- width of the first data type format is less than a third bit-width of the integer data type format. Example 5 includes the computing system of Example 4, wherein the first bit- width of the first data type format is less than a second bit-width of a second data type format corresponding to the plurality of source weights, and wherein the first bit-width of the first data type format is 4-bits. Example 6 includes the computing system of Example 1, wherein the plurality of source weights are converted into the plurality of quantized weights further based on a decision tree and a lookup table associated with a first data type format corresponding to the plurality of quantized weights. Example 7 includes at least one computer readable storage medium comprising a plurality of compression instructions, which when executed by a computing system, cause the computing system to determine a weight scale factor for a plurality of source weights in a pre-trained artificial intelligence (AI) model, wherein the plurality of source weights are contained within a source range, convert the plurality of source weights into a plurality of quantized weights based on the weight scale factor, wherein the plurality of quantized weights are contained within a quantization range, and generate an output AI model based on the plurality of quantized weights and the weight scale factor. Example 8 includes the at least one computer readable storage medium of Example 7, wherein a first data type format corresponding to the plurality of quantized weights is interpretable as a floating point data type format and an integer data type format. Example 9 includes the at least one computer readable storage medium of Example 8, wherein a first bit-width of the first data type format is less than a third bit- width of the integer data type format.
Docket No. AF7863-PCT Example 10 includes the at least one computer readable storage medium of Example 9, wherein the first bit-width of the first data type format is less than a second bit-width of a second data type format corresponding to the plurality of source weights, and wherein the first bit-width of the first data type format is 4-bits. Example 11 includes the at least one computer readable storage medium of Example 7, wherein the plurality of source weights are converted into the plurality of quantized weights further based on a decision tree and a lookup table associated with a first data type format corresponding to the plurality of quantized weights. Example 12 includes the at least one computer readable storage medium of any one of Examples 7 to 11, wherein the weight scale factor is determined based on a maximum value in the plurality of source weights. Example 13 includes the at least one computer readable storage medium of any one of Examples 7 to 11, wherein the quantization range is [-1, 1]. Example 14 includes at least one computer readable storage medium comprising a plurality of decompression instructions, which when executed by a computing system, cause the computing system to determine whether a plurality of input activations are in a floating point data type format or an integer data type format, interpret, during a floating point matrix multiplication, a first data type format corresponding to a plurality of quantized weights as the floating point data type format if the plurality of input activations are in the floating point data type format, and interpret, during an integer matrix multiplication, the first data type format as the integer data type format if the plurality of input activations are in the integer data type format. Example 15 includes the at least one computer readable storage medium of Example 14, wherein the plurality of decompression instructions, when executed, further cause the computing system to conduct the floating point matrix multiplication based on a lookup table associated with the first data type format and a weight scale factor associated with the plurality of quantized weights. Example 16 includes the at least one computer readable storage medium of Example 14, wherein the plurality of decompression instructions, when executed, further cause the computing system to conduct the integer matrix multiplication based on a lookup table associated with the first data type format, a weight scale factor associated with the plurality of quantized weights, a fixed scale factor, and an activation scale factor associated with the plurality of input activations.
Docket No. AF7863-PCT Example 17 includes the at least one computer readable storage medium of Example 16, wherein the fixed scale factor is associated with an integer range of the integer data type format, and wherein the integer range is [-127, 127]. Example 18 includes the at least one computer readable storage medium of any one of Examples 14 to 17, wherein a first bit-width of the first data type format is less than a second bit-width of the floating point data type format and a third bit-width of the integer data type format. Example 19 includes the at least one computer readable storage medium of Example 18, wherein the first bit-width of the first data type format is 4-bits. Example 20 includes the at least one computer readable storage medium of any one of Examples 14 to 17, wherein the plurality of quantized weights are contained within a quantization range, and wherein the quantization range is [-1, 1]. Example 21 includes a method of compressing a pre-trained artificial intelligence (AI) model, the method comprising determining a weight scale factor for a plurality of source weights in the pre-trained AI model, wherein the plurality of source weights are contained within a source range, converting the plurality of source weights into a plurality of quantized weights based on the weight scale factor, wherein the plurality of quantized weights are contained within a quantization range, and generating an output AI model based on the plurality of quantized weights and the weight scale factor. Example 22 includes a method of decompressing an output artificial intelligence (AI) model, the method comprising determining whether a plurality of input activations are in a floating point data type format or an integer data type format, interpreting, during a floating point matrix multiplication, a first data type format corresponding to a plurality of quantized weights as the floating point data type format if the plurality of input activations are in the floating point data type format, and interpreting, during an integer matrix multiplication, the first data type format as the integer data type format if the plurality of input activations are in the integer data type format. Example 23 includes an apparatus comprising means for performing the method of any of Examples 21 to 23. Embodiments are applicable for use with all types of semiconductor integrated circuit (“IC”) chips. Examples of these IC chips include but are not limited to processors, controllers, chipset components, programmable logic arrays (PLAs), memory chips, network chips, systems on chip (SoCs), SSD/NAND controller ASICs,
Docket No. AF7863-PCT and the like. In addition, in some of the drawings, signal conductor lines are represented with lines. Some may be different, to indicate more constituent signal paths, have a number label, to indicate a number of constituent signal paths, and/or have arrows at one or more ends, to indicate primary information flow direction. This, however, should not be construed in a limiting manner. Rather, such added detail may be used in connection with one or more exemplary embodiments to facilitate easier understanding of a circuit. Any represented signal lines, whether or not having additional information, may actually comprise one or more signals that may travel in multiple directions and may be implemented with any suitable type of signal scheme, e.g., digital or analog lines implemented with differential pairs, optical fiber lines, and/or single-ended lines. Example sizes/models/values/ranges may have been given, although embodiments are not limited to the same. As manufacturing techniques (e.g., photolithography) mature over time, it is expected that devices of smaller size could be manufactured. In addition, well known power/ground connections to IC chips and other components may or may not be shown within the figures, for simplicity of illustration and discussion, and so as not to obscure certain aspects of the embodiments. Further, arrangements may be shown in block diagram form in order to avoid obscuring embodiments, and also in view of the fact that specifics with respect to implementation of such block diagram arrangements are highly dependent upon the computing system within which the embodiment is to be implemented, i.e., such specifics should be well within purview of one skilled in the art. Where specific details (e.g., circuits) are set forth in order to describe example embodiments, it should be apparent to one skilled in the art that embodiments can be practiced without, or with variation of, these specific details. The description is thus to be regarded as illustrative instead of limiting. The term “coupled” may be used herein to refer to any type of relationship, direct or indirect, between the components in question, and may apply to electrical, mechanical, fluid, optical, electromagnetic, electromechanical or other connections. In addition, the terms “first”, “second”, etc. may be used herein only to facilitate discussion, and carry no particular temporal or chronological significance unless otherwise indicated. As used in this application and in the claims, a list of items joined by the term “one or more of” may mean any combination of the listed terms. For example, the
Docket No. AF7863-PCT phrases “one or more of A, B or C” may mean A; B; C; A and B; A and C; B and C; or A, B and C. Those skilled in the art will appreciate from the foregoing description that the broad techniques of the embodiments can be implemented in a variety of forms. Therefore, while the embodiments have been described in connection with particular examples thereof, the true scope of the embodiments should not be so limited since other modifications will become apparent to the skilled practitioner upon a study of the drawings, specification, and following claims.
Claims
Docket No. AF7863-PCT CLAIMS We claim: 1. A performance-enhanced computing system comprising: a network controller; a processor coupled to the network controller; and a memory coupled to the processor, the memory including a plurality of compression instructions, which when executed by the processor, cause the processor to: determine a weight scale factor for a plurality of source weights in a pre-trained artificial intelligence (AI) model, wherein the plurality of source weights are contained within a source range; convert the plurality of source weights into a plurality of quantized weights based on the weight scale factor, wherein the plurality of quantized weights are contained within a quantization range; and generate an output AI model based on the plurality of quantized weights and the weight scale factor. 2. The computing system of claim 1, wherein a first data type format corresponding to the plurality of quantized weights is interpretable as a floating point data type format and an integer data type format. 3. The computing system of claim 2, wherein the memory further includes a plurality of decompression instructions, which when executed by the processor, cause the processor to: determine whether a plurality of input activations are in the floating point data type format or the integer data type format; interpret, during a floating point matrix multiplication, the first data type format corresponding to the plurality of quantized weights as the floating point data type format if the plurality of input activations are in the floating point data type format; and
Docket No. AF7863-PCT interpret, during an integer matrix multiplication, the first data type format as the integer data type format if the plurality of input activations are in the integer data type format. 4. The computing system of claim 2, wherein a first bit-width of the first data type format is less than a third bit-width of the integer data type format. 5. The computing system of claim 4, wherein the first bit-width of the first data type format is less than a second bit-width of a second data type format corresponding to the plurality of source weights, and wherein the first bit-width of the first data type format is 4-bits. 6. The computing system of claim 1, wherein the plurality of source weights are converted into the plurality of quantized weights further based on a decision tree and a lookup table associated with a first data type format corresponding to the plurality of quantized weights. 7. At least one computer readable storage medium comprising a plurality of compression instructions, which when executed by a computing system, cause the computing system to: determine a weight scale factor for a plurality of source weights in a pre- trained artificial intelligence (AI) model, wherein the plurality of source weights are contained within a source range; convert the plurality of source weights into a plurality of quantized weights based on the weight scale factor, wherein the plurality of quantized weights are contained within a quantization range; and generate an output AI model based on the plurality of quantized weights and the weight scale factor. 8. The at least one computer readable storage medium of claim 7, wherein a first data type format corresponding to the plurality of quantized weights is interpretable as a floating point data type format and an integer data type format.
Docket No. AF7863-PCT 9. The at least one computer readable storage medium of claim 8, wherein a first bit-width of the first data type format is less than a third bit-width of the integer data type format. 10. The at least one computer readable storage medium of claim 9, wherein the first bit-width of the first data type format is less than a second bit-width of a second data type format corresponding to the plurality of source weights, and wherein the first bit-width of the first data type format is 4-bits. 11. The at least one computer readable storage medium of claim 7, wherein the plurality of source weights are converted into the plurality of quantized weights further based on a decision tree and a lookup table associated with a first data type format corresponding to the plurality of quantized weights. 12. The at least one computer readable storage medium of any one of claims 7 to 11, wherein the weight scale factor is determined based on a maximum value in the plurality of source weights. 13. The at least one computer readable storage medium of any one of claims 7 to 11, wherein the quantization range is [-1, 1]. 14. At least one computer readable storage medium comprising a plurality of decompression instructions, which when executed by a computing system, cause the computing system to: determine whether a plurality of input activations are in a floating point data type format or an integer data type format; interpret, during a floating point matrix multiplication, a first data type format corresponding to a plurality of quantized weights as the floating point data type format if the plurality of input activations are in the floating point data type format; and interpret, during an integer matrix multiplication, the first data type format as the integer data type format if the plurality of input activations are in the integer data type format.
Docket No. AF7863-PCT 15. The at least one computer readable storage medium of claim 14, wherein the plurality of decompression instructions, when executed, further cause the computing system to conduct the floating point matrix multiplication based on a lookup table associated with the first data type format and a weight scale factor associated with the plurality of quantized weights. 16. The at least one computer readable storage medium of claim 14, wherein the plurality of decompression instructions, when executed, further cause the computing system to conduct the integer matrix multiplication based on a lookup table associated with the first data type format, a weight scale factor associated with the plurality of quantized weights, a fixed scale factor, and an activation scale factor associated with the plurality of input activations. 17. The at least one computer readable storage medium of claim 16, wherein the fixed scale factor is associated with an integer range of the integer data type format, and wherein the integer range is [-127, 127]. 18. The at least one computer readable storage medium of any one of claims 14 to 17, wherein a first bit-width of the first data type format is less than a second bit-width of the floating point data type format and a third bit-width of the integer data type format. 19. The at least one computer readable storage medium of claim 18, wherein the first bit-width of the first data type format is 4-bits. 20. The at least one computer readable storage medium of any one of claims 14 to 17, wherein the plurality of quantized weights are contained within a quantization range, and wherein the quantization range is [-1, 1].
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/US2024/027423 WO2025230529A1 (en) | 2024-05-02 | 2024-05-02 | Reinterpretable data type format for accurate and efficient model compression |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/US2024/027423 WO2025230529A1 (en) | 2024-05-02 | 2024-05-02 | Reinterpretable data type format for accurate and efficient model compression |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025230529A1 true WO2025230529A1 (en) | 2025-11-06 |
Family
ID=97561832
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2024/027423 Pending WO2025230529A1 (en) | 2024-05-02 | 2024-05-02 | Reinterpretable data type format for accurate and efficient model compression |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2025230529A1 (en) |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110009101A (en) * | 2019-04-11 | 2019-07-12 | 北京字节跳动网络技术有限公司 | Method and apparatus for generating quantization neural network |
| US20200167632A1 (en) * | 2018-11-23 | 2020-05-28 | Samsung Electronics Co., Ltd. | Neural network device for neural network operation, method of operating neural network device, and application processor including the neural network device |
| US11556772B2 (en) * | 2017-04-28 | 2023-01-17 | Intel Corporation | Incremental precision networks using residual inference and fine-grain quantization |
| US20230410255A1 (en) * | 2021-01-22 | 2023-12-21 | Qualcomm Incorporated | Decreased quantization latency |
| US20240104346A1 (en) * | 2022-09-15 | 2024-03-28 | Huawei Technologies Co., Ltd. | Method and device for compressing generative pre-trained language models via quantization |
-
2024
- 2024-05-02 WO PCT/US2024/027423 patent/WO2025230529A1/en active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11556772B2 (en) * | 2017-04-28 | 2023-01-17 | Intel Corporation | Incremental precision networks using residual inference and fine-grain quantization |
| US20200167632A1 (en) * | 2018-11-23 | 2020-05-28 | Samsung Electronics Co., Ltd. | Neural network device for neural network operation, method of operating neural network device, and application processor including the neural network device |
| CN110009101A (en) * | 2019-04-11 | 2019-07-12 | 北京字节跳动网络技术有限公司 | Method and apparatus for generating quantization neural network |
| US20230410255A1 (en) * | 2021-01-22 | 2023-12-21 | Qualcomm Incorporated | Decreased quantization latency |
| US20240104346A1 (en) * | 2022-09-15 | 2024-03-28 | Huawei Technologies Co., Ltd. | Method and device for compressing generative pre-trained language models via quantization |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20250005364A1 (en) | Dynamic pruning of neurons on-the-fly to accelerate neural network inferences | |
| US12423561B2 (en) | Method and apparatus for keeping statistical inference accuracy with 8-bit Winograd convolution | |
| US11429838B2 (en) | Neural network device for neural network operation, method of operating neural network device, and application processor including the neural network device | |
| US20200364552A1 (en) | Quantization method of improving the model inference accuracy | |
| US11169776B2 (en) | Decomposed floating point multiplication | |
| CN114118347A (en) | Fine-grained per-vector scaling for neural network quantization | |
| CN113849675A (en) | Similarity Search Using Guided Reinforcement Learning | |
| WO2018000309A1 (en) | Importance-aware model pruning and re-training for efficient convolutional neural networks | |
| US20230118802A1 (en) | Optimizing low precision inference models for deployment of deep neural networks | |
| US20250077527A1 (en) | Variable precision in vectorization | |
| US12406169B2 (en) | Optimally clipped tensors and vectors | |
| CN118170347B (en) | Precision conversion device, data processing method, processor, and electronic device | |
| US20250037017A1 (en) | Weight compression accuracy enhancements in large language models | |
| WO2021119907A1 (en) | Technology to mininimize negative impact of cache conflicts caused by incompatible leading dimensions in matrix multiplication and convolution kernels without dimension padding | |
| JP2025522114A (en) | Model training method and related device | |
| EP3839736A1 (en) | Unified programming interface for regrained tile execution | |
| WO2025230529A1 (en) | Reinterpretable data type format for accurate and efficient model compression | |
| WO2025030383A1 (en) | Zero-shot learning of object-centric generative adversarial networks for data-free object detection network quantization | |
| US20250028965A1 (en) | Weight quantization adaptation technology | |
| CN112101511A (en) | Sparse Convolutional Neural Networks | |
| WO2025035403A1 (en) | Floating point accuracy control via dynamic exponent and mantissa bit configurations | |
| US20250217627A1 (en) | Weight rounding optimization via signed gradient descent | |
| US20250103329A1 (en) | Scalable deterministic solution for non-deterministic operations | |
| US20220391710A1 (en) | Neural network based power and performance model for versatile processing units | |
| US20220300795A1 (en) | Two-stage decompression pipeline for non-uniform quantized neural network inference on reconfigurable hardware |