[go: up one dir, main page]

WO2025230529A1 - Reinterpretable data type format for accurate and efficient model compression - Google Patents

Reinterpretable data type format for accurate and efficient model compression

Info

Publication number
WO2025230529A1
WO2025230529A1 PCT/US2024/027423 US2024027423W WO2025230529A1 WO 2025230529 A1 WO2025230529 A1 WO 2025230529A1 US 2024027423 W US2024027423 W US 2024027423W WO 2025230529 A1 WO2025230529 A1 WO 2025230529A1
Authority
WO
WIPO (PCT)
Prior art keywords
data type
type format
weights
bit
integer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/US2024/027423
Other languages
French (fr)
Inventor
Alexander Kozlov
Dmitry GOROKHOV
Andrey ANUFRIEV
Nikolay LYALYUSHKIN
Yury Gorbachev
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Intel Corp
Original Assignee
Intel Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intel Corp filed Critical Intel Corp
Priority to PCT/US2024/027423 priority Critical patent/WO2025230529A1/en
Publication of WO2025230529A1 publication Critical patent/WO2025230529A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0495Quantised networks; Sparse networks; Compressed networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/096Transfer learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound

Definitions

  • LLM large language model
  • FP32 32-bit floating point
  • FIG. 1 is an illustration of an example of an artificial intelligence (AI) model compression solution according to an embodiment
  • FIG. 2 is an illustration of an example of a decision tree according to an embodiment
  • FIG. 3 is an illustration of an example of an AI model decompression solution according to an embodiment
  • FIG. 4 is a flowchart of an example of a method of compressing a pre-trained AI model according to an embodiment
  • FIG.5 is a flowchart of an example of a method of decompressing an output AI model according to an embodiment
  • Docket No. AF7863-PCT FIG.6 is a block diagram of an example of a performance-enhanced computing system according to an embodiment
  • FIG. 7 is an illustration of an example of a semiconductor package apparatus according to an embodiment
  • FIG. 8 is a block diagram of an example of a processor according to an embodiment
  • FIG.9 is a block diagram of an example of a multi-processor based computing system according to an embodiment.
  • NF4 normal floating point
  • NF4 4-bit normal floating point
  • the data type can be used only with floating-point instructions (e.g., 32-bit floating point/FP32, 16-bit floating point/FP16, 16-bit brain floating point/Bfloat16) when computing dot product operations during inferences.
  • the NF4 data type does not allow the use of 8-bit integer (INT8) instructions, which are more power-efficient, performant, and available in most contemporary hardware (e.g., central processing units/CPUs, integrated and discrete graphics processing units/GPUs, and specialized accelerators such as network processing units/NPUs).
  • Docket No. AF7863-PCT Other proposed methods similar to NF4 include palletization, which may cluster and fit weights into a set of floating-point values. Although palletization solutions may improve accuracy, such solutions are typically not efficient during inference for certain platforms.
  • the technology described herein provides a 4-bit reinterpretable floating point (RF4) data type, which is a unique 4-bit data type to store weights that can be interpreted as the floating-point data type or cast to an 8-bit integer data type depending on the available instruction set.
  • RF4 4-bit reinterpretable floating point
  • the proposed data type is performant in various types of hardware (HW) architectures.
  • the proposed data type allows for preserving compressed model accuracy on a level similar to other less compute- efficient data types.
  • the proposed RF4 type enables accurate post-training and training- time weight compression of Deep Learning models and provides cross-hardware compatibility when running inference operations on various types of HW that have different instruction sets.
  • RF4 is highly applicable to the optimization of transformer- based models including LLMs for existing HW accelerators (e.g., CPUs, GPUs, NPUs, etc.).
  • a pre-trained AI model 10 e.g., source LLM
  • source weights 12 e.g., weight tensor, group of weights
  • FP floating point
  • a set of model parameters 14 is determined based on the source weights 12, wherein the model parameters 14 include information such as a weight scale factor, a decision tree, a lookup table, and so forth.
  • the weight scale factor is determined based on a maximum value in the plurality of source weights 12.
  • the plurality of source weights 12 are converted into a plurality of quantized weights 16 (16a, 16b) based on the model parameters 14.
  • the quantized weights 16 are in the reinterpretable format (RF) data type and contained within a quantization range (e.g., [-1, 1]), wherein the RF data type is interpretable as quantized weights 16a in an FP data type format and quantized weights 16b in an integer (INT) data type format.
  • An output AI model 18 is generated based on the plurality of quantized weights 16 and the model parameters 14.
  • Configuring the quantized weights 16 in the RF data type enables accurate post- training and training-time weight compression of the output AI model 18 and provides cross-hardware compatibility when running inference operations on various types of HW that have different instruction sets.
  • the RF data type is highly applicable to the Docket No. AF7863-PCT optimization of transformer-based models including LLMs for existing HW accelerators (e.g., CPUs, GPUs, NPUs, etc.).
  • the bit-width of the RF data type format is 4-bits (e.g., RF4).
  • ⁇ 94 ⁇ : ⁇ ; ⁇ ⁇ ./B.D
  • the quantization process from FP32 to RF4 type is as follows: Docket No. AF7863-PCT -
  • This scale can be computed per weight tensor or per group of weights (e.g., for each 128 weights). Thus, every weight tensor of RF4 values in the model is accompanied by the scaling factor.
  • FIG.2 shows a decision tree 30 that may be used to convert the source weights 12 (FIG.1) into the quantized weights 16 (FIG.1).
  • x is a quantized weight value in the [-1, 1] quantization range and the decision tree 30 returns one of sixteen values in an RF4 lookup table based on the location of the quantized weight value relative to the ⁇ 94 @ ⁇ A/ ⁇ floating point representation of the RF4 data type.
  • a return operation 32 outputs the binary value “1111” if the quantized weight value is greater than the median value (e.g., “0.862204725”) between “0.72440945” and “1”.
  • a return operation 34 outputs the binary value “1110” if the quantized weight value is greater than the median value (e.g., “0.641732285”) between “0.55905512” and “0.72440945”.
  • the quantization process can be computationally expensive, quantization may not have any impact on inference latency since quantization is typically conducted offline during model preparation prior to deployment.
  • FIG.3 shows a decompression solution in which weight dequantization occurs at inference time (e.g., right before a matrix multiplication/MatMul operation).
  • the output AI model 18 includes the reinterpretable quantized weights 16.
  • a determination is made (e.g., based on the instruction set architecture/ISA) as to whether a plurality of input activations 20 are in the FP data type format (e.g., FP32) or the INT data type format (e.g., INT8). If the plurality of input activations 20 are in the FP data type format, the quantized weights 16 in the RF data type format are interpreted as the quantized weights 16a in the FP data type format during an FP matrix multiplication.
  • the FP matrix multiplication may be conducted based on first model parameters 22 including the weight scale factor associated with the plurality of quantized weights 16 and the lookup table associated with the RF data type format. Docket No. AF7863-PCT If the plurality of input activations 20 are in the INT data type format, the quantized weights 16 in the RF data type are interpreted as the quantized weights 16b in the INT data type format during an INT matrix multiplication. In such a case, the INT matrix multiplication may be conducted based on the first model parameters 22 as well as second model parameters 24 including a fixed scale factor and an activation scale factor associated with the plurality of input activations 20.
  • the fixed scale factor can be associated with an integer range (e.g., [- 127, 127]) of the INT data type format.
  • an integer range e.g., [- 127, 127]
  • either the FP matrix multiplication or the INT matrix multiplication generate matrix multiplication results 26.
  • - Floating-point instructions are used for MatMul (dot product).
  • FIG.4 shows a method 40 of compressing a pre-trained AI model.
  • the method 40 may be implemented in one or more modules as a plurality of logic instructions (e.g., compression instructions) stored in a machine- or computer-readable storage medium such as random access memory (RAM), read only memory (ROM), programmable ROM (PROM), firmware, flash memory, etc., in hardware, or any combination thereof.
  • logic instructions e.g., compression instructions
  • RAM random access memory
  • ROM read only memory
  • PROM programmable ROM
  • firmware flash memory
  • hardware implementations may include configurable logic, fixed-functionality logic, or any combination thereof.
  • configurable logic e.g., configurable hardware
  • PLAs programmable logic arrays
  • FPGAs field programmable gate arrays
  • CPLDs complex programmable logic devices
  • general purpose microprocessors Examples of fixed- Docket No.
  • AF7863-PCT functionality logic e.g., fixed-functionality hardware
  • ASICs application specific integrated circuits
  • the configurable or fixed-functionality logic can be implemented with complementary metal oxide semiconductor (CMOS) logic circuits, transistor-transistor logic (TTL) logic circuits, or other circuits.
  • CMOS complementary metal oxide semiconductor
  • TTL transistor-transistor logic
  • computer program code to carry out operations shown in the method 40 can be written in any combination of one or more programming languages, including an object oriented programming language such as JAVA, SMALLTALK, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
  • Illustrated block 42 provides for determining a weight scale factor for a plurality of source weights in a pre-trained model, wherein the plurality of source weights are contained within a source range. In an embodiment, the weight scale factor is determined based on a maximum value in the plurality of source weights.
  • Block 44 converts the plurality of source weights into a plurality of quantized weights based on the weight scale factor, wherein the plurality of quantized weights are contained within a quantization range (e.g., [-1, 1]).
  • Block 44 may also convert the plurality of source weights into the plurality of quantized weights further based on a decision tree and a lookup table associated with a first data type format corresponding to the plurality of quantized weights.
  • the first data type format e.g., RF data type format
  • INT data type format is interpretable as an FP data type format and an INT data type format.
  • a first bit-width of the first data type format may be less than a second bit-width of a second data type format (e.g., FP32) corresponding to the plurality of source weights and less than a third bit-width of the INT data type format (e.g., INT8).
  • the first bit-width of the first data type is 4-bits (e.g., RF4) in one example.
  • Other bit-widths may be used, however, for the RF data type format.
  • Block 46 generates an output AI model based on the plurality of quantized weights and the weight scale factor. Docket No.
  • the method 40 therefore enhances performance at least to the extent that generating the output AI model based on the plurality of quantized weights and the weight scale factor enables accurate post-training and training-time weight compression of the output AI model and provides cross-hardware compatibility when running inference operations on various types of HW that have different instruction sets.
  • the RF data type is highly applicable to the optimization of transformer-based models including LLMs for existing HW accelerators (e.g., CPUs, GPUs, NPUs, etc.).
  • FIG.5 shows a method 50 of decompressing an output AI model.
  • the method 50 may be implemented in one or more modules as a plurality of logic instructions (e.g., decompression instructions) stored in a machine- or computer-readable storage medium such as RAM, ROM, PROM, firmware, flash memory, etc., in hardware, or any combination thereof.
  • Illustrated processing block 52 provides for determining whether a plurality of input activations are in an FP data type format (e.g., FP16, FP32) or an INT data type format (e.g., INT8).
  • Block 52 may determine the format of the input activations based on an ISA instruction (e.g., lookup instruction) associated with the underlying hardware.
  • ISA instruction e.g., lookup instruction
  • block 56 interprets, during an FP matrix multiplication (e.g., dot product), a first data type format (e.g., RF) corresponding to a plurality of quantized weights as the FP data type format.
  • block 56 conducts the FP matrix multiplication based on a lookup table associated with the first data type format and a weight scale factor associated with the plurality of quantized weights.
  • block 60 interprets, during an INT matrix multiplication (e.g., dot product), the first data type format as the INT data type format.
  • block 60 conducts the INT matrix multiplication based on a lookup table associated with the first data type format, a weight scale factor associated with the plurality of quantized weights, a fixed scale factor (e.g., 127), and an activation scale factor associated with the plurality of input activations.
  • the fixed scale factor can be associated with an integer range (e.g., [-127, 127]) of the INT data type format.
  • a first bit-width (e.g., 4-bits) of the first data type can be less than a second bit-width (e.g., 32-bits) of the FP data type and a third bit-width (e.g., 8-bits) of the INT data type format.
  • the plurality of quantized weights are contained within a quantization such as, for Docket No. AF7863-PCT example, [-1, 1]. The method 50 therefore enhances performance at least to the extent that the RF type format improves accuracy and/or reduces latency during inference operations.
  • FIG. 6 a performance-enhanced computing system 280 is shown.
  • the system 280 may generally be part of an electronic device/platform having computing functionality (e.g., personal digital assistant/PDA, notebook computer, tablet computer, convertible tablet, server), communications functionality (e.g., smart phone), imaging functionality (e.g., camera, camcorder), media playing functionality (e.g., smart television/TV), wearable functionality (e.g., watch, eyewear, headwear, footwear, jewelry), vehicular functionality (e.g., car, truck, motorcycle), robotic functionality (e.g., autonomous robot), Internet of Things (IoT) functionality, drone functionality, etc., or any combination thereof.
  • computing functionality e.g., personal digital assistant/PDA, notebook computer, tablet computer, convertible tablet, server
  • communications functionality e.g., smart phone
  • imaging functionality e.g., camera, camcorder
  • media playing functionality e.g., smart television/TV
  • wearable functionality e.g., watch, eyewear, headwear, footwear, jewelry
  • vehicular functionality e.g., car, truck,
  • the system 280 includes a host processor 282 (e.g., central processing unit/CPU) having an integrated memory controller (IMC) 284 that is coupled to a system memory 286 (e.g., dual inline memory module/DIMM).
  • IMC integrated memory controller
  • an IO module 288 is coupled to the host processor 282.
  • the illustrated IO module 288 communicates with, for example, a display 290 (e.g., touch screen, liquid crystal display/LCD, light emitting diode/LED display), and a network controller 292 (e.g., conducting wired and/or wireless communications).
  • the host processor 282 may be combined with the IO module 288, a graphics processor 294, and an AI accelerator 296 into a system on chip (SoC) 298.
  • SoC system on chip
  • the AI accelerator 296, the host processor 282 and/or the SoC 298 executes a plurality of executable program instructions 300 (e.g., compression and/or decompression instructions) retrieved from mass storage 302 and/or the system memory 286 to perform one or more aspects of the method 40 (FIG. 4) and/or the method 50 (FIG.5), already discussed.
  • a plurality of executable program instructions 300 e.g., compression and/or decompression instructions
  • execution of the instructions 300 causes the AI accelerator 296, the host processor 282 and/or the SoC 298 to determine a weight scale factor for a plurality of source weights in a pre-trained AI model, wherein the plurality of source weights are contained within a source range, convert the plurality of source weights into a plurality of quantized weights based on the weight scale factor, wherein the plurality of quantized weights are contained within a quantization range, and generate an output AI model based on the plurality of quantized weights and the weight scale factor.
  • a first data type format e.g., reinterpretable floating Docket No.
  • AF7863-PCT point/RF corresponding to the plurality of quantized weights is interpretable as a floating point data type format and an integer data type format.
  • execution of the instructions 300 causes the AI accelerator 296, the host processor 282 and/or the SoC to determine whether a plurality of input activations are in the floating point data type format or the integer data type format. If the plurality of input activations are in the floating point data type format, the first data type format is interpreted as the floating data type format during a floating point matrix multiplication. If the plurality of input activations are in the integer data type format, the first data type format is interpreted as the integer data type format during an integer matrix multiplication.
  • the computing system 280 is therefore considered performance-enhanced at least to the extent that generating the output AI model based on the plurality of quantized weights and the weight scale factor enables accurate post-training and training-time weight compression of the output AI model and provides cross-hardware compatibility when running inference operations on various types of HW that have different instruction sets.
  • the RF data type is highly applicable to the optimization of transformer-based models including LLMs for existing HW accelerators (e.g., CPUs, GPUs, NPUs, etc.). Additionally, the RF type format improves accuracy and/or reduces latency during inference operations.
  • FIG. 7 shows a semiconductor apparatus 350 (e.g., chip, die, package).
  • the illustrated apparatus 350 includes one or more substrates 352 (e.g., silicon, sapphire, gallium arsenide) and logic 354 (e.g., transistor array and other integrated circuit/IC components) coupled to the substrate(s) 352.
  • the logic 354 implements one or more aspects of the method 40 (FIG.4) and/or the method 50 (FIG. 5), already discussed.
  • the logic 354 may be implemented at least partly in configurable or fixed- functionality hardware.
  • the logic 354 includes transistor channel regions that are positioned (e.g., embedded) within the substrate(s) 352. Thus, the interface between the logic 354 and the substrate(s) 352 may not be an abrupt junction.
  • the logic 354 may also be considered to include an epitaxial layer that is grown on an initial wafer of the substrate(s) 352.
  • the method 40 (FIG. 4) and/or the method 50 (FIG. 5) are incorporated into an INTEL OPENVINO tookit, which streamlines AI model development and integration of deep learning in domains such as computer vision, large Docket No. AF7863-PCT language models, ⁇ and generative AI.
  • the use of an RF data type format as described herein improves accuracy and/or reduces latency during inference operations.
  • FIG. 8 illustrates a processor core 400 according to one embodiment.
  • the processor core 400 may be the core for any type of processor, such as a micro-processor, an embedded processor, a digital signal processor (DSP), a network processor, or other device to execute code. Although only one processor core 400 is illustrated in FIG.8, a processing element may alternatively include more than one of the processor core 400 illustrated in FIG. 8.
  • the processor core 400 may be a single-threaded core or, for at least one embodiment, the processor core 400 may be multithreaded in that it may include more than one hardware thread context (or “logical processor”) per core.
  • FIG. 8 also illustrates a memory 470 coupled to the processor core 400.
  • the memory 470 may be any of a wide variety of memories (including various layers of memory hierarchy) as are known or otherwise available to those of skill in the art.
  • the memory 470 may include one or more code 413 instruction(s) to be executed by the processor core 400, wherein the code 413 may implement the method 40 (FIG.4) and/or the method 50 (FIG.5), already discussed.
  • the processor core 400 follows a program sequence of instructions indicated by the code 413. Each instruction may enter a front end portion 410 and be processed by one or more decoders 420.
  • the decoder 420 may generate as its output a micro operation such as a fixed width micro operation in a predefined format, or may generate other instructions, microinstructions, or control signals which reflect the original code instruction.
  • the illustrated front end portion 410 also includes register renaming logic 425 and scheduling logic 430, which generally allocate resources and queue the operation corresponding to the convert instruction for execution.
  • the processor core 400 is shown including execution logic 450 having a set of execution units 455-1 through 455-N. Some embodiments may include a number of execution units dedicated to specific functions or sets of functions. Other embodiments may include only one execution unit or one execution unit that can perform a particular function.
  • the illustrated execution logic 450 performs the operations specified by code instructions. After completion of execution of the operations specified by the code instructions, back end logic 460 retires the instructions of the code 413. In one embodiment, the processor core 400 allows out of order execution but requires in order Docket No. AF7863-PCT retirement of instructions.
  • Retirement logic 465 may take a variety of forms as known to those of skill in the art (e.g., re-order buffers or the like).
  • a processing element may include other elements on chip with the processor core 400.
  • a processing element may include memory control logic along with the processor core 400.
  • the processing element may include I/O control logic and/or may include I/O control logic integrated with memory control logic.
  • the processing element may also include one or more caches. Referring now to FIG.9, shown is a block diagram of a computing system 1000 embodiment in accordance with an embodiment.
  • FIG.9 Shown in FIG.9 is a multiprocessor system 1000 that includes a first processing element 1070 and a second processing element 1080. While two processing elements 1070 and 1080 are shown, it is to be understood that an embodiment of the system 1000 may also include only one such processing element.
  • the system 1000 is illustrated as a point-to-point interconnect system, wherein the first processing element 1070 and the second processing element 1080 are coupled via a point-to-point interconnect 1050. It should be understood that any or all of the interconnects illustrated in FIG.9 may be implemented as a multi-drop bus rather than point-to-point interconnect. As shown in FIG.
  • each of processing elements 1070 and 1080 may be multicore processors, including first and second processor cores (i.e., processor cores 1074a and 1074b and processor cores 1084a and 1084b). Such cores 1074a, 1074b, 1084a, 1084b may be configured to execute instruction code in a manner similar to that discussed above in connection with FIG.8.
  • Each processing element 1070, 1080 may include at least one shared cache 1896a, 1896b.
  • the shared cache 1896a, 1896b may store data (e.g., instructions) that are utilized by one or more components of the processor, such as the cores 1074a, 1074b and 1084a, 1084b, respectively.
  • the shared cache 1896a, 1896b may locally cache data stored in a memory 1032, 1034 for faster access by components of the processor.
  • the shared cache 1896a, 1896b may Docket No. AF7863-PCT include one or more mid-level caches, such as level 2 (L2), level 3 (L3), level 4 (L4), or other levels of cache, a last level cache (LLC), and/or combinations thereof. While shown with only two processing elements 1070, 1080, it is to be understood that the scope of the embodiments are not so limited. In other embodiments, one or more additional processing elements may be present in a given processor.
  • processing elements 1070, 1080 may be an element other than a processor, such as an accelerator or a field programmable gate array.
  • additional processing element(s) may include additional processors(s) that are the same as a first processor 1070, additional processor(s) that are heterogeneous or asymmetric to processor a first processor 1070, accelerators (such as, e.g., graphics accelerators or digital signal processing (DSP) units), field programmable gate arrays, or any other processing element.
  • accelerators such as, e.g., graphics accelerators or digital signal processing (DSP) units
  • DSP digital signal processing
  • the various processing elements 1070, 1080 may reside in the same die package.
  • the first processing element 1070 may further include memory controller logic (MC) 1072 and point-to-point (P-P) interfaces 1076 and 1078.
  • the second processing element 1080 may include a MC 1082 and P-P interfaces 1086 and 1088.
  • MC’s 1072 and 1082 couple the processors to respective memories, namely a memory 1032 and a memory 1034, which may be portions of main memory locally attached to the respective processors.
  • the first processing element 1070 and the second processing element 1080 may be coupled to an I/O subsystem 1090 via P-P interconnects 10761086, respectively.
  • the I/O subsystem 1090 includes P-P interfaces 1094 and 1098.
  • I/O subsystem 1090 includes an interface 1092 to couple I/O subsystem 1090 with a high performance graphics engine 1038.
  • bus 1049 may be used to couple the graphics engine 1038 to the I/O subsystem 1090.
  • I/O subsystem 1090 may be coupled to a first bus 1016 via an interface 1096.
  • the first bus 1016 may be a Peripheral Component Interconnect (PCI) bus, or a bus such as a PCI Express bus or another third generation I/O interconnect bus, although the scope of the embodiments are not so limited.
  • PCI Peripheral Component Interconnect
  • various I/O devices 1014 e.g., biometric scanners, speakers, cameras, sensors
  • the second bus 1020 may be a low pin count (LPC) bus.
  • Various devices may be coupled to the second bus 1020 including, for example, a keyboard/mouse 1012, communication device(s) 1026, and a data storage unit 1019 such as a disk drive or other mass storage device which may include code 1030, in one embodiment.
  • the illustrated code 1030 may implement the method 40 (FIG. 4) and/or the method 50 (FIG.5), already discussed.
  • an audio I/O 1024 may be coupled to second bus 1020 and a battery 1010 may supply power to the computing system 1000. Note that other embodiments are contemplated. For example, instead of the point-to-point architecture of FIG.
  • Example 1 includes a performance-enhanced computing system comprising a network controller, a processor coupled to the network controller, and a memory coupled to the processor, the memory including a plurality of compression instructions, which when executed by the processor, cause the processor to determine a weight scale factor for a plurality of source weights in a pre-trained artificial intelligence (AI) model, wherein the plurality of source weights are contained within a source range, convert the plurality of source weights into a plurality of quantized weights based on the weight scale factor, wherein the plurality of quantized weights are contained within a quantization range, and generate an output AI model based on the plurality of quantized weights and the weight scale factor.
  • AI artificial intelligence
  • Example 2 includes the computing system of Example 1, wherein a first data type format corresponding to the plurality of quantized weights is interpretable as a floating point data type format and an integer data type format.
  • Example 3 includes the computing system of Example 2, wherein the memory further includes a plurality of decompression instructions, which when executed by the Docket No.
  • AF7863-PCT processor cause the processor to determine whether a plurality of input activations are in the floating point data type format or the integer data type format, interpret, during a floating point matrix multiplication, the first data type format corresponding to the plurality of quantized weights as the floating point data type format if the plurality of input activations are in the floating point data type format, and interpret, during an integer matrix multiplication, the first data type format as the integer data type format if the plurality of input activations are in the integer data type format.
  • Example 4 includes the computing system of Example 2, wherein a first bit- width of the first data type format is less than a third bit-width of the integer data type format.
  • Example 5 includes the computing system of Example 4, wherein the first bit- width of the first data type format is less than a second bit-width of a second data type format corresponding to the plurality of source weights, and wherein the first bit-width of the first data type format is 4-bits.
  • Example 6 includes the computing system of Example 1, wherein the plurality of source weights are converted into the plurality of quantized weights further based on a decision tree and a lookup table associated with a first data type format corresponding to the plurality of quantized weights.
  • Example 7 includes at least one computer readable storage medium comprising a plurality of compression instructions, which when executed by a computing system, cause the computing system to determine a weight scale factor for a plurality of source weights in a pre-trained artificial intelligence (AI) model, wherein the plurality of source weights are contained within a source range, convert the plurality of source weights into a plurality of quantized weights based on the weight scale factor, wherein the plurality of quantized weights are contained within a quantization range, and generate an output AI model based on the plurality of quantized weights and the weight scale factor.
  • AI artificial intelligence
  • Example 8 includes the at least one computer readable storage medium of Example 7, wherein a first data type format corresponding to the plurality of quantized weights is interpretable as a floating point data type format and an integer data type format.
  • Example 9 includes the at least one computer readable storage medium of Example 8, wherein a first bit-width of the first data type format is less than a third bit- width of the integer data type format.
  • Example 10 includes the at least one computer readable storage medium of Example 9, wherein the first bit-width of the first data type format is less than a second bit-width of a second data type format corresponding to the plurality of source weights, and wherein the first bit-width of the first data type format is 4-bits.
  • Example 11 includes the at least one computer readable storage medium of Example 7, wherein the plurality of source weights are converted into the plurality of quantized weights further based on a decision tree and a lookup table associated with a first data type format corresponding to the plurality of quantized weights.
  • Example 12 includes the at least one computer readable storage medium of any one of Examples 7 to 11, wherein the weight scale factor is determined based on a maximum value in the plurality of source weights.
  • Example 13 includes the at least one computer readable storage medium of any one of Examples 7 to 11, wherein the quantization range is [-1, 1].
  • Example 14 includes at least one computer readable storage medium comprising a plurality of decompression instructions, which when executed by a computing system, cause the computing system to determine whether a plurality of input activations are in a floating point data type format or an integer data type format, interpret, during a floating point matrix multiplication, a first data type format corresponding to a plurality of quantized weights as the floating point data type format if the plurality of input activations are in the floating point data type format, and interpret, during an integer matrix multiplication, the first data type format as the integer data type format if the plurality of input activations are in the integer data type format.
  • Example 15 includes the at least one computer readable storage medium of Example 14, wherein the plurality of decompression instructions, when executed, further cause the computing system to conduct the floating point matrix multiplication based on a lookup table associated with the first data type format and a weight scale factor associated with the plurality of quantized weights.
  • Example 16 includes the at least one computer readable storage medium of Example 14, wherein the plurality of decompression instructions, when executed, further cause the computing system to conduct the integer matrix multiplication based on a lookup table associated with the first data type format, a weight scale factor associated with the plurality of quantized weights, a fixed scale factor, and an activation scale factor associated with the plurality of input activations. Docket No.
  • Example 17 includes the at least one computer readable storage medium of Example 16, wherein the fixed scale factor is associated with an integer range of the integer data type format, and wherein the integer range is [-127, 127].
  • Example 18 includes the at least one computer readable storage medium of any one of Examples 14 to 17, wherein a first bit-width of the first data type format is less than a second bit-width of the floating point data type format and a third bit-width of the integer data type format.
  • Example 19 includes the at least one computer readable storage medium of Example 18, wherein the first bit-width of the first data type format is 4-bits.
  • Example 20 includes the at least one computer readable storage medium of any one of Examples 14 to 17, wherein the plurality of quantized weights are contained within a quantization range, and wherein the quantization range is [-1, 1].
  • Example 21 includes a method of compressing a pre-trained artificial intelligence (AI) model, the method comprising determining a weight scale factor for a plurality of source weights in the pre-trained AI model, wherein the plurality of source weights are contained within a source range, converting the plurality of source weights into a plurality of quantized weights based on the weight scale factor, wherein the plurality of quantized weights are contained within a quantization range, and generating an output AI model based on the plurality of quantized weights and the weight scale factor.
  • AI artificial intelligence
  • Example 22 includes a method of decompressing an output artificial intelligence (AI) model, the method comprising determining whether a plurality of input activations are in a floating point data type format or an integer data type format, interpreting, during a floating point matrix multiplication, a first data type format corresponding to a plurality of quantized weights as the floating point data type format if the plurality of input activations are in the floating point data type format, and interpreting, during an integer matrix multiplication, the first data type format as the integer data type format if the plurality of input activations are in the integer data type format.
  • Example 23 includes an apparatus comprising means for performing the method of any of Examples 21 to 23. Embodiments are applicable for use with all types of semiconductor integrated circuit (“IC”) chips.
  • IC semiconductor integrated circuit
  • IC chips examples include but are not limited to processors, controllers, chipset components, programmable logic arrays (PLAs), memory chips, network chips, systems on chip (SoCs), SSD/NAND controller ASICs, Docket No. AF7863-PCT and the like.
  • PLAs programmable logic arrays
  • SoCs systems on chip
  • SSD/NAND controller ASICs Docket No. AF7863-PCT and the like.
  • signal conductor lines are represented with lines. Some may be different, to indicate more constituent signal paths, have a number label, to indicate a number of constituent signal paths, and/or have arrows at one or more ends, to indicate primary information flow direction. This, however, should not be construed in a limiting manner. Rather, such added detail may be used in connection with one or more exemplary embodiments to facilitate easier understanding of a circuit.
  • Any represented signal lines, whether or not having additional information, may actually comprise one or more signals that may travel in multiple directions and may be implemented with any suitable type of signal scheme, e.g., digital or analog lines implemented with differential pairs, optical fiber lines, and/or single-ended lines.
  • Example sizes/models/values/ranges may have been given, although embodiments are not limited to the same. As manufacturing techniques (e.g., photolithography) mature over time, it is expected that devices of smaller size could be manufactured.
  • well known power/ground connections to IC chips and other components may or may not be shown within the figures, for simplicity of illustration and discussion, and so as not to obscure certain aspects of the embodiments.
  • Coupled may be used herein to refer to any type of relationship, direct or indirect, between the components in question, and may apply to electrical, mechanical, fluid, optical, electromagnetic, electromechanical or other connections.
  • first”, “second”, etc. may be used herein only to facilitate discussion, and carry no particular temporal or chronological significance unless otherwise indicated.
  • a list of items joined by the term “one or more of” may mean any combination of the listed terms.
  • the Docket No. AF7863-PCT phrases “one or more of A, B or C” may mean A; B; C; A and B; A and C; B and C; or A, B and C.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Neurology (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

Systems, apparatuses and methods may provide for technology that determines a weight scale factor for a plurality of source weights in a pre-trained artificial intelligence (AI) model, wherein the source weights are contained within a source range, convert the source weights into a plurality of quantized weights based on the weight scale factor, wherein the quantized weights are contained within a quantization range, and generate an output AI model based on the plurality of quantized weights and the weight scale factor. In addition, a first data type format corresponding to the quantized weights is interpretable as a floating point data type format and an integer data type format.

Description

Docket No. AF7863-PCT REINTERPRETABLE DATA TYPE FORMAT FOR ACCURATE AND EFFICIENT MODEL COMPRESSION BACKGROUND A large language model (LLM) is a type of language model notable for the ability to achieve general-purpose language understanding and generation. LLMs acquire these abilities by using large amounts of data to learn billions of parameters during training and consuming large computational resources during training and operation (e.g., inference). Weight data used by LLMs may originally be in a relatively high precision format such as, for example, the 32-bit floating point (FP32) format. Execution of LLMs on edge/client devices may be limited due to memory pressure during the loading of weights throughout the inference process. Quantizing the weight data used by LLMs to a lower-precision format such as, for example, 4-bit integer (INT4), can reduce the computational and memory demands of these modern architectures. Conventional quantization approaches, however, may encounter accuracy problems that negate the benefits of quantization. Additionally, the weight representation format may not run efficiently on all instruction sets, which limits data type adoption. BRIEF DESCRIPTION OF THE DRAWINGS The various advantages of the embodiments will become apparent to one skilled in the art by reading the following specification and appended claims, and by referencing the following drawings, in which: FIG. 1 is an illustration of an example of an artificial intelligence (AI) model compression solution according to an embodiment; FIG. 2 is an illustration of an example of a decision tree according to an embodiment; FIG. 3 is an illustration of an example of an AI model decompression solution according to an embodiment; FIG. 4 is a flowchart of an example of a method of compressing a pre-trained AI model according to an embodiment; FIG.5 is a flowchart of an example of a method of decompressing an output AI model according to an embodiment; Docket No. AF7863-PCT FIG.6 is a block diagram of an example of a performance-enhanced computing system according to an embodiment; FIG. 7 is an illustration of an example of a semiconductor package apparatus according to an embodiment; FIG. 8 is a block diagram of an example of a processor according to an embodiment; and FIG.9 is a block diagram of an example of a multi-processor based computing system according to an embodiment. DETAILED DESCRIPTION Current solutions in the model optimization domain may include methods for large language model (LLM) optimization that are inaccurate or inefficient, as well as post-optimization methods for accuracy improvement that do not focus on LLM optimization or involve relatively long model tuning procedures. For example, generative pre-trained transformer quantization (GPTQ) may quantize model weights in a layer-wise fashion into 4-bit integer (INT4) precision by default using zero-point and scale factor. A substantial challenge of GPTQ, however, is that the data type (INT4) that is used does not consider the nature of the weight distribution, which mostly corresponds to a unimodal normal distribution. As a result, the results are less accurate compressed models despite the complicated compression process. Another alternative is a 4-bit normal floating point (NF4) data type that may contain sixteen (24) floating point values in the range [-1,1] that are based on the quantiles of the normal distribution. Although this type allows a more accurate representation of the model weights and may lead to better accuracy after weight compression, one limitation of the NF4 type is that the data type can be used only with floating-point instructions (e.g., 32-bit floating point/FP32, 16-bit floating point/FP16, 16-bit brain floating point/Bfloat16) when computing dot product operations during inferences. The NF4 data type does not allow the use of 8-bit integer (INT8) instructions, which are more power-efficient, performant, and available in most contemporary hardware (e.g., central processing units/CPUs, integrated and discrete graphics processing units/GPUs, and specialized accelerators such as network processing units/NPUs). Docket No. AF7863-PCT Other proposed methods similar to NF4 include palletization, which may cluster and fit weights into a set of floating-point values. Although palletization solutions may improve accuracy, such solutions are typically not efficient during inference for certain platforms. The technology described herein provides a 4-bit reinterpretable floating point (RF4) data type, which is a unique 4-bit data type to store weights that can be interpreted as the floating-point data type or cast to an 8-bit integer data type depending on the available instruction set. As a result, the proposed data type is performant in various types of hardware (HW) architectures. In addition, the proposed data type allows for preserving compressed model accuracy on a level similar to other less compute- efficient data types. The proposed RF4 type enables accurate post-training and training- time weight compression of Deep Learning models and provides cross-hardware compatibility when running inference operations on various types of HW that have different instruction sets. RF4 is highly applicable to the optimization of transformer- based models including LLMs for existing HW accelerators (e.g., CPUs, GPUs, NPUs, etc.). Turning now to FIG.1, a compression solution is shown in which a pre-trained AI model 10 (e.g., source LLM) includes a plurality of source weights 12 (e.g., weight tensor, group of weights) that are in a floating point (FP) data type (e.g., FP32) and contained within a source range. A set of model parameters 14 is determined based on the source weights 12, wherein the model parameters 14 include information such as a weight scale factor, a decision tree, a lookup table, and so forth. In one example, the weight scale factor is determined based on a maximum value in the plurality of source weights 12. The plurality of source weights 12 are converted into a plurality of quantized weights 16 (16a, 16b) based on the model parameters 14. As will be discussed in greater detail, the quantized weights 16 are in the reinterpretable format (RF) data type and contained within a quantization range (e.g., [-1, 1]), wherein the RF data type is interpretable as quantized weights 16a in an FP data type format and quantized weights 16b in an integer (INT) data type format. An output AI model 18 is generated based on the plurality of quantized weights 16 and the model parameters 14. Configuring the quantized weights 16 in the RF data type enables accurate post- training and training-time weight compression of the output AI model 18 and provides cross-hardware compatibility when running inference operations on various types of HW that have different instruction sets. The RF data type is highly applicable to the Docket No. AF7863-PCT optimization of transformer-based models including LLMs for existing HW accelerators (e.g., CPUs, GPUs, NPUs, etc.). In one example, the bit-width of the RF data type format is 4-bits (e.g., RF4). Thus, considering a single linear (e.g., Fully-Connected) layer that is optimized with the RF4 data type format, weights of the source and optimized models have the following relationship: ^ ^ ^^^^^^^^ = ^^^ ^^^^^^ ^ ^^^^^^^ ^ ^^^^^ ^ Where scale is estimated in the output AI model 18 and the floating-point value from the range [-1,1] to the nearest value from RF4 type. More particularly, the RF4 data type leverages the fact that model weights are normally distributed around zero. First, the [-1,1] interval is considered because any weight tensor can be projected into that interval using the following formula: ^^ ^ = ^^ / m %ax |^%|. Then, the 2()* = 4, values -^ of the data type are estimated using the following equation: - . ^ ^4. ^ = / )012 /34.5 + 012 /34.5,, 01 - quantile function of the symmetrically distributed values with no zero value. To tackle this problem, 2 negative values and 2(7. + 1 positive values are sampled and one of the two zero values are dropped, which results overall in exactly 2( asymmetrically distributed values in the range [-1,1]. Next, this value is projected into the integer range of [-127,127] using the equation: ^94^:^; ^ = ^^<^=)-^ ∗ 127,, where round() is the operation of rounding to nearest neighbor. Finally, the following set of values is obtained, which represents projections of the NF4 data type into the INT8 data type: ^94^ ^:^; = [-127, -88, -67, -50, -36, -23, -12, 0, 10, 20, 31, 43, 56, 71, 92, 127] Scaling these values back to the floating point precision (e.g., by multiplying 1/127.0) forms the floating-point representation of the RF4 data type: ^94@^A/ ^ = [-1, -0.69291339, -0.52755906, -0.39370079, -0.28346457, -0.18110236, - 0.09448819, 0, 0.07874016, 0.15748031, 0.24409449, 0.33858268, 0.44094488, 0.55905512, 0.72440945, 1] Both representations are connected as follows: ^94@^A/ . ^ = ^94^:^; ^ ∗ ./B.D The quantization process from FP32 to RF4 type is as follows: Docket No. AF7863-PCT - The weight scale factor that projects weights to [-1,1] is computed as follows: ^^^^^ = m %ax |^%|. This scale can be computed per weight tensor or per group of weights (e.g., for each 128 weights). Thus, every weight tensor of RF4 values in the model is accompanied by the scaling factor. - Weights are projected to the [-1,1] range using pre-computed scales: ^^ ^ = ^^/ ^^^^^( - RF4 4-bit “nibbles” (e.g., half a byte) are determined using a decision tree associated with the RF4 data type format. FIG.2 shows a decision tree 30 that may be used to convert the source weights 12 (FIG.1) into the quantized weights 16 (FIG.1). In general, x is a quantized weight value in the [-1, 1] quantization range and the decision tree 30 returns one of sixteen values in an RF4 lookup table based on the location of the quantized weight value relative to the ^94@^A/ ^ floating point representation of the RF4 data type. For example, a return operation 32 outputs the binary value “1111” if the quantized weight value is greater than the median value (e.g., “0.862204725”) between “0.72440945” and “1”. By contrast, a return operation 34 outputs the binary value “1110” if the quantized weight value is greater than the median value (e.g., “0.641732285”) between “0.55905512” and “0.72440945”. Although the quantization process can be computationally expensive, quantization may not have any impact on inference latency since quantization is typically conducted offline during model preparation prior to deployment. FIG.3 shows a decompression solution in which weight dequantization occurs at inference time (e.g., right before a matrix multiplication/MatMul operation). In the illustrated example, the output AI model 18 includes the reinterpretable quantized weights 16. A determination is made (e.g., based on the instruction set architecture/ISA) as to whether a plurality of input activations 20 are in the FP data type format (e.g., FP32) or the INT data type format (e.g., INT8). If the plurality of input activations 20 are in the FP data type format, the quantized weights 16 in the RF data type format are interpreted as the quantized weights 16a in the FP data type format during an FP matrix multiplication. In such a case, the FP matrix multiplication may be conducted based on first model parameters 22 including the weight scale factor associated with the plurality of quantized weights 16 and the lookup table associated with the RF data type format. Docket No. AF7863-PCT If the plurality of input activations 20 are in the INT data type format, the quantized weights 16 in the RF data type are interpreted as the quantized weights 16b in the INT data type format during an INT matrix multiplication. In such a case, the INT matrix multiplication may be conducted based on the first model parameters 22 as well as second model parameters 24 including a fixed scale factor and an activation scale factor associated with the plurality of input activations 20. As will be discussed in greater detail, the fixed scale factor can be associated with an integer range (e.g., [- 127, 127]) of the INT data type format. In the illustrated example, either the FP matrix multiplication or the INT matrix multiplication generate matrix multiplication results 26. Thus, there are two possible options: - Floating-point instructions are used for MatMul (dot product). In this case, the quantized weights 16a are mapped to the floating-point range using the look up table and scaled back to the source range using the weight scaling factor: E = ∑ ^ ^@^ @^A/ @^A/ @^A G ∗ ^ ^ ∑ / ^ ^ ∗ H = ^ ^^ ∗ H^ activations are quantized to 8- bits (e.g., using dynamic quantization). The RF4 weights are converted into the INT8 data type using the ^94^:^; ^ values defined above: ^94@^A/ ^ = ^94^:^; ^ ∗ 127.0. Then, the dot product operation can be computed in 8-bit precision as follows: E = ∑ ^@^ ^:^; ^K ^:^; ^:^; ^ ^G ∗ ^^ ∗ ^J ∗ H ^ K ^ = ∑^ ./B.D ^^ ∗ ^J ∗ H^ = ∑^ ./B.D ∗ ^J ∗ performance gains in terms of latency. FIG.4 shows a method 40 of compressing a pre-trained AI model. The method 40 may be implemented in one or more modules as a plurality of logic instructions (e.g., compression instructions) stored in a machine- or computer-readable storage medium such as random access memory (RAM), read only memory (ROM), programmable ROM (PROM), firmware, flash memory, etc., in hardware, or any combination thereof. For example, hardware implementations may include configurable logic, fixed-functionality logic, or any combination thereof. Examples of configurable logic (e.g., configurable hardware) include suitably configured programmable logic arrays (PLAs), field programmable gate arrays (FPGAs), complex programmable logic devices (CPLDs), and general purpose microprocessors. Examples of fixed- Docket No. AF7863-PCT functionality logic (e.g., fixed-functionality hardware) include suitably configured application specific integrated circuits (ASICs), combinational logic circuits, and sequential logic circuits. The configurable or fixed-functionality logic can be implemented with complementary metal oxide semiconductor (CMOS) logic circuits, transistor-transistor logic (TTL) logic circuits, or other circuits. For example, computer program code to carry out operations shown in the method 40 can be written in any combination of one or more programming languages, including an object oriented programming language such as JAVA, SMALLTALK, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. Additionally, logic instructions might include assembler instructions, instruction set architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, state- setting data, configuration data for integrated circuitry, state information that personalizes electronic circuitry and/or other structural components that are native to hardware (e.g., host processor, central processing unit/CPU, microcontroller, etc.). Illustrated block 42 provides for determining a weight scale factor for a plurality of source weights in a pre-trained model, wherein the plurality of source weights are contained within a source range. In an embodiment, the weight scale factor is determined based on a maximum value in the plurality of source weights. Block 44 converts the plurality of source weights into a plurality of quantized weights based on the weight scale factor, wherein the plurality of quantized weights are contained within a quantization range (e.g., [-1, 1]). Block 44 may also convert the plurality of source weights into the plurality of quantized weights further based on a decision tree and a lookup table associated with a first data type format corresponding to the plurality of quantized weights. As already noted, the first data type format (e.g., RF data type format) corresponding to the plurality of quantized weights is interpretable as an FP data type format and an INT data type format. Moreover, a first bit-width of the first data type format may be less than a second bit-width of a second data type format (e.g., FP32) corresponding to the plurality of source weights and less than a third bit-width of the INT data type format (e.g., INT8). For example, the first bit-width of the first data type is 4-bits (e.g., RF4) in one example. Other bit-widths may be used, however, for the RF data type format. Block 46 generates an output AI model based on the plurality of quantized weights and the weight scale factor. Docket No. AF7863-PCT The method 40 therefore enhances performance at least to the extent that generating the output AI model based on the plurality of quantized weights and the weight scale factor enables accurate post-training and training-time weight compression of the output AI model and provides cross-hardware compatibility when running inference operations on various types of HW that have different instruction sets. The RF data type is highly applicable to the optimization of transformer-based models including LLMs for existing HW accelerators (e.g., CPUs, GPUs, NPUs, etc.). FIG.5 shows a method 50 of decompressing an output AI model. The method 50 may be implemented in one or more modules as a plurality of logic instructions (e.g., decompression instructions) stored in a machine- or computer-readable storage medium such as RAM, ROM, PROM, firmware, flash memory, etc., in hardware, or any combination thereof. Illustrated processing block 52 provides for determining whether a plurality of input activations are in an FP data type format (e.g., FP16, FP32) or an INT data type format (e.g., INT8). Block 52 may determine the format of the input activations based on an ISA instruction (e.g., lookup instruction) associated with the underlying hardware. If it is determined at block 54 that the input activations are in the FP data type format, block 56 interprets, during an FP matrix multiplication (e.g., dot product), a first data type format (e.g., RF) corresponding to a plurality of quantized weights as the FP data type format. In an embodiment, block 56 conducts the FP matrix multiplication based on a lookup table associated with the first data type format and a weight scale factor associated with the plurality of quantized weights. If it is determined at block 58 that the input activations are in the INT data type format, block 60 interprets, during an INT matrix multiplication (e.g., dot product), the first data type format as the INT data type format. In an embodiment, block 60 conducts the INT matrix multiplication based on a lookup table associated with the first data type format, a weight scale factor associated with the plurality of quantized weights, a fixed scale factor (e.g., 127), and an activation scale factor associated with the plurality of input activations. As already noted, the fixed scale factor can be associated with an integer range (e.g., [-127, 127]) of the INT data type format. Moreover, a first bit-width (e.g., 4-bits) of the first data type can be less than a second bit-width (e.g., 32-bits) of the FP data type and a third bit-width (e.g., 8-bits) of the INT data type format. In an embodiment, the plurality of quantized weights are contained within a quantization such as, for Docket No. AF7863-PCT example, [-1, 1]. The method 50 therefore enhances performance at least to the extent that the RF type format improves accuracy and/or reduces latency during inference operations. Turning now to FIG. 6, a performance-enhanced computing system 280 is shown. The system 280 may generally be part of an electronic device/platform having computing functionality (e.g., personal digital assistant/PDA, notebook computer, tablet computer, convertible tablet, server), communications functionality (e.g., smart phone), imaging functionality (e.g., camera, camcorder), media playing functionality (e.g., smart television/TV), wearable functionality (e.g., watch, eyewear, headwear, footwear, jewelry), vehicular functionality (e.g., car, truck, motorcycle), robotic functionality (e.g., autonomous robot), Internet of Things (IoT) functionality, drone functionality, etc., or any combination thereof. In the illustrated example, the system 280 includes a host processor 282 (e.g., central processing unit/CPU) having an integrated memory controller (IMC) 284 that is coupled to a system memory 286 (e.g., dual inline memory module/DIMM). In an embodiment, an IO module 288 is coupled to the host processor 282. The illustrated IO module 288 communicates with, for example, a display 290 (e.g., touch screen, liquid crystal display/LCD, light emitting diode/LED display), and a network controller 292 (e.g., conducting wired and/or wireless communications). The host processor 282 may be combined with the IO module 288, a graphics processor 294, and an AI accelerator 296 into a system on chip (SoC) 298. In an embodiment, the AI accelerator 296, the host processor 282 and/or the SoC 298 executes a plurality of executable program instructions 300 (e.g., compression and/or decompression instructions) retrieved from mass storage 302 and/or the system memory 286 to perform one or more aspects of the method 40 (FIG. 4) and/or the method 50 (FIG.5), already discussed. Thus, during compression, execution of the instructions 300 causes the AI accelerator 296, the host processor 282 and/or the SoC 298 to determine a weight scale factor for a plurality of source weights in a pre-trained AI model, wherein the plurality of source weights are contained within a source range, convert the plurality of source weights into a plurality of quantized weights based on the weight scale factor, wherein the plurality of quantized weights are contained within a quantization range, and generate an output AI model based on the plurality of quantized weights and the weight scale factor. In an embodiment, a first data type format (e.g., reinterpretable floating Docket No. AF7863-PCT point/RF) corresponding to the plurality of quantized weights is interpretable as a floating point data type format and an integer data type format. During decompression, execution of the instructions 300 causes the AI accelerator 296, the host processor 282 and/or the SoC to determine whether a plurality of input activations are in the floating point data type format or the integer data type format. If the plurality of input activations are in the floating point data type format, the first data type format is interpreted as the floating data type format during a floating point matrix multiplication. If the plurality of input activations are in the integer data type format, the first data type format is interpreted as the integer data type format during an integer matrix multiplication. The computing system 280 is therefore considered performance-enhanced at least to the extent that generating the output AI model based on the plurality of quantized weights and the weight scale factor enables accurate post-training and training-time weight compression of the output AI model and provides cross-hardware compatibility when running inference operations on various types of HW that have different instruction sets. The RF data type is highly applicable to the optimization of transformer-based models including LLMs for existing HW accelerators (e.g., CPUs, GPUs, NPUs, etc.). Additionally, the RF type format improves accuracy and/or reduces latency during inference operations. FIG. 7 shows a semiconductor apparatus 350 (e.g., chip, die, package). The illustrated apparatus 350 includes one or more substrates 352 (e.g., silicon, sapphire, gallium arsenide) and logic 354 (e.g., transistor array and other integrated circuit/IC components) coupled to the substrate(s) 352. In an embodiment, the logic 354 implements one or more aspects of the method 40 (FIG.4) and/or the method 50 (FIG. 5), already discussed. The logic 354 may be implemented at least partly in configurable or fixed- functionality hardware. In one example, the logic 354 includes transistor channel regions that are positioned (e.g., embedded) within the substrate(s) 352. Thus, the interface between the logic 354 and the substrate(s) 352 may not be an abrupt junction. The logic 354 may also be considered to include an epitaxial layer that is grown on an initial wafer of the substrate(s) 352. In an embodiment, the method 40 (FIG. 4) and/or the method 50 (FIG. 5) are incorporated into an INTEL OPENVINO tookit, which streamlines AI model development and integration of deep learning in domains such as computer vision, large Docket No. AF7863-PCT language models,^and generative AI. In such a case, the use of an RF data type format as described herein improves accuracy and/or reduces latency during inference operations. FIG. 8 illustrates a processor core 400 according to one embodiment. The processor core 400 may be the core for any type of processor, such as a micro-processor, an embedded processor, a digital signal processor (DSP), a network processor, or other device to execute code. Although only one processor core 400 is illustrated in FIG.8, a processing element may alternatively include more than one of the processor core 400 illustrated in FIG. 8. The processor core 400 may be a single-threaded core or, for at least one embodiment, the processor core 400 may be multithreaded in that it may include more than one hardware thread context (or “logical processor”) per core. FIG. 8 also illustrates a memory 470 coupled to the processor core 400. The memory 470 may be any of a wide variety of memories (including various layers of memory hierarchy) as are known or otherwise available to those of skill in the art. The memory 470 may include one or more code 413 instruction(s) to be executed by the processor core 400, wherein the code 413 may implement the method 40 (FIG.4) and/or the method 50 (FIG.5), already discussed. The processor core 400 follows a program sequence of instructions indicated by the code 413. Each instruction may enter a front end portion 410 and be processed by one or more decoders 420. The decoder 420 may generate as its output a micro operation such as a fixed width micro operation in a predefined format, or may generate other instructions, microinstructions, or control signals which reflect the original code instruction. The illustrated front end portion 410 also includes register renaming logic 425 and scheduling logic 430, which generally allocate resources and queue the operation corresponding to the convert instruction for execution. The processor core 400 is shown including execution logic 450 having a set of execution units 455-1 through 455-N. Some embodiments may include a number of execution units dedicated to specific functions or sets of functions. Other embodiments may include only one execution unit or one execution unit that can perform a particular function. The illustrated execution logic 450 performs the operations specified by code instructions. After completion of execution of the operations specified by the code instructions, back end logic 460 retires the instructions of the code 413. In one embodiment, the processor core 400 allows out of order execution but requires in order Docket No. AF7863-PCT retirement of instructions. Retirement logic 465 may take a variety of forms as known to those of skill in the art (e.g., re-order buffers or the like). In this manner, the processor core 400 is transformed during execution of the code 413, at least in terms of the output generated by the decoder, the hardware registers and tables utilized by the register renaming logic 425, and any registers (not shown) modified by the execution logic 450. Although not illustrated in FIG. 8, a processing element may include other elements on chip with the processor core 400. For example, a processing element may include memory control logic along with the processor core 400. The processing element may include I/O control logic and/or may include I/O control logic integrated with memory control logic. The processing element may also include one or more caches. Referring now to FIG.9, shown is a block diagram of a computing system 1000 embodiment in accordance with an embodiment. Shown in FIG.9 is a multiprocessor system 1000 that includes a first processing element 1070 and a second processing element 1080. While two processing elements 1070 and 1080 are shown, it is to be understood that an embodiment of the system 1000 may also include only one such processing element. The system 1000 is illustrated as a point-to-point interconnect system, wherein the first processing element 1070 and the second processing element 1080 are coupled via a point-to-point interconnect 1050. It should be understood that any or all of the interconnects illustrated in FIG.9 may be implemented as a multi-drop bus rather than point-to-point interconnect. As shown in FIG. 9, each of processing elements 1070 and 1080 may be multicore processors, including first and second processor cores (i.e., processor cores 1074a and 1074b and processor cores 1084a and 1084b). Such cores 1074a, 1074b, 1084a, 1084b may be configured to execute instruction code in a manner similar to that discussed above in connection with FIG.8. Each processing element 1070, 1080 may include at least one shared cache 1896a, 1896b. The shared cache 1896a, 1896b may store data (e.g., instructions) that are utilized by one or more components of the processor, such as the cores 1074a, 1074b and 1084a, 1084b, respectively. For example, the shared cache 1896a, 1896b may locally cache data stored in a memory 1032, 1034 for faster access by components of the processor. In one or more embodiments, the shared cache 1896a, 1896b may Docket No. AF7863-PCT include one or more mid-level caches, such as level 2 (L2), level 3 (L3), level 4 (L4), or other levels of cache, a last level cache (LLC), and/or combinations thereof. While shown with only two processing elements 1070, 1080, it is to be understood that the scope of the embodiments are not so limited. In other embodiments, one or more additional processing elements may be present in a given processor. Alternatively, one or more of processing elements 1070, 1080 may be an element other than a processor, such as an accelerator or a field programmable gate array. For example, additional processing element(s) may include additional processors(s) that are the same as a first processor 1070, additional processor(s) that are heterogeneous or asymmetric to processor a first processor 1070, accelerators (such as, e.g., graphics accelerators or digital signal processing (DSP) units), field programmable gate arrays, or any other processing element. There can be a variety of differences between the processing elements 1070, 1080 in terms of a spectrum of metrics of merit including architectural, micro architectural, thermal, power consumption characteristics, and the like. These differences may effectively manifest themselves as asymmetry and heterogeneity amongst the processing elements 1070, 1080. For at least one embodiment, the various processing elements 1070, 1080 may reside in the same die package. The first processing element 1070 may further include memory controller logic (MC) 1072 and point-to-point (P-P) interfaces 1076 and 1078. Similarly, the second processing element 1080 may include a MC 1082 and P-P interfaces 1086 and 1088. As shown in FIG.9, MC’s 1072 and 1082 couple the processors to respective memories, namely a memory 1032 and a memory 1034, which may be portions of main memory locally attached to the respective processors. While the MC 1072 and 1082 is illustrated as integrated into the processing elements 1070, 1080, for alternative embodiments the MC logic may be discrete logic outside the processing elements 1070, 1080 rather than integrated therein. The first processing element 1070 and the second processing element 1080 may be coupled to an I/O subsystem 1090 via P-P interconnects 10761086, respectively. As shown in FIG. 9, the I/O subsystem 1090 includes P-P interfaces 1094 and 1098. Furthermore, I/O subsystem 1090 includes an interface 1092 to couple I/O subsystem 1090 with a high performance graphics engine 1038. In one embodiment, bus 1049 may be used to couple the graphics engine 1038 to the I/O subsystem 1090. Alternately, a point-to-point interconnect may couple these components. Docket No. AF7863-PCT In turn, I/O subsystem 1090 may be coupled to a first bus 1016 via an interface 1096. In one embodiment, the first bus 1016 may be a Peripheral Component Interconnect (PCI) bus, or a bus such as a PCI Express bus or another third generation I/O interconnect bus, although the scope of the embodiments are not so limited. As shown in FIG. 9, various I/O devices 1014 (e.g., biometric scanners, speakers, cameras, sensors) may be coupled to the first bus 1016, along with a bus bridge 1018 which may couple the first bus 1016 to a second bus 1020. In one embodiment, the second bus 1020 may be a low pin count (LPC) bus. Various devices may be coupled to the second bus 1020 including, for example, a keyboard/mouse 1012, communication device(s) 1026, and a data storage unit 1019 such as a disk drive or other mass storage device which may include code 1030, in one embodiment. The illustrated code 1030 may implement the method 40 (FIG. 4) and/or the method 50 (FIG.5), already discussed. Further, an audio I/O 1024 may be coupled to second bus 1020 and a battery 1010 may supply power to the computing system 1000. Note that other embodiments are contemplated. For example, instead of the point-to-point architecture of FIG. 9, a system may implement a multi-drop bus or another such communication topology. Also, the elements of FIG.9 may alternatively be partitioned using more or fewer integrated chips than shown in FIG.9. Additional Notes and Examples: Example 1 includes a performance-enhanced computing system comprising a network controller, a processor coupled to the network controller, and a memory coupled to the processor, the memory including a plurality of compression instructions, which when executed by the processor, cause the processor to determine a weight scale factor for a plurality of source weights in a pre-trained artificial intelligence (AI) model, wherein the plurality of source weights are contained within a source range, convert the plurality of source weights into a plurality of quantized weights based on the weight scale factor, wherein the plurality of quantized weights are contained within a quantization range, and generate an output AI model based on the plurality of quantized weights and the weight scale factor. Example 2 includes the computing system of Example 1, wherein a first data type format corresponding to the plurality of quantized weights is interpretable as a floating point data type format and an integer data type format. Example 3 includes the computing system of Example 2, wherein the memory further includes a plurality of decompression instructions, which when executed by the Docket No. AF7863-PCT processor, cause the processor to determine whether a plurality of input activations are in the floating point data type format or the integer data type format, interpret, during a floating point matrix multiplication, the first data type format corresponding to the plurality of quantized weights as the floating point data type format if the plurality of input activations are in the floating point data type format, and interpret, during an integer matrix multiplication, the first data type format as the integer data type format if the plurality of input activations are in the integer data type format. Example 4 includes the computing system of Example 2, wherein a first bit- width of the first data type format is less than a third bit-width of the integer data type format. Example 5 includes the computing system of Example 4, wherein the first bit- width of the first data type format is less than a second bit-width of a second data type format corresponding to the plurality of source weights, and wherein the first bit-width of the first data type format is 4-bits. Example 6 includes the computing system of Example 1, wherein the plurality of source weights are converted into the plurality of quantized weights further based on a decision tree and a lookup table associated with a first data type format corresponding to the plurality of quantized weights. Example 7 includes at least one computer readable storage medium comprising a plurality of compression instructions, which when executed by a computing system, cause the computing system to determine a weight scale factor for a plurality of source weights in a pre-trained artificial intelligence (AI) model, wherein the plurality of source weights are contained within a source range, convert the plurality of source weights into a plurality of quantized weights based on the weight scale factor, wherein the plurality of quantized weights are contained within a quantization range, and generate an output AI model based on the plurality of quantized weights and the weight scale factor. Example 8 includes the at least one computer readable storage medium of Example 7, wherein a first data type format corresponding to the plurality of quantized weights is interpretable as a floating point data type format and an integer data type format. Example 9 includes the at least one computer readable storage medium of Example 8, wherein a first bit-width of the first data type format is less than a third bit- width of the integer data type format. Docket No. AF7863-PCT Example 10 includes the at least one computer readable storage medium of Example 9, wherein the first bit-width of the first data type format is less than a second bit-width of a second data type format corresponding to the plurality of source weights, and wherein the first bit-width of the first data type format is 4-bits. Example 11 includes the at least one computer readable storage medium of Example 7, wherein the plurality of source weights are converted into the plurality of quantized weights further based on a decision tree and a lookup table associated with a first data type format corresponding to the plurality of quantized weights. Example 12 includes the at least one computer readable storage medium of any one of Examples 7 to 11, wherein the weight scale factor is determined based on a maximum value in the plurality of source weights. Example 13 includes the at least one computer readable storage medium of any one of Examples 7 to 11, wherein the quantization range is [-1, 1]. Example 14 includes at least one computer readable storage medium comprising a plurality of decompression instructions, which when executed by a computing system, cause the computing system to determine whether a plurality of input activations are in a floating point data type format or an integer data type format, interpret, during a floating point matrix multiplication, a first data type format corresponding to a plurality of quantized weights as the floating point data type format if the plurality of input activations are in the floating point data type format, and interpret, during an integer matrix multiplication, the first data type format as the integer data type format if the plurality of input activations are in the integer data type format. Example 15 includes the at least one computer readable storage medium of Example 14, wherein the plurality of decompression instructions, when executed, further cause the computing system to conduct the floating point matrix multiplication based on a lookup table associated with the first data type format and a weight scale factor associated with the plurality of quantized weights. Example 16 includes the at least one computer readable storage medium of Example 14, wherein the plurality of decompression instructions, when executed, further cause the computing system to conduct the integer matrix multiplication based on a lookup table associated with the first data type format, a weight scale factor associated with the plurality of quantized weights, a fixed scale factor, and an activation scale factor associated with the plurality of input activations. Docket No. AF7863-PCT Example 17 includes the at least one computer readable storage medium of Example 16, wherein the fixed scale factor is associated with an integer range of the integer data type format, and wherein the integer range is [-127, 127]. Example 18 includes the at least one computer readable storage medium of any one of Examples 14 to 17, wherein a first bit-width of the first data type format is less than a second bit-width of the floating point data type format and a third bit-width of the integer data type format. Example 19 includes the at least one computer readable storage medium of Example 18, wherein the first bit-width of the first data type format is 4-bits. Example 20 includes the at least one computer readable storage medium of any one of Examples 14 to 17, wherein the plurality of quantized weights are contained within a quantization range, and wherein the quantization range is [-1, 1]. Example 21 includes a method of compressing a pre-trained artificial intelligence (AI) model, the method comprising determining a weight scale factor for a plurality of source weights in the pre-trained AI model, wherein the plurality of source weights are contained within a source range, converting the plurality of source weights into a plurality of quantized weights based on the weight scale factor, wherein the plurality of quantized weights are contained within a quantization range, and generating an output AI model based on the plurality of quantized weights and the weight scale factor. Example 22 includes a method of decompressing an output artificial intelligence (AI) model, the method comprising determining whether a plurality of input activations are in a floating point data type format or an integer data type format, interpreting, during a floating point matrix multiplication, a first data type format corresponding to a plurality of quantized weights as the floating point data type format if the plurality of input activations are in the floating point data type format, and interpreting, during an integer matrix multiplication, the first data type format as the integer data type format if the plurality of input activations are in the integer data type format. Example 23 includes an apparatus comprising means for performing the method of any of Examples 21 to 23. Embodiments are applicable for use with all types of semiconductor integrated circuit (“IC”) chips. Examples of these IC chips include but are not limited to processors, controllers, chipset components, programmable logic arrays (PLAs), memory chips, network chips, systems on chip (SoCs), SSD/NAND controller ASICs, Docket No. AF7863-PCT and the like. In addition, in some of the drawings, signal conductor lines are represented with lines. Some may be different, to indicate more constituent signal paths, have a number label, to indicate a number of constituent signal paths, and/or have arrows at one or more ends, to indicate primary information flow direction. This, however, should not be construed in a limiting manner. Rather, such added detail may be used in connection with one or more exemplary embodiments to facilitate easier understanding of a circuit. Any represented signal lines, whether or not having additional information, may actually comprise one or more signals that may travel in multiple directions and may be implemented with any suitable type of signal scheme, e.g., digital or analog lines implemented with differential pairs, optical fiber lines, and/or single-ended lines. Example sizes/models/values/ranges may have been given, although embodiments are not limited to the same. As manufacturing techniques (e.g., photolithography) mature over time, it is expected that devices of smaller size could be manufactured. In addition, well known power/ground connections to IC chips and other components may or may not be shown within the figures, for simplicity of illustration and discussion, and so as not to obscure certain aspects of the embodiments. Further, arrangements may be shown in block diagram form in order to avoid obscuring embodiments, and also in view of the fact that specifics with respect to implementation of such block diagram arrangements are highly dependent upon the computing system within which the embodiment is to be implemented, i.e., such specifics should be well within purview of one skilled in the art. Where specific details (e.g., circuits) are set forth in order to describe example embodiments, it should be apparent to one skilled in the art that embodiments can be practiced without, or with variation of, these specific details. The description is thus to be regarded as illustrative instead of limiting. The term “coupled” may be used herein to refer to any type of relationship, direct or indirect, between the components in question, and may apply to electrical, mechanical, fluid, optical, electromagnetic, electromechanical or other connections. In addition, the terms “first”, “second”, etc. may be used herein only to facilitate discussion, and carry no particular temporal or chronological significance unless otherwise indicated. As used in this application and in the claims, a list of items joined by the term “one or more of” may mean any combination of the listed terms. For example, the Docket No. AF7863-PCT phrases “one or more of A, B or C” may mean A; B; C; A and B; A and C; B and C; or A, B and C. Those skilled in the art will appreciate from the foregoing description that the broad techniques of the embodiments can be implemented in a variety of forms. Therefore, while the embodiments have been described in connection with particular examples thereof, the true scope of the embodiments should not be so limited since other modifications will become apparent to the skilled practitioner upon a study of the drawings, specification, and following claims.

Claims

Docket No. AF7863-PCT CLAIMS We claim: 1. A performance-enhanced computing system comprising: a network controller; a processor coupled to the network controller; and a memory coupled to the processor, the memory including a plurality of compression instructions, which when executed by the processor, cause the processor to: determine a weight scale factor for a plurality of source weights in a pre-trained artificial intelligence (AI) model, wherein the plurality of source weights are contained within a source range; convert the plurality of source weights into a plurality of quantized weights based on the weight scale factor, wherein the plurality of quantized weights are contained within a quantization range; and generate an output AI model based on the plurality of quantized weights and the weight scale factor. 2. The computing system of claim 1, wherein a first data type format corresponding to the plurality of quantized weights is interpretable as a floating point data type format and an integer data type format. 3. The computing system of claim 2, wherein the memory further includes a plurality of decompression instructions, which when executed by the processor, cause the processor to: determine whether a plurality of input activations are in the floating point data type format or the integer data type format; interpret, during a floating point matrix multiplication, the first data type format corresponding to the plurality of quantized weights as the floating point data type format if the plurality of input activations are in the floating point data type format; and Docket No. AF7863-PCT interpret, during an integer matrix multiplication, the first data type format as the integer data type format if the plurality of input activations are in the integer data type format. 4. The computing system of claim 2, wherein a first bit-width of the first data type format is less than a third bit-width of the integer data type format. 5. The computing system of claim 4, wherein the first bit-width of the first data type format is less than a second bit-width of a second data type format corresponding to the plurality of source weights, and wherein the first bit-width of the first data type format is 4-bits. 6. The computing system of claim 1, wherein the plurality of source weights are converted into the plurality of quantized weights further based on a decision tree and a lookup table associated with a first data type format corresponding to the plurality of quantized weights. 7. At least one computer readable storage medium comprising a plurality of compression instructions, which when executed by a computing system, cause the computing system to: determine a weight scale factor for a plurality of source weights in a pre- trained artificial intelligence (AI) model, wherein the plurality of source weights are contained within a source range; convert the plurality of source weights into a plurality of quantized weights based on the weight scale factor, wherein the plurality of quantized weights are contained within a quantization range; and generate an output AI model based on the plurality of quantized weights and the weight scale factor. 8. The at least one computer readable storage medium of claim 7, wherein a first data type format corresponding to the plurality of quantized weights is interpretable as a floating point data type format and an integer data type format. Docket No. AF7863-PCT 9. The at least one computer readable storage medium of claim 8, wherein a first bit-width of the first data type format is less than a third bit-width of the integer data type format. 10. The at least one computer readable storage medium of claim 9, wherein the first bit-width of the first data type format is less than a second bit-width of a second data type format corresponding to the plurality of source weights, and wherein the first bit-width of the first data type format is 4-bits. 11. The at least one computer readable storage medium of claim 7, wherein the plurality of source weights are converted into the plurality of quantized weights further based on a decision tree and a lookup table associated with a first data type format corresponding to the plurality of quantized weights. 12. The at least one computer readable storage medium of any one of claims 7 to 11, wherein the weight scale factor is determined based on a maximum value in the plurality of source weights. 13. The at least one computer readable storage medium of any one of claims 7 to 11, wherein the quantization range is [-1, 1]. 14. At least one computer readable storage medium comprising a plurality of decompression instructions, which when executed by a computing system, cause the computing system to: determine whether a plurality of input activations are in a floating point data type format or an integer data type format; interpret, during a floating point matrix multiplication, a first data type format corresponding to a plurality of quantized weights as the floating point data type format if the plurality of input activations are in the floating point data type format; and interpret, during an integer matrix multiplication, the first data type format as the integer data type format if the plurality of input activations are in the integer data type format. Docket No. AF7863-PCT 15. The at least one computer readable storage medium of claim 14, wherein the plurality of decompression instructions, when executed, further cause the computing system to conduct the floating point matrix multiplication based on a lookup table associated with the first data type format and a weight scale factor associated with the plurality of quantized weights. 16. The at least one computer readable storage medium of claim 14, wherein the plurality of decompression instructions, when executed, further cause the computing system to conduct the integer matrix multiplication based on a lookup table associated with the first data type format, a weight scale factor associated with the plurality of quantized weights, a fixed scale factor, and an activation scale factor associated with the plurality of input activations. 17. The at least one computer readable storage medium of claim 16, wherein the fixed scale factor is associated with an integer range of the integer data type format, and wherein the integer range is [-127, 127]. 18. The at least one computer readable storage medium of any one of claims 14 to 17, wherein a first bit-width of the first data type format is less than a second bit-width of the floating point data type format and a third bit-width of the integer data type format. 19. The at least one computer readable storage medium of claim 18, wherein the first bit-width of the first data type format is 4-bits. 20. The at least one computer readable storage medium of any one of claims 14 to 17, wherein the plurality of quantized weights are contained within a quantization range, and wherein the quantization range is [-1, 1].
PCT/US2024/027423 2024-05-02 2024-05-02 Reinterpretable data type format for accurate and efficient model compression Pending WO2025230529A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2024/027423 WO2025230529A1 (en) 2024-05-02 2024-05-02 Reinterpretable data type format for accurate and efficient model compression

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2024/027423 WO2025230529A1 (en) 2024-05-02 2024-05-02 Reinterpretable data type format for accurate and efficient model compression

Publications (1)

Publication Number Publication Date
WO2025230529A1 true WO2025230529A1 (en) 2025-11-06

Family

ID=97561832

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2024/027423 Pending WO2025230529A1 (en) 2024-05-02 2024-05-02 Reinterpretable data type format for accurate and efficient model compression

Country Status (1)

Country Link
WO (1) WO2025230529A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110009101A (en) * 2019-04-11 2019-07-12 北京字节跳动网络技术有限公司 Method and apparatus for generating quantization neural network
US20200167632A1 (en) * 2018-11-23 2020-05-28 Samsung Electronics Co., Ltd. Neural network device for neural network operation, method of operating neural network device, and application processor including the neural network device
US11556772B2 (en) * 2017-04-28 2023-01-17 Intel Corporation Incremental precision networks using residual inference and fine-grain quantization
US20230410255A1 (en) * 2021-01-22 2023-12-21 Qualcomm Incorporated Decreased quantization latency
US20240104346A1 (en) * 2022-09-15 2024-03-28 Huawei Technologies Co., Ltd. Method and device for compressing generative pre-trained language models via quantization

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11556772B2 (en) * 2017-04-28 2023-01-17 Intel Corporation Incremental precision networks using residual inference and fine-grain quantization
US20200167632A1 (en) * 2018-11-23 2020-05-28 Samsung Electronics Co., Ltd. Neural network device for neural network operation, method of operating neural network device, and application processor including the neural network device
CN110009101A (en) * 2019-04-11 2019-07-12 北京字节跳动网络技术有限公司 Method and apparatus for generating quantization neural network
US20230410255A1 (en) * 2021-01-22 2023-12-21 Qualcomm Incorporated Decreased quantization latency
US20240104346A1 (en) * 2022-09-15 2024-03-28 Huawei Technologies Co., Ltd. Method and device for compressing generative pre-trained language models via quantization

Similar Documents

Publication Publication Date Title
US20250005364A1 (en) Dynamic pruning of neurons on-the-fly to accelerate neural network inferences
US12423561B2 (en) Method and apparatus for keeping statistical inference accuracy with 8-bit Winograd convolution
US11429838B2 (en) Neural network device for neural network operation, method of operating neural network device, and application processor including the neural network device
US20200364552A1 (en) Quantization method of improving the model inference accuracy
US11169776B2 (en) Decomposed floating point multiplication
CN114118347A (en) Fine-grained per-vector scaling for neural network quantization
CN113849675A (en) Similarity Search Using Guided Reinforcement Learning
WO2018000309A1 (en) Importance-aware model pruning and re-training for efficient convolutional neural networks
US20230118802A1 (en) Optimizing low precision inference models for deployment of deep neural networks
US20250077527A1 (en) Variable precision in vectorization
US12406169B2 (en) Optimally clipped tensors and vectors
CN118170347B (en) Precision conversion device, data processing method, processor, and electronic device
US20250037017A1 (en) Weight compression accuracy enhancements in large language models
WO2021119907A1 (en) Technology to mininimize negative impact of cache conflicts caused by incompatible leading dimensions in matrix multiplication and convolution kernels without dimension padding
JP2025522114A (en) Model training method and related device
EP3839736A1 (en) Unified programming interface for regrained tile execution
WO2025230529A1 (en) Reinterpretable data type format for accurate and efficient model compression
WO2025030383A1 (en) Zero-shot learning of object-centric generative adversarial networks for data-free object detection network quantization
US20250028965A1 (en) Weight quantization adaptation technology
CN112101511A (en) Sparse Convolutional Neural Networks
WO2025035403A1 (en) Floating point accuracy control via dynamic exponent and mantissa bit configurations
US20250217627A1 (en) Weight rounding optimization via signed gradient descent
US20250103329A1 (en) Scalable deterministic solution for non-deterministic operations
US20220391710A1 (en) Neural network based power and performance model for versatile processing units
US20220300795A1 (en) Two-stage decompression pipeline for non-uniform quantized neural network inference on reconfigurable hardware