WO2025184890A1 - Reduced latency for mixed-precision quantized machine learning models - Google Patents
Reduced latency for mixed-precision quantized machine learning modelsInfo
- Publication number
- WO2025184890A1 WO2025184890A1 PCT/CN2024/080716 CN2024080716W WO2025184890A1 WO 2025184890 A1 WO2025184890 A1 WO 2025184890A1 CN 2024080716 W CN2024080716 W CN 2024080716W WO 2025184890 A1 WO2025184890 A1 WO 2025184890A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- quantization
- latency
- operations
- machine learning
- precision
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0495—Quantised networks; Sparse networks; Compressed networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/082—Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Definitions
- aspects of the present disclosure relate to machine learning.
- machine learning model architectures have proliferated and have been used to provide solutions for a multitude of prediction problems.
- machine learning models generally rely on a set of model parameters having values that are learned or trained based on training data (which may include labeled data and/or unlabeled data) .
- training data which may include labeled data and/or unlabeled data
- a large number of such parameters are used to provide better utility.
- bigger models e.g., models having more parameters
- tend to perform better e.g., with higher prediction accuracy
- even comparatively small models generally have a relatively large number of parameters and can have a substantial memory footprint.
- Model size has become particularly problematic in resource-constrained scenarios, where it is desired to deploy a trained model on a device having relatively limited resources (e.g., mobile devices, embedded devices, smart vehicles, and the like) .
- Some conventional approaches to ameliorate such concerns involve parameter quantization.
- parameter quantization is an approximation-based process, which inherently introduces error into the model.
- Certain aspects of the present disclosure provide a processor-implemented method, comprising: accessing a quantization profile for a machine learning model, the quantization profile indicating, for each respective operation of a plurality of operations of the machine learning model, a respective quantization precision of a plurality of quantization precisions; generating a first set of modifications for the quantization profile based on conversion latency for converting tensors among the plurality of quantization precisions, wherein each respective modification of the first set of modifications indicates to increase a respective quantization precision of a respective operation of the plurality of operations; generating a modified quantization profile based on modifying the quantization profile using the first set of modifications; and quantizing the machine learning model in accordance with the modified quantization profile.
- processing systems configured to perform the aforementioned methods as well as those described herein; non-transitory, computer-readable media comprising instructions that, when executed by one or more processors of a processing system, cause the processing system to perform the aforementioned methods as well as those described herein; a computer program product embodied on a computer-readable storage medium comprising code for performing the aforementioned methods as well as those further described herein; and a processing system comprising means for performing the aforementioned methods as well as those further described herein.
- FIG. 1 depicts an example workflow for quantizing machine learning models using reduced latency mixed-precision quantization, according to some aspects of the present disclosure.
- FIG. 2 depicts an example workflow to generate quantizer group graphs to facilitate mixed-precision quantization, according to some aspects of the present disclosure.
- FIG. 3 is a flow diagram depicting an example method for reduced latency mixed-precision quantized machine learning models, according to some aspects of the present disclosure.
- FIG. 4 is a flow diagram depicting an example method for generating and evaluating quantization profiles to reduce machine learning model latency, according to some aspects of the present disclosure.
- FIG. 5 is a flow diagram depicting an example method for training conversion latency prediction models, according to some aspects of the present disclosure.
- FIG. 6 is a flow diagram depicting an example method for mixed-precision quantization, according to some aspects of the present disclosure.
- FIG. 7 depicts an example processing system configured to perform various aspects of the present disclosure.
- aspects of the present disclosure provide apparatuses, methods, processing systems, and non-transitory computer-readable mediums for providing improved machine learning.
- Machine learning model quantization can be used to reduce the memory footprint of the model while simultaneously reducing the computational expense of executing the model. For example, because quantizing the parameters and/or activations (or other data being processed by the model) reduces their size (e.g., their bit-width) , the quantized (smaller) data can generally be processed with reduced memory usage, reduced compute cycles, and the like. Quantization often similarly leads to reduced latency of processing data using the model (e.g., reduced time to generate a prediction when input data is provided) .
- model parameters and/or activations are quantized to the same quantization precision (e.g., converting floating-point values with a bit-width of thirty-two to integer values with a bit-width of eight or sixteen) .
- quantization precision e.g., smaller bit-widths
- results in reduced computational expense as compared to higher bit-width quantization
- lower quantization precision also often results in lower model accuracy (as compared to higher bit-width quantization) .
- the impact on model accuracy that quantization has varies substantially depending on the particular operation or component of the model that is being quantized.
- some operations may suffer substantial accuracy reduction when quantized to a low bit-width, while other operations may result in less (or even no) degradation in accuracy when quantized to the same low bit-width.
- mixed-precision quantization is used to mitigate these concerns.
- mixed-precision quantization enables quantizing different operations or portions of the model to different quantization precisions. This can allow for use of less precise quantization (e.g., lower bit-widths) where the reduced precision does not substantially harm model accuracy, while using more precise quantization (e.g., higher bit-widths) where the reduced precision would more substantially harm accuracy.
- this mix of precision is referred to as a “quantization profile” for the model. That is, the quantization profile for a model may indicate the quantization precision for each operation of the model.
- the profile may indicate that some operations should be quantized such that both the parameters and the activations have a bit-width of eight, while other operations should be quantized such that the parameters have a bit-width of eight and the activations have a bit-width of sixteen.
- AMP automatic mixed-precision
- AMP techniques generally do not directly optimize the latency itself.
- conventional approaches may seek to optimize (e.g., reduce) proxies such as memory footprint, the latency introduced only by the operations or layers themselves (e.g., ignoring memory access latency, conversion latencies, and the like) , the number of multiply and accumulate (MAC) operations, the number of floating-point operations (FLOPs) , and the like.
- proxies such as memory footprint, the latency introduced only by the operations or layers themselves (e.g., ignoring memory access latency, conversion latencies, and the like)
- MAC multiply and accumulate
- FLOPs floating-point operations
- the tensors flowing between the operations are converted accordingly (e.g., from the quantization precision of the first operation to the quantization precision of the second) .
- These conversion costs e.g., latency and compute resources
- techniques are provided to modify or update quantization profiles in order to account for the latencies or other computational resources consumed by converting the tensors between different quantization precisions.
- the modifications are constrained to only use increased precision, rather than introducing decreased precision. That is, when generating the modifications, the system may consider changing a lower precision bit-width or format to a higher precision bit-width or format, but will not change a higher precision bit-width or format to a lower precision bit-width or format. This may ensure that the modified quantization profile is at least as accurate as the original quantization profile. In contrast, if any quantization precisions are reduced for any operations, the resulting model may exhibit reduced accuracy.
- the system uses integer programming (e.g., mixed integer programming (MIP) ) to generate the modifications.
- MIP mixed integer programming
- an objective to minimize execution latency may be defined, as discussed in more detail below, with various constraints to enforce accuracy preservation.
- the machine learning model (s) may be quantized in a way that reduces the inferencing latency during runtime while preserving model accuracy.
- FIG. 1 depicts an example workflow 100 for quantizing machine learning models using reduced latency mixed-precision quantization, according to some aspects of the present disclosure.
- a quantization system 115 accesses a machine learning model 105 and a quantization profile 110 and generates a quantized machine learning model 135.
- “accessing” data may generally include receiving, requesting, retrieving, generating, obtaining, or otherwise gaining access to the data.
- the quantization system 115 may access the machine learning model 105 from one or more other systems (e.g., a training system that trains the machine learning model 105) , or may generate the machine learning model 105 and/or quantization profile 110 locally (e.g., the quantization system 115 may itself train the machine learning model 105 and generate the quantization profile 110) .
- the quantization system 115 may provide relevant information (such as the modified quantization profile 125) to one or more other systems which use the information to generate the quantized machine learning model 135.
- relevant information such as the modified quantization profile 125
- the operations of the quantization system 115 may be combined or distributed across any number of systems, and may be implemented using hardware, software, or a combination of hardware and software.
- the machine learning model 105 is generally representative of a machine learning model that has been trained (e.g., using one or more labeled or unlabeled exemplars) to perform one or more tasks.
- the particular architecture of the machine learning model 105 may vary depending on the particular implementation.
- the machine learning model 105 comprises a neural network-based architecture, such as a feedforward neural network, a convolutional neural network (CNN) , a recursive neural network (RNN) , a multilayer perceptron (MLP) , a long short-term memory (LSTM) model, a transformer-based model, and the like.
- the quantization system 115 accesses (or generates) a quantizer group graph representing the model, as discussed below with reference to FIG. 2.
- the machine learning model 105 is trained such that the model is ready for runtime use in generating inferences or predictions.
- the quantization system 115 (or a dedicated training system) may train the machine learning model 105 until one or more termination criteria are met, such as for a defined number of iterations or epochs, until a defined period of time or computational resources have been spent training, until a defined accuracy preference is reached, and the like.
- the machine learning model 105 is a full-precision model (e.g., having non-quantized parameters, such as weights, that were learned during the training process) .
- the machine learning model 105 comprises a set of operations with corresponding data flows among the operations.
- an “operation” of the machine learning model may generally include any data processing or transformation performed by the machine learning model, such as a layer of a neural network, a convolution operation, application of an activation function, a pooling or normalization operation, a transformer or other attention operation, and the like.
- the quantization profile 110 indicates, for each respective operation of the set of operations of the machine learning model 105, a respective quantization scheme or precision. That is, the quantization profile 110 may indicate, for each operation, what quantization precision should be used.
- the quantization precision of an operation may correspond to the bit-width used to represent the parameters (e.g., weights) of the operation (e.g., parameter bit-widths) , the bit-width used to represent the data tensors that are used as input to the operation and/or used as output from the operation (e.g., activation bit-widths) , and/or the data type (s) used to represent the data tensors that are processed by the operation (e.g., floating point, eight-bit integer, and the like) .
- “activations” or “activation data” may generally include any data that is processed by the model during inferencing (e.g., the input and/or output of one or more operations, or
- the quantization profile 110 may indicate that one or more operations should be quantized such that both the parameters of the operation and the tensors processed by the operation have a bit-width of eight (e.g., eight-bit integers) , while one or more other operations should be quantized such that the parameters have a bit-width of eight (e.g., eight-bit integers) and the activations have a bit-width of sixteen (e.g., sixteen-bit integers) .
- one or more AMP techniques or algorithms may be used to generate the quantization profile 110.
- the AMP technique (s) generally seek to balance model size (and therefore inferencing latency) with model accuracy.
- the AMP techniques may attempt to find operation (s) that can be quantized with less precision for reduced size and latency without substantially harming accuracy, as compared to operation (s) where less precise quantization results in substantially reduced accuracy.
- the quantization profile 110 is generated based on a defined or preferred maximum accuracy reduction (e.g., where the AMP algorithm attempts to generate a quantization profile 110 that minimizes, or at least substantially reduces, model size without exceeding the maximum accuracy reduction) .
- the quantization system 115 includes a modification component 120 and a quantization component 130. Although depicted as discrete components for conceptual clarity, in aspects, the operations of the depicted components (and others not depicted) may be combined or distributed across any number of components.
- the modification component 120 accesses the machine learning model 105 and the quantization profile 110 and generates a modified quantization profile 125.
- the modified quantization profile 125 may indicate, for each respective operation of the set of operations of the machine learning model 105, a respective quantization scheme or precision.
- the modification component 120 evaluates the machine learning model 105 and the quantization profile 110 to generate a set of modifications (which may be an empty set) to the quantization profile 110. The modification component 120 may then implement these modifications to modify the quantization profile 110 in order to generate the modified quantization profile 125.
- the modification component 120 generates the modification (s) in an effort to reduce the runtime resources consumed by the quantized model (e.g., latency, memory, and the like) .
- the modification component 120 may generate the modifications based on the conversion latency for converting tensors among the various quantization precisions. For example, as discussed above, when sequential operations of the machine learning model 105 use different quantization precisions (as specified in the quantization profile 110) , one or more conversion operations may likely be applied (e.g., to re-quantize the data from the first format to the second) to the data output by the first operation before the data can be used as input to the second operation. These conversions introduce latency that, in some cases, can be substantial. Therefore, in some aspects, the modification component 120 may generate modifications to reduce the number of times that the quantization profile 110 switches between quantization precisions for adjacent operations (e.g., reducing the number of conversions) .
- the modifications generated by the modification component 120 are constrained to refrain from decreasing the quantization precision of any operation. That is, in some aspects, the modification component 120 may determine to increase the precision of an operation (e.g., from eight-bit integer activations to sixteen-bit integer activations) or to leave the precision unchanged, but may not decrease the precision (e.g., from thirty-two-bit floating point to sixteen-bit integer) . By limiting the modifications to only increases in precision, the modification component 120 can ensure that the accuracy of the model is not reduced.
- the initial quantization profile 110 when used to quantize the machine learning model 105, may reduce accuracy of the model (e.g., due to quantization noise caused by the reduced precision of the parameters and/or activations) as compared to the non-quantized machine learning model 105.
- the modification component 120 ensures that the resulting quantized model (generated using the modified quantization profile 125) will be at least as accurate (and possibly more accurate) as the model quantized according to the original quantization profile 110.
- this constraint obviates the use of any accuracy evaluations of the quantized model. That is, the modification component 120 may generate the modifications without evaluating the impact on the model’s accuracy, which substantially reduces the computational complexity of the modification process, as compared to some conventional approaches. For example, many AMP techniques (e.g., those used to generate the quantization profile 110) generally repeatedly use test data to evaluate the model accuracy, while generating the quantization profile 110, in order to ensure the quantization does not reduce accuracy more than a threshold amount.
- AMP techniques e.g., those used to generate the quantization profile 110
- the modification component 120 need not access or evaluate test data.
- the quantization system 115 can generate substantially improved models (e.g., the quantized machine learning model 135) using fewer computational resources, while simultaneously ensuring that the generated models have reduced inferencing latency and are at least as accurate as the initially proposed quantized model (e.g., quantized using the initial quantization profile 110) .
- the modification component 120 may generally use a variety of operations or techniques to generate the modifications to the quantization profile 110.
- the modification component 120 may process the quantization profile 110 and/or the machine learning model 105 (e.g., a representation of the architecture, such as a graph representing the operations and data flow, as discussed below in more detail) using one or more heuristic algorithms, search algorithms, machine learning models, and the like.
- the modification component 120 (or another system) may train a machine learning model to evaluate quantization profiles 110 in order to generate a set of modifications that reduce the runtime latency of the model.
- the modification component 120 uses an integer programming (IP) technique or approach, such as a mixed IP (MIP) technique, to generate the modifications.
- IP integer programming
- MIP mixed IP
- the machine learning model 105 is represented using a quantizer group graph, where each node in the graph corresponds to an operation of the machine learning model and each edge in the graph indicates data flow among the operations (e.g., directed edges from the first operation in the sequence through to the last operation in the sequence) .
- the quantizer group graph is denoted as G, having a set of nodes V and edges E.
- the modification component 120 uses integer programming to minimize (or at least reduce) a latency expression based on a set of constraints. For example, Expression 1 below (constrained by the constraints given in Expressions 2, 3, 4, and 5) may be used to generate modifications for a quantization profile 110 having two quantization precisions, where the modification component 120 seeks to minimize Expression 1.
- X and Y are the two quantization precisions
- u and v are respective operations of the machine learning model
- V is the total set of operations (e.g., nodes in the graph)
- E is the data flow among the operations (e.g., the edge set of the graph)
- BitOps (v, X) indicates the operation latency (e.g., the latency of performing) of the operation v using the quantization precision X
- BitOps (v, Y) is the operation latency of performing the operation v using the quantization precision Y.
- sol (u) is the quantization precision assigned to the operation u by the quantization profile 110
- f uv is the conversion latency of converting a tensor between the operations u and v (e.g., based on different quantization precisions)
- ⁇ is a conversion latency hyperparameter (discussed in more detail below)
- C f is the total conversion latency of the machine learning model 105, when quantized according to the quantization profile 110.
- the quantization precision X is less than the quantization precision Y.
- X may represent quantizing the activations to eight bits, while Y may represent quantizing the activations to sixteen bits.
- y v 1,
- the modification component 120 causes the conversion latency of any given pair of operations (having an edge connecting the operations) to be considered when generating the modifications. That is,
- f uv may generally be used to indicate or determine the latency of performing the conversion between operations u and v.
- a variety of conversion latency formulations may be used.
- the conversion latency is defined dynamically based on the tensor size (s) of the tensor (s) used as output of the operation u and/or input of the operation v (e.g., the tensors that are being converted) .
- the f uv may be proportional (e.g., scaled to a value between zero and one) to the number of elements in the tensor (s) to be converted, where larger tensors (with more elements) have higher conversion latencies than smaller tensors.
- the conversion latency is defined using a predictor function (e.g., a conversion latency machine learning model) to predict the latency of the conversion. For example, in some cases, additional factors beyond the size of the tensor (s) may impact conversion efficiency. As one example, the tiling strategy of the operation (or of the conversion itself) may affect the latency. In some aspects, tensors may be delineated into smaller sub-tensors (referred to as tiles) to perform operations in parallel on multiple processing units (or on multiple parts of the same unit, such as different threads or cores) . This may generally be referred to as tiling the tensors.
- a predictor function e.g., a conversion latency machine learning model
- the conversion may be performed more quickly by tiling a large tensor into a set of sub-tensors, converting each sub-tensor, and aggregating the converted sub-tensors.
- the tiling strategy generally refers to the number of tiles used to delineate the tensor (s) of the given conversion operation.
- f uv is generated by processing data such as the size of the tensor (s) being converted, and the tiling strategy used to implement the conversion, using the conversion latency model.
- the modification component 120 may train the conversion latency model based on training exemplars indicating the conversion latency of various conversions in one or more model architectures.
- the modification component 120 may determine, for a given set of input parameters (e.g., tensor size and tiling strategy) , the conversion latency of converting the tensor from quantization precision X to precision Y (or vice versa) .
- the conversion latency model is a relatively lightweight regression model (e.g., a linear regression model) .
- the particular operations used to train the conversion latency model may vary depending on the particular implementation.
- the modification component 120 may process the input portion of an exemplar (e.g., the tensor size and tiling strategy) using the model to generate a predicted latency, and this predicted latency can be compared against the label of the exemplar (e.g., the actual conversion latency) to generate a loss. This loss may then be used to update one or more parameters of the regression model (e.g., using backpropagation) .
- ⁇ is a hyperparameter.
- the value of ⁇ may generally be set to balance the strength of the conversion latency constraint. For example, in some aspects, ⁇ is given a value between zero and one (inclusive) . Assigning a value of zero may result in all operations being assigned to the higher precision Y (e.g., such that the total conversion latency is zero) , while a value of one may result in all operations having the same precision as in the initial quantization profile 110.
- the particular value of ⁇ that should be used may vary depending on the particular implementation, architecture, and choice (s) of the designers.
- the modification component 120 may generate multiple sets of modifications to the quantization profile 110 using multiple values for ⁇ . Each set of modifications may then be evaluated to select the best one. For example, the modification component 120 may use each respective set of modifications to generate a respective modified quantization profile 125 (and a resulting respective quantized machine learning model 135) . Each quantized model may then be evaluated to determine the latency of processing data using the model. Using such evaluations, the modification component 120 (or another component) can determine which set of modifications resulted in the lowest inferencing latency, and determine to select or use these modifications for the machine learning model 105. x v , y v ⁇ ⁇ 0, 1 ⁇ ,
- the modification component 120 ensures that the decision variables x v and y v are binary values (e.g., zero or one) , facilitating the integer programming process.
- the modification component 120 may therefore seek to find binary values for x v , y v , that minimizes Expression 1, using constraints given in Expressions 2-5.
- a similar integer programming approach can be used to modify quantization profiles 110 having more than two quantization precision candidates.
- u and v are respective operations of the machine learning model
- V is the total set of operations (e.g., nodes in the graph)
- E is the data flow among the operations (e.g., the edge set of the graph)
- BitOps (v, P i ) indicates the operation latency (e.g., the latency of performing) of the operation v using the quantization precision P i (e.g., the i-th quantization precision of the set of candidate precisions) .
- d is the number of quantization precision candidates
- x v, i is the decision variable (e.g., variables with values generated by the modification component 120 when performing integer programming) . That is, for each operation v in the set of operations V, the modification component 120 may generate a corresponding value x v, i to indicate which precision to use.
- L v is a subset of the plurality of quantization precisions such that decreases in quantization precision are disallowed (as discussed in more detail below) , and only if the activation bit-width of the i-th quantization precision is different than an activation bit-width of a j-th precision candidate and is otherwise zero.
- f uv is the conversion latency of converting a tensor between the operations u and v (e.g., based on different quantization precisions)
- ⁇ is the conversion latency hyperparameter (discussed above)
- C f is the total conversion latency of the machine learning model 105, when quantized according to the quantization profile 110.
- the modification component 120 ensures that any given operation in the machine learning model 105 is assigned to exactly one quantization precision.
- x v, i 0,
- the modification component 120 ensures that x v, i is zero for all precision candidates in L v for all operations v in V. This ensures that disallowable modifications are not used (e.g., that the modification component 120 does not decrease the precision of any of the operations) . For example, in some aspects, such that any defined undesired changes (e.g., from sixteen bit-width quantization to eight bit-width quantization) are prevented. That is, for each operation v, the modification component 120 may determine or define a set L v that contains all quantization precision candidates that are lower than the quantization precision assigned to the operation v by the initial quantization profile 110. Therefore, using this constraint, the modification component 120 will not assign one of these lower precision quantization schemes to the operation v.
- the modification component 120 causes the conversion latency of any given pair of operations (having an edge connecting the operations) to be considered when generating the modifications. That is,
- the conversion latency f uv may be defined in various ways, such as using a static and uniform latency value for all conversions, using a dynamic latency determined based on the tensor sizes (e.g., using a fixed mapping or scaling) , and/or using a machine learning model to predict the conversion latency based on factors such as the tensor size and/or the tiling strategy.
- the conversion latency hyperparameter ⁇ may be selected by testing various values of ⁇ and evaluating the resulting quantized models to identify the model having the lowest latency.
- the modification component 120 ensures that the decision variables x v, i has a binary value (e.g., zero or one) for all operations, facilitating the integer programming process.
- the modification component 120 may therefore seek to find a binary values for x v, i and that minimizes Expression 6, using constraints given in Expressions 7-10.
- the particular set of quantization precision candidates that can be evaluated by the quantization system 115 may vary depending on the particular implementation. For example, depending on the hardware support of the device (s) that will use the quantized machine learning model 135 during runtime, the quantization system 115 may restrict the evaluations to those quantization precisions that the device (s) support. In some aspects, the allowable quantization precision candidates may be defined or indicated, such as by a designer or user, when generating the modified quantization profile 125.
- the modification component 120 can then use the generated set of modifications to modify the quantization profile 110 in order to generate the modified quantization profile 125 (e.g., to change the quantization precision of zero or more operations) .
- the modified quantization profile 125 and the machine learning model 105 are then accessed by the quantization component 130.
- the quantization component 130 quantizes the machine learning model 105 based on the modified quantization profile 125 to generate the quantized machine learning model.
- the quantization component 130 may quantize the parameters (e.g., weights) of each operation in the machine learning model 105 to the specific quantization precision (e.g., bit-width, format, and the like) indicated in the modified quantization profile 125 for the operation.
- the quantization component 130 may generate and/or insert conversion operations as applicable (e.g., to convert the tensors that are passed from an operation using a first activation bit-width to an operation using a second activation bit-width) .
- the quantized machine learning model 135 may then be deployed for inferencing.
- “deploying” the model may generally include any operations used to prepare or provide the model for runtime use, such as instantiating the model locally, transmitting the quantized model to one or more inferencing systems or devices, and the like.
- the quantized machine learning model 135 may generally exhibit reduced latency with comparable or higher accuracy, as compared to the machine learning model 105 quantized according to the initial quantization profile 110.
- FIG. 2 depicts an example workflow 200 to generate quantizer group graphs to facilitate mixed-precision quantization, according to some aspects of the present disclosure.
- the workflow 200 is performed by a quantization system, such as the quantization system 115 of FIG. 1.
- a machine learning model (e.g., the machine learning model 105 of FIG. 1) can be represented as a directed graph 205.
- each node 210A-F represents an operation performed by the machine learning model (e.g., a convolution operation, a layer, an activation function, and the like)
- the edges (depicted as arrows connecting nodes 210) indicate the data flow of the model.
- each node 210 in the graph 205 is referred to as a quantizer group, where a quantizer group corresponds to a set of linked operations that have the same quantization precision (e.g., a convolution operation followed by a corresponding activation function, such as a rectified linear unit (ReLU) ) .
- a quantizer group corresponds to a set of linked operations that have the same quantization precision (e.g., a convolution operation followed by a corresponding activation function, such as a rectified linear unit (ReLU) ) .
- the nodes 210 in the graph 205 may be defined as V and the edges may be defined as E.
- the nodes 210 include stippling to indicate which quantization precision is assigned to the node 210 (e.g., by a quantization profile, such as the quantization profile 110 of FIG. 1) .
- the nodes 210B, 210C, 210D, and 210E may use a first quantization precision, while the nodes 210A and 210F use a second quantization precision.
- the data output by the node 210A and consumed by the node 210B, as well as the data output by the node 210E and consumed by the node 210F, may undergo conversion based on the corresponding quantization operations.
- the graph 205 may be modified by grouping the nodes 210 based on the quantization precision assigned to each, forming precision groups 215A and 215B.
- precision groups corresponding to two quantization precisions
- the quantization system may use any number of quantization precision candidates, as discussed above.
- delineating the nodes 210 into precision groups enables rapid identification of the edge (s) where data conversion may be performed. That is, the quantization system may compute or determine the minimum cut to completely separate the precision groups 215 (e.g., the cut 220) , where each severed edge in the cut 220 corresponds to a conversion.
- the value of the cut 220 indicates the aggregate or total conversion latency of the quantization profile.
- the nodes 210 in each precision group 215 may change, resulting in a changed minimum cut 220 and changed conversion latency, as discussed above.
- this graph formulation is used to facilitate the process of modifying the quantization profile (e.g., using machine learning models, integer programming, and the like) .
- FIG. 3 is a flow diagram depicting an example method 300 for reduced latency mixed-precision quantized machine learning models, according to some aspects of the present disclosure.
- the method 300 is performed by a quantization system, such as the quantization system 115 of FIG. 1 and/or the quantization system discussed above with reference to FIG. 2.
- the quantization system accesses a trained machine learning model (e.g., the machine learning model 105 of FIG. 1) .
- the quantization system may itself train the machine learning model, or may access the model from another system (e.g., a dedicated training system) .
- the machine learning model may generally comprise a set or sequence of operations, where each operation may correspond to any data processing transformation or operation, such as a layer of a neural network, a convolution operation, a transformer operation, and the like.
- the machine learning model may be encoded in full (or at least high) precision. That is, in some aspects, the machine learning model is not quantized.
- the quantization system determines a set of quantization precision candidates for the machine learning model.
- the candidates may correspond to the quantization schemes that are supported by the device (s) which will use the quantized machine learning for inferencing.
- the quantization system determines which candidate have hardware support on the inferencing device (s) (e.g., dedicated accelerator hardware that can operate or use the candidates) .
- the candidate quantization precisions are indicated or provided along with the machine learning model (e.g., by the training system or a user) .
- the quantization system determines an initial quantization profile (e.g., the quantization profile 110 of FIG. 1) for the machine learning model.
- the quantization profile generally indicates, for one or more operations of the machine learning model, a quantization precision or scheme.
- the quantization profile may specify the precision (e.g., bit-width and/or data type) of the activations (e.g., data processed by the operation) and/or the parameters (e.g., the weights used to perform the operation) .
- the quantization system accesses or receives the quantization profile from another system (e.g., the training system) .
- the quantization system generates the quantization profile.
- the quantization system (or another system) uses one or more AMP techniques to generate the quantization profile.
- the quantization system determines modification constraints for the quantization profile.
- the constraints may relate to allowable quantization precisions (e.g., indicating that one or more precisions should not be used) , allowable sequences of precision (e.g., sets of precisions that should, or should not, be used for adjacent operations) , allowable precision changes (e.g., indicating that the quantization system should not decrease the precision of any operations, as compared to the initial quantization profile) , and the like.
- the quantization system determines the constraints as defined above in Expressions 2-5 and/or 7-10.
- the quantization system determines the constraints based at least in part on the number of quantization precision candidates determined at block 310 (e.g., using constraints given in Expressions 2-5 when two candidates are used, and constraints given in Expressions 7-10 when three or more candidates are used) .
- the quantization system generates a set of modifications for the initial quantization profile, as discussed above.
- the quantization system processes the initial quantization profile and/or the machine learning model (e.g., a graph comprising nodes and edges representing the sequence of operations) using one or more modification machine learning models to generate the modifications.
- the quantization system may use various techniques such as one or more heuristics-based algorithms, one or more search algorithms, and the like.
- the quantization system may use one or more integer programming operations or techniques to generate the modifications, such as by seeking to minimize, or at least reduce, Expression 1 and/or Expression 6 (using suitable constraints) .
- the quantization system generally generates the modifications based in part on the latency introduced by converting data between precision candidates. That is, to generate the modifications, the quantization system may evaluate the latency that will be caused by converting data whenever two different quantization precisions are used in sequential operations (e.g., between node 210A and node 210B in FIG. 2) .
- the quantization system may use a variety of techniques to determine the conversion costs. For example, in some aspects, the quantization system may use a fixed or uniform latency value for all conversions. In some aspects, the quantization system may determine the conversion latency for each respective pair of operations based on data such as the size of the tensors that are to be converted (e.g., generating a value between zero and one based on the number of elements, where the latency value is directly proportional to the number of elements) . In some aspects, the quantization system may determine the conversion latency based on data such as the tiling strategy used for the conversion (or for the subsequent operation) . For example, the quantization system may process data such as the tiling strategy, the tensor size, and the like using a machine learning model (e.g., a linear regression model) to infer the latency.
- a machine learning model e.g., a linear regression model
- the set of modifications can then be used to generate a modified quantization profile (e.g., the modified quantization profile 125 of FIG. 1) .
- the quantization system outputs a quantized machine learning model (e.g., the quantized machine learning model 135 of FIG. 1) by quantizing the initial machine learning model (accessed at block 305) using the modified quantization profile (generated at block 325) .
- generating the quantized model may generally include quantizing the parameters of each operation as indicated in the modified quantization profile, adding appropriate conversion operations between machine learning model operations (e.g., wherever the quantization precision of sequential operations is different) , and the like.
- the quantized machine learning model may generally have lower runtime latency, as compared to both the original (non-quantized) machine learning model, as well as compared to a version of the machine learning model quantized according to the initial quantization profile. Further, in some aspects, the quantized machine learning model may be at least as accurate as the version of the machine learning model quantized according to the initial quantization profile (and potentially more accurate) . For example, if the quantization system is constrained to refrain from reducing the quantization precision of any operations (e.g., either increasing the precision or leaving the precision unchanged for each operation) , the resulting quantized model will generally exhibit the same (or better) accuracy.
- the quantization system can generate improved machine learning models that are substantially faster during runtime without harming model accuracy.
- the quantized machine learning model may then be deployed for runtime use (e.g., to an inferencing system, by the quantization system itself, and the like) .
- FIG. 4 is a flow diagram depicting an example method 400 for generating and evaluating quantization profiles to reduce machine learning model latency, according to some aspects of the present disclosure.
- the method 400 is performed by a quantization system, such as the quantization system 115 of FIG. 1 and/or the quantization system discussed above with reference to FIGS. 2-3.
- the method 400 provides additional detail for block 325 of FIG. 3.
- the quantization system selects a value for a conversion latency hyperparameter (e.g., ⁇ in Expressions 4 and/or 9) .
- the conversion latency hyperparameter may generally be selected to balance the strength or influence of the conversion latency constraint. That is, the conversion latency hyperparameter generally indicates the relative importance of reducing conversion latency as compared to maintaining the initial quantization profile.
- the conversion latency hyperparameter is assigned a value between zero and one (inclusive) , where a value of zero causes all operations being assigned to the higher precision (e.g., such that the total conversion latency is zero) , while a value of one causes all operations to have the same precision as in the initial quantization profile (e.g., no changes are made) .
- the quantization system can evaluate multiple values for the conversion latency hyperparameter, generating the corresponding mixed quantization precision profiles for each. The quantization system can then decide which value to use for the quantization using a latency evaluation of these profiles on target data (as discussed above) .
- the quantization system may select the value at block 405 using any suitable criteria.
- the quantization system may select the value from a defined range (e.g., between zero and one, or between 0.2 and 0.8) , and may select values within this range according to any suitable criteria (e.g., evaluating N values equally spaced in the range) .
- the hyperparameter range and/or the number (and/or spacing) of values to evaluate may be predefined, or may be specified (e.g., by a user) .
- the quantization system generates a set of modifications for the quantization profile based on the selected value for the conversion latency hyperparameter.
- the hyperparameter may be used as an input to the search algorithm (s) , machine learning model (s) , integer programming constraint (s) , and the like.
- the quantization system quantizes the initial machine learning model (e.g., the model 105 of FIG. 1) using the generated set of modifications, as discussed above.
- the quantization system may modify the initial quantization profile (e.g., the quantization profile 110 of FIG. 1) using the modifications, and quantize the model using the modified quantization profile.
- the quantization system evaluates or determines the inferencing latency of the quantized machine learning model (generated at block 415) .
- the quantization system may process test data (e.g., from the target domain, such as data that is the same or similar to data that will be processed during runtime) using the model, determining the latency (e.g., the time between when input is provided to the model and when the output is available) .
- the quantization system may determine the minimum latency, the maximum latency, the average latency, the median latency, the standard deviation or variance of the latency, and the like.
- the quantization system determines whether at least one additional value for the conversion latency hyperparameter is yet-to-be evaluated. If so, the method 400 returns to block 405 to select a new value. If not, the method 400 continues to block 430.
- the illustrated method 400 depicts an iterative process (e.g., selecting and evaluating each value in sequence) for conceptual clarity, in some aspects, the quantization system may evaluate some or all of the alternative values in parallel.
- the quantization system selects a set of modifications for the quantization profile based on the determined latency of each alternative value of the conversion latency hyperparameter. For example, as discussed above, the quantization system may select the value that resulted in the lowest average latency, the smallest latency variance, the value which resulted in the lowest maximum latency, and the like.
- the quantization system may select and provide the corresponding quantized model for inferencing directly (e.g., rather than re-quantizing the model at block 330 of FIG. 3) .
- the quantization system can efficiently find an optimal (or at least improved) balance between conversion latencies and operational latencies, ensuring that the resulting quantized machine learning model performs efficiently and accurately.
- FIG. 5 is a flow diagram depicting an example method 500 for training conversion latency prediction models, according to some aspects of the present disclosure.
- the method 500 is performed by a quantization system, such as the quantization system 115 of FIG. 1 and/or the quantization system discussed above with reference to FIGS. 2-4.
- the quantization system selects a tensor conversion operation from a (quantized) machine learning model architecture. That is, the quantization system identifies a pair of sequential operations that use different quantization precisions for the activation data (e.g., where the output of the first operation is converted to a different precision before being provided as input to the second operation) . Generally, the quantization system may select the conversion using any suitable criteria, as the quantization system may evaluate any number and variety of conversions (from any number and variety of model architectures) to train the conversion latency prediction model.
- the quantization system determines the size of the tensor conversion. That is, the quantization system determines the number of elements of the tensor that is converted by the selected tensor conversion. For example, if the tensor has dimensions (n, c, h, w) , the size of the tensor may be defined as n*c*h*w.
- the quantization system determines the tiling strategy of the conversion. That is, the quantization system may determine whether tiling is used to convert sub-tensors from the tensor in parallel, and/or may determine the number of such tiles or sub-tensors that are processed in parallel.
- the quantization system determines the latency of the selected conversion operation. For example, the quantization system may perform the tensor conversion operation one or more times using one or more tensors and monitor the latency (e.g., to determine the minimum, maximum, average, median, or other latency value) .
- the quantization system trains a conversion latency machine learning model to predict conversion latency based on the collected data.
- the quantization system may use characteristics such as the tensor size and tiling strategy as input to the model to generate an output prediction, and the output prediction may be compared against the actual latency (determined at block 520) to generate a loss. This loss may then be used to refine the parameter (s) of the model.
- the conversion latency machine learning model comprises a regression model fitted to the conversion latencies.
- the quantization system determines whether one or more termination criteria are met.
- the termination criteria may vary depending on the particular implementation. For example, the quantization system may determine whether additional tensor conversion operations remain to be tested, whether a defined number of epochs or iterations have been performed, whether a defined amount of time and/or resources have been spent training, whether the model accuracy has reached a desired threshold, and the like. If the termination criteria are not met, the method 500 returns to block 505. If the termination criteria are met, the method 500 continues to block 535.
- the illustrated example depicts updating the model for each individual sample (e.g., using stochastic gradient descent) for conceptual clarity, in some aspects, the quantization system may update the model using batches of samples (e.g., batch gradient descent) .
- the quantization system deploys the conversion latency machine learning model for inferencing.
- the conversion latency machine learning model may be used to predict the conversion latency of various operation sequences (e.g., based on changes in quantization precision) when generating modifications to quantization profiles.
- the conversion latency machine learning model may be used to define the latency constraints, such as using Expressions 4 and/or 9.
- FIG. 6 is a flow diagram depicting an example method 600 for mixed-precision quantization, according to some aspects of the present disclosure.
- the method 600 is performed by a quantization system, such as the quantization system 115 of FIG. 1 and/or the quantization system discussed above with reference to FIGS. 2-5.
- a quantization profile for a machine learning model is accessed, the quantization profile indicating, for each respective operation of a plurality of operations of the machine learning model, a respective quantization precision of a plurality of quantization precisions.
- a first set of modifications is generated for the quantization profile based on conversion latency for converting tensors among the plurality of quantization precisions, wherein each respective modification of the first set of modifications indicates to increase a respective quantization precision of a respective operation of the plurality of operations.
- a modified quantization profile is generated based on modifying the quantization profile using the first set of modifications.
- the machine learning model is quantized in accordance with the modified quantization profile.
- the plurality of quantization precisions comprises one or more of: (i) a set of weight bit-widths, (ii) a set of activation bit-widths, or (iii) a set of data types.
- the first set of modifications is generated based on a first value for the conversion latency hyperparameter
- the method 600 further comprises generating a second set of modifications for the quantization profile based on a second value for a conversion latency hyperparameter, determining a first latency of the machine learning model quantized according to the quantization profile and the first set of modifications, determining a second latency of the machine learning model quantized according to the quantization profile and the second set of modifications, and selecting the first set of modifications in response to determining that the first latency is less than the second latency.
- the conversion latency comprises a static value for converting the tensors among the plurality of quantization precisions.
- the method 600 further includes determining a respective conversion latency for each respective operation of at least a subset of operations of the plurality of operations based at least in part on a respective tensor size of the respective operation.
- determining the respective conversion latencies comprises processing the respective tensor sizes using a conversion latency machine learning model.
- determining the respective conversion latencies further comprises processing a tiling strategy of the respective operation using the conversion latency machine learning model.
- generating the first set of modifications comprises using integer programming, and the integer programming is constrained to refrain from decreasing quantization precision of any operations of the plurality of operations.
- FIG. 7 depicts an example processing system 700 configured to perform various aspects of the present disclosure, including, for example, the techniques and methods described with respect to FIGS. 1-6.
- the processing system 700 may correspond to a quantization system.
- the processing system 700 may correspond to the quantization system 115 of FIG. 1, and/or the quantization system discussed above with reference to FIGS. 2-6.
- the operations described below with respect to the processing system 700 may be distributed across any number of devices or systems.
- the processing system 700 includes a central processing unit (CPU) 702, which in some examples may be a multi-core CPU. Instructions executed at the CPU 702 may be loaded, for example, from a program memory associated with the CPU 702 or may be loaded from a memory partition (e.g., a partition of a memory 724) .
- CPU central processing unit
- a memory partition e.g., a partition of a memory 724.
- the processing system 700 also includes additional processing components tailored to specific functions, such as a graphics processing unit (GPU) 704, a digital signal processor (DSP) 706, a neural processing unit (NPU) 708, a multimedia component 710 (e.g., a multimedia processing unit) , and a wireless connectivity component 712.
- graphics processing unit GPU
- DSP digital signal processor
- NPU neural processing unit
- multimedia component 710 e.g., a multimedia processing unit
- An NPU such as the NPU 708, is generally a specialized circuit configured for implementing the control and arithmetic logic for executing machine learning algorithms, such as algorithms for processing artificial neural networks (ANNs) , deep neural networks (DNNs) , random forests (RFs) , and the like.
- An NPU may sometimes alternatively be referred to as a neural signal processor (NSP) , tensor processing unit (TPU) , neural network processor (NNP) , intelligence processing unit (IPU) , vision processing unit (VPU) , or graph processing unit.
- NSP neural signal processor
- TPU tensor processing unit
- NNP neural network processor
- IPU intelligence processing unit
- VPU vision processing unit
- graph processing unit or graph processing unit.
- NPUs such as the NPU 708, are configured to accelerate the performance of common machine learning tasks, such as image classification, machine translation, object detection, and various other predictive models.
- a plurality of NPUs may be instantiated on a single chip, such as a system on a chip (SoC) , while in other examples the NPUs may be part of a dedicated neural-network accelerator.
- SoC system on a chip
- NPUs may be optimized for training or inference, or in some cases configured to balance performance between both.
- the two tasks may still generally be performed independently.
- NPUs designed to accelerate training are generally configured to accelerate the optimization of new models, which is a highly compute-intensive operation that involves inputting an existing dataset (often labeled or tagged) , iterating over the dataset, and then adjusting model parameters, such as weights and biases, in order to improve model performance.
- model parameters such as weights and biases
- NPUs designed to accelerate inference are generally configured to operate on complete models. Such NPUs may thus be configured to input a new piece of data and rapidly process this piece of data through an already trained model to generate a model output (e.g., an inference) .
- a model output e.g., an inference
- the NPU 708 is a part of one or more of the CPU 702, the GPU 704, and/or the DSP 706.
- the wireless connectivity component 712 may include subcomponents, for example, for third generation (3G) connectivity, fourth generation (4G) connectivity (e.g., Long-Term Evolution (LTE) ) , fifth generation (5G) connectivity (e.g., New Radio (NR) ) , Wi-Fi connectivity, Bluetooth connectivity, and other wireless data transmission standards.
- the wireless connectivity component 712 is further coupled to one or more antennas 714.
- the processing system 700 may also include one or more sensor processing units 716 associated with any manner of sensor, one or more image signal processors (ISPs) 718 associated with any manner of image sensor, and/or a navigation processor 720, which may include satellite-based positioning system components (e.g., GPS or GLONASS) as well as inertial positioning system components.
- ISPs image signal processors
- navigation processor 720 which may include satellite-based positioning system components (e.g., GPS or GLONASS) as well as inertial positioning system components.
- the processing system 700 may also include one or more input and/or output devices 722, such as screens, touch-sensitive surfaces (including touch-sensitive displays) , physical buttons, speakers, microphones, and the like.
- input and/or output devices 722 such as screens, touch-sensitive surfaces (including touch-sensitive displays) , physical buttons, speakers, microphones, and the like.
- one or more of the processors of the processing system 700 may be based on an ARM or RISC-V instruction set.
- the processing system 700 also includes a memory 724, which is representative of one or more static and/or dynamic memories, such as a dynamic random access memory, a flash-based static memory, and the like.
- the memory 724 includes computer-executable components, which may be executed by one or more of the aforementioned processors of the processing system 700.
- the memory 724 includes a modification component 724A, a quantization component 724B, and a latency component 724C.
- the memory 724 may also include other components, such as an inferencing or generation component to manage the generation of output predictions using trained machine learning models, a training component used to train or update the machine learning model (s) , and the like. Though depicted as discrete components for conceptual clarity in FIG. 7, the illustrated components (and others not depicted) may be collectively or individually implemented in various aspects.
- the memory 724 also includes a set of quantization precisions 724D (e.g., indicating the precision alternatives or candidates that can be used to quantize machine learning models) .
- the quantization precisions 724D may indicate the allowable activation bit-width (s) , the allowable parameter bit-width (s) , the allowable data types or formats (e.g., float, double, integer, and the like) , and the like.
- the memory 724 also includes a set of profile constraints 724E.
- the profile constraints 724E may indicate limits or restrictions on the modification process, such as indicating that the quantization system should not reduce the precision of any operations, as compared to an initial quantization profile for the operations.
- the profile constraints 724E are defined as above in Expressions 2-5 and/or 7-10.
- the memory 724 may also include other data such as training data (e.g., tensor conversion latency data) , model parameters (e.g., for a conversion latency machine learning model) , and the like.
- training data e.g., tensor conversion latency data
- model parameters e.g., for a conversion latency machine learning model
- the processing system 700 further comprises a modification circuit 726, a quantization circuit 727, and a latency circuit 728.
- the depicted circuits, and others not depicted may be configured to perform various aspects of the techniques described herein.
- the modification component 724A and/or the modification circuit 726 may be used to generate modifications to initial quantization profiles (e.g., to generate the modified quantization profile 125 of FIG. 1) , as discussed above.
- the modification component 724A and/or the modification circuit 726 may use various techniques such as heuristics, search algorithms, integer programming, and the like to generate the modifications based on the specified machine learning model architecture (e.g., the machine learning model 105 of FIG. 1) and an initial quantization profile (e.g., the quantization profile 110 of FIG. 1) .
- the quantization component 724B and/or the quantization circuit 727 may be used to quantize machine learning models (e.g., to generate the quantized machine learning model 135 of FIG. 1) , as discussed above.
- the quantization component 724B and/or the quantization circuit 727 may quantize initial (non-quantized) machine learning models (e.g., the machine learning model 105 of FIG. 1) using modified quantization profile (s) (generated by the modification component 724A and/or the modification circuit 726) to generate the quantized models.
- the latency component 724C and/or the latency circuit 728 may be used to evaluate the runtime latencies of quantized machine learning models, as discussed above. For example, the latency component 724C and/or the latency circuit 728 may evaluate models quantized according to different modified quantization profiles (e.g., generated based on different values for a conversion latency hyperparameter, as discussed above) . Based on these evaluations, the latency component 724C and/or the latency circuit 728 may select which quantized model is to be used for inferencing (e.g., to minimize the average and/or maximum runtime latency) .
- modified quantization profiles e.g., generated based on different values for a conversion latency hyperparameter, as discussed above
- the modification circuit 726, the quantization circuit 727, and the latency circuit 728 may collectively or individually be implemented in other processing devices of the processing system 700, such as within the CPU 702, the GPU 704, the DSP 706, the NPU 708, and the like.
- processing system 700 and/or components thereof may be configured to perform the methods described herein.
- aspects of the processing system 700 may be omitted, such as where the processing system 700 is a server computer or the like.
- the multimedia component 710, the wireless connectivity component 712, the sensor processing units 716, the ISPs 718, and/or the navigation processor 720 may be omitted in other aspects.
- aspects of the processing system 700 maybe distributed between multiple devices.
- the processing system 700 may use the quantized model (quantized according to the modified quantization profile) for runtime inferencing.
- the processing system 700 may deploy the quantized model (quantized according to the modified quantization profile) to one or more other systems for runtime inferencing, such as to one or more cloud-based systems, one or more wireless devices (e.g., wearables, smartphones, or edge devices) , and the like. That is, the processing system 700 may quantize the model according to the modified quantization profile, and may then either use the quantized model for inferencing, or may provide the quantized model to one or more other systems for inferencing.
- the processing system 700 may itself generate the initial quantization profile, or may receive the initial quantization profile from another system. Additionally, the processing system 700 may itself train the initial machine learning model, or may receive the trained model from another system. Moreover, the processing system 700 may itself quantize the model based on the modified quantization profile, or may provide the modified quantization profile to another system which performs the quantization. Generally, the operations involved in model training, initial quantization profile generation, modified quantization profile generation, model quantization using the modified quantization profile, and inferencing using the quantized model may be combined or distributed across any number and variety of computing systems.
- a method for machine learning model quantization comprising: accessing a quantization profile for a machine learning model, the quantization profile indicating, for each respective operation of a plurality of operations of the machine learning model, a respective quantization precision of a plurality of quantization precisions; generating a first set of modifications for the quantization profile based on conversion latency for converting tensors among the plurality of quantization precisions, wherein each respective modification of the first set of modifications indicates to increase a respective quantization precision of a respective operation of the plurality of operations; generating a modified quantization profile based on modifying the quantization profile using the first set of modifications; and quantizing the machine learning model in accordance with the modified quantization profile.
- Clause 2 A method according to Clause 1, wherein the plurality of quantization precisions comprises one or more of: (i) a set of weight bit-widths, (ii) a set of activation bit-widths, or (iii) a set of data types.
- Clause 3 A method according to any of Clauses 1-2, wherein the first set of modifications is generated based on a first value for the conversion latency hyperparameter and wherein the method further comprises: generating a second set of modifications for the quantization profile based on a second value for a conversion latency hyperparameter; determining a first latency of the machine learning model quantized according to the quantization profile and the first set of modifications; determining a second latency of the machine learning model quantized according to the quantization profile and the second set of modifications; and selecting the first set of modifications in response to determining that the first latency is less than the second latency.
- Clause 4 A method according to any of Clauses 1-3, wherein the conversion latency comprises a static value for converting the tensors among the plurality of quantization precisions.
- Clause 5 A method according to any of Clauses 1-3, further comprising determining a respective conversion latency for each respective operation of at least a subset of operations of the plurality of operations based at least in part on a respective tensor size of the respective operation.
- Clause 6 A method according to Clause 5, wherein determining the respective conversion latencies comprises processing the respective tensor sizes using a conversion latency machine learning model.
- Clause 7 A method according to Clause 6, wherein determining the respective conversion latencies further comprises processing a tiling strategy of the respective operation using the conversion latency machine learning model.
- Clause 8 A method according to any of Clauses 1-7, wherein generating the first set of modifications comprises using integer programming, and the integer programming is constrained to refrain from decreasing quantization precision of any operations of the plurality of operations.
- Clause 11 A processing system comprising: a memory comprising computer-executable instructions; and one or more processors configured to execute the computer-executable instructions and cause the processing system to perform a method in accordance with any of Clauses 1-10.
- Clause 11 A processing system comprising means for performing a method in accordance with any of Clauses 1-10.
- Clause 12 A non-transitory computer-readable medium comprising computer-executable instructions that, when executed by one or more processors of a processing system, cause the processing system to perform a method in accordance with any of Clauses 1-10.
- Clause 13 A computer program product embodied on a computer-readable storage medium comprising code for performing a method in accordance with any of Clauses 1-10.
- an apparatus may be implemented or a method may be practiced using any number of the aspects set forth herein.
- the scope of the disclosure is intended to cover such an apparatus or method that is practiced using other structure, functionality, or structure and functionality in addition to, or other than, the various aspects of the disclosure set forth herein. It should be understood that any aspect of the disclosure disclosed herein may be embodied by one or more elements of a claim.
- exemplary means “serving as an example, instance, or illustration. ” Any aspect described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects.
- a phrase referring to “at least one of” a list of items refers to any combination of those items, including single members.
- “at least one of:a, b, or c” is intended to cover a, b, c, a-b, a-c, b-c, and a-b-c, as well as any combination with multiples of the same element (e.g., a-a, a-a-a, a-a-b, a-a-c, a-b-b, a-c-c, b-b, b-b-b, b-b-c, c-c, and c-c-c or any other ordering of a, b, and c) .
- determining encompasses a wide variety of actions. For example, “determining” may include calculating, computing, processing, deriving, investigating, looking up (e.g., looking up in a table, a database or another data structure) , ascertaining, and the like. Also, “determining” may include receiving (e.g., receiving information) , accessing (e.g., accessing data in a memory) , and the like. Also, “determining” may include resolving, selecting, choosing, establishing, and the like.
- the methods disclosed herein comprise one or more steps or actions for achieving the methods.
- the method steps and/or actions may be interchanged with one another without departing from the scope of the claims.
- the order and/or use of specific steps and/or actions may be modified without departing from the scope of the claims.
- the various operations of methods described above may be performed by any suitable means capable of performing the corresponding functions.
- the means may include various hardware and/or software component (s) and/or module (s) , including, but not limited to a circuit, an application specific integrated circuit (ASIC) , or processor.
- ASIC application specific integrated circuit
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- General Physics & Mathematics (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Neurology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Numerical Control (AREA)
Abstract
Certain aspects of the present disclosure provide techniques and apparatus for improved machine learning. In an example method, a quantization profile for a machine learning model is accessed, the quantization profile indicating, for each respective operation of a plurality of operations of the machine learning model, a respective quantization precision of a plurality of quantization precisions. A set of modifications for the quantization profile is generated based on conversion latency for converting tensors among the plurality of quantization precisions, where each respective modification of the set of modifications indicates to increase a respective quantization precision of a respective operation of the plurality of operations. A modified quantization profile is generated based on modifying the quantization profile using the set of modifications, and the machine learning model is quantized in accordance with the modified quantization profile.
Description
INTRODUCTION
Aspects of the present disclosure relate to machine learning.
A wide variety of machine learning model architectures have proliferated and have been used to provide solutions for a multitude of prediction problems. Though the specific architectures may vary, machine learning models generally rely on a set of model parameters having values that are learned or trained based on training data (which may include labeled data and/or unlabeled data) . In many architectures (e.g., deep learning models) , a large number of such parameters (well into the billions in some cases) are used to provide better utility. Additionally, in many cases, bigger models (e.g., models having more parameters) tend to perform better (e.g., with higher prediction accuracy) and/or tend to be better suited for more complex prediction tasks. However, even comparatively small models generally have a relatively large number of parameters and can have a substantial memory footprint.
Such a large number of parameters inherently incurs a significant memory and/or storage footprint, as well as a similarly vast use of other computing resources. These models similarly often introduce substantial computational latency due to the computing resources relied upon (e.g., due to the number of processing cycles used) . Model size has become particularly problematic in resource-constrained scenarios, where it is desired to deploy a trained model on a device having relatively limited resources (e.g., mobile devices, embedded devices, smart vehicles, and the like) . Some conventional approaches to ameliorate such concerns involve parameter quantization. However, parameter quantization is an approximation-based process, which inherently introduces error into the model.
BRIEF SUMMARY
Certain aspects of the present disclosure provide a processor-implemented method, comprising: accessing a quantization profile for a machine learning model, the quantization profile indicating, for each respective operation of a plurality of operations of the machine learning model, a respective quantization precision of a plurality of quantization precisions; generating a first set of modifications for the quantization profile
based on conversion latency for converting tensors among the plurality of quantization precisions, wherein each respective modification of the first set of modifications indicates to increase a respective quantization precision of a respective operation of the plurality of operations; generating a modified quantization profile based on modifying the quantization profile using the first set of modifications; and quantizing the machine learning model in accordance with the modified quantization profile.
Other aspects provide processing systems configured to perform the aforementioned methods as well as those described herein; non-transitory, computer-readable media comprising instructions that, when executed by one or more processors of a processing system, cause the processing system to perform the aforementioned methods as well as those described herein; a computer program product embodied on a computer-readable storage medium comprising code for performing the aforementioned methods as well as those further described herein; and a processing system comprising means for performing the aforementioned methods as well as those further described herein.
The following description and the related drawings set forth in detail certain illustrative features of one or more aspects.
The appended figures depict certain aspects of the present disclosure and are therefore not to be considered limiting of the scope of this disclosure.
FIG. 1 depicts an example workflow for quantizing machine learning models using reduced latency mixed-precision quantization, according to some aspects of the present disclosure.
FIG. 2 depicts an example workflow to generate quantizer group graphs to facilitate mixed-precision quantization, according to some aspects of the present disclosure.
FIG. 3 is a flow diagram depicting an example method for reduced latency mixed-precision quantized machine learning models, according to some aspects of the present disclosure.
FIG. 4 is a flow diagram depicting an example method for generating and evaluating quantization profiles to reduce machine learning model latency, according to some aspects of the present disclosure.
FIG. 5 is a flow diagram depicting an example method for training conversion latency prediction models, according to some aspects of the present disclosure.
FIG. 6 is a flow diagram depicting an example method for mixed-precision quantization, according to some aspects of the present disclosure.
FIG. 7 depicts an example processing system configured to perform various aspects of the present disclosure.
To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the drawings. It is contemplated that elements and features of one aspect may be beneficially incorporated in other aspects without further recitation.
Aspects of the present disclosure provide apparatuses, methods, processing systems, and non-transitory computer-readable mediums for providing improved machine learning.
Machine learning model quantization can be used to reduce the memory footprint of the model while simultaneously reducing the computational expense of executing the model. For example, because quantizing the parameters and/or activations (or other data being processed by the model) reduces their size (e.g., their bit-width) , the quantized (smaller) data can generally be processed with reduced memory usage, reduced compute cycles, and the like. Quantization often similarly leads to reduced latency of processing data using the model (e.g., reduced time to generate a prediction when input data is provided) .
In some conventional approaches, model parameters and/or activations are quantized to the same quantization precision (e.g., converting floating-point values with a bit-width of thirty-two to integer values with a bit-width of eight or sixteen) . While lower quantization precision (e.g., smaller bit-widths) results in reduced computational expense (as compared to higher bit-width quantization) , lower quantization precision also often results in lower model accuracy (as compared to higher bit-width quantization) .
Further, in many cases, the impact on model accuracy that quantization has varies substantially depending on the particular operation or component of the model that is being quantized. That is, some operations (e.g., layers, convolutions, attention mechanisms, or any other operation or transformation performed while processing data using a machine learning model) may suffer substantial accuracy reduction when quantized to a low bit-width, while other operations may result in less (or even no) degradation in accuracy when quantized to the same low bit-width.
Therefore, in many models, a balance is struck between quantization precision and performance degradation. In some solutions, mixed-precision quantization is used to mitigate these concerns. Generally, mixed-precision quantization enables quantizing different operations or portions of the model to different quantization precisions. This can allow for use of less precise quantization (e.g., lower bit-widths) where the reduced precision does not substantially harm model accuracy, while using more precise quantization (e.g., higher bit-widths) where the reduced precision would more substantially harm accuracy. In some aspects, this mix of precision is referred to as a “quantization profile” for the model. That is, the quantization profile for a model may indicate the quantization precision for each operation of the model.
For example, the profile may indicate that some operations should be quantized such that both the parameters and the activations have a bit-width of eight, while other operations should be quantized such that the parameters have a bit-width of eight and the activations have a bit-width of sixteen. There are a variety of automatic mixed-precision (AMP) techniques that can be used to automatically generate such quantization profiles.
However, while such approaches can reduce model size and inferencing latency, AMP techniques generally do not directly optimize the latency itself. For example, conventional approaches may seek to optimize (e.g., reduce) proxies such as memory footprint, the latency introduced only by the operations or layers themselves (e.g., ignoring memory access latency, conversion latencies, and the like) , the number of multiply and accumulate (MAC) operations, the number of floating-point operations (FLOPs) , and the like.
Generally, when adjacent operations of the machine learning model (e.g., adjacent layers) have different quantization precisions, the tensors flowing between the
operations are converted accordingly (e.g., from the quantization precision of the first operation to the quantization precision of the second) . These conversion costs (e.g., latency and compute resources) can, in some cases, reduce or even eliminate the latency benefits obtained by quantizing (particularly when many different quantization precisions are used and/or when the precision changes frequently) .
In aspects of the present disclosure, techniques are provided to modify or update quantization profiles in order to account for the latencies or other computational resources consumed by converting the tensors between different quantization precisions. In some aspects, the modifications are constrained to only use increased precision, rather than introducing decreased precision. That is, when generating the modifications, the system may consider changing a lower precision bit-width or format to a higher precision bit-width or format, but will not change a higher precision bit-width or format to a lower precision bit-width or format. This may ensure that the modified quantization profile is at least as accurate as the original quantization profile. In contrast, if any quantization precisions are reduced for any operations, the resulting model may exhibit reduced accuracy.
However, even if higher precision operations introduce more latency in some cases, using these higher precisions may enable elimination of one or more conversion operations (or at least switching to more efficient conversion operations) , which may result in overall reduced latency of processing data using the quantized model. In some aspects, the system uses integer programming (e.g., mixed integer programming (MIP) ) to generate the modifications. For example, an objective to minimize execution latency may be defined, as discussed in more detail below, with various constraints to enforce accuracy preservation.
By using aspects of the present disclosure to modify quantization profiles, the machine learning model (s) may be quantized in a way that reduces the inferencing latency during runtime while preserving model accuracy.
Example Workflow for Quantizing Machine Learning Models Using Reduced Latency Mixed-Precision Quantization
FIG. 1 depicts an example workflow 100 for quantizing machine learning models using reduced latency mixed-precision quantization, according to some aspects of the present disclosure.
In the illustrated example, a quantization system 115 accesses a machine learning model 105 and a quantization profile 110 and generates a quantized machine learning model 135. As used herein, “accessing” data may generally include receiving, requesting, retrieving, generating, obtaining, or otherwise gaining access to the data. For example, the quantization system 115 may access the machine learning model 105 from one or more other systems (e.g., a training system that trains the machine learning model 105) , or may generate the machine learning model 105 and/or quantization profile 110 locally (e.g., the quantization system 115 may itself train the machine learning model 105 and generate the quantization profile 110) . Further, although the illustrated example depicts the quantization system 115 generating the quantized machine learning model 135, in some aspects, the quantization system 115 may provide relevant information (such as the modified quantization profile 125) to one or more other systems which use the information to generate the quantized machine learning model 135. Although depicted as a discrete system for conceptual clarity, in aspects, the operations of the quantization system 115 may be combined or distributed across any number of systems, and may be implemented using hardware, software, or a combination of hardware and software.
The machine learning model 105 is generally representative of a machine learning model that has been trained (e.g., using one or more labeled or unlabeled exemplars) to perform one or more tasks. Generally, the particular architecture of the machine learning model 105 may vary depending on the particular implementation. For example, in some aspects, the machine learning model 105 comprises a neural network-based architecture, such as a feedforward neural network, a convolutional neural network (CNN) , a recursive neural network (RNN) , a multilayer perceptron (MLP) , a long short-term memory (LSTM) model, a transformer-based model, and the like. In some aspects, in addition to or instead of receiving the model itself, the quantization system 115 accesses (or generates) a quantizer group graph representing the model, as discussed below with reference to FIG. 2.
In some aspects, the machine learning model 105 is trained such that the model is ready for runtime use in generating inferences or predictions. For example, the quantization system 115 (or a dedicated training system) may train the machine learning model 105 until one or more termination criteria are met, such as for a defined number of iterations or epochs, until a defined period of time or computational resources have been spent training, until a defined accuracy preference is reached, and the like. In some
aspects, the machine learning model 105 is a full-precision model (e.g., having non-quantized parameters, such as weights, that were learned during the training process) .
Generally, the machine learning model 105 comprises a set of operations with corresponding data flows among the operations. As used herein, an “operation” of the machine learning model may generally include any data processing or transformation performed by the machine learning model, such as a layer of a neural network, a convolution operation, application of an activation function, a pooling or normalization operation, a transformer or other attention operation, and the like.
In the illustrated example, the quantization profile 110 indicates, for each respective operation of the set of operations of the machine learning model 105, a respective quantization scheme or precision. That is, the quantization profile 110 may indicate, for each operation, what quantization precision should be used. As used herein, the quantization precision of an operation may correspond to the bit-width used to represent the parameters (e.g., weights) of the operation (e.g., parameter bit-widths) , the bit-width used to represent the data tensors that are used as input to the operation and/or used as output from the operation (e.g., activation bit-widths) , and/or the data type (s) used to represent the data tensors that are processed by the operation (e.g., floating point, eight-bit integer, and the like) . As used here, “activations” or “activation data” may generally include any data that is processed by the model during inferencing (e.g., the input and/or output of one or more operations, or intermediate data generated by an operation itself) .
For example, as discussed above, the quantization profile 110 may indicate that one or more operations should be quantized such that both the parameters of the operation and the tensors processed by the operation have a bit-width of eight (e.g., eight-bit integers) , while one or more other operations should be quantized such that the parameters have a bit-width of eight (e.g., eight-bit integers) and the activations have a bit-width of sixteen (e.g., sixteen-bit integers) .
In some aspects, as discussed above, one or more AMP techniques or algorithms may be used to generate the quantization profile 110. In some aspects, the AMP technique (s) generally seek to balance model size (and therefore inferencing latency) with model accuracy. For example, the AMP techniques may attempt to find operation (s) that can be quantized with less precision for reduced size and latency without substantially harming accuracy, as compared to operation (s) where less precise
quantization results in substantially reduced accuracy. In some aspects, the quantization profile 110 is generated based on a defined or preferred maximum accuracy reduction (e.g., where the AMP algorithm attempts to generate a quantization profile 110 that minimizes, or at least substantially reduces, model size without exceeding the maximum accuracy reduction) .
In the illustrated example, the quantization system 115 includes a modification component 120 and a quantization component 130. Although depicted as discrete components for conceptual clarity, in aspects, the operations of the depicted components (and others not depicted) may be combined or distributed across any number of components. As illustrated, the modification component 120 accesses the machine learning model 105 and the quantization profile 110 and generates a modified quantization profile 125.
In a similar manner to the quantization profile 110, the modified quantization profile 125 may indicate, for each respective operation of the set of operations of the machine learning model 105, a respective quantization scheme or precision. In the illustrated example, the modification component 120 evaluates the machine learning model 105 and the quantization profile 110 to generate a set of modifications (which may be an empty set) to the quantization profile 110. The modification component 120 may then implement these modifications to modify the quantization profile 110 in order to generate the modified quantization profile 125.
Generally, the modification component 120 generates the modification (s) in an effort to reduce the runtime resources consumed by the quantized model (e.g., latency, memory, and the like) . In some aspects, the modification component 120 may generate the modifications based on the conversion latency for converting tensors among the various quantization precisions. For example, as discussed above, when sequential operations of the machine learning model 105 use different quantization precisions (as specified in the quantization profile 110) , one or more conversion operations may likely be applied (e.g., to re-quantize the data from the first format to the second) to the data output by the first operation before the data can be used as input to the second operation. These conversions introduce latency that, in some cases, can be substantial. Therefore, in some aspects, the modification component 120 may generate modifications to reduce the number of times that the quantization profile 110 switches between quantization precisions for adjacent operations (e.g., reducing the number of conversions) .
In some aspects, the modifications generated by the modification component 120 are constrained to refrain from decreasing the quantization precision of any operation. That is, in some aspects, the modification component 120 may determine to increase the precision of an operation (e.g., from eight-bit integer activations to sixteen-bit integer activations) or to leave the precision unchanged, but may not decrease the precision (e.g., from thirty-two-bit floating point to sixteen-bit integer) . By limiting the modifications to only increases in precision, the modification component 120 can ensure that the accuracy of the model is not reduced.
That is, as discussed above, the initial quantization profile 110, when used to quantize the machine learning model 105, may reduce accuracy of the model (e.g., due to quantization noise caused by the reduced precision of the parameters and/or activations) as compared to the non-quantized machine learning model 105. By restricting the modifications to only increase (or leave unchanged) the quantization precision of any given operation, the modification component 120 ensures that the resulting quantized model (generated using the modified quantization profile 125) will be at least as accurate (and possibly more accurate) as the model quantized according to the original quantization profile 110.
Advantageously, in addition to preserving model accuracy, this constraint obviates the use of any accuracy evaluations of the quantized model. That is, the modification component 120 may generate the modifications without evaluating the impact on the model’s accuracy, which substantially reduces the computational complexity of the modification process, as compared to some conventional approaches. For example, many AMP techniques (e.g., those used to generate the quantization profile 110) generally repeatedly use test data to evaluate the model accuracy, while generating the quantization profile 110, in order to ensure the quantization does not reduce accuracy more than a threshold amount.
By using the increased precision constraint, in contrast, the modification component 120 need not access or evaluate test data. In this way, the quantization system 115 can generate substantially improved models (e.g., the quantized machine learning model 135) using fewer computational resources, while simultaneously ensuring that the generated models have reduced inferencing latency and are at least as accurate as the initially proposed quantized model (e.g., quantized using the initial quantization profile 110) .
The modification component 120 may generally use a variety of operations or techniques to generate the modifications to the quantization profile 110. For example, in some aspects, the modification component 120 may process the quantization profile 110 and/or the machine learning model 105 (e.g., a representation of the architecture, such as a graph representing the operations and data flow, as discussed below in more detail) using one or more heuristic algorithms, search algorithms, machine learning models, and the like. For example, the modification component 120 (or another system) may train a machine learning model to evaluate quantization profiles 110 in order to generate a set of modifications that reduce the runtime latency of the model.
In some aspects, the modification component 120 uses an integer programming (IP) technique or approach, such as a mixed IP (MIP) technique, to generate the modifications. In some aspects, as discussed below with reference to FIG. 2, the machine learning model 105 is represented using a quantizer group graph, where each node in the graph corresponds to an operation of the machine learning model and each edge in the graph indicates data flow among the operations (e.g., directed edges from the first operation in the sequence through to the last operation in the sequence) . In some aspects the quantizer group graph is denoted as G, having a set of nodes V and edges E.
In some aspects, the modification component 120 uses integer programming to minimize (or at least reduce) a latency expression based on a set of constraints. For example, Expression 1 below (constrained by the constraints given in Expressions 2, 3, 4, and 5) may be used to generate modifications for a quantization profile 110 having two quantization precisions, where the modification component 120 seeks to minimize Expression 1. In Expressions 1-5, X and Y are the two quantization precisions, u and v are respective operations of the machine learning model, V is the total set of operations (e.g., nodes in the graph) , E is the data flow among the operations (e.g., the edge set of the graph) , BitOps (v, X) indicates the operation latency (e.g., the latency of performing) of the operation v using the quantization precision X, and BitOps (v, Y) is the operation latency of performing the operation v using the quantization precision Y. Further, in Expressions 1-5, xv and yv are the decision variables (e.g., variables with values generated by the modification component 120 when performing integer programming) where an assignment of xv=1 indicates that the operation v is assigned the quantization precision X and an assignment of yv=1 indicates that the operation v is assigned the quantization precision Y. That is, for each operation v in the set of operations V, the
modification component 120 may generate corresponding values xv and yv to indicate which precision to use. Further, in Expressions 1-5, sol (u) is the quantization precision assigned to the operation u by the quantization profile 110, fuv is the conversion latency of converting a tensor between the operations u and v (e.g., based on different quantization precisions) , α is a conversion latency hyperparameter (discussed in more detail below) , and Cf is the total conversion latency of the machine learning model 105, when quantized according to the quantization profile 110.
∑v∈V (BitOps (v, X) xv+BitOps (v, Y) yv) (1)
∑v∈V (BitOps (v, X) xv+BitOps (v, Y) yv) (1)
As defined in Expression 1, the modification component 120 may attempt to minimize the sum of the operation latencies of operations quantized according to the first quantization precision and operations quantized according to the second quantization precision. Using the constraints defined below in Expressions 2-5, the modification component 120 evaluates the conversion latency when generating values for xv and yvfor all v in V.
xv+yv=1,
xv+yv=1,
By using Expression 2 to constrain the integer programming (e.g., the process of minimizing, or at least reducing, Expression 1) , the modification component 120 may ensure that any given operation in the machine learning model 105 is assigned to exactly one quantization precision (e.g., either the precision X by assigning values xv=1 and yv=0, or the precision Y by assigning values xv=0 and yv=1) . In some aspects, the quantization precision X is less than the quantization precision Y. For example, X may represent quantizing the activations to eight bits, while Y may represent quantizing the activations to sixteen bits.
yv=1,
yv=1,
By using Expression 3 to constrain the integer programming (e.g., the process of minimizing, or at least reducing, Expression 1) , the modification component 120 ensures that yv=1 (e.g., that the operation v is assigned the higher quantization precision Y) for any operations that were assigned the higher precision Y by the initial quantization profile 110. That is, Expression 3 ensures that the modification component 120 does not decrease the quantization precision of any operations. Therefore, the only operations for which the modification component 120 may modify the quantization
precision are those assigned to quantization precision X (e.g., the modification component 120 may only increase or leave unchanged the precision of each operation) .
∑ (u, v) ∈E|xu-xv|fuv≤αCf (4)
∑ (u, v) ∈E|xu-xv|fuv≤αCf (4)
By using Expression 4 to constrain the integer programming (e.g., the process of minimizing, or at least reducing, Expression 1) , the modification component 120 causes the conversion latency of any given pair of operations (having an edge connecting the operations) to be considered when generating the modifications. That is, |xu-xv|evaluates to zero if the operations u and v have the same quantization precision, and to 1 if the operations u and v have different quantization precisions.
In Expression 4, fuv may generally be used to indicate or determine the latency of performing the conversion between operations u and v. In aspects, a variety of conversion latency formulations may be used. For example, in some aspects, the conversion latency is defined using a fixed uniform cost for all data conversions (e.g., fuv=1 for all (u, v) ) . As another example, in some aspects, the conversion latency is defined dynamically based on the tensor size (s) of the tensor (s) used as output of the operation u and/or input of the operation v (e.g., the tensors that are being converted) . For example, the fuv may be proportional (e.g., scaled to a value between zero and one) to the number of elements in the tensor (s) to be converted, where larger tensors (with more elements) have higher conversion latencies than smaller tensors.
In some aspects, the conversion latency is defined using a predictor function (e.g., a conversion latency machine learning model) to predict the latency of the conversion. For example, in some cases, additional factors beyond the size of the tensor (s) may impact conversion efficiency. As one example, the tiling strategy of the operation (or of the conversion itself) may affect the latency. In some aspects, tensors may be delineated into smaller sub-tensors (referred to as tiles) to perform operations in parallel on multiple processing units (or on multiple parts of the same unit, such as different threads or cores) . This may generally be referred to as tiling the tensors. For example, the conversion may be performed more quickly by tiling a large tensor into a set of sub-tensors, converting each sub-tensor, and aggregating the converted sub-tensors. As used herein, the tiling strategy generally refers to the number of tiles used to delineate the tensor (s) of the given conversion operation.
In some aspects, therefore, fuv is generated by processing data such as the size of the tensor (s) being converted, and the tiling strategy used to implement the conversion, using the conversion latency model. For example, in some aspects, the modification component 120 (or another system) may train the conversion latency model based on training exemplars indicating the conversion latency of various conversions in one or more model architectures. For example, the modification component 120 (or another component or system) may determine, for a given set of input parameters (e.g., tensor size and tiling strategy) , the conversion latency of converting the tensor from quantization precision X to precision Y (or vice versa) . In some aspects, the conversion latency model is a relatively lightweight regression model (e.g., a linear regression model) .
The particular operations used to train the conversion latency model may vary depending on the particular implementation. For example, for a regression model, the modification component 120 may process the input portion of an exemplar (e.g., the tensor size and tiling strategy) using the model to generate a predicted latency, and this predicted latency can be compared against the label of the exemplar (e.g., the actual conversion latency) to generate a loss. This loss may then be used to update one or more parameters of the regression model (e.g., using backpropagation) .
In Expression 4, α is a hyperparameter. The value of α may generally be set to balance the strength of the conversion latency constraint. For example, in some aspects, α is given a value between zero and one (inclusive) . Assigning a value of zero may result in all operations being assigned to the higher precision Y (e.g., such that the total conversion latency is zero) , while a value of one may result in all operations having the same precision as in the initial quantization profile 110.
In some aspects, the particular value of α that should be used may vary depending on the particular implementation, architecture, and choice (s) of the designers. In some aspects, the modification component 120 may generate multiple sets of modifications to the quantization profile 110 using multiple values for α. Each set of modifications may then be evaluated to select the best one. For example, the modification component 120 may use each respective set of modifications to generate a respective modified quantization profile 125 (and a resulting respective quantized machine learning model 135) . Each quantized model may then be evaluated to determine the latency of processing data using the model. Using such evaluations, the modification component 120 (or another component) can determine which set of modifications resulted in the
lowest inferencing latency, and determine to select or use these modifications for the machine learning model 105.
xv, yv∈ {0, 1} ,
xv, yv∈ {0, 1} ,
By using Expression 5 to constrain the integer programming (e.g., the process of minimizing, or at least reducing, Expression 1) , the modification component 120 ensures that the decision variables xv and yv are binary values (e.g., zero or one) , facilitating the integer programming process.
In some aspects, therefore, the modification component 120 may therefore seek to find binary values for xv, yv, that minimizes Expression 1, using constraints given in Expressions 2-5.
In some aspects, without loss of generality, a similar integer programming approach can be used to modify quantization profiles 110 having more than two quantization precision candidates. For example, Expression 6 below (constrained by the constraints given in Expressions 7, 8, 9, and 10) may be used to generate modifications for a quantization profile 110 having d quantization precisions, where the modification component 120 seeks to minimize Expression 6. That is, the set of quantization precision candidates may be defined as P, and the modification component 120 may seek to find or generate a value for xv, i, where an assignment of xv, i=1 indicates that operation v is assigned to the precision candidate Pi where i∈ {1, 2, …, d-1, d} .
In Expressions 6-10, u and v are respective operations of the machine learning model, V is the total set of operations (e.g., nodes in the graph) , E is the data flow among the operations (e.g., the edge set of the graph) , and BitOps (v, Pi) indicates the operation latency (e.g., the latency of performing) of the operation v using the quantization precision Pi (e.g., the i-th quantization precision of the set of candidate precisions) . Further, in Expressions 6-10, d is the number of quantization precision candidates, and xv, i is the decision variable (e.g., variables with values generated by the modification component 120 when performing integer programming) . That is, for each operation v in the set of operations V, the modification component 120 may generate a corresponding value xv, i to indicate which precision to use.
Further, in Expressions 6-10, Lv is a subset of the plurality of quantization precisions such that decreases in quantization precision are disallowed (as discussed in
more detail below) , andonly if the activation bit-width of the i-th quantization precision is different than an activation bit-width of a j-th precision candidate and is otherwise zero. Additionally, fuv is the conversion latency of converting a tensor between the operations u and v (e.g., based on different quantization precisions) , α is the conversion latency hyperparameter (discussed above) , and Cf is the total conversion latency of the machine learning model 105, when quantized according to the quantization profile 110.
∑v∈V∑i∈ [d] BitOps (v, Pi) xv, i (6)
∑v∈V∑i∈ [d] BitOps (v, Pi) xv, i (6)
As defined in Expression 1, the modification component 120 may attempt to minimize the sum of the operation latencies of operations quantized according to various quantization precisions. Using the constraints defined below in Expressions 7-10, the modification component 120 evaluates the conversion latency when generating values for xv, i for all v in V and all i in [d] .
∑i∈ [d] xv, i=1,
∑i∈ [d] xv, i=1,
By using Expression 7 to constrain the integer programming (e.g., the process of minimizing, or at least reducing, Expression 1) , the modification component 120 ensures that any given operation in the machine learning model 105 is assigned to exactly one quantization precision.
xv, i=0,
xv, i=0,
By using Expression 8 to constrain the integer programming (e.g., the process of minimizing, or at least reducing, Expression 1) , the modification component 120 ensures that xv, i is zero for all precision candidates in Lv for all operations v in V. This ensures that disallowable modifications are not used (e.g., that the modification component 120 does not decrease the precision of any of the operations) . For example, in some aspects, such that any defined undesired changes (e.g., from sixteen bit-width quantization to eight bit-width quantization) are prevented. That is, for each operation v, the modification component 120 may determine or define a set Lv that contains all quantization precision candidates that are lower than the quantization precision assigned to the operation v by the initial quantization profile 110. Therefore, using this constraint, the modification component 120 will not assign one of these lower precision quantization schemes to the operation v.
By using Expression 9 to constrain the integer programming (e.g., the process of minimizing, or at least reducing, Expression 6) , the modification component 120 causes the conversion latency of any given pair of operations (having an edge connecting the operations) to be considered when generating the modifications. That is, |xu, i-xv, j|evaluates to zero if the operations v and v have the same quantization precision, and to 1 if the operations have different quantization precisions.
In some aspects, as discussed above, the conversion latency fuv may be defined in various ways, such as using a static and uniform latency value for all conversions, using a dynamic latency determined based on the tensor sizes (e.g., using a fixed mapping or scaling) , and/or using a machine learning model to predict the conversion latency based on factors such as the tensor size and/or the tiling strategy. Further, in some aspects, as discussed above, the conversion latency hyperparameter αmay be selected by testing various values of α and evaluating the resulting quantized models to identify the model having the lowest latency.
xv, i∈ {0, 1} ,
xv, i∈ {0, 1} ,
By using Expression 10 to constrain the integer programming (e.g., the process of minimizing, or at least reducing, Expression 1) , the modification component 120 ensures that the decision variables xv, i has a binary value (e.g., zero or one) for all operations, facilitating the integer programming process.
In some aspects, therefore, the modification component 120 may therefore seek to find a binary values for xv, i
andthat minimizes Expression 6, using constraints given in Expressions 7-10.
In some aspects, the particular set of quantization precision candidates that can be evaluated by the quantization system 115 may vary depending on the particular implementation. For example, depending on the hardware support of the device (s) that will use the quantized machine learning model 135 during runtime, the quantization system 115 may restrict the evaluations to those quantization precisions that the device (s) support. In some aspects, the allowable quantization precision candidates may be defined or indicated, such as by a designer or user, when generating the modified quantization profile 125.
As discussed above, the modification component 120 can then use the generated set of modifications to modify the quantization profile 110 in order to generate the modified quantization profile 125 (e.g., to change the quantization precision of zero or more operations) .
As illustrated, the modified quantization profile 125 and the machine learning model 105 are then accessed by the quantization component 130. The quantization component 130 quantizes the machine learning model 105 based on the modified quantization profile 125 to generate the quantized machine learning model. For example, the quantization component 130 may quantize the parameters (e.g., weights) of each operation in the machine learning model 105 to the specific quantization precision (e.g., bit-width, format, and the like) indicated in the modified quantization profile 125 for the operation. Further, the quantization component 130 may generate and/or insert conversion operations as applicable (e.g., to convert the tensors that are passed from an operation using a first activation bit-width to an operation using a second activation bit-width) .
The quantized machine learning model 135 may then be deployed for inferencing. As used herein, “deploying” the model may generally include any operations used to prepare or provide the model for runtime use, such as instantiating the model locally, transmitting the quantized model to one or more inferencing systems or devices, and the like. As discussed above, using the workflow 100, the quantized machine learning model 135 may generally exhibit reduced latency with comparable or higher accuracy, as compared to the machine learning model 105 quantized according to the initial quantization profile 110.
Example Workflow to Generate Quantizer Group Graphs to Facilitate Mixed-Precision Quantization
FIG. 2 depicts an example workflow 200 to generate quantizer group graphs to facilitate mixed-precision quantization, according to some aspects of the present disclosure. In some aspects, the workflow 200 is performed by a quantization system, such as the quantization system 115 of FIG. 1.
As illustrated, a machine learning model (e.g., the machine learning model 105 of FIG. 1) can be represented as a directed graph 205. Specifically, in the graph 205, each node 210A-F (collectively, the nodes 210) represents an operation performed by the
machine learning model (e.g., a convolution operation, a layer, an activation function, and the like) , and the edges (depicted as arrows connecting nodes 210) indicate the data flow of the model. For example, in the illustrated graph 205, the output of the node 210A (corresponding to a first operation in the model) is used as input to the node 210B (corresponding to a second operation) , the output of the node 210B is used as input to nodes 210C and 210E (corresponding to third and fourth operations, respectively) , and so on. In some aspects, each node 210 in the graph 205 is referred to as a quantizer group, where a quantizer group corresponds to a set of linked operations that have the same quantization precision (e.g., a convolution operation followed by a corresponding activation function, such as a rectified linear unit (ReLU) ) .
In some aspects, as discussed above, the nodes 210 in the graph 205 may be defined as V and the edges may be defined as E. In the illustrated example, the nodes 210 include stippling to indicate which quantization precision is assigned to the node 210 (e.g., by a quantization profile, such as the quantization profile 110 of FIG. 1) . For example, as illustrated, the nodes 210B, 210C, 210D, and 210E may use a first quantization precision, while the nodes 210A and 210F use a second quantization precision.
Therefore, as discussed above, the data output by the node 210A and consumed by the node 210B, as well as the data output by the node 210E and consumed by the node 210F, may undergo conversion based on the corresponding quantization operations.
In the illustrated example, as depicted by an arrow 213, the graph 205 may be modified by grouping the nodes 210 based on the quantization precision assigned to each, forming precision groups 215A and 215B. Although two precision groups (corresponding to two quantization precisions) are depicted for conceptual clarity, in aspects, the quantization system may use any number of quantization precision candidates, as discussed above. As illustrated, delineating the nodes 210 into precision groups enables rapid identification of the edge (s) where data conversion may be performed. That is, the quantization system may compute or determine the minimum cut to completely separate the precision groups 215 (e.g., the cut 220) , where each severed edge in the cut 220 corresponds to a conversion.
In some aspects, if each edge has a weight corresponding to the latency of data conversion along the edge, the value of the cut 220 indicates the aggregate or total conversion latency of the quantization profile. By modifying the quantization profile, the nodes 210 in each precision group 215 may change, resulting in a changed minimum cut 220 and changed conversion latency, as discussed above. As discussed above, in some aspects, this graph formulation is used to facilitate the process of modifying the quantization profile (e.g., using machine learning models, integer programming, and the like) .
Example Method for Reduced Latency Mixed-Precision Quantized Machine Learning Models
FIG. 3 is a flow diagram depicting an example method 300 for reduced latency mixed-precision quantized machine learning models, according to some aspects of the present disclosure. In some aspects, the method 300 is performed by a quantization system, such as the quantization system 115 of FIG. 1 and/or the quantization system discussed above with reference to FIG. 2.
At block 305, the quantization system accesses a trained machine learning model (e.g., the machine learning model 105 of FIG. 1) . For example, as discussed above, the quantization system may itself train the machine learning model, or may access the model from another system (e.g., a dedicated training system) . As discussed above, the machine learning model may generally comprise a set or sequence of operations, where each operation may correspond to any data processing transformation or operation, such as a layer of a neural network, a convolution operation, a transformer operation, and the like. In some aspects, as discussed above, the machine learning model may be encoded in full (or at least high) precision. That is, in some aspects, the machine learning model is not quantized.
At block 310, the quantization system determines a set of quantization precision candidates for the machine learning model. For example, in some aspects, the candidates may correspond to the quantization schemes that are supported by the device (s) which will use the quantized machine learning for inferencing. In some aspects, the quantization system determines which candidate have hardware support on the inferencing device (s) (e.g., dedicated accelerator hardware that can operate or use the candidates) . In some aspects, the candidate quantization precisions are indicated or provided along with the machine learning model (e.g., by the training system or a user) .
At block 315, the quantization system determines an initial quantization profile (e.g., the quantization profile 110 of FIG. 1) for the machine learning model. As discussed above, the quantization profile generally indicates, for one or more operations of the machine learning model, a quantization precision or scheme. For example, as discussed above, the quantization profile may specify the precision (e.g., bit-width and/or data type) of the activations (e.g., data processed by the operation) and/or the parameters (e.g., the weights used to perform the operation) .
In some aspects, the quantization system accesses or receives the quantization profile from another system (e.g., the training system) . In some aspects, the quantization system generates the quantization profile. For example, in some aspects, the quantization system (or another system) uses one or more AMP techniques to generate the quantization profile.
At block 320, the quantization system determines modification constraints for the quantization profile. For example, the constraints may relate to allowable quantization precisions (e.g., indicating that one or more precisions should not be used) , allowable sequences of precision (e.g., sets of precisions that should, or should not, be used for adjacent operations) , allowable precision changes (e.g., indicating that the quantization system should not decrease the precision of any operations, as compared to the initial quantization profile) , and the like. In some aspects, at block 320, the quantization system determines the constraints as defined above in Expressions 2-5 and/or 7-10. In some aspects, the quantization system determines the constraints based at least in part on the number of quantization precision candidates determined at block 310 (e.g., using constraints given in Expressions 2-5 when two candidates are used, and constraints given in Expressions 7-10 when three or more candidates are used) .
At block 325, the quantization system generates a set of modifications for the initial quantization profile, as discussed above. For example, in some aspects, the quantization system processes the initial quantization profile and/or the machine learning model (e.g., a graph comprising nodes and edges representing the sequence of operations) using one or more modification machine learning models to generate the modifications. In some aspects, the quantization system may use various techniques such as one or more heuristics-based algorithms, one or more search algorithms, and the like. In some aspects, as discussed above, the quantization system may use one or more integer programming
operations or techniques to generate the modifications, such as by seeking to minimize, or at least reduce, Expression 1 and/or Expression 6 (using suitable constraints) .
In some aspects, as discussed above, the quantization system generally generates the modifications based in part on the latency introduced by converting data between precision candidates. That is, to generate the modifications, the quantization system may evaluate the latency that will be caused by converting data whenever two different quantization precisions are used in sequential operations (e.g., between node 210A and node 210B in FIG. 2) .
Generally, the quantization system may use a variety of techniques to determine the conversion costs. For example, in some aspects, the quantization system may use a fixed or uniform latency value for all conversions. In some aspects, the quantization system may determine the conversion latency for each respective pair of operations based on data such as the size of the tensors that are to be converted (e.g., generating a value between zero and one based on the number of elements, where the latency value is directly proportional to the number of elements) . In some aspects, the quantization system may determine the conversion latency based on data such as the tiling strategy used for the conversion (or for the subsequent operation) . For example, the quantization system may process data such as the tiling strategy, the tensor size, and the like using a machine learning model (e.g., a linear regression model) to infer the latency.
The set of modifications can then be used to generate a modified quantization profile (e.g., the modified quantization profile 125 of FIG. 1) .
At block 330, the quantization system outputs a quantized machine learning model (e.g., the quantized machine learning model 135 of FIG. 1) by quantizing the initial machine learning model (accessed at block 305) using the modified quantization profile (generated at block 325) . As discussed above, generating the quantized model may generally include quantizing the parameters of each operation as indicated in the modified quantization profile, adding appropriate conversion operations between machine learning model operations (e.g., wherever the quantization precision of sequential operations is different) , and the like.
As discussed above, the quantized machine learning model may generally have lower runtime latency, as compared to both the original (non-quantized) machine learning model, as well as compared to a version of the machine learning model quantized
according to the initial quantization profile. Further, in some aspects, the quantized machine learning model may be at least as accurate as the version of the machine learning model quantized according to the initial quantization profile (and potentially more accurate) . For example, if the quantization system is constrained to refrain from reducing the quantization precision of any operations (e.g., either increasing the precision or leaving the precision unchanged for each operation) , the resulting quantized model will generally exhibit the same (or better) accuracy.
In this way, the quantization system can generate improved machine learning models that are substantially faster during runtime without harming model accuracy. As discussed above, the quantized machine learning model may then be deployed for runtime use (e.g., to an inferencing system, by the quantization system itself, and the like) .
Example Method for Generating and Evaluating Quantization Profiles to Reduce Machine Learning Model Latency
FIG. 4 is a flow diagram depicting an example method 400 for generating and evaluating quantization profiles to reduce machine learning model latency, according to some aspects of the present disclosure. In some aspects, the method 400 is performed by a quantization system, such as the quantization system 115 of FIG. 1 and/or the quantization system discussed above with reference to FIGS. 2-3. In some aspects, the method 400 provides additional detail for block 325 of FIG. 3.
At block 405, the quantization system selects a value for a conversion latency hyperparameter (e.g., α in Expressions 4 and/or 9) . In some aspects, as discussed above, the conversion latency hyperparameter may generally be selected to balance the strength or influence of the conversion latency constraint. That is, the conversion latency hyperparameter generally indicates the relative importance of reducing conversion latency as compared to maintaining the initial quantization profile. For example, in some aspects, the conversion latency hyperparameter is assigned a value between zero and one (inclusive) , where a value of zero causes all operations being assigned to the higher precision (e.g., such that the total conversion latency is zero) , while a value of one causes all operations to have the same precision as in the initial quantization profile (e.g., no changes are made) .
In some aspects, as discussed above, as the overall inferencing latency of a quantized model is a nonlinear (and generally unknown) function of factors such as the
data conversion latencies and the operation latencies themselves, it is often difficult or impossible to determine a good balance between the conversion control and the operation latency minimization (or at least reduction) . Therefore, any given value for the conversion latency hyperparameter may not work equally well for a range of different models. In the illustrated method 400, therefore, the quantization system can evaluate multiple values for the conversion latency hyperparameter, generating the corresponding mixed quantization precision profiles for each. The quantization system can then decide which value to use for the quantization using a latency evaluation of these profiles on target data (as discussed above) .
Generally, the quantization system may select the value at block 405 using any suitable criteria. For example, the quantization system may select the value from a defined range (e.g., between zero and one, or between 0.2 and 0.8) , and may select values within this range according to any suitable criteria (e.g., evaluating N values equally spaced in the range) . In some aspects, the hyperparameter range and/or the number (and/or spacing) of values to evaluate may be predefined, or may be specified (e.g., by a user) .
At block 410, the quantization system generates a set of modifications for the quantization profile based on the selected value for the conversion latency hyperparameter. For example, as discussed above, the hyperparameter may be used as an input to the search algorithm (s) , machine learning model (s) , integer programming constraint (s) , and the like.
At block 415, the quantization system quantizes the initial machine learning model (e.g., the model 105 of FIG. 1) using the generated set of modifications, as discussed above. For example, the quantization system may modify the initial quantization profile (e.g., the quantization profile 110 of FIG. 1) using the modifications, and quantize the model using the modified quantization profile.
At block 420, the quantization system evaluates or determines the inferencing latency of the quantized machine learning model (generated at block 415) . For example, the quantization system may process test data (e.g., from the target domain, such as data that is the same or similar to data that will be processed during runtime) using the model, determining the latency (e.g., the time between when input is provided to the model and when the output is available) . In some aspects, using multiple test inputs, the quantization
system may determine the minimum latency, the maximum latency, the average latency, the median latency, the standard deviation or variance of the latency, and the like.
At block 425, the quantization system determines whether at least one additional value for the conversion latency hyperparameter is yet-to-be evaluated. If so, the method 400 returns to block 405 to select a new value. If not, the method 400 continues to block 430. Although the illustrated method 400 depicts an iterative process (e.g., selecting and evaluating each value in sequence) for conceptual clarity, in some aspects, the quantization system may evaluate some or all of the alternative values in parallel.
At block 430, the quantization system selects a set of modifications for the quantization profile based on the determined latency of each alternative value of the conversion latency hyperparameter. For example, as discussed above, the quantization system may select the value that resulted in the lowest average latency, the smallest latency variance, the value which resulted in the lowest maximum latency, and the like.
In some aspects, if the method 400 is used to evaluate alternative values for the conversion latency hyperparameter, the quantization system may select and provide the corresponding quantized model for inferencing directly (e.g., rather than re-quantizing the model at block 330 of FIG. 3) .
As discussed above, by evaluating multiple alternative values for the conversion latency hyperparameter, the quantization system can efficiently find an optimal (or at least improved) balance between conversion latencies and operational latencies, ensuring that the resulting quantized machine learning model performs efficiently and accurately.
Example Method for Training Conversion Latency Prediction Models
FIG. 5 is a flow diagram depicting an example method 500 for training conversion latency prediction models, according to some aspects of the present disclosure. In some aspects, the method 500 is performed by a quantization system, such as the quantization system 115 of FIG. 1 and/or the quantization system discussed above with reference to FIGS. 2-4.
At block 505, the quantization system selects a tensor conversion operation from a (quantized) machine learning model architecture. That is, the quantization system
identifies a pair of sequential operations that use different quantization precisions for the activation data (e.g., where the output of the first operation is converted to a different precision before being provided as input to the second operation) . Generally, the quantization system may select the conversion using any suitable criteria, as the quantization system may evaluate any number and variety of conversions (from any number and variety of model architectures) to train the conversion latency prediction model.
At block 510, the quantization system determines the size of the tensor conversion. That is, the quantization system determines the number of elements of the tensor that is converted by the selected tensor conversion. For example, if the tensor has dimensions (n, c, h, w) , the size of the tensor may be defined as n*c*h*w.
At block 515, the quantization system determines the tiling strategy of the conversion. That is, the quantization system may determine whether tiling is used to convert sub-tensors from the tensor in parallel, and/or may determine the number of such tiles or sub-tensors that are processed in parallel.
At block 520, the quantization system determines the latency of the selected conversion operation. For example, the quantization system may perform the tensor conversion operation one or more times using one or more tensors and monitor the latency (e.g., to determine the minimum, maximum, average, median, or other latency value) .
At block 525, the quantization system trains a conversion latency machine learning model to predict conversion latency based on the collected data. For example, the quantization system may use characteristics such as the tensor size and tiling strategy as input to the model to generate an output prediction, and the output prediction may be compared against the actual latency (determined at block 520) to generate a loss. This loss may then be used to refine the parameter (s) of the model. In some aspects, as discussed above, the conversion latency machine learning model comprises a regression model fitted to the conversion latencies.
At block 530, the quantization system determines whether one or more termination criteria are met. Generally, the termination criteria may vary depending on the particular implementation. For example, the quantization system may determine whether additional tensor conversion operations remain to be tested, whether a defined number of epochs or iterations have been performed, whether a defined amount of time
and/or resources have been spent training, whether the model accuracy has reached a desired threshold, and the like. If the termination criteria are not met, the method 500 returns to block 505. If the termination criteria are met, the method 500 continues to block 535. Although the illustrated example depicts updating the model for each individual sample (e.g., using stochastic gradient descent) for conceptual clarity, in some aspects, the quantization system may update the model using batches of samples (e.g., batch gradient descent) .
At block 535, the quantization system deploys the conversion latency machine learning model for inferencing. For example, as discussed above, the conversion latency machine learning model may be used to predict the conversion latency of various operation sequences (e.g., based on changes in quantization precision) when generating modifications to quantization profiles. For example, the conversion latency machine learning model may be used to define the latency constraints, such as using Expressions 4 and/or 9.
Example Method for Mixed-Precision Quantization
FIG. 6 is a flow diagram depicting an example method 600 for mixed-precision quantization, according to some aspects of the present disclosure. In some aspects, the method 600 is performed by a quantization system, such as the quantization system 115 of FIG. 1 and/or the quantization system discussed above with reference to FIGS. 2-5.
At block 605, a quantization profile for a machine learning model is accessed, the quantization profile indicating, for each respective operation of a plurality of operations of the machine learning model, a respective quantization precision of a plurality of quantization precisions.
At block 610, a first set of modifications is generated for the quantization profile based on conversion latency for converting tensors among the plurality of quantization precisions, wherein each respective modification of the first set of modifications indicates to increase a respective quantization precision of a respective operation of the plurality of operations.
At block 615, a modified quantization profile is generated based on modifying the quantization profile using the first set of modifications.
At block 620, the machine learning model is quantized in accordance with the modified quantization profile.
In some aspects, the plurality of quantization precisions comprises one or more of: (i) a set of weight bit-widths, (ii) a set of activation bit-widths, or (iii) a set of data types.
In some aspects, the first set of modifications is generated based on a first value for the conversion latency hyperparameter, and the method 600 further comprises generating a second set of modifications for the quantization profile based on a second value for a conversion latency hyperparameter, determining a first latency of the machine learning model quantized according to the quantization profile and the first set of modifications, determining a second latency of the machine learning model quantized according to the quantization profile and the second set of modifications, and selecting the first set of modifications in response to determining that the first latency is less than the second latency.
In some aspects, the conversion latency comprises a static value for converting the tensors among the plurality of quantization precisions.
In some aspects, the method 600 further includes determining a respective conversion latency for each respective operation of at least a subset of operations of the plurality of operations based at least in part on a respective tensor size of the respective operation.
In some aspects, determining the respective conversion latencies comprises processing the respective tensor sizes using a conversion latency machine learning model.
In some aspects, determining the respective conversion latencies further comprises processing a tiling strategy of the respective operation using the conversion latency machine learning model.
In some aspects, generating the first set of modifications comprises using integer programming, and the integer programming is constrained to refrain from decreasing quantization precision of any operations of the plurality of operations.
In some aspects, the integer programming comprises minimizing ∑v∈V (BitOps (v, X) xv+BitOps (v, Y) yv) and using constraints: xv+yv=1,
yv=1, ∑ (u, v) ∈E|xu-xv|fvu≤αCf, and xv, yv∈
{0, 1} , where X and Y are first and second quantization precisions, respectively, of the plurality of quantization precisions, u and v are operations of the plurality of operations, V is the plurality of operations, E is a set of edges representing data flow among the plurality of operations, BitOps (v, X) is an operation latency of performing the operation v using the first quantization precision X, BitOps (v, Y) is the operation latency of performing the operation v using the second quantization precision Y, xv and yv are decision variables, wherein an assignment of xv=1 indicates that the operation v is assigned the first quantization precision X and an assignment of yv=1 indicates that the operation v is assigned the second quantization precision Y, sol (u) is a quantization precision assigned to an operation u by the quantization profile, fuv is a conversion latency of converting a tensor between the operations u and v, α is a conversion latency hyperparameter, and Cf is a total conversion latency of the machine learning model quantized according to the quantization profile.
In some aspects, the integer programming comprises minimizing ∑v∈V∑i∈ [d] BitOps (v, Pi) xv, i and using constraints: ∑i∈ [d] xv, i=1, xv, i=0, and xv, i∈ {0, 1} , wherein u and v are operations of the plurality of operations, V is the plurality of operations, E is a set of edges representing data flow among the plurality of operations, Pi is an i-th quantization precision of the plurality of quantization precisions, BitOps (v, Pi) is an operation latency of performing the operation v using the quantization precision Pi, d is a number of the plurality of quantization precisions, xv, i is a decision variable, an assignment of xv, i=1 indicates that the operation v is assigned the i-th quantization precision, Lv is a subset of the plurality of quantization precisions such that decreases in quantization precision are disallowed, only if an activation bit-width of the i-th quantization precision is different than an activation bit-width of a j-th precision candidate, fuv is a conversion latency of converting a tensor between operations u and v, α is a conversion latency hyperparameter, and Cf is a total conversion latency of the machine learning model quantized according to the quantization profile.
Example Processing System for Machine Learning
FIG. 7 depicts an example processing system 700 configured to perform various aspects of the present disclosure, including, for example, the techniques and
methods described with respect to FIGS. 1-6. In some aspects, the processing system 700 may correspond to a quantization system. For example, the processing system 700 may correspond to the quantization system 115 of FIG. 1, and/or the quantization system discussed above with reference to FIGS. 2-6. Although depicted as a single system for conceptual clarity, in some aspects, as discussed above, the operations described below with respect to the processing system 700 may be distributed across any number of devices or systems.
The processing system 700 includes a central processing unit (CPU) 702, which in some examples may be a multi-core CPU. Instructions executed at the CPU 702 may be loaded, for example, from a program memory associated with the CPU 702 or may be loaded from a memory partition (e.g., a partition of a memory 724) .
The processing system 700 also includes additional processing components tailored to specific functions, such as a graphics processing unit (GPU) 704, a digital signal processor (DSP) 706, a neural processing unit (NPU) 708, a multimedia component 710 (e.g., a multimedia processing unit) , and a wireless connectivity component 712.
An NPU, such as the NPU 708, is generally a specialized circuit configured for implementing the control and arithmetic logic for executing machine learning algorithms, such as algorithms for processing artificial neural networks (ANNs) , deep neural networks (DNNs) , random forests (RFs) , and the like. An NPU may sometimes alternatively be referred to as a neural signal processor (NSP) , tensor processing unit (TPU) , neural network processor (NNP) , intelligence processing unit (IPU) , vision processing unit (VPU) , or graph processing unit.
NPUs, such as the NPU 708, are configured to accelerate the performance of common machine learning tasks, such as image classification, machine translation, object detection, and various other predictive models. In some examples, a plurality of NPUs may be instantiated on a single chip, such as a system on a chip (SoC) , while in other examples the NPUs may be part of a dedicated neural-network accelerator.
NPUs may be optimized for training or inference, or in some cases configured to balance performance between both. For NPUs that are capable of performing both training and inference, the two tasks may still generally be performed independently.
NPUs designed to accelerate training are generally configured to accelerate the optimization of new models, which is a highly compute-intensive operation that
involves inputting an existing dataset (often labeled or tagged) , iterating over the dataset, and then adjusting model parameters, such as weights and biases, in order to improve model performance. Generally, optimizing based on a wrong prediction involves propagating back through the layers of the model and determining gradients to reduce the prediction error.
NPUs designed to accelerate inference are generally configured to operate on complete models. Such NPUs may thus be configured to input a new piece of data and rapidly process this piece of data through an already trained model to generate a model output (e.g., an inference) .
In some implementations, the NPU 708 is a part of one or more of the CPU 702, the GPU 704, and/or the DSP 706.
In some examples, the wireless connectivity component 712 may include subcomponents, for example, for third generation (3G) connectivity, fourth generation (4G) connectivity (e.g., Long-Term Evolution (LTE) ) , fifth generation (5G) connectivity (e.g., New Radio (NR) ) , Wi-Fi connectivity, Bluetooth connectivity, and other wireless data transmission standards. The wireless connectivity component 712 is further coupled to one or more antennas 714.
The processing system 700 may also include one or more sensor processing units 716 associated with any manner of sensor, one or more image signal processors (ISPs) 718 associated with any manner of image sensor, and/or a navigation processor 720, which may include satellite-based positioning system components (e.g., GPS or GLONASS) as well as inertial positioning system components.
The processing system 700 may also include one or more input and/or output devices 722, such as screens, touch-sensitive surfaces (including touch-sensitive displays) , physical buttons, speakers, microphones, and the like.
In some examples, one or more of the processors of the processing system 700 may be based on an ARM or RISC-V instruction set.
The processing system 700 also includes a memory 724, which is representative of one or more static and/or dynamic memories, such as a dynamic random access memory, a flash-based static memory, and the like. In this example, the memory 724 includes computer-executable components, which may be executed by one or more of the aforementioned processors of the processing system 700.
In particular, in this example, the memory 724 includes a modification component 724A, a quantization component 724B, and a latency component 724C. Although not depicted in the illustrated example, the memory 724 may also include other components, such as an inferencing or generation component to manage the generation of output predictions using trained machine learning models, a training component used to train or update the machine learning model (s) , and the like. Though depicted as discrete components for conceptual clarity in FIG. 7, the illustrated components (and others not depicted) may be collectively or individually implemented in various aspects.
As illustrated, the memory 724 also includes a set of quantization precisions 724D (e.g., indicating the precision alternatives or candidates that can be used to quantize machine learning models) . For example, the quantization precisions 724D may indicate the allowable activation bit-width (s) , the allowable parameter bit-width (s) , the allowable data types or formats (e.g., float, double, integer, and the like) , and the like. The memory 724 also includes a set of profile constraints 724E. For example, the profile constraints 724E may indicate limits or restrictions on the modification process, such as indicating that the quantization system should not reduce the precision of any operations, as compared to an initial quantization profile for the operations. In some aspects, the profile constraints 724E are defined as above in Expressions 2-5 and/or 7-10.
Although not depicted in the illustrated example, the memory 724 may also include other data such as training data (e.g., tensor conversion latency data) , model parameters (e.g., for a conversion latency machine learning model) , and the like.
The processing system 700 further comprises a modification circuit 726, a quantization circuit 727, and a latency circuit 728. The depicted circuits, and others not depicted (such as an inferencing circuit) , may be configured to perform various aspects of the techniques described herein.
The modification component 724A and/or the modification circuit 726 (which may correspond to the modification component 120 of FIG. 1) may be used to generate modifications to initial quantization profiles (e.g., to generate the modified quantization profile 125 of FIG. 1) , as discussed above. For example, the modification component 724A and/or the modification circuit 726 may use various techniques such as heuristics, search algorithms, integer programming, and the like to generate the modifications based on the specified machine learning model architecture (e.g., the machine learning model
105 of FIG. 1) and an initial quantization profile (e.g., the quantization profile 110 of FIG. 1) .
The quantization component 724B and/or the quantization circuit 727 (which may correspond to the quantization component 130 of FIG. 1) may be used to quantize machine learning models (e.g., to generate the quantized machine learning model 135 of FIG. 1) , as discussed above. For example, the quantization component 724B and/or the quantization circuit 727 may quantize initial (non-quantized) machine learning models (e.g., the machine learning model 105 of FIG. 1) using modified quantization profile (s) (generated by the modification component 724A and/or the modification circuit 726) to generate the quantized models.
The latency component 724C and/or the latency circuit 728 may be used to evaluate the runtime latencies of quantized machine learning models, as discussed above. For example, the latency component 724C and/or the latency circuit 728 may evaluate models quantized according to different modified quantization profiles (e.g., generated based on different values for a conversion latency hyperparameter, as discussed above) . Based on these evaluations, the latency component 724C and/or the latency circuit 728 may select which quantized model is to be used for inferencing (e.g., to minimize the average and/or maximum runtime latency) .
Though depicted as separate components and circuits for clarity in FIG. 7, the modification circuit 726, the quantization circuit 727, and the latency circuit 728 may collectively or individually be implemented in other processing devices of the processing system 700, such as within the CPU 702, the GPU 704, the DSP 706, the NPU 708, and the like.
Generally, the processing system 700 and/or components thereof may be configured to perform the methods described herein.
Notably, in other aspects, aspects of the processing system 700 may be omitted, such as where the processing system 700 is a server computer or the like. For example, the multimedia component 710, the wireless connectivity component 712, the sensor processing units 716, the ISPs 718, and/or the navigation processor 720 may be omitted in other aspects. Further, aspects of the processing system 700 maybe distributed between multiple devices.
In some aspects, as discussed above, the processing system 700 may use the quantized model (quantized according to the modified quantization profile) for runtime inferencing. In some aspects, as discussed above, the processing system 700 may deploy the quantized model (quantized according to the modified quantization profile) to one or more other systems for runtime inferencing, such as to one or more cloud-based systems, one or more wireless devices (e.g., wearables, smartphones, or edge devices) , and the like. That is, the processing system 700 may quantize the model according to the modified quantization profile, and may then either use the quantized model for inferencing, or may provide the quantized model to one or more other systems for inferencing. Similarly, as discussed above, the processing system 700 may itself generate the initial quantization profile, or may receive the initial quantization profile from another system. Additionally, the processing system 700 may itself train the initial machine learning model, or may receive the trained model from another system. Moreover, the processing system 700 may itself quantize the model based on the modified quantization profile, or may provide the modified quantization profile to another system which performs the quantization. Generally, the operations involved in model training, initial quantization profile generation, modified quantization profile generation, model quantization using the modified quantization profile, and inferencing using the quantized model may be combined or distributed across any number and variety of computing systems.
Example Clauses
Implementation examples are described in the following numbered clauses:
Clause 1: A method for machine learning model quantization, comprising: accessing a quantization profile for a machine learning model, the quantization profile indicating, for each respective operation of a plurality of operations of the machine learning model, a respective quantization precision of a plurality of quantization precisions; generating a first set of modifications for the quantization profile based on conversion latency for converting tensors among the plurality of quantization precisions, wherein each respective modification of the first set of modifications indicates to increase a respective quantization precision of a respective operation of the plurality of operations; generating a modified quantization profile based on modifying the quantization profile using the first set of modifications; and quantizing the machine learning model in accordance with the modified quantization profile.
Clause 2: A method according to Clause 1, wherein the plurality of quantization precisions comprises one or more of: (i) a set of weight bit-widths, (ii) a set of activation bit-widths, or (iii) a set of data types.
Clause 3: A method according to any of Clauses 1-2, wherein the first set of modifications is generated based on a first value for the conversion latency hyperparameter and wherein the method further comprises: generating a second set of modifications for the quantization profile based on a second value for a conversion latency hyperparameter; determining a first latency of the machine learning model quantized according to the quantization profile and the first set of modifications; determining a second latency of the machine learning model quantized according to the quantization profile and the second set of modifications; and selecting the first set of modifications in response to determining that the first latency is less than the second latency.
Clause 4: A method according to any of Clauses 1-3, wherein the conversion latency comprises a static value for converting the tensors among the plurality of quantization precisions.
Clause 5: A method according to any of Clauses 1-3, further comprising determining a respective conversion latency for each respective operation of at least a subset of operations of the plurality of operations based at least in part on a respective tensor size of the respective operation.
Clause 6: A method according to Clause 5, wherein determining the respective conversion latencies comprises processing the respective tensor sizes using a conversion latency machine learning model.
Clause 7: A method according to Clause 6, wherein determining the respective conversion latencies further comprises processing a tiling strategy of the respective operation using the conversion latency machine learning model.
Clause 8: A method according to any of Clauses 1-7, wherein generating the first set of modifications comprises using integer programming, and the integer programming is constrained to refrain from decreasing quantization precision of any operations of the plurality of operations.
Clause 9: A method according to Clause 8, wherein the integer programming comprises: minimizing ∑v∈V (BitOps (v, X) xv+BitOps (v, Y) yv) , and using constraints:
xv+yv=1, ∑ (u, v) ∈E|xu-xv|fuv≤αCf, and xv, yv∈ {0, 1} , wherein: X and Y are first and second quantization precisions, respectively, of the plurality of quantization precisions, u and v are operations of the plurality of operations, V is the plurality of operations, E is a set of edges representing data flow among the plurality of operations, BitOps (v, X) is an operation latency of performing the operation v using the first quantization precision X, BitOps(v, Y) is the operation latency of performing the operation v using the second quantization precision Y, xv and yv are decision variables, wherein an assignment of xv=1 indicates that the operation v is assigned the first quantization precision X and an assignment of yv=1 indicates that the operation v is assigned the second quantization precision Y, sol (u) is a quantization precision assigned to an operation u by the quantization profile, fuv is a conversion latency of converting a tensor between the operations u and v, α is a conversion latency hyperparameter, and Cf is a total conversion latency of the machine learning model quantized according to the quantization profile.
Clause 10: A method according to Clause 8, wherein the integer programming comprises: minimizing ∑v∈V∑i∈ [d] BitOps (v, Pi) xv, i, and using constraints: ∑i∈ [d] xv, i=1,
and xv, i∈ {0, 1} , wherein: u and v are operations of the plurality of operations, V is the plurality of operations, E is a set of edges representing data flow among the plurality of operations, Pi is an i-th quantization precision of the plurality of quantization precisions, BitOps (v, Pi) is an operation latency of performing the operation v using the quantization precision Pi, d is a number of the plurality of quantization precisions, xv, i is a decision variable, an assignment of xv, i=1 indicates that the operation v is assigned the i-th quantization precision, Lv is a subset of the plurality of quantization precisions such that decreases in quantization precision are disallowed, only if an activation bit-width of the i-th quantization precision is different than an activation bit-width of a j-th precision candidate, fuv is a conversion latency of converting a tensor between operations u and v, α is a conversion latency hyperparameter, and Cf is a total conversion latency of the machine learning model quantized according to the quantization profile.
Clause 11: A processing system comprising: a memory comprising computer-executable instructions; and one or more processors configured to execute the computer-executable instructions and cause the processing system to perform a method in accordance with any of Clauses 1-10.
Clause 11: A processing system comprising means for performing a method in accordance with any of Clauses 1-10.
Clause 12: A non-transitory computer-readable medium comprising computer-executable instructions that, when executed by one or more processors of a processing system, cause the processing system to perform a method in accordance with any of Clauses 1-10.
Clause 13: A computer program product embodied on a computer-readable storage medium comprising code for performing a method in accordance with any of Clauses 1-10.
Additional Considerations
The preceding description is provided to enable any person skilled in the art to practice the various aspects described herein. The examples discussed herein are not limiting of the scope, applicability, or aspects set forth in the claims. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. For example, changes may be made in the function and arrangement of elements discussed without departing from the scope of the disclosure. Various examples may omit, substitute, or add various procedures or components as appropriate. For instance, the methods described may be performed in an order different from that described, and various steps may be added, omitted, or combined. Also, features described with respect to some examples may be combined in some other examples. For example, an apparatus may be implemented or a method may be practiced using any number of the aspects set forth herein. In addition, the scope of the disclosure is intended to cover such an apparatus or method that is practiced using other structure, functionality, or structure and functionality in addition to, or other than, the various aspects of the disclosure set forth herein. It should be understood that any aspect of the disclosure disclosed herein may be embodied by one or more elements of a claim.
As used herein, the word “exemplary” means “serving as an example, instance, or illustration. ” Any aspect described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects.
As used herein, a phrase referring to “at least one of” a list of items refers to any combination of those items, including single members. As an example, “at least one of:a, b, or c” is intended to cover a, b, c, a-b, a-c, b-c, and a-b-c, as well as any combination with multiples of the same element (e.g., a-a, a-a-a, a-a-b, a-a-c, a-b-b, a-c-c, b-b, b-b-b, b-b-c, c-c, and c-c-c or any other ordering of a, b, and c) .
As used herein, the term “determining” encompasses a wide variety of actions. For example, “determining” may include calculating, computing, processing, deriving, investigating, looking up (e.g., looking up in a table, a database or another data structure) , ascertaining, and the like. Also, “determining” may include receiving (e.g., receiving information) , accessing (e.g., accessing data in a memory) , and the like. Also, “determining” may include resolving, selecting, choosing, establishing, and the like.
The methods disclosed herein comprise one or more steps or actions for achieving the methods. The method steps and/or actions may be interchanged with one another without departing from the scope of the claims. In other words, unless a specific order of steps or actions is specified, the order and/or use of specific steps and/or actions may be modified without departing from the scope of the claims. Further, the various operations of methods described above may be performed by any suitable means capable of performing the corresponding functions. The means may include various hardware and/or software component (s) and/or module (s) , including, but not limited to a circuit, an application specific integrated circuit (ASIC) , or processor. Generally, where there are operations illustrated in figures, those operations may have corresponding counterpart means-plus-function components with similar numbering.
The following claims are not intended to be limited to the aspects shown herein, but are to be accorded the full scope consistent with the language of the claims. Within a claim, reference to an element in the singular is not intended to mean “one and only one” unless specifically so stated, but rather “one or more. ” Unless specifically stated otherwise, the term “some” refers to one or more. No claim element is to be construed under the provisions of 35 U.S.C. §112 (f) unless the element is expressly recited using the phrase “means for” or, in the case of a method claim, the element is
recited using the phrase “step for. ” All structural and functional equivalents to the elements of the various aspects described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims. Moreover, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the claims.
Claims (20)
- A processing system comprising:one or more memories comprising processor-executable instructions; andone or more processors configured to execute the processor-executable instructions and cause the processing system to:access a quantization profile for a machine learning model, the quantization profile indicating, for each respective operation of a plurality of operations of the machine learning model, a respective quantization precision of a plurality of quantization precisions;generate a first set of modifications for the quantization profile based on conversion latency for converting tensors among the plurality of quantization precisions, wherein each respective modification of the first set of modifications indicates to increase a respective quantization precision of a respective operation of the plurality of operations;generate a modified quantization profile based on modifying the quantization profile using the first set of modifications; andquantize the machine learning model in accordance with the modified quantization profile.
- The processing system of claim 1, wherein the plurality of quantization precisions comprises one or more of: (i) a set of parameter bit-widths, (ii) a set of activation bit-widths, or (iii) a set of data types.
- The processing system of claim 1, wherein the first set of modifications is generated based on a first value for a conversion latency hyperparameter and wherein the one or more processors are configured to further execute the processor-executable instructions and cause the processing system to:generate a second set of modifications for the quantization profile based on a second value for the conversion latency hyperparameter;determine a first latency of the machine learning model quantized according to the quantization profile and the first set of modifications;determine a second latency of the machine learning model quantized according to the quantization profile and the second set of modifications; andselect the first set of modifications in response to determining that the first latency is less than the second latency.
- The processing system of claim 1, wherein the conversion latency comprises a static value for converting the tensors among the plurality of quantization precisions.
- The processing system of claim 1, wherein the one or more processors are configured to further execute the processor-executable instructions and cause the processing system to determine a respective conversion latency for each respective operation of at least a subset of operations of the plurality of operations based at least in part on a respective tensor size of the respective operation.
- The processing system of claim 5, wherein, to determine the respective conversion latencies, the one or more processors are configured to execute the processor-executable instructions and cause the processing system to process the respective tensor sizes using a conversion latency machine learning model.
- The processing system of claim 6, wherein, to determine the respective conversion latencies, the one or more processors are configured to further execute the processor-executable instructions and cause the processing system to process a tiling strategy of the respective operation using the conversion latency machine learning model.
- The processing system of claim 1, wherein, to generate the first set of modifications, the one or more processors are configured to execute the processor-executable instructions and cause the processing system to use integer programming, wherein the integer programming is constrained to refrain from decreasing quantization precision of any operations of the plurality of operations.
- The processing system of claim 8, wherein, to use the integer programming, the one or more processors are configured to execute the processor-executable instructions and cause the processing system to:minimize ∑v∈V (BitOps (v, X) xv+BitOps (v, Y) yv) , anduse constraints:
∑ (u, v) ∈E|xu-xv|fuv≤αCf, and
wherein:X and Y are first and second quantization precisions, respectively, of the plurality of quantization precisions,u and v are operations of the plurality of operations,V is the plurality of operations,E is a set of edges representing data flow among the plurality of operations,BitOps (v, X) is an operation latency of performing the operation v using the first quantization precision X,BitOps (v, Y) is the operation latency of performing the operation v using the second quantization precision Y,xv and yv are decision variables, wherein an assignment of xv=1 indicates that the operation v is assigned the first quantization precision X and an assignment of yv=1 indicates that the operation v is assigned the second quantization precision Y,sol (u) is a quantization precision assigned to an operation u by the quantization profile,fuv is a conversion latency of converting a tensor between the operations u and v,α is a conversion latency hyperparameter, andCf is a total conversion latency of the machine learning model quantized according to the quantization profile. - The processing system of claim 8, wherein, to use the integer programming, the one or more processors are configured to execute the processor-executable instructions and cause the processing system to:minimize ∑v∈V∑i∈ [d] BitOps (v, Pi) xv, i, anduse constraints:
and
wherein:u and v are operations of the plurality of operations,V is the plurality of operations,E is a set of edges representing data flow among the plurality of operations,Pi is an i-th quantization precision of the plurality of quantization precisions,BitOps (v, Pi) is an operation latency of performing the operation v using the quantization precision Pi,d is a number of the plurality of quantization precisions,xv, i is a decision variable,an assignment of xv, i=1 indicates that the operation v is assigned the i-th quantization precision,Lv is a subset of the plurality of quantization precisions such that decreases in quantization precision are disallowed,only if an activation bit-width of the i-th quantization precision is different than an activation bit-width of a j-th precision candidate,fuv is a conversion latency of converting a tensor between operations u and v,α is a conversion latency hyperparameter, andCf is a total conversion latency of the machine learning model quantized according to the quantization profile. - A method for machine learning model quantization, comprising:accessing a quantization profile for a machine learning model, the quantization profile indicating, for each respective operation of a plurality of operations of the machine learning model, a respective quantization precision of a plurality of quantization precisions;generating a first set of modifications for the quantization profile based on conversion latency for converting tensors among the plurality of quantization precisions, wherein each respective modification of the first set of modifications indicates to increase a respective quantization precision of a respective operation of the plurality of operations;generating a modified quantization profile based on modifying the quantization profile using the first set of modifications; andquantizing the machine learning model in accordance with the modified quantization profile.
- The method of claim 11, wherein the plurality of quantization precisions comprises one or more of: (i) a set of parameter bit-widths, (ii) a set of activation bit-widths, or (iii) a set of data types.
- The method of claim 11, wherein the first set of modifications is generated based on a first value for a conversion latency hyperparameter and wherein the method further comprises:generating a second set of modifications for the quantization profile based on a second value for the conversion latency hyperparameter;determining a first latency of the machine learning model quantized according to the quantization profile and the first set of modifications;determining a second latency of the machine learning model quantized according to the quantization profile and the second set of modifications; andselecting the first set of modifications in response to determining that the first latency is less than the second latency.
- The method of claim 11, wherein the conversion latency comprises a static value for converting the tensors among the plurality of quantization precisions.
- The method of claim 11, further comprising determining a respective conversion latency for each respective operation of at least a subset of operations of the plurality of operations based at least in part on a respective tensor size of the respective operation.
- The method of claim 15, wherein determining the respective conversion latencies comprises processing the respective tensor sizes using a conversion latency machine learning model.
- The method of claim 16, wherein determining the respective conversion latencies further comprises processing a tiling strategy of the respective operation using the conversion latency machine learning model.
- The method of claim 11, wherein generating the first set of modifications comprises using integer programming, and the integer programming is constrained to refrain from decreasing quantization precision of any operations of the plurality of operations.
- A processing system comprising:means for accessing a quantization profile for a machine learning model, the quantization profile indicating, for each respective operation of a plurality of operations of the machine learning model, a respective quantization precision of a plurality of quantization precisions;means for generating a set of modifications for the quantization profile based on conversion latency for converting tensors among the plurality of quantization precisions, wherein each respective modification of the set of modifications indicates to increase a respective quantization precision of a respective operation of the plurality of operations;means for generating a modified quantization profile based on modifying the quantization profile using the set of modifications; andmeans for quantizing the machine learning model in accordance with the modified quantization profile.
- The processing system of claim 19, wherein the means for generating the set of modifications comprise means for using integer programming and wherein the integer programming is constrained to refrain from decreasing quantization precision of any operations of the plurality of operations.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2024/080716 WO2025184890A1 (en) | 2024-03-08 | 2024-03-08 | Reduced latency for mixed-precision quantized machine learning models |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2024/080716 WO2025184890A1 (en) | 2024-03-08 | 2024-03-08 | Reduced latency for mixed-precision quantized machine learning models |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| WO2025184890A1 true WO2025184890A1 (en) | 2025-09-12 |
| WO2025184890A8 WO2025184890A8 (en) | 2025-10-02 |
Family
ID=96989842
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2024/080716 Pending WO2025184890A1 (en) | 2024-03-08 | 2024-03-08 | Reduced latency for mixed-precision quantized machine learning models |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2025184890A1 (en) |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114503132A (en) * | 2019-09-30 | 2022-05-13 | 亚马逊科技公司 | Debugging and profiling of machine learning model training |
| US20220164666A1 (en) * | 2020-11-20 | 2022-05-26 | Adobe Inc. | Efficient mixed-precision search for quantizers in artificial neural networks |
| CN117392406A (en) * | 2023-11-07 | 2024-01-12 | 四川大学 | Low-bit-width mixed precision quantization method for single-stage real-time target detection model |
| CN117529728A (en) * | 2021-04-06 | 2024-02-06 | 高通股份有限公司 | Privacy-aware pruning in machine learning |
-
2024
- 2024-03-08 WO PCT/CN2024/080716 patent/WO2025184890A1/en active Pending
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114503132A (en) * | 2019-09-30 | 2022-05-13 | 亚马逊科技公司 | Debugging and profiling of machine learning model training |
| US20220164666A1 (en) * | 2020-11-20 | 2022-05-26 | Adobe Inc. | Efficient mixed-precision search for quantizers in artificial neural networks |
| CN117529728A (en) * | 2021-04-06 | 2024-02-06 | 高通股份有限公司 | Privacy-aware pruning in machine learning |
| CN117392406A (en) * | 2023-11-07 | 2024-01-12 | 四川大学 | Low-bit-width mixed precision quantization method for single-stage real-time target detection model |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2025184890A8 (en) | 2025-10-02 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN116594748B (en) | Model customization processing method, device, equipment and medium for task | |
| CN113723589A (en) | Hybrid precision neural network | |
| US20230037498A1 (en) | Method and system for generating a predictive model | |
| KR20220054397A (en) | Method and apparatus for predicting kernel tuning parameters | |
| US20240144017A1 (en) | Quantization range estimation for quantized training | |
| Dubhir et al. | Benchmarking of quantization libraries in popular frameworks | |
| WO2025184890A1 (en) | Reduced latency for mixed-precision quantized machine learning models | |
| WO2025101253A1 (en) | Open vocabulary image segmentation | |
| WO2024186332A1 (en) | Mixed-precision quantization in machine learning using model sensitivity and constrained optimization | |
| WO2024073178A1 (en) | Hyperparameter optimization using partitioned machine learning models | |
| US20240046078A1 (en) | Desparsified convolution for sparse activations | |
| WO2025025198A1 (en) | Mixed-precision quantization of machine learning model parameters | |
| US20250356184A1 (en) | Positional embedding generation for machine learning models | |
| JP7107797B2 (en) | Information processing method and information processing system | |
| US20250165301A1 (en) | Efficient execution of machine learning models in heterogeneous processing environments | |
| WO2025189371A1 (en) | Multiple token generation in autoregressive generative artificial intelligence models | |
| US20250165854A1 (en) | Quantization compensation for machine learning models | |
| US20240311622A1 (en) | Selectable data-aware activation functions in neural networks | |
| US20250199767A1 (en) | Dynamic energy saving controller for machine learning hardware accelerators | |
| WO2025227353A1 (en) | Machine learning model multiple adapter support | |
| WO2024197437A9 (en) | Increased accuracy in quantization-aware neural networks using fake quantization nodes | |
| US20240289594A1 (en) | Efficient hidden markov model architecture and inference response | |
| US20230214629A1 (en) | Transformer-based autoregressive language model selection | |
| US20240419963A1 (en) | Power neural network-based workload distribution in distributed computing systems | |
| WO2024227270A1 (en) | Modified convolution parameters to avoid requantizing operations |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24927832 Country of ref document: EP Kind code of ref document: A1 |