US20240378436A1 - Partial Quantization To Achieve Full Quantized Model On Edge Device - Google Patents
Partial Quantization To Achieve Full Quantized Model On Edge Device Download PDFInfo
- Publication number
- US20240378436A1 US20240378436A1 US18/479,875 US202318479875A US2024378436A1 US 20240378436 A1 US20240378436 A1 US 20240378436A1 US 202318479875 A US202318479875 A US 202318479875A US 2024378436 A1 US2024378436 A1 US 2024378436A1
- Authority
- US
- United States
- Prior art keywords
- updated
- original
- mlm
- quantized
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/096—Transfer learning
Definitions
- This disclosure relates generally to neural networks, and more specifically to partial quantization of a neural network to achieve a fully quantized model on an Edge device.
- Neural networks consist of a series of interconnected structures or layers, inspired by the neurons of a human brain. Each layer (e.g. neuron) may include the summation of multiple weighted inputs, which will produce an output if the summation exceeds a respective activation level.
- a functional model using neural networks may be built with sample data, known as training data, to make predictions or decisions without explicit programming. Such models may also continue to be refined with successive training cycles providing statistical learning. A trained model, may then use this adaptive machine learning with a subsequent data set to predict an appropriate response.
- Machine learning is being increasingly used by portable devices such as cell phones and wearable electronic devices. Due to the computing power and storage limitations of these portable devices, quantization may be used to reduce the model's complexity. Typically, training and quantization of a machine learning model requires the portable device to communicate over the internet cloud or with another processor having more computing resources, thus exposing the portable device to potential data security issues and generally limiting operational flexibility.
- FIG. 1 is a schematic view of an embodiment of a neuron forming the basis of a neural network.
- FIG. 2 is a graphical view of an embodiment of affine mapping of a real number to an integer number.
- FIG. 3 is a flowchart representation of a method for partial quantization to achieve a fully quantized model, in accordance with an embodiment of the present disclosure.
- FIG. 4 is a flowchart representation of a usage of a partial quantization tool to achieve a fully quantized model, in accordance with an embodiment of the present disclosure.
- FIG. 5 is a flowchart representation of a training and quantization system, in accordance with an embodiment of the present disclosure.
- FIG. 6 is a flowchart representation of a method for partial quantization to achieve full quantized model on Edge device, in accordance with an embodiment of the present disclosure.
- FIG. 7 is a flowchart representation of another method for partial quantization to achieve full quantized model on Edge device, in accordance with an embodiment of the present disclosure.
- FIG. 8 is a flowchart representation of another method for partial quantization to achieve full quantized model on Edge device, in accordance with an embodiment of the present disclosure.
- Embodiments described herein provide for the training and quantization of a Machine Learning Model (MLM) on an Edge device without requiring the resources of a more powerful host machine, nor a connection to a host or a similar cloud-based computing platform.
- An Edge device as defined herein includes hardware that controls dataflow at a network boundary such as a cellular phone, a wearable fitness tracking device, portable devices and the like. Edge devices typically have limited processing power and memory storage compared to host machines.
- Partial training and quantization of an MLM directly on the Edge device avoids the typical requirement of providing a full training dataset on the device, where the full dataset may include proprietary information. Furthermore, by quantizing an MLM directly on the device, security issues associated with the host are avoided (e.g., including malware, data hacking and cloning). Additional personalization of the Edge device may also be realized without the typical data safety concern.
- Partial training and quantization on the device may also facilitate “transfer learning” whereby a device trained by one set of data may be extended or modified to learn with related data by modifying only a subset of the many layers of the MLM.
- transfer learning a device is trained to recognize a sedan. Through transfer learning the device is subsequently taught to recognize a truck or a bus.
- a new MLM with updated layers may thereby be quantized by retaining the quantization of weights associated with frozen layers (e.g., those that have not been updated as a result of transfer learning), by quantizing the weights associated with updated layers, and by quantizing the activation functions based on changes to the scaling and integer zero point between the original MLM and the updated MLM (having the updated layers).
- FIG. 1 shows an example of a neuron, which forms the basis of an MLM.
- a neuron may include a plurality of inputs (e.g., tensor inputs) 10 a , 10 b and 10 c (generally 10 ). Each input 10 is modified by a respective weight 12 a, 12 b and 12 c (generally 12 ) and summated with a biased summation 14 . If the result of the biased summation exceeds a threshold defined by the activation function 16 , an output 18 is generated.
- the example neuron of FIG. 1 may define a single layer from a plurality of interconnected layers of an MLM.
- FIG. 2 shows an example of affine mapping for quantization of a floating point number to an integer number.
- a floating point range 20 includes a floating point minimum (Rmin) 22 , a floating point maximum (Rmax) 24 and floating point zero 26 .
- the affine mapping of FIG. 2 further includes an integer range 30 .
- the integer range 30 includes an integer minimum (Qmin) 32 , an integer maximum (Qmax) 34 and an integer zero point (Z) 36 .
- the example of FIG. 2 shows an asymmetric quantization of floating point numbers to integer numbers based on a scaling difference between Rmax 24 ⁇ Rmin 22 and Qmax 34 ⁇ Qmin 32 , in addition to a potential shift of the floating point zero 26 to an integer zero point 36 within the integer range 30 .
- the affine mapping of FIG. 2 reduces the numbers represented by floating point values to integer values, which will require less memory storage and a reduction in computing resources during subsequent processing of the integer values.
- FIG. 3 is a flowchart representation of a method for partial quantization to achieve a fully quantized model, in accordance with an embodiment 40 of the present disclosure.
- the representative dataset 42 is a subset of a training dataset used to train an original 32 bit floating point (FP32) model 44 to create the original quantized INT8 model 46 . Therefore, the model 46 is a pre-quantized model created from the original FP32 model 44 .
- the original quantized INT8 model 46 is already quantized using more than one image before being deployed to the Edge device.
- the representative dataset 42 may be a subset of data generated or produced on the Edge device itself and used to train the original FP32 model 44 through transfer learning.
- the representative dataset 42 may also be used to quantize an on-device training (or updated) FP32 model 48 for transfer learning.
- the updated model 48 includes frozen layers 50 , which remain the same as the corresponding layers of the original FP32 model 44 .
- the updated model 48 further includes updated layers 52 , which are different than the corresponding layers of the original FP32 model 44 .
- quantized weights are extracted 54 from the frozen layers 50 of the model 46 to form frozen quantized weights.
- the weights of the updated layers 52 of the updated model 48 are quantized 56 to generate updated quantized weights.
- the floating point maximum (Rmax) and floating point minimum (Rmin) of the range of activation floating point values for the original FP32 model 44 (respectively, R1max and R1min), and for the updated FP32 model 48 (respectively, R2max and R2min) are determined.
- a Delta-max and a Delta-min value are determined at 62 in accordance with the following equations [1] and [2]:
- Delta - max R ⁇ 2 ⁇ max - R ⁇ 1 ⁇ max [ 1 ]
- Delta - min R ⁇ 2 ⁇ min - R ⁇ 1 ⁇ min [ 2 ]
- a new quantized model is generated with the frozen quantized weights and quantized parameters of activations extracted from 54 , the updated quantized weights quantized from 56 , and the updated quantized activation functions of the updated model 48 , quantized with the Delta-max and Delta-min values from 62 .
- the quantized parameters of activations include scaling and integer zero point values.
- FIG. 4 is a flowchart representation of a usage of a partial quantization tool to achieve a fully quantized model, in accordance with an embodiment 70 of the present disclosure.
- the embodiment 70 further defines the method for reconstructing the new quantized model as shown in embodiment 40 of FIG. 3 .
- the partial quantization tool 72 receives inputs from an original FP32 model 44 , a quantized INT8 model 46 , and an updated FP32 model 48 , and generates a new quantized model 74 from the model reconstruction step 64 .
- the partial quantization tool 72 determines an original scale (S 1 ) and an original integer zero point (Z 1 ) at 76 in accordance with the following equations [3] and [4]:
- the partial quantization tool 72 determines an updated scale (S 2 ) and an updated integer zero point (Z 2 ) at 78 in accordance with the following equations [5] and [6]:
- the equation [5] is an example based on quantizing a floating point number to an eight bit integer value (INT8), hence the denominator of 255 representing one less than an integer maximum of an integer range (of 256).
- the denominator should be suitably modified to match the new integer range minus one.
- FIG. 5 is a flowchart representation of a training and quantization system, in accordance with an embodiment 80 of the present disclosure.
- a new dataset is provided to train an updated model 48 with updated layers.
- the new dataset 82 is used to train the updated model 48 using a Tensor Flow (TF) Lite program 84 .
- TF Tensor Flow
- a representative dataset 42 (derived as a subset of the new dataset 82 ) is used with the original quantized model 46 , the updated model 48 and the original floating point model 44 to generate a new quantized model 74 , using the partial quantization tool of FIG. 4 .
- the representative dataset 42 , the original quantized model 46 , the updated model 48 and the original floating point model 44 are all stored on an Edge device, without connection to other computing resources (e.g., a host or cloud connection).
- the new quantized model 74 may then be used by a Neural Processing Unit (NPU) inferencing to perform transfer learning.
- NPU Neural Processing Unit
- FIG. 6 shows an embodiment 90 of a method for partial quantization to achieve full quantized model.
- a plurality of weights and a respective activation function of an original MLM 44 are quantized to generate a quantized MLM 46 .
- frozen quantized weights and quantized activations parameters are extracted 54 from the quantized MLM.
- weights from an updated layer 52 of an updated MLM 48 are quantized 56 .
- a respective activation function of the updated MLM 48 is quantized from a difference between the original MLM 44 and the updated MLM 48 .
- a new quantized MLM 64 is generated from the frozen weights, the quantized weights from the updated MLM 48 and the quantized activation function of the updated MLM 48 .
- FIG. 7 shows an embodiment 110 of a method for partial quantization to achieve full quantized model.
- frozen quantized weights and quantized activations parameters are extracted 54 from at least one frozen layer 50 of a quantized MLM 46 , generated from an original MLM 44 .
- weights from an updated layer 52 of an updated MLM 48 are quantized 56 .
- a respective activation function of the updated MLM 48 is quantized from an updated scale S 2 and an updated integer zero point Z 2 (see 78 of FIG. 4 ).
- a new quantized MLM 64 is generated from the frozen weights, the quantized weights from the updated MLM 48 and the quantized activation function of the updated MLM 48 .
- FIG. 8 shows an embodiment 120 of a method for partial quantization to achieve full quantized model.
- frozen quantized weights are extracted 54 from at least one frozen layer 50 of a quantized model 46 , generated from an original model 44 .
- weights from an updated layer 52 of an updated model 48 are quantized 56 .
- a respective activation function of an updated layer 52 of the updated model 48 is quantized from an updated scale S 2 and an updated integer zero point Z 2 (see 78 of FIG.
- a new quantized model 64 is generated from the frozen weights, the quantized weights from the updated model 48 and the quantized activation function of the updated model 48 .
- a method for partial quantization to achieve full quantized model comprises quantizing a plurality of weights and a respective activation function from each of a plurality of respective layers of an original Machine Learning Model (MLM) to generate a quantized MLM comprising a plurality of frozen quantized weights, each layer comprising a summation of a plurality of inputs, each input multiplied by a respective one of the plurality of weights, and the summation gated by the respective activation function.
- the plurality of frozen quantized weights of at least one frozen layer are extracted from the layers of the quantized MLM.
- the plurality of weights from at least one updated layer of an updated MLM are quantized to generate a plurality of updated quantized weights, wherein the at least one updated layer of the updated MLM is generated by updating the respective layer of the original MLM.
- the respective activation function of the at least one updated layer of the updated MLM is quantized from a difference between the original MLM and the updated MLM, to generate a respective quantized activation function.
- a new quantized MLM is generated from the frozen quantized weights, the updated quantized weights and the respective quantized activation function.
- the difference between the original MLM and the updated MLM is determined by determining an updated scale comprising adding a ratio to an original scale of the original MLM, wherein the ratio is determined by a delta-difference between a delta-max and a delta-min, divided by one less than an integer maximum of an integer range used for quantizing the respective activation function of the updated MLM.
- the delta-max is determined by subtracting an original maximum of an original floating point range of the activation function of each layer of the original MLM from an updated maximum of an updated floating point range of the activation function of each layer of the updated MLM.
- the delta-min is determined by subtracting an original minimum of an original floating point range of the activation function of each layer of the original MLM from an updated minimum of an updated floating point range of the activation function of each layer of the updated MLM.
- the difference between the original MLM and the updated MLM is determined by determining an updated integer zero point by adding an offset to an original integer zero point of the original MLM, wherein the offset is determined by subtracting an updated minimum of an updated floating point range of the activation function of each layer of the updated MLM divided by an updated scale, from an original minimum of an original floating point range of the activation function of each layer of the original MLM divided by an original scale of the original MLM.
- the original integer zero point is determined by subtracting the original minimum of the original floating point range of the activation function of each layer of the original MLM, divided by the original scale, from an integer minimum of an integer range used for quantizing the respective activation function of the quantized MLM.
- the original scale is determined by subtracting the original minimum of the original floating point range from the original maximum of the original floating point range, divided by the integer minimum of the integer range subtracted from an integer maximum of the integer range.
- the plurality of weights are quantized from the at least one updated layer of the updated MLM with the original MLM, the quantized MLM, the updated MLM and a representative dataset collected on an Edge device, wherein the representative dataset is a subset of a dataset used to train the updated MLM.
- the quantized MLM further comprises quantizing the plurality of weights from each one of a plurality of tensor inputs of the original MLM.
- Generating the quantized MLM further comprises quantizing the plurality of weights from each one of a plurality of channels of the original MLM.
- a method for partial quantization to achieve full quantized model comprises extracting a plurality of frozen quantized weights from each frozen layer of a quantized Machine Learning Model (MLM), wherein the quantized MLM is generated by quantizing an original MLM.
- a plurality of updated weights are quantized from an updated MLM to generate a plurality of updated quantized weights, wherein the updated MLM is generated by updating at least one layer of the original MLM to form at least one updated layer.
- a respective activation function of each updated layer from the updated MLM is quantized from a difference between the original MLM and the updated MLM to generate a quantized activation function, wherein the difference comprises an updated scale and an updated integer zero point.
- a new quantized MLM is generated from the frozen quantized weights, the updated quantized weights and the respective quantized activation function of each updated layer.
- the updated scale is determined by adding a ratio to an original scale of the original MLM, wherein the ratio is determined by a delta-difference between a delta-max and a delta-min, divided by one less than an integer maximum of an integer range used for quantizing the respective quantized activation function of the updated MLM.
- the delta-max is determined by subtracting an original maximum of an original floating point range of the activation function of each layer of the original MLM from an updated maximum of an updated floating point range of the activation function of each layer of the updated MLM and determining the delta-min by subtracting an original minimum of the original floating point range of the activation function of each layer of the original MLM from an updated minimum of the updated floating point range of the activation function of each layer of the updated MLM.
- the updated integer zero point is determined by adding an offset to an original integer zero point of the original MLM, wherein the offset is determined by subtracting from an original minimum of an original floating point range of the activation function of each layer of the original MLM divided by an original scale of the original MLM, an updated minimum of an updated floating point range of the activation function of updated layer of the updated MLM divided by an updated scale.
- the original integer zero point is determined by subtracting from an integer minimum of an integer range used for quantizing the respective activation function of the quantized MLM, a quantized minimum of a quantized floating point range of the activation function of each layer of the quantized MLM, divided by the original scale, and the original scale is determined by subtracting the original minimum of the original floating point range from the original maximum of the original floating point range, divided by the integer minimum of the integer range subtracted from an integer maximum of the integer range.
- a method for partial quantization to achieve full quantized model comprises extracting a plurality of frozen quantized weights and quantized activations parameters from each frozen layer of a quantized model, wherein the quantized model is generated by quantizing an original model.
- a plurality of updated weights are quantized from an updated model to generate a plurality of updated quantized weights, wherein the updated model is generated by updating at least one layer of the original model to form at least one updated layer.
- a respective activation function of each updated layer is quantized from the updated model from a difference between the original model and the updated model to generate a quantized activation function, wherein the difference comprises an updated scale and an updated integer zero point, the updated scale modifying an original scale used to quantize the original model with a ratio proportional to a floating point range difference between the updated model and the original model, and the updated integer zero point modifying an original integer zero point used to quantize the original model with a scaled difference between the floating point range minimums of the updated model and the original model.
- a new quantized model is generated from the frozen quantized weights and quantized activations parameters, the updated quantized weights and the respective quantized activation function of each updated layer.
- the updated model is trained with a new dataset stored on an Edge device.
- a plurality of weights and a respective activation function from each of a plurality of respective layers of the original model are quantized with a representative dataset, wherein the representative dataset is a subset of the new dataset.
- the updated model adapts to the new dataset through transfer learning. Inferencing is performed with a neural processing unit and the new quantized model to perform transfer learning on an Edge device.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Neurology (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
A method for partial quantization to achieve full quantized model includes quantizing a plurality of weights and a respective activation function from each of a plurality of respective layers of an original Machine Learning Model (MLM) to generate a quantized MLM comprising a plurality of frozen quantized weights. The plurality of frozen quantized weights are extracted from at least one frozen layer of the layers of the quantized MLM. The plurality of weights are quantized from at least one updated layer of an updated MLM to generate a plurality of updated quantized weights. The respective activation function of the at least one updated layer of the updated MLM is quantized from a difference between the original MLM and the updated MLM, to form a respective quantized activation function. A new quantized MLM is generated from the frozen quantized weights, the updated quantized weights and the respective quantized activation function.
Description
- This disclosure relates generally to neural networks, and more specifically to partial quantization of a neural network to achieve a fully quantized model on an Edge device.
- Neural networks consist of a series of interconnected structures or layers, inspired by the neurons of a human brain. Each layer (e.g. neuron) may include the summation of multiple weighted inputs, which will produce an output if the summation exceeds a respective activation level. A functional model using neural networks may be built with sample data, known as training data, to make predictions or decisions without explicit programming. Such models may also continue to be refined with successive training cycles providing statistical learning. A trained model, may then use this adaptive machine learning with a subsequent data set to predict an appropriate response.
- Machine learning is being increasingly used by portable devices such as cell phones and wearable electronic devices. Due to the computing power and storage limitations of these portable devices, quantization may be used to reduce the model's complexity. Typically, training and quantization of a machine learning model requires the portable device to communicate over the internet cloud or with another processor having more computing resources, thus exposing the portable device to potential data security issues and generally limiting operational flexibility.
- The present invention is illustrated by way of example and is not limited by the accompanying figures, in which like references indicate similar elements. Elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale.
-
FIG. 1 is a schematic view of an embodiment of a neuron forming the basis of a neural network. -
FIG. 2 is a graphical view of an embodiment of affine mapping of a real number to an integer number. -
FIG. 3 is a flowchart representation of a method for partial quantization to achieve a fully quantized model, in accordance with an embodiment of the present disclosure. -
FIG. 4 is a flowchart representation of a usage of a partial quantization tool to achieve a fully quantized model, in accordance with an embodiment of the present disclosure. -
FIG. 5 is a flowchart representation of a training and quantization system, in accordance with an embodiment of the present disclosure. -
FIG. 6 is a flowchart representation of a method for partial quantization to achieve full quantized model on Edge device, in accordance with an embodiment of the present disclosure. -
FIG. 7 is a flowchart representation of another method for partial quantization to achieve full quantized model on Edge device, in accordance with an embodiment of the present disclosure. -
FIG. 8 is a flowchart representation of another method for partial quantization to achieve full quantized model on Edge device, in accordance with an embodiment of the present disclosure. - Embodiments described herein provide for the training and quantization of a Machine Learning Model (MLM) on an Edge device without requiring the resources of a more powerful host machine, nor a connection to a host or a similar cloud-based computing platform. An Edge device as defined herein includes hardware that controls dataflow at a network boundary such as a cellular phone, a wearable fitness tracking device, portable devices and the like. Edge devices typically have limited processing power and memory storage compared to host machines.
- Partial training and quantization of an MLM directly on the Edge device avoids the typical requirement of providing a full training dataset on the device, where the full dataset may include proprietary information. Furthermore, by quantizing an MLM directly on the device, security issues associated with the host are avoided (e.g., including malware, data hacking and cloning). Additional personalization of the Edge device may also be realized without the typical data safety concern.
- Partial training and quantization on the device may also facilitate “transfer learning” whereby a device trained by one set of data may be extended or modified to learn with related data by modifying only a subset of the many layers of the MLM. In one non-limiting example of transfer learning, a device is trained to recognize a sedan. Through transfer learning the device is subsequently taught to recognize a truck or a bus. A new MLM with updated layers may thereby be quantized by retaining the quantization of weights associated with frozen layers (e.g., those that have not been updated as a result of transfer learning), by quantizing the weights associated with updated layers, and by quantizing the activation functions based on changes to the scaling and integer zero point between the original MLM and the updated MLM (having the updated layers).
-
FIG. 1 shows an example of a neuron, which forms the basis of an MLM. In one example, a neuron may include a plurality of inputs (e.g., tensor inputs) 10 a, 10 b and 10 c (generally 10). Each input 10 is modified by a 12 a, 12 b and 12 c (generally 12) and summated with arespective weight biased summation 14. If the result of the biased summation exceeds a threshold defined by theactivation function 16, anoutput 18 is generated. The example neuron ofFIG. 1 may define a single layer from a plurality of interconnected layers of an MLM. -
FIG. 2 shows an example of affine mapping for quantization of a floating point number to an integer number. Specifically, afloating point range 20 includes a floating point minimum (Rmin) 22, a floating point maximum (Rmax) 24 and floating point zero 26. The affine mapping ofFIG. 2 further includes aninteger range 30. Theinteger range 30 includes an integer minimum (Qmin) 32, an integer maximum (Qmax) 34 and an integer zero point (Z) 36. The example ofFIG. 2 shows an asymmetric quantization of floating point numbers to integer numbers based on a scaling difference betweenRmax 24−Rmin 22 andQmax 34−Qmin 32, in addition to a potential shift of thefloating point zero 26 to an integer zeropoint 36 within theinteger range 30. The affine mapping ofFIG. 2 reduces the numbers represented by floating point values to integer values, which will require less memory storage and a reduction in computing resources during subsequent processing of the integer values. -
FIG. 3 is a flowchart representation of a method for partial quantization to achieve a fully quantized model, in accordance with anembodiment 40 of the present disclosure. In one embodiment, therepresentative dataset 42 is a subset of a training dataset used to train an original 32 bit floating point (FP32)model 44 to create the original quantized INT8model 46. Therefore, themodel 46 is a pre-quantized model created from theoriginal FP32 model 44. The original quantized INT8model 46 is already quantized using more than one image before being deployed to the Edge device. Therepresentative dataset 42 may be a subset of data generated or produced on the Edge device itself and used to train theoriginal FP32 model 44 through transfer learning. Specifically, only one image is required from therepresentative dataset 42 by theoriginal FP32 model 44 and the updatedFP32 model 48 to calculate the R1max, R1min, R2max and R2min in the updatedlayers 52, and ultimately to reconstruct a new quantizedmodel 64. - It should be appreciated that the example of
FIG. 3 uses FP32 and INT8 models for case of illustration but should not be considered limited to these numeric ranges. Therepresentative dataset 42 may also be used to quantize an on-device training (or updated)FP32 model 48 for transfer learning. Theupdated model 48 includes frozenlayers 50, which remain the same as the corresponding layers of theoriginal FP32 model 44. Theupdated model 48 further includes updatedlayers 52, which are different than the corresponding layers of theoriginal FP32 model 44. - In the
embodiment 40, quantized weights are extracted 54 from thefrozen layers 50 of themodel 46 to form frozen quantized weights. The weights of the updatedlayers 52 of theupdated model 48 are quantized 56 to generate updated quantized weights. With reference toFIG. 2 andFIG. 3 , at 60 the floating point maximum (Rmax) and floating point minimum (Rmin) of the range of activation floating point values for the original FP32 model 44 (respectively, R1max and R1min), and for the updated FP32 model 48 (respectively, R2max and R2min) are determined. A Delta-max and a Delta-min value are determined at 62 in accordance with the following equations [1] and [2]: -
- At 64, a new quantized model is generated with the frozen quantized weights and quantized parameters of activations extracted from 54, the updated quantized weights quantized from 56, and the updated quantized activation functions of the
updated model 48, quantized with the Delta-max and Delta-min values from 62. In one embodiment, the quantized parameters of activations include scaling and integer zero point values. -
FIG. 4 is a flowchart representation of a usage of a partial quantization tool to achieve a fully quantized model, in accordance with anembodiment 70 of the present disclosure. Referring toFIG. 4 , theembodiment 70 further defines the method for reconstructing the new quantized model as shown inembodiment 40 ofFIG. 3 . Similar toFIG. 3 , thepartial quantization tool 72 receives inputs from anoriginal FP32 model 44, a quantized INT8model 46, and an updatedFP32 model 48, and generates a new quantizedmodel 74 from themodel reconstruction step 64. With reference toFIG. 2 andFIG. 4 , thepartial quantization tool 72 determines an original scale (S1) and an original integer zero point (Z1) at 76 in accordance with the following equations [3] and [4]: -
- The
partial quantization tool 72 determines an updated scale (S2) and an updated integer zero point (Z2) at 78 in accordance with the following equations [5] and [6]: -
- The equation [5] is an example based on quantizing a floating point number to an eight bit integer value (INT8), hence the denominator of 255 representing one less than an integer maximum of an integer range (of 256). For embodiments based on a different integer range, the denominator should be suitably modified to match the new integer range minus one.
-
FIG. 5 is a flowchart representation of a training and quantization system, in accordance with anembodiment 80 of the present disclosure. A new dataset is provided to train an updatedmodel 48 with updated layers. In one example, thenew dataset 82 is used to train the updatedmodel 48 using a Tensor Flow (TF)Lite program 84. With reference toFIG. 3 andFIG. 5 , a representative dataset 42 (derived as a subset of the new dataset 82) is used with the originalquantized model 46, the updatedmodel 48 and the original floatingpoint model 44 to generate a newquantized model 74, using the partial quantization tool ofFIG. 4 . In one embodiment, therepresentative dataset 42, the originalquantized model 46, the updatedmodel 48 and the original floatingpoint model 44 are all stored on an Edge device, without connection to other computing resources (e.g., a host or cloud connection). The newquantized model 74 may then be used by a Neural Processing Unit (NPU) inferencing to perform transfer learning. -
FIG. 6 shows anembodiment 90 of a method for partial quantization to achieve full quantized model. With continued reference toFIG. 3 andFIG. 4 , at 92 a plurality of weights and a respective activation function of anoriginal MLM 44 are quantized to generate aquantized MLM 46. At 94, frozen quantized weights and quantized activations parameters are extracted 54 from the quantized MLM. At 96, weights from an updatedlayer 52 of an updatedMLM 48 are quantized 56. At 98, a respective activation function of the updatedMLM 48 is quantized from a difference between theoriginal MLM 44 and the updatedMLM 48. At 100, a new quantizedMLM 64 is generated from the frozen weights, the quantized weights from the updatedMLM 48 and the quantized activation function of the updatedMLM 48. -
FIG. 7 shows anembodiment 110 of a method for partial quantization to achieve full quantized model. With continued reference toFIG. 3 andFIG. 4 , at 112 frozen quantized weights and quantized activations parameters are extracted 54 from at least onefrozen layer 50 of aquantized MLM 46, generated from anoriginal MLM 44. At 114, weights from an updatedlayer 52 of an updatedMLM 48 are quantized 56. At 116, a respective activation function of the updatedMLM 48 is quantized from an updated scale S2 and an updated integer zero point Z2 (see 78 ofFIG. 4 ). At 118, a new quantizedMLM 64 is generated from the frozen weights, the quantized weights from the updatedMLM 48 and the quantized activation function of the updatedMLM 48. -
FIG. 8 shows anembodiment 120 of a method for partial quantization to achieve full quantized model. With continued reference toFIG. 3 FIG. 4 , andFIG. 5 at 122 frozen quantized weights are extracted 54 from at least onefrozen layer 50 of aquantized model 46, generated from anoriginal model 44. At 124, weights from an updatedlayer 52 of an updatedmodel 48 are quantized 56. At 126, a respective activation function of an updatedlayer 52 of the updatedmodel 48 is quantized from an updated scale S2 and an updated integer zero point Z2 (see 78 ofFIG. 4 ), wherein the updated scale S2 modifies an original scale S1 used to quantize theoriginal model 44, and the updated zero integer zero point Z2 modifies an original integer zero point Z1 used to quantize theoriginal model 44. At 128, a newquantized model 64 is generated from the frozen weights, the quantized weights from the updatedmodel 48 and the quantized activation function of the updatedmodel 48. - As will be appreciated, at least some of the embodiments as disclosed include at least the following. In one embodiment, a method for partial quantization to achieve full quantized model comprises quantizing a plurality of weights and a respective activation function from each of a plurality of respective layers of an original Machine Learning Model (MLM) to generate a quantized MLM comprising a plurality of frozen quantized weights, each layer comprising a summation of a plurality of inputs, each input multiplied by a respective one of the plurality of weights, and the summation gated by the respective activation function. The plurality of frozen quantized weights of at least one frozen layer are extracted from the layers of the quantized MLM. The plurality of weights from at least one updated layer of an updated MLM are quantized to generate a plurality of updated quantized weights, wherein the at least one updated layer of the updated MLM is generated by updating the respective layer of the original MLM. The respective activation function of the at least one updated layer of the updated MLM is quantized from a difference between the original MLM and the updated MLM, to generate a respective quantized activation function. A new quantized MLM is generated from the frozen quantized weights, the updated quantized weights and the respective quantized activation function.
- Alternative embodiments of the method for partial quantization to achieve full quantized model include one of the following features, or any combination thereof. The difference between the original MLM and the updated MLM is determined by determining an updated scale comprising adding a ratio to an original scale of the original MLM, wherein the ratio is determined by a delta-difference between a delta-max and a delta-min, divided by one less than an integer maximum of an integer range used for quantizing the respective activation function of the updated MLM. The delta-max is determined by subtracting an original maximum of an original floating point range of the activation function of each layer of the original MLM from an updated maximum of an updated floating point range of the activation function of each layer of the updated MLM. The delta-min is determined by subtracting an original minimum of an original floating point range of the activation function of each layer of the original MLM from an updated minimum of an updated floating point range of the activation function of each layer of the updated MLM. The difference between the original MLM and the updated MLM is determined by determining an updated integer zero point by adding an offset to an original integer zero point of the original MLM, wherein the offset is determined by subtracting an updated minimum of an updated floating point range of the activation function of each layer of the updated MLM divided by an updated scale, from an original minimum of an original floating point range of the activation function of each layer of the original MLM divided by an original scale of the original MLM. The original integer zero point is determined by subtracting the original minimum of the original floating point range of the activation function of each layer of the original MLM, divided by the original scale, from an integer minimum of an integer range used for quantizing the respective activation function of the quantized MLM. The original scale is determined by subtracting the original minimum of the original floating point range from the original maximum of the original floating point range, divided by the integer minimum of the integer range subtracted from an integer maximum of the integer range. The plurality of weights are quantized from the at least one updated layer of the updated MLM with the original MLM, the quantized MLM, the updated MLM and a representative dataset collected on an Edge device, wherein the representative dataset is a subset of a dataset used to train the updated MLM. The quantized MLM further comprises quantizing the plurality of weights from each one of a plurality of tensor inputs of the original MLM. Generating the quantized MLM further comprises quantizing the plurality of weights from each one of a plurality of channels of the original MLM.
- In another embodiment, a method for partial quantization to achieve full quantized model comprises extracting a plurality of frozen quantized weights from each frozen layer of a quantized Machine Learning Model (MLM), wherein the quantized MLM is generated by quantizing an original MLM. A plurality of updated weights are quantized from an updated MLM to generate a plurality of updated quantized weights, wherein the updated MLM is generated by updating at least one layer of the original MLM to form at least one updated layer. A respective activation function of each updated layer from the updated MLM is quantized from a difference between the original MLM and the updated MLM to generate a quantized activation function, wherein the difference comprises an updated scale and an updated integer zero point. A new quantized MLM is generated from the frozen quantized weights, the updated quantized weights and the respective quantized activation function of each updated layer.
- Alternative embodiments of the method for partial quantization to achieve full quantized model include one of the following features, or any combination thereof. The updated scale is determined by adding a ratio to an original scale of the original MLM, wherein the ratio is determined by a delta-difference between a delta-max and a delta-min, divided by one less than an integer maximum of an integer range used for quantizing the respective quantized activation function of the updated MLM. The delta-max is determined by subtracting an original maximum of an original floating point range of the activation function of each layer of the original MLM from an updated maximum of an updated floating point range of the activation function of each layer of the updated MLM and determining the delta-min by subtracting an original minimum of the original floating point range of the activation function of each layer of the original MLM from an updated minimum of the updated floating point range of the activation function of each layer of the updated MLM. The updated integer zero point is determined by adding an offset to an original integer zero point of the original MLM, wherein the offset is determined by subtracting from an original minimum of an original floating point range of the activation function of each layer of the original MLM divided by an original scale of the original MLM, an updated minimum of an updated floating point range of the activation function of updated layer of the updated MLM divided by an updated scale. The original integer zero point is determined by subtracting from an integer minimum of an integer range used for quantizing the respective activation function of the quantized MLM, a quantized minimum of a quantized floating point range of the activation function of each layer of the quantized MLM, divided by the original scale, and the original scale is determined by subtracting the original minimum of the original floating point range from the original maximum of the original floating point range, divided by the integer minimum of the integer range subtracted from an integer maximum of the integer range.
- In another embodiment, a method for partial quantization to achieve full quantized model comprises extracting a plurality of frozen quantized weights and quantized activations parameters from each frozen layer of a quantized model, wherein the quantized model is generated by quantizing an original model. A plurality of updated weights are quantized from an updated model to generate a plurality of updated quantized weights, wherein the updated model is generated by updating at least one layer of the original model to form at least one updated layer. A respective activation function of each updated layer is quantized from the updated model from a difference between the original model and the updated model to generate a quantized activation function, wherein the difference comprises an updated scale and an updated integer zero point, the updated scale modifying an original scale used to quantize the original model with a ratio proportional to a floating point range difference between the updated model and the original model, and the updated integer zero point modifying an original integer zero point used to quantize the original model with a scaled difference between the floating point range minimums of the updated model and the original model. A new quantized model is generated from the frozen quantized weights and quantized activations parameters, the updated quantized weights and the respective quantized activation function of each updated layer.
- Alternative embodiments of the method for partial quantization to achieve full quantized model include one of the following features, or any combination thereof. The updated model is trained with a new dataset stored on an Edge device. A plurality of weights and a respective activation function from each of a plurality of respective layers of the original model are quantized with a representative dataset, wherein the representative dataset is a subset of the new dataset. The updated model adapts to the new dataset through transfer learning. Inferencing is performed with a neural processing unit and the new quantized model to perform transfer learning on an Edge device.
- Although the invention is described herein with reference to specific embodiments, various modifications and changes can be made without departing from the scope of the present invention as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of the present invention. Any benefits, advantages, or solutions to problems that are described herein with regard to specific embodiments are not intended to be construed as a critical, required, or essential feature or element of any or all the claims.
- Unless stated otherwise, terms such as “first” and “second” are used to arbitrarily distinguish between the elements such terms describe. Thus, these terms are not necessarily intended to indicate temporal or other prioritization of such elements.
Claims (20)
1. A method for partial quantization to achieve full quantized model comprising:
quantizing a plurality of weights and a respective activation function from each of a plurality of respective layers of an original Machine Learning Model (MLM) to generate a quantized MLM comprising a plurality of frozen quantized weights, each layer comprising a summation of a plurality of inputs, each input multiplied by a respective one of the plurality of weights, and the summation gated by the respective activation function;
extracting the plurality of frozen quantized weights of at least one frozen layer from the layers of the quantized MLM;
quantizing the plurality of weights from at least one updated layer of an updated MLM to generate a plurality of updated quantized weights, wherein the at least one updated layer of the updated MLM is generated by updating the respective layer of the original MLM;
quantizing the respective activation function of the at least one updated layer of the updated MLM from a difference between the original MLM and the updated MLM, to generate a respective quantized activation function; and
generating a new quantized MLM from the frozen quantized weights, the updated quantized weights and the respective quantized activation function.
2. The method of claim 1 further comprising determining the difference between the original MLM and the updated MLM by determining an updated scale comprising adding a ratio to an original scale of the original MLM, wherein the ratio is determined by a delta-difference between a delta-max and a delta-min, divided by one less than an integer maximum of an integer range used for quantizing the respective activation function of the updated MLM.
3. The method of claim 2 further comprising determining the delta-max by subtracting an original maximum of an original floating point range of the activation function of each layer of the original MLM from an updated maximum of an updated floating point range of the activation function of each layer of the updated MLM.
4. The method of claim 2 further comprising determining the delta-min by subtracting an original minimum of an original floating point range of the activation function of each layer of the original MLM from an updated minimum of an updated floating point range of the activation function of each layer of the updated MLM.
5. The method of claim 1 further comprising determining the difference between the original MLM and the updated MLM by determining an updated integer zero point by adding an offset to an original integer zero point of the original MLM, wherein the offset is determined by subtracting an updated minimum of an updated floating point range of the activation function of each layer of the updated MLM divided by an updated scale, from an original minimum of an original floating point range of the activation function of each layer of the original MLM divided by an original scale of the original MLM.
6. The method of claim 5 wherein the original integer zero point is determined by subtracting the original minimum of the original floating point range of the activation function of each layer of the original MLM, divided by the original scale, from an integer minimum of an integer range used for quantizing the respective activation function of the quantized MLM.
7. The method of claim 6 wherein the original scale is determined by subtracting the original minimum of the original floating point range from the original maximum of the original floating point range, divided by the integer minimum of the integer range subtracted from an integer maximum of the integer range.
8. The method of claim 1 further comprising quantizing the plurality of weights from the at least one updated layer of the updated MLM with the original MLM, the quantized MLM, the updated MLM and a representative dataset collected on an Edge device, wherein the representative dataset is a subset of a dataset used to train the updated MLM.
9. The method of claim 1 wherein generating the quantized MLM further comprises quantizing the plurality of weights from each one of a plurality of tensor inputs of the original MLM.
10. The method of claim 1 wherein generating the quantized MLM further comprises quantizing the plurality of weights from each one of a plurality of channels of the original MLM.
11. A method for partial quantization to achieve full quantized model comprising:
extracting a plurality of frozen quantized weights from each frozen layer of a quantized Machine Learning Model (MLM), wherein the quantized MLM is generated by quantizing an original MLM;
quantizing a plurality of updated weights from an updated MLM to generate a plurality of updated quantized weights, wherein the updated MLM is generated by updating at least one layer of the original MLM to form at least one updated layer;
quantizing a respective activation function of each updated layer from the updated MLM from a difference between the original MLM and the updated MLM to generate a quantized activation function, wherein the difference comprises an updated scale and an updated integer zero point; and
generating a new quantized MLM from the frozen quantized weights, the updated quantized weights and the respective quantized activation function of each updated layer.
12. The method of claim 11 wherein the updated scale is determined by adding a ratio to an original scale of the original MLM, wherein the ratio is determined by a delta-difference between a delta-max and a delta-min, divided by one less than an integer maximum of an integer range used for quantizing the respective quantized activation function of the updated MLM.
13. The method of claim 12 further comprising determining the delta-max by subtracting an original maximum of an original floating point range of the activation function of each layer of the original MLM from an updated maximum of an updated floating point range of the activation function of each layer of the updated MLM and determining the delta-min by subtracting an original minimum of the original floating point range of the activation function of each layer of the original MLM from an updated minimum of the updated floating point range of the activation function of each layer of the updated MLM.
14. The method of claim 11 wherein the updated integer zero point is determined by adding an offset to an original integer zero point of the original MLM, wherein the offset is determined by subtracting from an original minimum of an original floating point range of the activation function of each layer of the original MLM divided by an original scale of the original MLM, an updated minimum of an updated floating point range of the activation function of each layer of the updated MLM divided by an updated scale.
15. The method of claim 14 wherein the original integer zero point is determined by subtracting from an integer minimum of an integer range used for quantizing the respective activation function of the quantized MLM, a quantized minimum of a quantized floating point range of the activation function of each layer of the quantized MLM, divided by the original scale, and the original scale is determined by subtracting the original minimum of the original floating point range from the original maximum of the original floating point range, divided by the integer minimum of the integer range subtracted from an integer maximum of the integer range.
16. A method for partial quantization to achieve full quantized model comprising:
extracting a plurality of frozen quantized weights from each frozen layer of a quantized model, wherein the quantized model is generated by quantizing an original model;
quantizing a plurality of updated weights from an updated model to generate a plurality of updated quantized weights, wherein the updated model is generated by updating at least one layer of the original model to form at least one updated layer;
quantizing a respective activation function of each updated layer from the updated model from a difference between the original model and the updated model to generate a quantized activation function, wherein the difference comprises an updated scale and an updated integer zero point, the updated scale modifying an original scale used to quantize the original model with a ratio proportional to a floating point range difference between the updated model and the original model, and the updated integer zero point modifying an original integer zero point used to quantize the original model with a scaled difference between the floating point range minimums of the updated model and the original model; and
generating a new quantized model from the frozen quantized weights, the updated quantized weights and the respective quantized activation function of each updated layer.
17. The method of claim 16 further comprising training the updated model with a new dataset stored on an Edge device.
18. The method of claim 17 further comprising quantizing a plurality of weights and a respective activation function from each of a plurality of respective layers of the original model with a representative dataset, wherein the representative dataset is a subset of the new dataset.
19. The method of claim 16 wherein the updated model adapts to the new dataset through transfer learning.
20. The method of claim 16 further comprising inferencing with a neural processing unit and the new quantized model to perform transfer learning on an Edge device.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310532613.9 | 2023-05-11 | ||
| CN202310532613.9A CN118940792A (en) | 2023-05-11 | 2023-05-11 | Partial quantization to enable fully quantized models on edge devices |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| US20240378436A1 true US20240378436A1 (en) | 2024-11-14 |
Family
ID=93357260
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US18/479,875 Pending US20240378436A1 (en) | 2023-05-11 | 2023-10-03 | Partial Quantization To Achieve Full Quantized Model On Edge Device |
Country Status (2)
| Country | Link |
|---|---|
| US (1) | US20240378436A1 (en) |
| CN (1) | CN118940792A (en) |
-
2023
- 2023-05-11 CN CN202310532613.9A patent/CN118940792A/en active Pending
- 2023-10-03 US US18/479,875 patent/US20240378436A1/en active Pending
Also Published As
| Publication number | Publication date |
|---|---|
| CN118940792A (en) | 2024-11-12 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12073309B2 (en) | Neural network device and method of quantizing parameters of neural network | |
| US11663483B2 (en) | Latent space and text-based generative adversarial networks (LATEXT-GANs) for text generation | |
| US12475355B2 (en) | Quantizing trained long short-term memory neural networks | |
| US11423282B2 (en) | Autoencoder-based generative adversarial networks for text generation | |
| US11373087B2 (en) | Method and apparatus for generating fixed-point type neural network | |
| CN111461322B (en) | Deep neural network model compression method | |
| US20220383126A1 (en) | Low-Rank Adaptation of Neural Network Models | |
| EP3629250A1 (en) | Parameter-efficient multi-task and transfer learning | |
| KR20190034985A (en) | Method and apparatus of artificial neural network quantization | |
| CN110622178A (en) | Learning neural network structure | |
| JP2019032808A (en) | Mechanical learning method and device | |
| US20220245428A1 (en) | Machine-Learned Attention Models Featuring Omnidirectional Processing | |
| Tchendjou et al. | Fuzzy logic based objective image quality assessment with FPGA implementation | |
| KR20220157619A (en) | Method and apparatus for calculating nonlinear functions in hardware accelerators | |
| CN112561050A (en) | Neural network model training method and device | |
| US10410140B1 (en) | Categorical to numeric conversion of features for machine learning models | |
| WO2019106132A1 (en) | Gated linear networks | |
| WO2022153711A1 (en) | Training apparatus, classification apparatus, training method, classification method, and program | |
| US20250166236A1 (en) | Segmentation free guidance in diffusion models | |
| CN111783936A (en) | Convolutional neural network construction method, device, equipment and medium | |
| US20240378436A1 (en) | Partial Quantization To Achieve Full Quantized Model On Edge Device | |
| CN114021697A (en) | Reinforcement learning-based neural network generation method and system for end-cloud framework | |
| US20240202501A1 (en) | System and method for mathematical modeling of hardware quantization process | |
| US20250139516A1 (en) | Accuracy-preserving deep model compression | |
| CN116796826A (en) | Pulsar search network training method, pulsar search device and pulsar search equipment |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AS | Assignment |
Owner name: NXP USA, INC., TEXAS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:BAJAJ, MANISH KUMAR;JIAO, BIN;REEL/FRAME:065103/0579 Effective date: 20230417 |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |