[go: up one dir, main page]

WO2025024035A1 - Calibrating a quantized machine-learning models - Google Patents

Calibrating a quantized machine-learning models Download PDF

Info

Publication number
WO2025024035A1
WO2025024035A1 PCT/US2024/030395 US2024030395W WO2025024035A1 WO 2025024035 A1 WO2025024035 A1 WO 2025024035A1 US 2024030395 W US2024030395 W US 2024030395W WO 2025024035 A1 WO2025024035 A1 WO 2025024035A1
Authority
WO
WIPO (PCT)
Prior art keywords
learning model
machine
quantized
normalization function
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/US2024/030395
Other languages
French (fr)
Inventor
Nilesh Prasad PANDEY
Marios FOURNARAKIS
Markus Nagel
Chirag Sureshbhai Patel
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qualcomm Inc
Original Assignee
Qualcomm Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qualcomm Inc filed Critical Qualcomm Inc
Publication of WO2025024035A1 publication Critical patent/WO2025024035A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0495Quantised networks; Sparse networks; Compressed networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0475Generative networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Definitions

  • the present disclosure generally relates to quantized machine-learning models.
  • aspects of the present disclosure include systems and techniques for calibrating quantized machine-learning models.
  • Machine-learning models can be trained to perform operations. For example, during training, a machine-learning model may be provided with a training data set and be allowed to perform operations based on the training data set. The model may also be given conditions describing successful performance of the operations.
  • the model may iteratively adapt internal parameters (e.g., weights) of the model to minimize a difference (e.g., an “error”) between the model’s performance of the operations and the described successful performance.
  • the model After being trained (at a phase of operation that may be referred to as “inference”), the model may perform the operations based on the parameters (e.g., the weights) which were adapted through the training to perform the operations.
  • Generative machine-learning models e.g., stable diffusion models and language transformers
  • a neural network is generally a collection of nodes connected by weighted edges that transmit data signals between the nodes.
  • the nodes are commonly organized into layers, with one layer’s output (e.g., a lower layer) feeding the next layer’s input (e.g., a higher layer).
  • Different layers may generally be configured to perform different types of transformations on their inputs, such as convolutional transformations.
  • a node in a layer may receive one or more data signals from inbound edges connected to other nodes, and those data signals may be adjusted by the edge weights. Further, a node may have a bias as an independent input data signal.
  • the node processes all of the input data signals, for example, with a linear or non-linear function and then “activates” based on the output of the function. In some cases, the activation of a node may cause the transmission of further data signals to further connected nodes.
  • the output of the neural network is often referred to as an inference, and can take many forms, such as a numerical output, a classification output, and others.
  • Training a neural Qualcomm Ref. No.2307322WO network often referred to as deep learning, generally involves adjusting the values of the edge weights and biases until the output of the neural network meets some task performance objective.
  • neural networks are powerful machine learning model architectures capable of a wide range of useful tasks, such as recognizing objects in image data, they are likewise highly resource dependent. For example, neural networks may require significant compute, memory, power, and/or time resources for training and/or for inferencing. These resource requirements may significantly limit the ability to train and deploy neural networks to certain types of devices and for certain use cases.
  • the method includes: obtaining the quantized machine-learning model, wherein a machine-learning model on which the quantized machine-learning model is based is associated with a first data format and the quantized machine-learning model is associated with a second data format; determining a bias associated with at least one output of a normalization function of the quantized machine-learning model; and determining a correction factor based on the bias, wherein the correction factor is to be applied by the normalization function of the quantized machine-learning model when normalizing vectors.
  • an apparatus for calibrating a quantized machine-learning includes at least one memory and at least one processor (e.g., configured in circuitry) coupled to the at least one memory.
  • the at least one processor configured to: obtain the quantized machine-learning model, wherein a machine-learning model on which the Qualcomm Ref. No.2307322WO quantized machine-learning model is based is associated with a first data format and the quantized machine-learning model is associated with a second data format; determine a bias associated with at least one output of a normalization function of the quantized machine-learning model; and determine a correction factor based on the bias, wherein the correction factor is to be applied by the normalization function of the quantized machine-learning model when normalizing vectors.
  • a non-transitory computer-readable medium that has stored thereon instructions that, when executed by one or more processors, cause the one or more processors to: obtain the quantized machine-learning model, wherein a machine-learning model on which the quantized machine-learning model is based is associated with a first data format and the quantized machine-learning model is associated with a second data format; determine a bias associated with at least one output of a normalization function of the quantized machine- learning model; and determine a correction factor based on the bias, wherein the correction factor is to be applied by the normalization function of the quantized machine-learning model when normalizing vectors.
  • an apparatus for calibrating a quantized machine-learning is provided.
  • the apparatus includes: means for obtaining the quantized machine-learning model, wherein a machine-learning model on which the quantized machine-learning model is based is associated with a first data format and the quantized machine-learning model is associated with a second data format; means for determining a bias associated with at least one output of a normalization function of the quantized machine-learning model; and means for determining a correction factor based on the bias, wherein the correction factor is to be applied by the normalization function of the quantized machine-learning model when normalizing vectors.
  • Systems and techniques are described for normalizing a vector in a quantized machine- learning model. According to at least one example, a method is provided for normalizing a vector in a quantized machine-learning model.
  • an apparatus for normalizing a vector in a quantized machine- learning model includes at least one memory and at least one processor (e.g., configured in circuitry) coupled to the at least one memory.
  • the at least one processor configured to: normalize an input vector using a normalization function to generate an intermediate vector; and combine a predetermined value with each value of the intermediate vector to generate an output vector, wherein the predetermined value is determined to correct a bias of the normalization function of the quantized machine-learning model.
  • a non-transitory computer-readable medium has stored thereon instructions that, when executed by one or more processors, cause the one or more processors to: normalize an input vector using a normalization function to generate an intermediate vector; and combine a predetermined value with each value of the intermediate vector to generate an output vector, wherein the predetermined value is determined to correct a bias of the normalization function of the quantized machine-learning model.
  • an apparatus for normalizing a vector in a quantized machine- learning model includes: means for normalizing an input vector using a normalization function to generate an intermediate vector; and means for combining a predetermined value with each value of the intermediate vector to generate an output vector, wherein the predetermined value is determined to correct a bias of the normalization function of the quantized machine-learning model.
  • one or more of the apparatuses described herein is, can be part of, or can include a mobile device (e.g., a mobile telephone or so-called “smart phone”, a tablet computer, or other type of mobile device), an extended reality device (e.g., a virtual reality (VR) device, an augmented reality (AR) device, or a mixed reality (MR) device), a vehicle (or a computing device or system of a vehicle), a smart or connected device (e.g., an Internet-of-Things (IoT) device), a wearable device, a personal computer, a laptop computer, a video server, a television (e.g., a network-connected television), a robotics device or system, or other device.
  • a mobile device e.g., a mobile telephone or so-called “smart phone”, a tablet computer, or other type of mobile device
  • an extended reality device e.g., a virtual reality (VR) device, an augmented reality (AR) device,
  • each apparatus can include an image sensor (e.g., a camera) or multiple image sensors (e.g., multiple cameras) for capturing one or more images.
  • each apparatus can include one or more displays for displaying one or more images, notifications, and/or other displayable data.
  • each apparatus can include one or more speakers, one or more light-emitting devices, and/or one or more microphones.
  • each Qualcomm Ref. No.2307322WO apparatus can include one or more sensors.
  • the one or more sensors can be used for determining a location of the apparatuses, a state of the apparatuses (e.g., a tracking state, an operating state, a temperature, a humidity level, and/or other state), and/or for other purposes.
  • a state of the apparatuses e.g., a tracking state, an operating state, a temperature, a humidity level, and/or other state
  • the one or more sensors can be used for determining a location of the apparatuses, a state of the apparatuses (e.g., a tracking state, an operating state, a temperature, a humidity level, and/or other state), and/or for other purposes.
  • FIG. 1 illustrates an example implementation of a system, which may include a central processing unit (CPU), configured to perform one or more of the functions described herein;
  • FIG.2 is a block diagram illustrating an example machine-learning model, according to various aspects of the present disclosure; [0020] FIG.
  • FIG. 3 provides two sets of images that show the forward diffusion process (which is fixed) and the reverse diffusion process (which is learned) of a diffusion model;
  • FIG.4 is a diagram illustrating how diffusion data is distributed from initial data to noise using a diffusion model in the forward diffusion direction, in accordance with some aspects of the present disclosure;
  • FIG.5 is a diagram illustrating a U-Net architecture for a diffusion model, in accordance with some aspects of the present disclosure;
  • FIG.6 is a diagram illustrating an example attention block, according to various aspects of the present disclosure;
  • FIG.7 is a diagram illustrating an example system for calibrating a quantized machine- learning model, according to various aspects of the present disclosure; Qualcomm Ref.
  • FIG. 8A is a diagram illustrating a system including a normalization function for normalizing values
  • FIG. 8B is a diagram illustrating a system including a normalization function for normalizing values
  • FIG. 8C is a diagram illustrating a system including a normalization function for normalizing values
  • FIG. 9 is a diagram illustrating a tensor that may be input into a calibration function, according to various aspects of the present disclosure
  • FIG.10 is a diagram illustrating the tensor of FIG.9 with example formats of correction factors that may be applied when normalizing tensor, according to various aspects of the present disclosure
  • FIG.10 is a diagram illustrating the tensor of FIG.9 with example formats of correction factors that may be applied when normalizing tensor, according to various aspects of the present disclosure
  • FIG.10 is a diagram illustrating the tensor of FIG.9 with example formats of correction factors that may be applied when normalizing tensor, according to various aspects of the present disclosure
  • FIG.30 FIG.
  • FIG. 11 is a flow diagram illustrating an example process for calibrating a quantized machine-learning model, in accordance with aspects of the present disclosure
  • FIG. 12 is a flow diagram illustrating an example process for normalizing a vector in a quantized machine-learning model, in accordance with aspects of the present disclosure
  • FIG. 13 is a block diagram illustrating an example of a deep learning neural network that can be used to implement a perception module and/or one or more validation modules, according to some aspects of the disclosed technology
  • FIG. 14 is a block diagram illustrating an example of a convolutional neural network (CNN), according to various aspects of the present disclosure
  • CNN convolutional neural network
  • FIG. 15 is a block diagram illustrating an example computing-device architecture of an example computing device which can implement the various techniques described herein.
  • DETAILED DESCRIPTION Certain aspects of this disclosure are provided below. Some of these aspects may be applied independently and some of them may be applied in combination as would be apparent to those of skill in the art. In the following description, for the purposes of explanation, specific details are set forth in order to provide a thorough understanding of aspects of the application. Qualcomm Ref. No.2307322WO However, it will be apparent that various aspects may be practiced without these specific details. The figures and description are not intended to be restrictive. [0036] The ensuing description provides example aspects only, and is not intended to limit the scope, applicability, or configuration of the disclosure.
  • Training of machine-learning models may be a computationally intensive process that may take a relatively long time, a large quantity of training data, and many operations. For example, it may take hundreds of hours to train a machine-learning model.
  • Quantization is a method of mapping continuous values to a smaller set of discrete finite values. For example, quantization approximates real-world values (e.g., floating point values) with representative values (e.g., integer values) that limit the precision and range of the original input.
  • quantization may significantly reduce resource usage for both training and inferencing.
  • Quantization of trained machine-learning models allows quantized trained machine-learning models to be efficiently deployed on various devices. Quantization may allow a quantized trained machine-learning model to perform operations (e.g., at the inference phase of operation) on a device relatively quickly. Quantization may include changing a format of Qualcomm Ref.
  • Post-Training Quantization is a process by which a trained machine-learning model is quantized.
  • a machine-learning model may store parameters (e.g., weights) according to a first format that may allow a high degree of precision.
  • the model may store parameters as floating-point numbers, such as 16-bit floating point numbers, which may be referred to as float16 or FP16, or 32-bit floating point numbers, which may be referred to as float32 or FP32.
  • the model after training, before a model is deployed onto a device for use by a device, the model may be quantized by, in part, changing the format used to store the parameters of the model from the first format to a second format.
  • the second format may use less memory and/or be less computational expensive to use.
  • the model may store parameters as integer numbers such as 16-bit integer numbers, which may be referred to as Int16, or 8-bit integer numbers, which may be referred to as Int8. It may be less computationally expensive to store and/or operate using integers than floating point numbers.
  • a device may conserve power and/or processing time when using a quantized trained machine-learning model as compared with using an unquantized trained machine-learning model.
  • FPQ fixed-precision quantization
  • MPQ mixed-precision quantization
  • MPQ can achieve higher accuracy for the same computational budget because it allocates higher precision to elements (e.g., layers) that are more sensitive to quantization and reduces bitwidth for elements more robust to quantization.
  • the term “quantize,” “quantizing,” and like terms may be used as a verb and may be applied to machine-learning models, and/or layers of a Qualcomm Ref. No.2307322WO trained machine-learning model.
  • the term “quantize,” “quantizing,” and like terms may refer to quantizing parameters (e.g., weights and/or activations) as stored in the of the respective machine-learning models and/or layers.
  • a trained machine-learning model may include weights (e.g., numerical values).
  • the weights may be stored in the trained machine-learning model in a first format (e.g., float16). Quantizing the trained machine-learning model may include changing the weights from the first format to a second format (e.g., Int8).
  • the machine-learning model may include activations. Activations may refer to values used as inputs and/or outputs of various functions, operations, and/or layers of the machine-learning model. The activations may have a format. For example, a function may expect to receive values formatted according to a certain format. For instance, the function may read the values from memory according to the certain format.
  • the function may output other values (e.g., processed by the function) according to a particular format (which may or may not be the same as the certain format). For instance the function may write the other values to memory in the particular format.
  • a particular format which may or may not be the same as the certain format.
  • an unquantized trained machine-learning model may use FP16 to pass values between functions (e.g., outputting values from one function and reading the values by another function). After quantization, the quantized trained machine-learning model may use Int8 to pass the values between functions.
  • the term “quantize,” “quantizing,” and like terms may refer to changing how activations are stored and/or used in a quantized machine- learning model. [0044] Quantizing numbers can result in a loss of precision. Quantizing a trained machine- learning model can result in degradation of the model.
  • the term “noise” is used to describe a degree of difference between outputs of an unquantized trained machine- learning model and the trained machine-learning model after being quantized (given the same inputs).
  • a trained machine-learning model may be quantized.
  • the unquantized trained machine-learning model may be provided an input
  • the quantized trained machine- learning model may be provided the same input.
  • Outputs of the models may be compared.
  • a degree of difference between the outputs may be described as noise.
  • Noise may be a way to describe degradation of a quantized machine-learning model.
  • Quantizing normalization functions of a trained machine-learning model may result in degradation of the model (e.g., as evidenced by noise).
  • a normalization function may generate Qualcomm Ref.
  • a normalization function may generate a probability distribution including a number of outputs that sum to 1. The probability distribution may represent the probability of each of outputs.
  • normalization functions include: SoftMax, batch normalization (BatchNorm), layer normalization (LayerNorm), group normalization (GroupNorm), and instance normalization (InstanceNorm). Quantizing a normalization function of a trained machine-learning model may cause the quantized trained machine-learning model to exhibit more noise than quantizing other functions, operations, and/or layers of the trained machine-learning model.
  • quantizing a SoftMax operation (e.g., quantizing the activations, such as the output, of the SoftMax function) of a trained machine- learning model may result in more noise than quantizing inputs to the SoftMax function, queries of a transformer including the SoftMax function, keys of the transformer, and/or values of the transformer.
  • small values e.g., small relative to the resolution of the output data format and/or characteristics output distribution
  • output by a quantized normalization function may be rounded to zero.
  • the SoftMax function is quantized, many of those outputs may be rounded to 0. For example, if the outputs of a SoftMax function are stored as int8 in 255 steps between 0 and 1, values less than 0.001953 may be rounded to zero.
  • Systems, apparatuses, methods (also referred to as processes), and computer-readable media are described herein for calibrating a quantized machine learning model.
  • the systems and techniques described herein may obtain a quantized machine-learning model.
  • the systems and techniques may obtain a machine-learning model and quantize the machine-learning model.
  • the systems and techniques may obtain the quantized machine-learning model, which may have been quantized by another system or technique.
  • the unquantized machine-learning model may be associated with a first data format.
  • the unquantized machine-learning model may include and/or store parameters (e.g., weights and activations) according to the first data format.
  • the quantized machine-learning model may be associated with a second data format.
  • the quantized machine-learning model may include and/or store parameters (e.g., weights and activations) according to the second data format.
  • the second data format may be smaller and/or less computationally expensive to use than the first data format. For example, the Qualcomm Ref.
  • No.2307322WO first data format may be a floating-point data format (e.g., FP16 or FP32) and the second data format may be an integer data format (e.g., int8 or int16).
  • the systems and techniques may determine a bias associated with an output of a normalization function of the quantized machine-learning model. For example, the systems and techniques may provide an input vector (e.g., of a calibration data set) to the normalization function. The normalization function may generate an output (e.g., an output vector). The systems and techniques may compare a sum of values of the output vector to an expected output sum. For example, the normalization function may be a SoftMax function. The expected output sum may be 1.
  • the systems and techniques may determine a bias of the normalization function of the quantized machine-learning model based on the difference between the sum of the values of the output vector and the expected output sum.
  • the systems and techniques may provide a number of input vectors (e.g., from the calibration data set) to the normalization function and determine a number of differences.
  • the systems and techniques may use a statistical technique (e.g., an average) to determine the bias based on the number of differences.
  • the systems and techniques may determine a correction factor based on the bias. For example, the systems and techniques may determine that the bias is the amount by which output vectors of the normalization function should be adjusted to correct the bias.
  • the systems and techniques may determine to spread the adjustment to correct the bias across the vector. For example, the systems and techniques may determine the correction factor to be the bias divided by the number of values in the output vector.
  • the systems and techniques may modify the normalization function of the quantized machine-learning model such that the normalization function applies the correction factor to output vectors (e.g., by adding the correction factor to each value of each output vector generated by the normalization function at inference).
  • the systems and techniques may be calibrating the quantized machine-learning model. Further, the systems and techniques may be specifically calibrating the normalization function of the quantized machine-learning model to correct bias added by the quantization of the normalization function.
  • FIG. 1 illustrates an example implementation of a system 100, which may include a central processing unit (CPU 102) (which may be a multi-core CPU), configured to perform one or more of the functions described herein.
  • CPU 102 central processing unit
  • CPU 102 which may be a multi-core CPU
  • Parameters or variables e.g., neural signals and synaptic weights
  • system parameters associated with a computational device e.g., neural network with weights
  • task information may be stored in a memory block associated with a neural processing unit (NPU 108), in a memory block associated with a CPU 102, in a memory block associated with a graphics processing unit (GPU 104), in a memory block associated with a digital signal processor (DSP 106), in a memory 116, and/or may be distributed across multiple blocks. Instructions executed at the CPU 102 may be loaded from a program memory associated with the CPU 102 or may be loaded from memory 116.
  • NPU 108 neural processing unit
  • GPU 104 graphics processing unit
  • DSP 106 digital signal processor
  • the system 100 may also include additional processing blocks tailored to specific functions, such as the GPU 104, the DSP 106, a connectivity engine 118, which may include fifth generation (5G) connectivity, fourth generation long term evolution (4G LTE) connectivity, Wi- Fi connectivity, USB connectivity, Bluetooth connectivity, and the like, and a multimedia processor 112 that may, for example, detect and recognize gestures.
  • the NPU is implemented in the CPU 102, the DSP 106, and/or the GPU 104.
  • the system 100 may also include one or more sensor processor(s) 114, one or more image signal processors (ISP(s) 110), and/or navigation engine 120, which may include a global positioning system.
  • the sensor processor(s) 114 can be associated with or connected to one or more sensors for providing sensor input(s) to the sensor processor(s) 114.
  • the one or more sensors and sensor processor(s) 114 can be provided in, coupled to, or otherwise associated with a same computing device.
  • the system 100 may be implemented as a system on a chip (SoC).
  • SoC system on a chip
  • the system 100 may be based on an Advanced Reduced Instruction Set Computer (RISC) Machine (ARM) instruction set.
  • RISC Reduced Instruction Set Computer
  • ARM Advanced Reduced Instruction Set Computer
  • the system 100 and/or components thereof may be configured to perform machine learning techniques according to aspects of the present disclosure discussed herein.
  • the system 100 and/or components thereof may be configured to implement a machine-learning Qualcomm Ref.
  • No.2307322WO model (e.g., a quantized trained machine-learning model) as described herein and/or according to aspects of the present disclosure.
  • Machine learning can be considered a subset of artificial intelligence (AI).
  • ML systems can include algorithms and statistical models that computer systems can use to perform various tasks by relying on patterns and inference, without the use of explicit instructions.
  • a ML system is a neural network (also referred to as an artificial neural network), which may include an interconnected group of artificial neurons (e.g., neuron models).
  • Neural networks may be used for various applications and/or devices, such as image and/or video coding, image analysis and/or computer vision applications, Internet Protocol (IP) cameras, Internet of Things (IoT) devices, autonomous vehicles, service robots, among others.
  • IP Internet Protocol
  • IoT Internet of Things
  • Individual nodes in a neural network may emulate biological neurons by taking input data and performing simple operations on the data. The results of the simple operations performed on the input data are selectively passed on to other neurons. Weight values are associated with each vector and node in the network, and these values constrain how input data is related to output data. For example, the input data of each node may be multiplied by a corresponding weight value, and the products may be summed.
  • the sum of the products may be adjusted by an optional bias, and an activation function may be applied to the result, yielding the node’s output signal or “output activation” (sometimes referred to as a feature map or an activation map).
  • the weight values may initially be determined by an iterative flow of training data through the network (e.g., weight values are established during a training phase in which the network learns how to identify particular classes by their typical input data characteristics).
  • CNNs convolutional neural networks
  • RNNs recurrent neural networks
  • GANs generative adversarial networks
  • MLP multilayer perceptron
  • transformer neural networks e.g., transformer neural networks, diffusion-based neural networks, among others.
  • CNNs convolutional neural networks
  • Convolutional neural networks may include collections of artificial neurons that each have a receptive field (e.g., a spatially localized region of an input space) and that collectively tile an input space.
  • RNNs work on the principle of saving the output of a layer and feeding this output back to the input to help in predicting an outcome of the layer.
  • a GAN is a form of generative neural network that can learn patterns in input data so that the neural network model can generate new synthetic outputs that reasonably could have been from Qualcomm Ref. No.2307322WO the original dataset.
  • a GAN can include two neural networks that operate together, including a generative neural network that generates a synthesized output and a discriminative neural network that evaluates the output for authenticity.
  • MLP neural networks data may be fed into an input layer, and one or more hidden layers provide levels of abstraction to the data. Predictions may then be made on an output layer based on the abstracted data.
  • Deep learning (DL) is one example of a machine learning technique and can be considered a subset of ML. Many DL approaches are based on a neural network, such as an RNN or a CNN, and utilize multiple layers. The use of multiple layers in deep neural networks can permit progressively higher-level features to be extracted from a given input of raw data.
  • a first layer of artificial neurons becomes an input to a second layer of artificial neurons
  • the output of a second layer of artificial neurons becomes an input to a third layer of artificial neurons
  • Layers that are located between the input and output of the overall deep neural network are often referred to as hidden layers.
  • the hidden layers learn (e.g., are trained) to transform an intermediate input from a preceding layer into a slightly more abstract and composite representation that can be provided to a subsequent layer, until a final or desired representation is obtained as the final output of the deep neural network.
  • a neural network is an example of a machine learning system, and can include an input layer, one or more hidden layers, and an output layer.
  • Deep learning networks typically include multiple hidden layers. Each layer of the neural network can include feature maps or activation maps that can include artificial neurons (or nodes). A feature map can include a filter, a kernel, or the like. The nodes can include one or more weights used to indicate an importance of the nodes of one or more of the layers. In some cases, a deep learning network can have a series of many hidden layers, with early layers being used to determine simple and low-level characteristics of an input, and later layers building up a hierarchy of more complex and abstract characteristics. [0059] A deep learning architecture may learn a hierarchy of features.
  • the first layer may learn to recognize relatively simple features, such as edges, in the input stream.
  • the first layer may learn to recognize spectral power in specific frequencies.
  • the second layer taking the output of the Qualcomm Ref. No.2307322WO first layer as input, may learn to recognize combinations of features, such as simple shapes for visual data or combinations of sounds for auditory data. For instance, higher layers may learn to represent complex shapes in visual data or words in auditory data. Still higher layers may learn to recognize common visual objects or spoken phrases.
  • Deep learning architectures may perform especially well when applied to problems that have a natural hierarchical structure. For example, the classification of motorized vehicles may benefit from first learning to recognize wheels, windshields, and other features.
  • Neural networks may be designed with a variety of connectivity patterns. In feed- forward networks, information is passed from lower to higher layers, with each neuron in a given layer communicating to neurons in higher layers. A hierarchical representation may be built up in successive layers of a feed-forward network, as described above. Neural networks may also have recurrent or feedback (also called top-down) connections. In a recurrent connection, the output from a neuron in a given layer may be communicated to another neuron in the same layer. A recurrent architecture may be helpful in recognizing patterns that span more than one of the input data chunks that are delivered to the neural network in a sequence.
  • FIG. 2 is a block diagram illustrating an example machine-learning model 200, according to various aspects of the present disclosure.
  • the machine-learning model 200 may be a diffusion model.
  • the machine-learning model 200 may generate an output image based on a user prompt. For example, a textual user prompt (e.g., “a person riding a horse”) may be provided to the machine-learning model 200.
  • the machine-learning model 200 may generate text embeddings based on the user prompt (e.g., using a text encoder, such as a frozen Contrastive Vision Language Pre-training (CLIP) encoder). Additionally, the machine-learning model 200 may obtain a latent seed (e.g., gaussian noise, for example, having a mean of 0 and a variance of 1). The machine-learning model 200 may arranged values of the latent seed into a grid. The machine-learning model 200 may provide the latents and the text embeddings to a U-Net (e.g., a U-Net encoder decoder).
  • a U-Net e.g., a U-Net encoder decoder
  • the U-Net may generate conditioned latents which may be combined, using a scheduler algorithm, with the Gaussian noise and provided as input to the U-Net.
  • the Qualcomm Ref. No.2307322WO conditioned outputs may again be combined with the latents and input into the U-Net a number (N) of times.
  • the conditioned latents may be provided to a variational autoencoder decoder which may generate the output image based on the conditioned latents.
  • a diffusion model provides a general-purpose, high quality, model for performing a task (e.g., image generation, depth estimation, optical flow estimation, stereo estimation, etc.) and enables more general applications as well.
  • Diffusion models are latent-variable generative models trained to transform a sample of a noise into a sample from a data distribution.
  • a diffusion model can define a Markov chain of diffusion steps to slowly add random noise (e.g., Gaussian noise) to data and then learn to reverse the diffusion process to construct desired data samples from the noise.
  • the diffusion model can successfully perform a particular task (e.g., object classification, depth estimation, etc.) when provided a conditioning image or other conditional input and random noise to perform reverse diffusion for task-specific prediction.
  • Diffusion models e.g., diffusion-based neural networks
  • Diffusion models are latent-variable models.
  • a diffusion model defines a Markov chain of diffusion steps to slowly add random noise (e.g., Gaussian noise) to data and then learn to reverse the diffusion process to construct desired data samples from the noise.
  • a diffusion model can be trained using a forward diffusion process (which is fixed) and a reverse diffusion process (which is learned).
  • a diffusion model can be trained to be able to perform a generative process (e.g., a denoising process).
  • One example goal of a diffusion model is to be able to denoise any arbitrary noise added to input data (e.g., a video).
  • FIG. 3 provides two sets of images 300 that show the forward diffusion process (which is fixed) and the reverse diffusion process (which is learned) of a diffusion model.
  • noise 303 is gradually added to a first set of images 302 at different time steps for a total of T time steps (e.g., making up a Markov chain), producing a sequence of noisy samples X1 through XT.
  • T time steps e.g., making up a Markov chain
  • the noise 303 is Gaussian noise.
  • Each time step can correspond to each consecutive image of the first set of Qualcomm Ref. No.2307322WO images 302 shown in FIG.3.
  • the initial image X0 of FIG.3 is of a cat.
  • the second set of images 304 shows the reverse diffusion process in which XT is the starting point with a noisy image (e.g., one that has Gaussian noise).
  • the diffusion model can be trained to reverse the diffusion process (e.g., by training a model p ⁇ (x t-1
  • a diffusion model can be trained by finding the reverse Markov transitions that maximize the likelihood of the training data. By traversing backwards along the chain of time steps, the diffusion model can generate the new data. For example, as shown in FIG.3, the reverse diffusion process proceeds to generate X0 as the image of the cat. In other cases, the input data and output data can vary based on the task for which the diffusion model is trained. [0068] As noted above, the diffusion model is trained to be able to denoise or recover the original image X 0 in an incremental process as shown in the second set of images 304.
  • the neural network of the diffusion model can be trained to recover Xt given Xt-1, such as provided in the below example equation: ⁇ ⁇ ⁇ ⁇ ⁇
  • Sampling can be defined as ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ 1 ⁇ ⁇ ⁇ ⁇ ⁇ where ⁇ ⁇ ⁇ ⁇ , ⁇ .
  • FIG.4 is a diagram 400 illustrating how diffusion data is distributed from initial data to noise using a diffusion model in the forward diffusion direction, in accordance with some aspects. Note that the initial data q(X 0 ) is detailed in the initial stage of the diffusion process.
  • the example of FIG. 4 illustrates the progression of the data and how it becomes diffused with noise in the forward diffusion process.
  • the diffused data distribution (e.g., as shown in FIG.
  • the model can sample ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ by first sampling ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ and then sampling ⁇ ⁇ ⁇ ⁇ ⁇ ⁇
  • ⁇ ⁇ ⁇ (which may be referred to as ancestral sampling).
  • the diffusion kernel takes the input and returns a vector or other data structure as output.
  • a training algorithm can include the following steps: 1: repeat 2: ⁇ ⁇ ⁇ ⁇ 3 : ⁇ ⁇ Uniform ⁇ 1, ...
  • a sampling algorithm can include the following steps: 1 : ⁇ ⁇ ⁇ ⁇ ⁇ , ⁇ Qualcomm Ref. No.2307322WO 2 : for ⁇ ⁇ ⁇ , ...
  • FIG. 5 is a diagram illustrating a U-Net architecture 500 for a diffusion model, in accordance with some aspects.
  • the initial image 502 e.g., of a cat
  • the U-Net architecture 500 which includes a series of residual networks (ResNet) blocks and self-attention layers to represent the network ⁇ ⁇ (x t , t).
  • the U-Net architecture 500 also includes fully connected layers 508.
  • time representation 510 can be sinusoidal positional embeddings or random Fourier features.
  • noisy output 506 from the forward diffusion process is also shown.
  • the U-Net architecture 500 includes a contracting path 504 and an expansive path 505 as shown in FIG. 5, which gives it the U-shaped architecture.
  • the contracting path 504 can be a convolutional network that includes repeated convolutional layers (that apply convolutional operations), each followed by a rectified linear unit (ReLU) and a max pooling operation.
  • ReLU rectified linear unit
  • FIG. 6 is a diagram illustrating an example attention block 600, according to various aspects of the present disclosure.
  • the attention block 600 may be an example of an attention layer of the architecture 500 of FIG.5 and/or of the U-Net of the machine-learning model 200 of FIG. 2.
  • the attention block 600 may combine Query (Q), Key (K), and Value (V) to generate an output.
  • the attention block 600 may perform a matrix multiplication of Q and K, scale the output of the matrix multiplication, optionally mask the output of the scaling, normalize the output of the mask (or of the scaling in cases in which the attention block 600 does not include the mask), and multiply the output of the normalization with V.
  • SoftMax is provided as an example of a normalization function. SoftMax may generate a Qualcomm Ref. No.2307322WO probability distribution, which may sum to 1. For example, SoftMax may implement the equation.
  • a machine-learning model (e.g., a diffusion model or a language transformer) may include any number of normalization functions at a respective number of locations within the machine-learning model.
  • SoftMax of the attention block 600 is provided as one example.
  • quantizing a machine-learning model may result in degradation of the machine-learning model (e.g., as evidenced by noise).
  • the quantization of normalization functions of the machine-learning model may be a significant source of the degradation.
  • FIG.7 is a diagram illustrating an example system for calibrating a quantized machine- learning model 708, according to various aspects of the present disclosure.
  • a system 700 may obtain the quantized machine-learning model 708.
  • the system 700 may obtain a machine-learning model 702 and quantize the machine-learning model 702 to generate the quantized machine-learning model 708.
  • the system 700 may obtain the quantized machine-learning model 708 (e.g., which may have been quantized by another system or technique). In either case, the system 700 may calibrate the quantized machine-learning model 708 to generate a calibrated quantized machine-learning model 714.
  • the machine-learning model 702 may be, or may include, a machine-learning model trained to perform one or more operations and/or to generate one or more outputs.
  • the machine- learning model 702 may be, or may include, a generative machine-learning model.
  • the machine-learning model 702 may be a diffusion model (e.g., as described with regard to FIG. 2, FIG.3, FIG. 4, and FIG.5).
  • the machine-learning model 702 may be, or may include, a language transformer.
  • the machine-learning model 702 may include one or more normalization functions (including, in some examples, a normalization function 704). The Qualcomm Ref.
  • No.2307322WO normalization function 704 may be, or may include, an activation function (e.g., to provide activations to a further layer of the machine-learning model 702).
  • the normalization function 704 may be configured to generate a probability distribution.
  • the normalization function 704 may be, as examples, a SoftMax function, a batch normalization function (e.g., BatchNorm), a layer normalization function (e.g., LayerNorm), a group normalization function (e.g., GroupNorm), and instance normalization function (e.g., InstanceNorm).
  • the normalization function 704 may be associated with an expected output. For example, values of the normalization function 704 may be expected to sum to an expected output sum.
  • the machine-learning model 702 may be associated with a first data format.
  • the machine-learning model 702 may be associated with a high-precision data format, (e.g., a floating-point data format) (e.g., FP16 or FP32).
  • the machine-learning model 702 may include and/or store parameters according to the first data format.
  • the machine- learning model 702 may include weights.
  • the weights may be stored in the machine-learning model 702 according to the first data format (e.g., the weights may be stored as FP16 or FP32). Further, the machine-learning model 702 may include multiple layers that may have respective activations. The machine-learning model 702 may be configured to store the activations according to the first data format.
  • the normalization function 704 may be configured to output data according to the first data format. For example, the normalization function 704 may be configured to output a vector, and the normalization function 704 may be configured such that each value of the vector may be according to the first data format. [0087]
  • a quantization engine 706 may quantize the machine-learning model 702 to generate the quantized machine-learning model 708.
  • the quantization engine 706 may operate according to any suitable quantization techniques including, as examples, post-training quantization (PTQ) techniques, mixed-precision-quantization (MPQ) techniques, automatic mixed precision (AMP) techniques, etc.
  • PTQ post-training quantization
  • MPQ mixed-precision-quantization
  • AMP automatic mixed precision
  • the quantization engine 706 may change a data format of the machine-learning model 702 from the first data format to a second data format.
  • the quantization engine 706 may change a data format of the output of the normalization function 704 from the first data format to the second data format.
  • the quantized machine-learning model 708 may be associated with the second data format.
  • the second data format may have a lower precision than the first data format.
  • the second data format may be an integer format (e.g., int8, int16, etc.).
  • the quantized machine-learning model 708 may include and/or store parameters according to the second data format.
  • the quantized machine-learning model 708 may include weights. The weights may be stored in the quantized machine-learning model 708 according to the second data format. Further, the quantized machine-learning model 708 may include multiple layers that may have respective activations.
  • the quantized machine-learning model 708 may be configured to store the activations according to the second data format.
  • a quantized normalization function 710 may be configured to output data according to the second data format.
  • the quantized normalization function 710 may be configured to output a vector, and the quantized normalization function 710 may be configured such that each value of the vector may be according to the second data format.
  • the machine-learning model 702 may include and/or store parameters according to multiple data formats (including the second data format).
  • the machine- learning model 702 may include some weights stored according to the first data format and some weights according to another data format.
  • the machine-learning model 702 may include multiple operations, functions, and/or layers with activations according to other respective data formats.
  • the quantized machine-learning model 708 may include and/or store parameters according to multiple data formats (including the first data format).
  • the machine-learning model 702 includes at least some parameters according to the first format and the quantized machine-learning model 708 includes at least some parameters according to the second format.
  • the normalization function 704 may be configured to output data according to a data format (e.g., the first data format) and the quantized normalization function 710 is configured to output data according to a different data format (e.g., the second data format).
  • the quantized machine-learning model 708 is a quantized version of the machine-learning model 702 and outputs of the quantized normalization function 710 are quantized relative to outputs of the normalization function 704.
  • a calibration engine 712 may calibrate the quantized machine-learning model 708 to generate the calibrated quantized machine-learning model 714.
  • the calibration engine 712 may determine a bias associated with an output of the quantized normalization function 710.
  • the calibration engine 712 may obtain calibration data 718, which may include data representative of the data used to train the machine-learning model 702. In some cases, the calibration data 718 may be a subset of the data used to the train machine-learning model 702.
  • the calibration engine 712 may provide the calibration data 718 to the quantized machine-learning model 708 and observe outputs of the quantized normalization function 710. In providing the calibration data 718 to the quantized machine-learning model 708, the calibration engine 712 may provide one or more input vectors to the quantized normalization function 710. In other cases, the calibration data 718 may include input vectors representative of input vectors that were provided to the normalization function 704 during the training of the machine-learning model 702, and the calibration engine 712 may provide one or more of the input vectors to the quantized normalization function 710.
  • the quantized normalization function 710 may generate one or more outputs (e.g., one or more output vectors) based on the respective one or more input vectors, and the calibration engine 712 may observe the output(s). [0091]
  • the calibration engine 712 may compare the output(s) to an expected output.
  • the calibration engine 712 may compare a sum of values of each of the output vectors to an expected output sum.
  • the normalization function 704 may be a SoftMax with an expected output sum of 1.
  • the quantized normalization function 710 may be a batch normalization, a layer normalization, a group normalization, or an instance normalization with an expected output sum that may be determined by the calibration engine 712.
  • the calibration engine 712 may determine a bias of the quantized normalization function 710 based on the difference between the sum of the values of each of the output vectors and the expected output sum.
  • the calibration engine 712 may use a statistical technique (e.g., an average or a median) to determine the bias based on the number of differences between the sum of number of output vectors and the expected output sum.
  • the calibration engine 712 may determine a correction factor based on the bias. For example, the calibration engine 712 may determine that a calibrated quantized normalization function 716 should adjust outputs of the calibrated quantized normalization function 716 to Qualcomm Ref. No.2307322WO correct for the bias.
  • the calibration engine 712 may determine to spread the adjustment to correct the bias across the vector. For example, the calibration engine 712 may determine the correction factor to be the bias divided by the number of values in the output vector.
  • the calibration engine 712 may modify the quantized normalization function 710 to generate the calibrated quantized normalization function 716 such that the calibrated quantized normalization function 716 applies the correction factor to output vectors (e.g., by adding the correction factor to each value of each output vector generated by the calibrated quantized normalization function 716 when the calibrated quantized normalization function 716 operates, such as at inference).
  • the calibration engine 712 may calibrate the quantized machine-learning model 708 to generate the calibrated quantized machine-learning model 714.
  • the calibration engine 712 may be calibrating the quantized normalization function 710 to generate the calibrated quantized normalization function 716 to correct bias added by the quantization of the quantized normalization function 710.
  • Calibrating the quantized normalization function 710 may decrease or eliminate degradation of the quantized machine-learning model 708.
  • the calibrated quantized machine-learning model 714 may exhibit less noise than the quantized machine-learning model 708.
  • FIG. 8A is a diagram illustrating a system 802 including a normalization function 806 for normalizing values.
  • the normalization function 806 may be an example of the normalization function 704 of FIG.7.
  • the normalization function 806 may receive an input vector 804 and generate and output an output vector 808.
  • the input vector 804 includes the values [-1, 0, 3, 5]
  • the output vector 808 includes the values [0.002, 0.006, 0.118, 0.874].
  • the output vector 808 may be according to a first data format, (e.g., a floating-point data format). Notably, the values of the output vector 808 may sum to 1.
  • FIG. 8B is a diagram illustrating a system 812 including a normalization function 816 for normalizing values.
  • the normalization function 816 may be an example of the quantized machine-learning model 708 of FIG.7. Qualcomm Ref. No.2307322WO [0097]
  • the normalization function 816 may receive an input vector 814 (which, according to the example of FIG. 8B, may be the same as the input vector 804).
  • the normalization function 816 may generate and output an output vector 818.
  • the input vector 814 includes the values [-1, 0, 3, 5]
  • the output vector 808 includes values that may be represented as [0, 0.003922, 0.117647, 0.87451].
  • the output vector 818 may be according to a second data format, (e.g., an integer data format).
  • the output vector 818 may store integer values that may represent scaled steps between 0 and 1.
  • the output vector 818 may store an integer value of 0 that may represent 0 and an integer value of 1 that may represent 1/255 (or 0.003922), an integer value of 2 that may represent 2/255 (or 0.007843), etc.
  • the values of the output vector 818 do not sum to 1, rather the values of the output vector 818 sum to 0.996078, which is 0.003922 less than 1. This may be the cause of degradation of a machine-learning model, including the normalization function 816. It should be understood that while the output vector 818 of the example of FIG.8B includes 4 values, other output vectors may include dozens, hundreds, or more values.
  • FIG. 8C is a diagram illustrating a system 822 including a normalization function 826 for normalizing values.
  • the normalization function 826 may be an example of the calibrated quantized normalization function 716 of FIG.7.
  • the normalization function 826 may receive an input vector 824 (which, according to the example of FIG.8C, may be the same as the input vector 804 and the input vector 814).
  • the normalization function 826 may generate and output an intermediate vector 828.
  • FIG. 824 which, according to the example of FIG.8C, may be the same as the input vector 804 and the input vector 814.
  • the input vector 824 may include the values [-1, 0, 3, 5], and the intermediate vector 828 may include values that may be represented as [0, 0.003922, 0.117647, 0.87451].
  • the intermediate vector 828 may be according to a second data format, (e.g., an integer data format). As such, the intermediate vector 828 may store integer values that may represent scaled steps between 0 and 1. The values of the intermediate vector 828 do not sum to 1, rather the values of the intermediate vector 828 sum to 0.996078, which is 0.003922 less than 1. [0100]
  • a calibration engine e.g., the calibration engine 712 of FIG.7) may determine that 0.003922 is the bias of the normalization function 826.
  • the calibration engine may have provided the normalization function 826 with calibration data and determined the bias through numerous calibration tests. Qualcomm Ref. No.2307322WO [0101] Further, the calibration engine may modify the system 822 such that in operation, the system 822 corrects the bias by adding the bias to values of the intermediate vector 828. For example, the calibration engine may modify the system 822 such that the system 822 adds a fraction of the bias to each value of the intermediate vector 828 to generate an output vector 832. According to the example of FIG. 8C, the calibration engine may modify the system 822 such that the system 822 adds 0.003922 / 4 to each value of the intermediate vector 828 (e.g., based on the intermediate vector 828 including 4 values).
  • FIG.9 is a diagram illustrating a tensor 902 that may be input into a calibration function, according to various aspects of the present disclosure.
  • the tensor 902 is three-dimensional.
  • the tensor 902 includes “heads” values in a first dimension (e.g., a height dimension or a z dimension), L y values in a second dimension (e.g., a width dimension or a y dimension), and Lx values in a third dimension (e.g., a depth dimension or a z dimension).
  • a normalization function e.g., any of the normalization function 704, the quantized normalization function 710, the calibrated quantized normalization function 716, the normalization function 806, the normalization function 816, the normalization function 826 and/or the system 822
  • a vector e.g., a one-dimensional data structure
  • a machine-learning model may provide one vector of the tensor 902 to a normalization function at a time.
  • the machine-learning model may provide a vector of values defined by one head and one Ly to the normalization function at a time.
  • FIG. 10 is a diagram illustrating the tensor 902 of FIG. 9 with example formats of correction factors that may be applied when normalizing the tensor 902, according to various aspects of the present disclosure.
  • a calibration engine e.g., 712 in FIG. 7 may determine a correction factor matrix 1004 (e.g., a two-dimensional matrix) that may be applied to vectors of the tensor 902.
  • the correction factor matrix 1004 may include “heads” * Ly values (e.g., one value corresponding to each vector of the tensor 902).
  • the calibration engine may cause the normalization function to apply a value of the correction factor Qualcomm Ref. No.2307322WO matrix 1004 to each corresponding vector of the tensor 902 when the normalization function normalizes the corresponding vector.
  • the calibration engine may determine a correction factor vector 1006 (e.g., a one-dimensional vector) that may be applied to vectors of the tensor 902.
  • the correction factor vector 1006 may include “heads” values (e.g., one value corresponding to each layer, in the height dimension, of the vectors of the tensor 902).
  • the calibration engine may cause the normalization function to apply a value of the correction factor vector 1006 to all vectors of the tensor 902 of the same layer when the normalization function normalizes the vectors.
  • the calibration engine may determine a correction factor value 1008 (e.g., a single value) that may be applied to all vectors of the tensor 902.
  • the calibration engine may cause the normalization function to apply the correction factor value 1008 to all vectors of tensor 902 when the normalization function normalizes the vectors.
  • the systems and techniques may determine bias as a systematic discrepancy between quantized and unquantized activation vector of size, according to the equation: ⁇ ⁇ ⁇ ⁇ ⁇ ; ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ , ⁇ is transformation function, and ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ is the quantization function [0108]
  • normalization functions or activation layers, e.g., of normalization functions
  • SoftMax the expected value of the activation may be known in advance.
  • the bias correction may have a dimensionality.
  • Activation tensor in deep learning can have different dimensions, (e.g., 4D for ConvNets and 3D for Transformers). A user may choose the dimensionality of the bias correction vector depending on the target hardware and application.
  • can be a scalar applied to the full tensor
  • can be an 1D vector applied along the output channel dimensions of BatchNorm.
  • SoftMax quantization we can calculate a different ⁇ for each attention head. In stable diffusion we can have a different ⁇ per diffusion step making it time-dependent.
  • an input a softmax layer may be a 3D tensor ⁇ ⁇ R ⁇ .
  • SoftMax may be applied along the final dimension (red in the figure).
  • the bias may be applied along the first 2 dimensions independently: ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ , ⁇ , ⁇ ⁇ ⁇ , ⁇ ⁇ ⁇ , ⁇ ⁇ ⁇ 1 ... , h ⁇ ⁇ ⁇ ⁇ , ⁇ ⁇ ⁇ 1, ... , ⁇ [0116] can be reduced: ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ , ⁇ , ⁇ ⁇ Per-head: ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ er-tensor ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ P ⁇ ⁇ ⁇ , ⁇ , ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ [0117]
  • the 11 is a a process 1100 for calibrating a quantized machine-learning model, in accordance with aspects of the present disclosure.
  • One or more operations of the process 1100 may be performed by a computing device (or apparatus) or a component (e.g., a chipset, codec, etc.) of the computing device.
  • the computing device may be a mobile device (e.g., a mobile phone), a network-connected wearable such as a watch, an extended reality (XR) device such as a virtual reality (VR) device or augmented reality (AR) device, a vehicle or component or system of a vehicle, or other type of computing device.
  • the one or more operations of the process 1100 may be implemented as software components that are executed and run on one or more processors.
  • a computing device may obtain a quantized machine-learning model, wherein a machine-learning model on which the quantized machine-learning model is based is associated with a first data format and the quantized machine- learning model is associated with a second data format.
  • system 700 of FIG.7 may obtain quantized machine-learning model 708.
  • Quantized machine-learning model 708 may be associated with a second data format (e.g., int8).
  • one or more parameters (e.g., weights and/or activations) of quantized machine-learning model 708 may be formatted according to the second data format.
  • Quantized machine-learning model 708 may be based on machine-learning model 702.
  • Machine-learning model 702 may be associated with a first data format (e.g., float16). For example, one or more parameters (e.g., weights and/or activations) of machine-learning model 702 may be formatted according to the first data format.
  • Qualcomm Ref. No.2307322WO [0119]
  • the quantized machine-learning model may be, or may include, a diffusion model.
  • the quantized machine-learning model may be, or may include, a language transformer.
  • obtaining the quantized machine-learning model may be, or may include, changing a format associated with a machine-learning model from the first data format to the second data format to generate the quantized machine-learning model.
  • system 700 may obtain machine-learning model 702 and quantize machine- learning model 702 to generate quantized machine-learning model 708.
  • changing the format associated with the machine-learning model may be, or may include, changing parameters of the machine-learning model from being stored according to the first data format to being stored according to the second data format.
  • the parameters may be, or may include, one or both of weights and activations of the machine-learning model.
  • the machine-learning model on which the quantized machine-learning model is based may store parameters of the machine-learning model according to the first data format and the quantized machine-learning model may store parameters of the quantized machine-learning model according to the second data format.
  • the parameters of the machine-learning model may be, or may include, one or both of weights and activations of the machine-learning model and the parameters of the quantized machine-learning model may be, or may include, one or both of weights and activations of the quantized machine-learning model.
  • the first data format may be, or may include, a floating-point number data format
  • the second data format may be, or may include, an integer data format.
  • the integer data format comprises an 8-bit integer data format.
  • Machine-learning model 702 may be associated with a first data format (e.g., float16).
  • one or more parameters (e.g., weights and/or activations) of machine-learning model 702 may be formatted according to the first data format.
  • Quantized machine-learning model 708 may be based on machine-learning model 702.
  • quantized machine-learning model 708 may be a quantized version of machine-learning model 702.
  • Quantized machine-learning model 708 may be associated with a second data format (e.g., int8).
  • one or more parameters (e.g., weights and/or activations) of quantized machine-learning model 708 may be formatted according to the second data format.
  • the computing device (or one or more components thereof) may determine a bias associated with at least one output of a normalization function of the quantized machine-learning model.
  • determining the bias may include: normalizing at least one input vector using the normalization function of the quantized machine-learning model to generate the at least one output, wherein the at least one output comprises at least one output vector; determining a difference between a sum of values of the at least one output vector and an expected output sum; and determining the bias based on the difference.
  • calibration engine 712 may provide at least one input vector of calibration data 718 to quantized normalization function 710.
  • Quantized normalization function 710 may generate at least one output vector.
  • Calibration engine 712 may determine a difference between a sum of values of the at least one output vector and an expected output sum.
  • Calibration engine 712 may determine the bias based on the difference.
  • the normalization function may be, or may include, an activation function configured to generate a probability distribution.
  • the normalization function may be configured to receive an input vector and generate an output vector.
  • the normalization function may be configured such that values of the output vector sum to an expected output sum.
  • the normalization function may be a SoftMax function that may have an expected output sum of 1.
  • the normalization function may be, or may include, at least one of: a SoftMax function; a batch normalization function; a layer normalization function; a group normalization function; or an instance normalization function.
  • the computing device may determine a correction factor based on the bias, wherein the correction factor is to be applied by the normalization function of the quantized machine-learning model when normalizing vectors.
  • system 700 may determine a correction factor based on the bias determined at block 1104.
  • the correction factor may be for calibrated quantized machine-learning model 714 to apply when calibrated quantized machine-learning model 714 normalizes vectors.
  • the correction factor may be determined further based on a size of a set of data input to the normalization function. Additionally or alternatively, the correction factor may be determined further based on a size of a set of data output by the normalization function.
  • the correction factor may be the bias divided by the number of values in vectors input to the quantized normalization function. Additionally or alternatively, the correction factor may be the bias divided by the number of values in vectors output by the quantized normalization function. [0127] In some aspects, the correction factor is to be combined with each value of each output vector of the normalization function when the normalization function generates the output vectors. For example, correction factor 830 of FIG. 8 may be applied to each value of intermediate vector 828 to generate output vector 832. [0128] In some aspects, the correction factor may be, or may include, two or more correction factors to be combined with respective output vectors of the normalization function when the normalization function generates the output vectors.
  • FIG. 12 is a flow diagram illustrating a process 1200 for normalizing a vector in a quantized machine-learning model, in accordance with aspects of the present disclosure.
  • One or more operations of the process 1200 may be performed by a computing device (or apparatus) or a component (e.g., a chipset, codec, etc.) of the computing device.
  • the computing device may be a mobile device (e.g., a mobile phone), a network-connected wearable such as a watch, an extended reality (XR) device such as a virtual reality (VR) device or augmented reality (AR) device, a vehicle or component or system of a vehicle, a desktop computing device, a tablet computing device, a server computer, a robotic device, and/or any other computing device with the resource capabilities to perform the process 1200.
  • the one or more operations of the process 1200 may be implemented as software components that are executed and run on one or more processors.
  • a computing device (or one or more components thereof) may normalize an input vector using a normalization function to generate an intermediate vector. for example, normalization function 826 of FIG.
  • the normalization function may be, or may include, an activation function configured to generate a probability distribution.
  • the normalization Qualcomm Ref. No.2307322WO function may be configured such that values of the intermediate vector sum to an expected output sum.
  • the normalization function may be, or may include, at least one of: a SoftMax function; a batch normalization function; a layer normalization function; a group normalization function; or an instance normalization function.
  • the computing device (or one or more components thereof) may select a vector of a tensor as the input vector.
  • the predetermined value may be determined based on a size of the input vector. Additionally or alternatively, the predetermined value may be determined further based on a size of a set of data output by the normalization function. For example, the predetermined value may be the bias divided by the number of values in vectors input to the quantized normalization function. Additionally or alternatively, the predetermined value may be the bias divided by the number of values in vectors output by the quantized normalization function. [0133] At a block 1204, the computing device (or one or more components thereof) may combine a predetermined value with each value of the intermediate vector to generate an output vector, wherein the predetermined value is determined to correct a bias of the normalization function of the quantized machine-learning model.
  • the quantized machine-learning model may be, or may include, a diffusion model. In some aspects, the quantized machine-learning model may be, or may include, a language transformer. [0135] In some aspects, the computing device (or one or more components thereof) may obtain a tensor. The tensor may be, or may include, the input vector and one or more vectors. The computing device (or one or more components thereof) may normalize the one or more vectors using the normalization function to generate one or more intermediate vectors. For example, the computing device (or one or more components thereof) may receive tensor 902. The computing device (or one or more components thereof) may normalize tensor 902, one vector at a time.
  • the computing device may combine the predetermined value with each value of the one or more intermediate vectors to generate one or more output vectors.
  • the computing device may combine correction factor value 1008 with each value of each intermediate vector normalized based on all vectors of tensor 902. Qualcomm Ref. No.2307322WO [0137]
  • the computing device may combine a predetermined value of one or more predetermined values with each value of each one of the one or more intermediate vectors to generate one or more output vectors.
  • the computing device may combine values of correction factor matrix 1004 with each value of respective ones of intermediate vectors normalized based on respective vectors of tensor 902.
  • the computing device may combine values of correction factor vector 1006 with each value of respective ones of intermediate vectors normalized based on respective vectors of tensor 902.
  • a computing device with the computing-device architecture 1500 shown in FIG.15 can include, or be included in, the components of the system 100 of FIG.1, machine- learning model 200 of FIG. 2, two sets of images 300 of FIG.3, U-Net architecture 500 of FIG.
  • the computing device or apparatus can include various components, such as one or more input devices, one or more output devices, one or more processors, one or more microprocessors, one or more microcomputers, one or more cameras, one or more sensors, and/or other component(s) that are configured to carry out the steps of processes described herein.
  • the computing device can include a display, a network interface configured to communicate and/or receive the data, any combination thereof, and/or other component(s).
  • the network interface can be configured to communicate and/or receive Internet Protocol (IP) based data or other type of data.
  • IP Internet Protocol
  • the components of the computing device can be implemented in circuitry.
  • the components can include and/or can be implemented using electronic circuits or other electronic hardware, which can include one or more programmable electronic circuits (e.g., microprocessors, graphics processing units (GPUs), digital signal processors (DSPs), central processing units (CPUs), and/or other suitable electronic circuits), and/or can include and/or be implemented using computer software, firmware, or any combination thereof, to perform the various operations described herein.
  • programmable electronic circuits e.g., microprocessors, graphics processing units (GPUs), digital signal processors (DSPs), central processing units (CPUs), and/or other suitable electronic circuits
  • CPUs central processing units
  • the process 1100, the process 1200, and/or other process described herein are illustrated as logical flow diagrams, the operation of which represents a sequence of operations that can be implemented in hardware, computer instructions, or a combination thereof.
  • the operations represent computer-executable instructions stored on one or more computer-readable storage media that, when executed by one or more processors, perform the recited operations.
  • computer-executable instructions include routines, programs, objects, components, data structures, and the like that perform particular functions or implement particular data types.
  • the order in which the operations are described is not intended to be construed as a limitation, and any number of the described operations can be combined in any order and/or in parallel to implement the processes.
  • the process 1100, the process 1200, and/or other process described herein can be performed under the control of one or more computer systems configured with executable instructions and can be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) executing collectively on one or more processors, by hardware, or combinations thereof.
  • the code can be stored on a computer- readable or machine-readable storage medium, for example, in the form of a computer program comprising a plurality of instructions executable by one or more processors.
  • the computer- readable or machine-readable storage medium can be non-transitory.
  • various aspects of the present disclosure can use machine-learning models or systems.
  • FIG.13 is an illustrative example of a neural network 1300 (e.g., a deep-learning neural network) that can be used to implement the machine-learning based feature segmentation, implicit-neural-representation generation, rendering, and/or classification described above.
  • Qualcomm Ref. No.2307322WO Neural network 1300 may be an example of, or can implement, machine-learning model 200 of FIG. 2, two sets of images 300 of FIG. 3, U-Net architecture 500 of FIG. 5, machine-learning model 702 of FIG. 2, quantized machine-learning model 708 of FIG. 7, and/or calibrated quantized machine-learning model 714 of FIG.7.
  • An input layer 1302 includes input data.
  • input layer 1302 can include data representing a prompt, text embeddings (e.g., based on a prompt), and/or a latent seed.
  • Neural network 1300 includes multiple hidden layers hidden layers 1306a, 1306b, through 1306n.
  • the hidden layers 1306a, 1306b, through hidden layer 1306n include “n” number of hidden layers, where “n” is an integer greater than or equal to one.
  • the number of hidden layers can be made to include as many layers as needed for the given application.
  • Neural network 1300 further includes an output layer 1304 that provides an output resulting from the processing performed by the hidden layers 1306a, 1306b, through 1306n.
  • output layer 1304 can provide output images or output language.
  • Neural network 1300 may be, or may include, a multi-layer neural network of interconnected nodes. Each node can represent a piece of information. Information associated with the nodes is shared among the different layers and each layer retains information as information is processed. In some cases, neural network 1300 can include a feed-forward network, in which case there are no feedback connections where outputs of the network are fed back into itself. In some cases, neural network 1300 can include a recurrent neural network, which can have loops that allow information to be carried across nodes while reading in input. [0146] Information can be exchanged between nodes through node-to-node interconnections between the various layers. Nodes of input layer 1302 can activate a set of nodes in the first hidden layer 1306a.
  • each of the input nodes of input layer 1302 is connected to each of the nodes of the first hidden layer 1306a.
  • the nodes of first hidden layer 1306a can transform the information of each input node by applying activation functions to the input node information.
  • the information derived from the transformation can then be passed to and can activate the nodes of the next hidden layer 1306b, which can perform their own designated functions.
  • Example functions include convolutional, up-sampling, data transformation, and/or any other suitable functions.
  • the output of the hidden layer 1306b can then activate nodes of the next hidden layer, and so on.
  • the output of the last hidden layer 1306n can activate one or more nodes of the output layer 1304, at which an output is provided. In some Qualcomm Ref.
  • each node or interconnection between nodes can have a weight that is a set of parameters derived from the training of neural network 1300.
  • neural network 1300 Once neural network 1300 is trained, it can be referred to as a trained neural network, which can be used to perform one or more operations.
  • an interconnection between nodes can represent a piece of information learned about the interconnected nodes.
  • the interconnection can have a tunable numeric weight that can be tuned (e.g., based on a training dataset), allowing neural network 1300 to be adaptive to inputs and able to learn as more and more data is processed.
  • Neural network 1300 may be pre-trained to process the features from the data in the input layer 1302 using the different hidden layers 1306a, 1306b, through 1306n in order to provide the output through the output layer 1304.
  • neural network 1300 can be trained using training data that includes both images and labels, as described above.
  • training images can be input into the network, with each training image having a label indicating the features in the images (for the feature-segmentation machine-learning system) or a label indicating classes of an activity in each image.
  • a training image can include an image of a number 2, in which case the label for the image can be [0010 00 000 0].
  • neural network 1300 can adjust the weights of the nodes using a training process called backpropagation.
  • a backpropagation process can include a forward pass, a loss function, a backward pass, and a weight update. The forward pass, loss function, backward pass, and parameter update is performed for one training iteration.
  • the forward pass can include passing a training image through neural network 1300.
  • the weights are initially randomized before neural network 1300 is trained.
  • an image can include an array of numbers representing the pixels of the image. Each number in the array can include a value from 0 to 255 Qualcomm Ref. No.2307322WO describing the pixel intensity at that position in the array.
  • the array can include a 28 x 28 x 3 array of numbers with 28 rows and 28 columns of pixels and 3 color components (such as red, green, and blue, or luma and two chroma components, or the like).
  • the output will likely include values that do not give preference to any particular class due to the weights being randomly selected at initialization. For example, if the output is a vector with probabilities that the object includes different classes, the probability value for each of the different classes can be equal or at least very similar (e.g., for ten possible classes, each class can have a probability value of 0.1).
  • a loss function can be used to analyze error in the output. Any suitable loss function definition can be used, such as a cross-entropy loss. Another example of a loss function includes the mean squared error (MSE), defined as . The loss can be set to be equal to the value of Etotal. [0152] The loss (or error) will be high for the first training images since the actual values will be much different than the predicted output. The goal of training is to minimize the amount of loss so that the predicted output is the same as the training label.
  • MSE mean squared error
  • Neural network 1300 can perform a backward pass by determining which inputs (weights) most contributed to the loss of the network and can adjust the weights so that the loss decreases and is eventually minimized.
  • a derivative of the loss with respect to the weights (denoted as dL/dW, where W are the weights at a particular layer) can be computed to determine the weights that contributed most to the loss of the network.
  • a weight update can be performed by updating all the weights of the filters. For example, the weights can be updated so that they change in the opposite direction of the gradient.
  • Neural network 1300 can include any suitable deep network.
  • One example includes a convolutional neural network (CNN), which includes an input layer and an output layer, with multiple hidden layers between the input and out layers.
  • CNN convolutional neural network
  • the hidden layers of a CNN include a Qualcomm Ref. No.2307322WO series of convolutional, nonlinear, pooling (for downsampling), and fully connected layers.
  • Neural network 1300 can include any other deep network other than a CNN, such as an autoencoder, a deep belief nets (DBNs), a Recurrent Neural Networks (RNNs), among others.
  • FIG. 14 is an illustrative example of a convolutional neural network (CNN) 1400.
  • the input layer 1402 of the CNN 1400 includes data representing an image or frame.
  • the data can include an array of numbers representing the pixels of the image, with each number in the array including a value from 0 to 255 describing the pixel intensity at that position in the array.
  • the array can include a 28 x 28 x 3 array of numbers with 28 rows and 28 columns of pixels and 3 color components (e.g., red, green, and blue, or luma and two chroma components, or the like).
  • the image can be passed through a convolutional hidden layer 1404, an optional non-linear activation layer, a pooling hidden layer 1406, and fully connected layer 1408 (which fully connected layer 1408 can be hidden) to get an output at the output layer 1410. While only one of each hidden layer is shown in FIG.14, one of ordinary skill will appreciate that multiple convolutional hidden layers, non-linear layers, pooling hidden layers, and/or fully connected layers can be included in the CNN 1400.
  • the output can indicate a single class of an object or can include a probability of classes that best describe the object in the image.
  • the first layer of the CNN 1400 can be the convolutional hidden layer 1404.
  • the convolutional hidden layer 1404 can analyze image data of the input layer 1402. Each node of the convolutional hidden layer 1404 is connected to a region of nodes (pixels) of the input image called a receptive field.
  • the convolutional hidden layer 1404 can be considered as one or more filters (each filter corresponding to a different activation or feature map), with each convolutional iteration of a filter being a node or neuron of the convolutional hidden layer 1404.
  • the region of the input image that a filter covers at each convolutional iteration would be the receptive field for the filter.
  • the input image includes a 28 ⁇ 28 array, and each filter (and corresponding receptive field) is a 5 ⁇ 5 array, then there will be 24 ⁇ 24 nodes in the convolutional hidden layer 1404.
  • Each connection between a node and a receptive field for that node learns a weight and, in some cases, an overall bias such that each node learns to analyze its particular local receptive field in the input image.
  • Each node of the convolutional hidden layer 1404 will have the same weights and bias (called a shared weight and a shared bias).
  • the filter has an array of weights (numbers) and the same depth as the input.
  • a filter Qualcomm Ref. No.2307322WO will have a depth of 3 for an image frame example (according to three color components of the input image).
  • An illustrative example size of the filter array is 5 x 5 x 3, corresponding to a size of the receptive field of a node.
  • the convolutional nature of the convolutional hidden layer 1404 is due to each node of the convolutional layer being applied to its corresponding receptive field.
  • a filter of the convolutional hidden layer 1404 can begin in the top-left corner of the input image array and can convolve around the input image.
  • each convolutional iteration of the filter can be considered a node or neuron of the convolutional hidden layer 1404.
  • the values of the filter are multiplied with a corresponding number of the original pixel values of the image (e.g., the 5x5 filter array is multiplied by a 5x5 array of input pixel values at the top-left corner of the input image array).
  • the multiplications from each convolutional iteration can be summed together to obtain a total sum for that iteration or node.
  • the process is next continued at a next location in the input image according to the receptive field of a next node in the convolutional hidden layer 1404.
  • a filter can be moved by a step amount (referred to as a stride) to the next receptive field.
  • the stride can be set to 1 or any other suitable amount. For example, if the stride is set to 1, the filter will be moved to the right by 1 pixel at each convolutional iteration. Processing the filter at each unique location of the input volume produces a number representing the filter results for that location, resulting in a total sum value being determined for each node of the convolutional hidden layer 1404.
  • the mapping from the input layer to the convolutional hidden layer 1404 is referred to as an activation map (or feature map).
  • the activation map includes a value for each node representing the filter results at each location of the input volume.
  • the activation map can include an array that includes the various total sum values resulting from each iteration of the filter on the input volume.
  • the activation map will include a 24 x 24 array if a 5 x 5 filter is applied to each pixel (a stride of 1) of a 28 x 28 input image.
  • the convolutional hidden layer 1404 can include several activation maps in order to identify multiple features in an image. The example shown in FIG. 14 includes three activation maps. Using three activation maps, the convolutional hidden layer 1404 can detect three different kinds of features, with each feature being detectable across the entire image. [0158] In some examples, a non-linear hidden layer can be applied after the convolutional hidden layer 1404.
  • the non-linear layer can be used to introduce non-linearity to a system that Qualcomm Ref. No.2307322WO has been computing linear operations.
  • One illustrative example of a non-linear layer is a rectified linear unit (ReLU) layer.
  • the ReLU can thus increase the non-linear properties of the CNN 1400 without affecting the receptive fields of the convolutional hidden layer 1404.
  • the pooling hidden layer 1406 can be applied after the convolutional hidden layer 1404 (and after the non-linear hidden layer when used).
  • the pooling hidden layer 1406 is used to simplify the information in the output from the convolutional hidden layer 1404.
  • the pooling hidden layer 1406 can take each activation map output from the convolutional hidden layer 1404 and generates a condensed activation map (or feature map) using a pooling function.
  • Max-pooling is one example of a function performed by a pooling hidden layer.
  • Other forms of pooling functions be used by the pooling hidden layer 1406, such as average pooling, L2-norm pooling, or other suitable pooling functions.
  • a pooling function e.g., a max-pooling filter, an L2-norm filter, or other suitable pooling filter
  • a pooling function is applied to each activation map included in the convolutional hidden layer 1404.
  • max-pooling can be used by applying a max-pooling filter (e.g., having a size of 2x2) with a stride (e.g., equal to a dimension of the filter, such as a stride of 2) to an activation map output from the convolutional hidden layer 1404.
  • the output from a max- pooling filter includes the maximum number in every sub-region that the filter convolves around.
  • each unit in the pooling layer can summarize a region of 2 ⁇ 2 nodes in the previous layer (with each node being a value in the activation map).
  • an activation map For example, four values (nodes) in an activation map will be analyzed by a 2x2 max-pooling filter at each iteration of the filter, with the maximum value from the four values being output as the “max” value. If such a max-pooling filter is applied to an activation filter from the convolutional hidden layer 1404 having a dimension of 24x24 nodes, the output from the pooling hidden layer 1406 will be an array of 12x12 nodes. [0161] In some examples, an L2-norm pooling filter could also be used.
  • the L2-norm pooling filter includes computing the square root of the sum of the squares of the values in the 2 ⁇ 2 region (or other suitable region) of an activation map (instead of computing the maximum values as is done in max-pooling) and using the computed values as an output.
  • Qualcomm Ref. No.2307322WO [0162]
  • the pooling function e.g., max-pooling, L2-norm pooling, or other pooling function determines whether a given feature is found anywhere in a region of the image and discards the exact positional information. This can be done without affecting results of the feature detection because, once a feature has been found, the exact location of the feature is not as important as its approximate location relative to other features.
  • Max-pooling offers the benefit that there are many fewer pooled features, thus reducing the number of parameters needed in later layers of the CNN 1400.
  • the final layer of connections in the network is a fully-connected layer that connects every node from the pooling hidden layer 1406 to every one of the output nodes in the output layer 1410.
  • the input layer includes 28 x 28 nodes encoding the pixel intensities of the input image
  • the convolutional hidden layer 1404 includes 3 ⁇ 24 ⁇ 24 hidden feature nodes based on application of a 5 ⁇ 5 local receptive field (for the filters) to three activation maps
  • the pooling hidden layer 1406 includes a layer of 3 ⁇ 12 ⁇ 12 hidden feature nodes based on application of max-pooling filter to 2 ⁇ 2 regions across each of the three feature maps.
  • the output layer 1410 can include ten output nodes. In such an example, every node of the 3x12x12 pooling hidden layer 1406 is connected to every node of the output layer 1410.
  • the fully connected layer 1408 can obtain the output of the previous pooling hidden layer 1406 (which should represent the activation maps of high-level features) and determines the features that most correlate to a particular class. For example, the fully connected layer 1408 can determine the high-level features that most strongly correlate to a particular class and can include weights (nodes) for the high-level features. A product can be computed between the weights of the fully connected layer 1408 and the pooling hidden layer 1406 to obtain probabilities for the different classes.
  • the CNN 1400 is being used to predict that an object in an image is a person, high values will be present in the activation maps that represent high-level features of people (e.g., two legs are present, a face is present at the top of the object, two eyes are present at the top left and top right of the face, a nose is present in the middle of the face, a mouth is present at the bottom of the face, and/or other features common for a person).
  • Each number in the M-dimensional vector can represent the probability the object is of a certain class.
  • the vector indicates that there is a 5% probability that the image is the third class of object (e.g., a dog), an 80% probability that the image is the fourth class of object (e.g., a human), and a 15% probability that the image is the sixth class of object (e.g., a kangaroo).
  • the probability for a class can be considered a confidence level that the object is part of that class.
  • FIG. 15 illustrates an example computing-device architecture 1500 of an example computing device which can implement the various techniques described herein.
  • the computing device can include a mobile device, a wearable device, an extended reality device (e.g., a virtual reality (VR) device, an augmented reality (AR) device, or a mixed reality (MR) device), a personal computer, a laptop computer, a video server, a vehicle (or computing device of a vehicle), or other device.
  • the computing-device architecture 1500 may include, implement, or be included in any or all of system 100 of FIG. 1, machine- learning model 200 of FIG. 2, two sets of images 300 of FIG.3, U-Net architecture 500 of FIG. 5, attention block 600 of FIG. 6, system 700 of FIG. 7, calibrated quantized machine-learning model 714 of FIG.7, and/or system 822 of FIG.8.
  • computing-device architecture 1500 includes a processing unit (CPU or processor) 1502 and computing device connection 1512 that couples various computing device components including computing device memory 1510, such as read only memory (ROM) 1508 and random-access memory (RAM) 1506, to processor 1502.
  • computing device memory 1510 such as read only memory (ROM) 1508 and random-access memory (RAM) 1506, to processor 1502.
  • Computing-device architecture 1500 can include a cache of high-speed memory connected directly with, in close proximity to, or integrated as part of processor 1502.
  • Computing-device architecture 1500 can copy data from memory 1510 and/or the storage device 1514 to cache 1504 for quick access by processor 1502. In this way, the cache can provide a performance boost that avoids processor 1502 delays while waiting for data.
  • Processor 1502 can control or be configured to control processor 1502 to perform various actions.
  • Other computing device memory 1510 may be available for use as well. Memory 1510 can include multiple different types of memory with different performance characteristics.
  • Processor 1502 Qualcomm Ref. No.2307322WO can include any general-purpose processor and a hardware or software service, such as service 1 1516, service 21518, and service 3 1520 stored in storage device 1514, configured to control processor 1502 as well as a special-purpose processor where software instructions are incorporated into the processor design.
  • Processor 1502 may be a self-contained system, containing multiple cores or processors, a bus, memory controller, cache, etc.
  • a multi-core processor may be symmetric or asymmetric.
  • input device 1522 can represent any number of input mechanisms, such as a microphone for speech, a touch- sensitive screen for gesture or graphical input, keyboard, mouse, motion input, speech and so forth.
  • Output device 1524 can also be one or more of a number of output mechanisms known to those of skill in the art, such as a display, projector, television, speaker device, etc.
  • multimodal computing devices can enable a user to provide multiple types of input to communicate with computing-device architecture 1500.
  • Communication interface 1526 can generally govern and manage the user input and computing device output. There is no restriction on operating on any particular hardware arrangement and therefore the basic features here may easily be substituted for improved hardware or firmware arrangements as they are developed.
  • Storage device 1514 is a non-volatile memory and can be a hard disk or other types of computer readable media which can store data that are accessible by a computer, such as magnetic cassettes, flash memory cards, solid state memory devices, digital versatile disks, cartridges, random-access memories (RAMs) 1506, read only memory (ROM) 1508, and hybrids thereof.
  • Storage device 1514 can include services 1516, 1518, and 1520 for controlling processor 1502. Other hardware or software modules are contemplated.
  • Storage device 1514 can be connected to the computing device connection 1512.
  • a hardware module that performs a particular function can include the software component stored in a computer-readable medium in connection with the necessary hardware components, such as processor 1502, connection 1512, output device 1524, and so forth, to carry out the function.
  • the term “substantially,” in reference to a given parameter, property, or condition may refer to a degree that one of ordinary skill in the art would understand that the given parameter, property, or condition is met with a small degree of variance, such as, for example, within acceptable manufacturing tolerances.
  • the parameter, property, or condition may be at least 90% met, at least 95% met, or even at least 99% met.
  • aspects of the present disclosure are applicable to any suitable electronic device (such as security systems, smartphones, tablets, laptop computers, vehicles, drones, or other devices) including or coupled to one or more active depth sensing systems. While described below with respect to a device having or coupled to one light projector, aspects of the present disclosure are applicable to devices having any number of light projectors and are therefore not limited to specific devices.
  • the term “device” is not limited to one or a specific number of physical objects (such as one smartphone, one controller, one processing system and so on). As used herein, a device may be any electronic device with one or more parts that may implement at least some portions of this disclosure.
  • the term “device” is not limited to a specific configuration, type, or number of objects.
  • the term “system” is not limited to multiple components or specific aspects. For example, a system may be implemented on one or more printed circuit boards or other substrates and may have movable or static components. While the below description and examples use the term “system” to describe various aspects of this disclosure, the term “system” is not limited to a specific configuration, type, or number of objects. [0174] Specific details are provided in the description above to provide a thorough understanding of the aspects and examples provided herein. However, it will be understood by one of ordinary skill in the art that the aspects may be practiced without these specific details.
  • the present technology may be presented as including individual functional blocks including functional blocks including devices, device components, steps or routines in a method embodied in software, or combinations of hardware and software. Additional components may be used other than those shown in the figures and/or described herein.
  • circuits, systems, networks, processes, and other components may be shown as components in block diagram form in order not to obscure the aspects in unnecessary detail.
  • well-known circuits, processes, algorithms, structures, and techniques may be shown without unnecessary detail in order to avoid obscuring the aspects.
  • Processes and methods according to the above-described examples can be implemented using computer-executable instructions that are stored or otherwise available from computer- readable media. Such instructions can include, for example, instructions and data which cause or otherwise configure a general-purpose computer, special purpose computer, or a processing device to perform a certain function or group of functions. Portions of computer resources used can be accessible over a network.
  • the computer executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, firmware, source code, etc.
  • the term “computer-readable medium” includes, but is not limited to, portable or non- portable storage devices, optical storage devices, and various other mediums capable of storing, containing, or carrying instruction(s) and/or data.
  • a computer-readable medium may include a non-transitory medium in which data can be stored and that does not include carrier waves and/or transitory electronic signals propagating wirelessly or over wired connections.
  • Examples of a non-transitory medium may include, but are not limited to, a magnetic disk or tape, optical storage media such as compact disk (CD) or digital versatile disk (DVD), flash memory, magnetic or optical disks, USB devices provided with non-volatile memory, networked storage devices, any suitable combination thereof, among others.
  • a computer-readable medium may have stored thereon code and/or machine-executable instructions that may represent a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, a software package, a class, or any combination of instructions, data structures, or program statements.
  • a code segment may be coupled to another code segment or a hardware circuit by passing and/or receiving information, data, arguments, parameters, or memory contents.
  • Information, arguments, parameters, data, etc. may be passed, forwarded, or transmitted via any suitable means including memory sharing, message passing, token passing, network transmission, or the like.
  • the computer-readable storage devices, mediums, and memories can include a cable or wireless signal containing a bit stream and the like.
  • non-transitory computer-readable storage media expressly exclude media such as energy, carrier signals, electromagnetic waves, and signals per se.
  • Devices implementing processes and methods according to these disclosures can include hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof, and can take any of a variety of form factors.
  • the program code or code segments to perform the necessary tasks may be stored in a computer-readable or machine-readable medium.
  • a processor(s) may perform the necessary tasks.
  • Typical examples of form factors include laptops, smart phones, mobile phones, tablet devices or other small form factor personal computers, personal digital assistants, rackmount devices, standalone devices, and so on.
  • Functionality described herein also can be embodied in peripherals or add-in cards. Such functionality can also be implemented on a circuit board among different chips or different processes executing in a single device, by way of further example.
  • the instructions, media for conveying such instructions, computing resources for executing them, and other structures for supporting such computing resources are example means for providing the functions described in the disclosure.
  • Claim language or other language reciting “at least one of” a set and/or “one or more” of a set indicates that one member of the set or multiple members of the set (in any combination) satisfy the claim.
  • claim language reciting “at least one of A and B” or “at least one of A or B” means A, B, or A and B.
  • claim language reciting “at least one of A, B, and C” or “at least one of A, B, or C” means A, B, C, or A and B, or A and C, or B and C, or A and B and C.
  • the language “at least one of” a set and/or “one or more” of a set does not limit the set to the items listed in the set.
  • claim language reciting “at least one of A and B” or “at least one of A or B” can mean A, B, or A and B, and can additionally include items not listed in the set of A and B.
  • Claim language or other language reciting “at least one processor configured to,” “at least one processor being configured to,” or the like indicates that one processor or multiple processors (in any combination) can perform the associated operation(s).
  • claim language reciting “at least one processor configured to: X, Y, and Z” means a single processor can be used to perform operations X, Y, and Z; or that multiple processors are each tasked with a certain subset of operations X, Y, and Z such that together the multiple processors perform X, Y, and Z; or that a group of multiple processors work together to perform operations X, Y, and Z.
  • claim language reciting “at least one processor configured to: X, Y, and Qualcomm Ref. No.2307322WO Z” can mean that any single processor may only perform at least a subset of operations X, Y, and Z.
  • Such techniques may be implemented in any of a variety of devices such as general-purposes computers, wireless communication device handsets, or integrated circuit devices having multiple uses including application in wireless communication device handsets and other devices. Any features described as modules or components may be implemented together in an integrated logic device or separately as discrete but interoperable logic devices. If implemented in software, the techniques may be realized at least in part by a computer-readable data storage medium including program code including instructions that, when executed, performs one or more of the methods described above.
  • the computer-readable data storage medium may form part of a computer program product, which may include packaging materials.
  • the computer-readable medium may include memory or data storage media, such as random-access memory (RAM) such as synchronous dynamic random-access memory (SDRAM), read-only memory (ROM), non-volatile random- access memory (NVRAM), electrically erasable programmable read-only memory (EEPROM), FLASH memory, magnetic or optical data storage media, and the like.
  • RAM random-access memory
  • SDRAM synchronous dynamic random-access memory
  • ROM read-only memory
  • NVRAM non-volatile random- access memory
  • EEPROM electrically erasable programmable read-only memory
  • FLASH memory magnetic or optical data storage media, and the like.
  • the techniques additionally, or alternatively, may be realized at least in part by a computer-readable communication medium that carries or communicates program code in the form of instructions or data structures and that can be accessed, read, and/or executed by a computer, such as propagated signals or waves. Qualcomm Ref.
  • the program code may be executed by a processor, which may include one or more processors, such as one or more digital signal processors (DSPs), general-purpose microprocessors, an application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry.
  • DSPs digital signal processors
  • ASICs application specific integrated circuits
  • FPGAs field programmable logic arrays
  • a general-purpose processor may be a microprocessor; but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine.
  • a processor may also be implemented as a combination of computing devices, (e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration). Accordingly, the term “processor,” as used herein may refer to any of the foregoing structure, any combination of the foregoing structure, or any other structure or apparatus suitable for implementation of the techniques described herein. [0190] Illustrative aspects of the disclosure include: [0191] Aspect 1.
  • a method for calibrating a quantized machine-learning model comprising: obtaining the quantized machine-learning model, wherein a machine-learning model on which the quantized machine-learning model is based is associated with a first data format and the quantized machine-learning model is associated with a second data format; determining a bias associated with at least one output of a normalization function of the quantized machine- learning model; and determining a correction factor based on the bias, wherein the correction factor is to be applied by the normalization function of the quantized machine-learning model when normalizing vectors.
  • determining the bias comprises: normalizing at least one input vector using the normalization function of the quantized machine-learning model to generate the at least one output, wherein the at least one output comprises at least one output vector; determining a difference between a sum of values of the at least one output vector and an expected output sum; and determining the bias based on the difference.
  • Aspect 3 The method of aspect 2, wherein the correction factor is determined further based on a size of a set of data input to the normalization function. Qualcomm Ref. No.2307322WO [0194] Aspect 4. The method of any one of aspects 1 to 3, wherein the correction factor is to be combined with each value of each output vector of the normalization function when the normalization function generates the output vectors.
  • Aspect 5 The method of any one of aspects 1 to 4, wherein the correction factor comprises two or more correction factors to be combined with respective output vectors of the normalization function when the normalization function generates the output vectors.
  • Aspect 6 The method of any one of aspects 1 to 5, wherein the normalization function comprises an activation function configured to generate a probability distribution.
  • Aspect 7. The method of any one of aspects 1 to 6, wherein the normalization function is configured to receive an input vector and generate an output vector.
  • Aspect 8 The method of aspect 7, wherein the normalization function is configured such that values of the output vector sum to an expected output sum. [0199] Aspect 9.
  • the normalization function comprises at least one of: a SoftMax function; a batch normalization function; a layer normalization function; a group normalization function; or an instance normalization function.
  • Aspect 10 The method of any one of aspects 1 to 9, wherein the quantized machine- learning model comprises a diffusion model.
  • Aspect 11 The method of any one of aspects 1 to 10, wherein the quantized machine- learning model comprises a language transformer.
  • Aspect 12 The method of any one of aspects 1 to 11, wherein obtaining the quantized machine-learning model comprises changing a format associated with a machine-learning model from the first data format to the second data format to generate the quantized machine-learning model.
  • Aspect 13 The method of any one of aspects 1 to 8, wherein the normalization function comprises at least one of: a SoftMax function; a batch normalization function; a layer normalization function; a group normalization function; or an instance normalization function.
  • changing the format associated with the machine-learning model comprises changing parameters of the machine-learning model from being stored according to the first data format to being stored according to the second data format.
  • Qualcomm Ref. No.2307322WO [0204] Aspect 14. The method of aspect 13, wherein the parameters comprise one or both of weights and activations of the machine-learning model. [0205] Aspect 15. The method of any one of aspects 1 to 14, wherein the machine-learning model on which the quantized machine-learning model is based stores parameters of the machine- learning model according to the first data format and the quantized machine-learning model stores parameters of the quantized machine-learning model according to the second data format. [0206] Aspect 16.
  • Aspect 17 The method of any one of aspects 1 to 16, wherein the first data format comprises a floating-point number data format and the second data format comprises an integer data format.
  • Aspect 18 The method of aspect 17, wherein the integer data format comprises an 8-bit integer data format.
  • a method for normalizing a vector in a quantized machine-learning model comprising: normalizing an input vector using a normalization function to generate an intermediate vector; and combining a predetermined value with each value of the intermediate vector to generate an output vector, wherein the predetermined value is determined to correct a bias of the normalization function of the quantized machine-learning model.
  • Aspect 20 The method of aspect 19, wherein the normalization function comprises an activation function configured to generate a probability distribution.
  • Aspect 21 The method of any one of aspects 19 or 20, wherein the normalization function is configured such that values of the intermediate vector sum to an expected output sum.
  • the normalization function comprises at least one of: a SoftMax function; a batch normalization function; a layer normalization function; a group normalization function; or an instance normalization function.
  • Qualcomm Ref. No.2307322WO [0213]
  • the quantized machine- learning model comprises a language transformer.
  • Aspect 26 The method of any one of aspects 19 to 21, wherein the normalization function comprises at least one of: a SoftMax function; a batch normalization function; a layer normalization function; a group normalization function; or an instance normalization function.
  • Aspect 27 The method of any one of aspects 19 to 26, further comprising: obtaining a tensor, wherein the tensor comprises the input vector and one or more vectors; and normalizing the one or more vectors using the normalization function to generate one or more intermediate vectors.
  • Aspect 28 The method of aspect 27, further comprising combining the predetermined value with each value of the one or more intermediate vectors to generate one or more output vectors.
  • Aspect 30 An apparatus for calibrating a quantized machine-learning model, the apparatus comprising: at least one memory; and at least one processor coupled to the at least one memory and configured to: obtain the quantized machine-learning model, wherein a machine- learning model on which the quantized machine-learning model is based is associated with a first data format and the quantized machine-learning model is associated with a second data format; determine a bias associated with at least one output of a normalization function of the quantized machine-learning model; and determine a correction factor based on the bias, wherein the correction factor is to be applied by the normalization function of the quantized machine-learning model when normalizing vectors.
  • Aspect 31 A non-transitory computer-readable storage medium having stored thereon instructions that, when executed by at least one processor, cause the at least one processor to perform operations according to any of aspects 1 to 29.
  • Aspect 32 An apparatus for providing virtual content for display, the apparatus comprising one or more means for perform operations according to any of aspects 1 to 29.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

Systems and techniques are described herein for calibrating a quantized machine-learning model. For instance, a method for calibrating a quantized machine-learning model is provided. The method may include obtaining the quantized machine-learning model, wherein a machine-learning model on which the quantized machine-learning model is based is associated with a first data format and the quantized machine-learning model is associated with a second data format; determining a bias associated with at least one output of a normalization function of the quantized machine-learning model; and determining a correction factor based on the bias, wherein the correction factor is to be applied by the normalization function of the quantized machine-learning model when normalizing vectors.

Description

Qualcomm Ref. No.2307322WO CALIBRATING A QUANTIZED MACHINE-LEARNING MODELS TECHNICAL FIELD [0001] The present disclosure generally relates to quantized machine-learning models. For example, aspects of the present disclosure include systems and techniques for calibrating quantized machine-learning models. BACKGROUND [0002] Machine-learning models can be trained to perform operations. For example, during training, a machine-learning model may be provided with a training data set and be allowed to perform operations based on the training data set. The model may also be given conditions describing successful performance of the operations. The model may iteratively adapt internal parameters (e.g., weights) of the model to minimize a difference (e.g., an “error”) between the model’s performance of the operations and the described successful performance. After being trained (at a phase of operation that may be referred to as “inference”), the model may perform the operations based on the parameters (e.g., the weights) which were adapted through the training to perform the operations. Generative machine-learning models (e.g., stable diffusion models and language transformers) can be trained to generate content. [0003] One type of machine-learning model that has proven extremely versatile is the artificial neural network model, or neural network for short. A neural network is generally a collection of nodes connected by weighted edges that transmit data signals between the nodes. The nodes are commonly organized into layers, with one layer’s output (e.g., a lower layer) feeding the next layer’s input (e.g., a higher layer). Different layers may generally be configured to perform different types of transformations on their inputs, such as convolutional transformations. A node in a layer may receive one or more data signals from inbound edges connected to other nodes, and those data signals may be adjusted by the edge weights. Further, a node may have a bias as an independent input data signal. The node processes all of the input data signals, for example, with a linear or non-linear function and then “activates” based on the output of the function. In some cases, the activation of a node may cause the transmission of further data signals to further connected nodes. The output of the neural network is often referred to as an inference, and can take many forms, such as a numerical output, a classification output, and others. Training a neural Qualcomm Ref. No.2307322WO network, often referred to as deep learning, generally involves adjusting the values of the edge weights and biases until the output of the neural network meets some task performance objective. [0004] While neural networks are powerful machine learning model architectures capable of a wide range of useful tasks, such as recognizing objects in image data, they are likewise highly resource dependent. For example, neural networks may require significant compute, memory, power, and/or time resources for training and/or for inferencing. These resource requirements may significantly limit the ability to train and deploy neural networks to certain types of devices and for certain use cases. Thus there is typically a significant trade-off between neural network task performance and resource usage associated with training and using the neural network. SUMMARY [0005] The following presents a simplified summary relating to one or more aspects disclosed herein. Thus, the following summary should not be considered an extensive overview relating to all contemplated aspects, nor should the following summary be considered to identify key or critical elements relating to all contemplated aspects or to delineate the scope associated with any particular aspect. Accordingly, the following summary presents certain concepts relating to one or more aspects relating to the mechanisms disclosed herein in a simplified form to precede the detailed description presented below. [0006] Systems and techniques are described for calibrating a quantized machine- learning. According to at least one example, a method is provided for calibrating a quantized machine-learning. The method includes: obtaining the quantized machine-learning model, wherein a machine-learning model on which the quantized machine-learning model is based is associated with a first data format and the quantized machine-learning model is associated with a second data format; determining a bias associated with at least one output of a normalization function of the quantized machine-learning model; and determining a correction factor based on the bias, wherein the correction factor is to be applied by the normalization function of the quantized machine-learning model when normalizing vectors. [0007] In another example, an apparatus for calibrating a quantized machine-learning is provided that includes at least one memory and at least one processor (e.g., configured in circuitry) coupled to the at least one memory. The at least one processor configured to: obtain the quantized machine-learning model, wherein a machine-learning model on which the Qualcomm Ref. No.2307322WO quantized machine-learning model is based is associated with a first data format and the quantized machine-learning model is associated with a second data format; determine a bias associated with at least one output of a normalization function of the quantized machine-learning model; and determine a correction factor based on the bias, wherein the correction factor is to be applied by the normalization function of the quantized machine-learning model when normalizing vectors. [0008] In another example, a non-transitory computer-readable medium is provided that has stored thereon instructions that, when executed by one or more processors, cause the one or more processors to: obtain the quantized machine-learning model, wherein a machine-learning model on which the quantized machine-learning model is based is associated with a first data format and the quantized machine-learning model is associated with a second data format; determine a bias associated with at least one output of a normalization function of the quantized machine- learning model; and determine a correction factor based on the bias, wherein the correction factor is to be applied by the normalization function of the quantized machine-learning model when normalizing vectors. [0009] In another example, an apparatus for calibrating a quantized machine-learning is provided. The apparatus includes: means for obtaining the quantized machine-learning model, wherein a machine-learning model on which the quantized machine-learning model is based is associated with a first data format and the quantized machine-learning model is associated with a second data format; means for determining a bias associated with at least one output of a normalization function of the quantized machine-learning model; and means for determining a correction factor based on the bias, wherein the correction factor is to be applied by the normalization function of the quantized machine-learning model when normalizing vectors. [0010] Systems and techniques are described for normalizing a vector in a quantized machine- learning model. According to at least one example, a method is provided for normalizing a vector in a quantized machine-learning model. The method includes: normalizing an input vector using a normalization function to generate an intermediate vector; and combining a predetermined value with each value of the intermediate vector to generate an output vector, wherein the predetermined value is determined to correct a bias of the normalization function of the quantized machine-learning model. Qualcomm Ref. No.2307322WO [0011] In another example, an apparatus for normalizing a vector in a quantized machine- learning model is provided that includes at least one memory and at least one processor (e.g., configured in circuitry) coupled to the at least one memory. The at least one processor configured to: normalize an input vector using a normalization function to generate an intermediate vector; and combine a predetermined value with each value of the intermediate vector to generate an output vector, wherein the predetermined value is determined to correct a bias of the normalization function of the quantized machine-learning model. [0012] In another example, a non-transitory computer-readable medium is provided that has stored thereon instructions that, when executed by one or more processors, cause the one or more processors to: normalize an input vector using a normalization function to generate an intermediate vector; and combine a predetermined value with each value of the intermediate vector to generate an output vector, wherein the predetermined value is determined to correct a bias of the normalization function of the quantized machine-learning model. [0013] In another example, an apparatus for normalizing a vector in a quantized machine- learning model is provided. The apparatus includes: means for normalizing an input vector using a normalization function to generate an intermediate vector; and means for combining a predetermined value with each value of the intermediate vector to generate an output vector, wherein the predetermined value is determined to correct a bias of the normalization function of the quantized machine-learning model. [0014] In some aspects, one or more of the apparatuses described herein is, can be part of, or can include a mobile device (e.g., a mobile telephone or so-called “smart phone”, a tablet computer, or other type of mobile device), an extended reality device (e.g., a virtual reality (VR) device, an augmented reality (AR) device, or a mixed reality (MR) device), a vehicle (or a computing device or system of a vehicle), a smart or connected device (e.g., an Internet-of-Things (IoT) device), a wearable device, a personal computer, a laptop computer, a video server, a television (e.g., a network-connected television), a robotics device or system, or other device. In some aspects, each apparatus can include an image sensor (e.g., a camera) or multiple image sensors (e.g., multiple cameras) for capturing one or more images. In some aspects, each apparatus can include one or more displays for displaying one or more images, notifications, and/or other displayable data. In some aspects, each apparatus can include one or more speakers, one or more light-emitting devices, and/or one or more microphones. In some aspects, each Qualcomm Ref. No.2307322WO apparatus can include one or more sensors. In some cases, the one or more sensors can be used for determining a location of the apparatuses, a state of the apparatuses (e.g., a tracking state, an operating state, a temperature, a humidity level, and/or other state), and/or for other purposes. [0015] This summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used in isolation to determine the scope of the claimed subject matter. The subject matter should be understood by reference to appropriate portions of the entire specification of this patent, any or all drawings, and each claim. [0016] The foregoing, together with other features and aspects, will become more apparent upon referring to the following specification, claims, and accompanying drawings. BRIEF DESCRIPTION OF THE DRAWINGS [0017] Illustrative examples of the present application are described in detail below with reference to the following figures: [0018] FIG. 1 illustrates an example implementation of a system, which may include a central processing unit (CPU), configured to perform one or more of the functions described herein; [0019] FIG.2 is a block diagram illustrating an example machine-learning model, according to various aspects of the present disclosure; [0020] FIG. 3 provides two sets of images that show the forward diffusion process (which is fixed) and the reverse diffusion process (which is learned) of a diffusion model; [0021] FIG.4 is a diagram illustrating how diffusion data is distributed from initial data to noise using a diffusion model in the forward diffusion direction, in accordance with some aspects of the present disclosure; [0022] FIG.5 is a diagram illustrating a U-Net architecture for a diffusion model, in accordance with some aspects of the present disclosure; [0023] FIG.6 is a diagram illustrating an example attention block, according to various aspects of the present disclosure; [0024] FIG.7 is a diagram illustrating an example system for calibrating a quantized machine- learning model, according to various aspects of the present disclosure; Qualcomm Ref. No.2307322WO [0025] FIG. 8A is a diagram illustrating a system including a normalization function for normalizing values; [0026] FIG. 8B is a diagram illustrating a system including a normalization function for normalizing values; [0027] FIG. 8C is a diagram illustrating a system including a normalization function for normalizing values; [0028] FIG. 9 is a diagram illustrating a tensor that may be input into a calibration function, according to various aspects of the present disclosure; [0029] FIG.10 is a diagram illustrating the tensor of FIG.9 with example formats of correction factors that may be applied when normalizing tensor, according to various aspects of the present disclosure; [0030] FIG. 11 is a flow diagram illustrating an example process for calibrating a quantized machine-learning model, in accordance with aspects of the present disclosure; [0031] FIG. 12 is a flow diagram illustrating an example process for normalizing a vector in a quantized machine-learning model, in accordance with aspects of the present disclosure; [0032] FIG. 13 is a block diagram illustrating an example of a deep learning neural network that can be used to implement a perception module and/or one or more validation modules, according to some aspects of the disclosed technology; [0033] FIG. 14 is a block diagram illustrating an example of a convolutional neural network (CNN), according to various aspects of the present disclosure; and [0034] FIG. 15 is a block diagram illustrating an example computing-device architecture of an example computing device which can implement the various techniques described herein. DETAILED DESCRIPTION [0035] Certain aspects of this disclosure are provided below. Some of these aspects may be applied independently and some of them may be applied in combination as would be apparent to those of skill in the art. In the following description, for the purposes of explanation, specific details are set forth in order to provide a thorough understanding of aspects of the application. Qualcomm Ref. No.2307322WO However, it will be apparent that various aspects may be practiced without these specific details. The figures and description are not intended to be restrictive. [0036] The ensuing description provides example aspects only, and is not intended to limit the scope, applicability, or configuration of the disclosure. Rather, the ensuing description of the exemplary aspects will provide those skilled in the art with an enabling description for implementing an exemplary aspect. It should be understood that various changes may be made in the function and arrangement of elements without departing from the spirit and scope of the application as set forth in the appended claims. [0037] The terms “exemplary” and/or “example” are used herein to mean “serving as an example, instance, or illustration.” Any aspect described herein as “exemplary” and/or “example” is not necessarily to be construed as preferred or advantageous over other aspects. Likewise, the term “aspects of the disclosure” does not require that all aspects of the disclosure include the discussed feature, advantage, or mode of operation. [0038] As described above, machine-learning models can be trained to perform operations. Training of machine-learning models may be a computationally intensive process that may take a relatively long time, a large quantity of training data, and many operations. For example, it may take hundreds of hours to train a machine-learning model. [0039] Quantization is a method of mapping continuous values to a smaller set of discrete finite values. For example, quantization approximates real-world values (e.g., floating point values) with representative values (e.g., integer values) that limit the precision and range of the original input. When applied to machine learning, such as to a neural network, quantization may significantly reduce resource usage for both training and inferencing. For example, performing massive numbers of integer operations during training of or inferencing with a quantized (e.g., reduced-precision) neural network may be significantly more efficient in terms of resource usage as compared to performing floating point operations with an unquantized (e.g., full-precision) neural network processing the same input data. [0040] Quantization of trained machine-learning models allows quantized trained machine- learning models to be efficiently deployed on various devices. Quantization may allow a quantized trained machine-learning model to perform operations (e.g., at the inference phase of operation) on a device relatively quickly. Quantization may include changing a format of Qualcomm Ref. No.2307322WO parameters (e.g., weights and activations) of a machine-learning model from the format in which the machine-learning model was trained to a different format. Post-Training Quantization (PTQ) is a process by which a trained machine-learning model is quantized. For example, during training, a machine-learning model may store parameters (e.g., weights) according to a first format that may allow a high degree of precision. For example, during training, the model may store parameters as floating-point numbers, such as 16-bit floating point numbers, which may be referred to as float16 or FP16, or 32-bit floating point numbers, which may be referred to as float32 or FP32. According to PTQ, after training, before a model is deployed onto a device for use by a device, the model may be quantized by, in part, changing the format used to store the parameters of the model from the first format to a second format. The second format may use less memory and/or be less computational expensive to use. For example, after quantization, the model may store parameters as integer numbers such as 16-bit integer numbers, which may be referred to as Int16, or 8-bit integer numbers, which may be referred to as Int8. It may be less computationally expensive to store and/or operate using integers than floating point numbers. Thus, a device may conserve power and/or processing time when using a quantized trained machine-learning model as compared with using an unquantized trained machine-learning model. Accordingly, it may be advantageous to quantize machine-learning models. [0041] Not all elements of a neural network are equally resilient to quantization. That is, certain elements of a neural network may be more sensitive to quantization than others. Consequently, quantizing an entire neural network to a uniform bitwidth (referred to a fixed-precision quantization (FPQ)) may result in reduced resource usage, but may also reduce task performance to an unacceptable level. To resolve this issue, mixed-precision quantization (MPQ) seeks to quantize different elements of a neural network at varying quantization rates, which results in reduced resource usage while maintaining task performance. The various elements of a neural network that may be quantized with MPQ include, for example, individual layers, groups of layers, sub-layers, weight channels, and others. Generally, MPQ can achieve higher accuracy for the same computational budget because it allocates higher precision to elements (e.g., layers) that are more sensitive to quantization and reduces bitwidth for elements more robust to quantization. [0042] In various aspects of the present disclosure, the term “quantize,” “quantizing,” and like terms may be used as a verb and may be applied to machine-learning models, and/or layers of a Qualcomm Ref. No.2307322WO trained machine-learning model. In such cases the term “quantize,” “quantizing,” and like terms may refer to quantizing parameters (e.g., weights and/or activations) as stored in the of the respective machine-learning models and/or layers. For example, a trained machine-learning model may include weights (e.g., numerical values). During training, and after training, the weights may be stored in the trained machine-learning model in a first format (e.g., float16). Quantizing the trained machine-learning model may include changing the weights from the first format to a second format (e.g., Int8). [0043] Further, the machine-learning model may include activations. Activations may refer to values used as inputs and/or outputs of various functions, operations, and/or layers of the machine-learning model. The activations may have a format. For example, a function may expect to receive values formatted according to a certain format. For instance, the function may read the values from memory according to the certain format. Further, the function may output other values (e.g., processed by the function) according to a particular format (which may or may not be the same as the certain format). For instance the function may write the other values to memory in the particular format. For example, an unquantized trained machine-learning model may use FP16 to pass values between functions (e.g., outputting values from one function and reading the values by another function). After quantization, the quantized trained machine-learning model may use Int8 to pass the values between functions. The term “quantize,” “quantizing,” and like terms may refer to changing how activations are stored and/or used in a quantized machine- learning model. [0044] Quantizing numbers can result in a loss of precision. Quantizing a trained machine- learning model can result in degradation of the model. In the present disclosure, the term “noise” is used to describe a degree of difference between outputs of an unquantized trained machine- learning model and the trained machine-learning model after being quantized (given the same inputs). For example, a trained machine-learning model may be quantized. The unquantized trained machine-learning model may be provided an input, the quantized trained machine- learning model may be provided the same input. Outputs of the models may be compared. A degree of difference between the outputs may be described as noise. Noise may be a way to describe degradation of a quantized machine-learning model. [0045] Quantizing normalization functions of a trained machine-learning model may result in degradation of the model (e.g., as evidenced by noise). A normalization function may generate Qualcomm Ref. No.2307322WO results that sum to a certain number. For instance, a normalization function may generate a probability distribution including a number of outputs that sum to 1. The probability distribution may represent the probability of each of outputs. Examples of normalization functions include: SoftMax, batch normalization (BatchNorm), layer normalization (LayerNorm), group normalization (GroupNorm), and instance normalization (InstanceNorm). Quantizing a normalization function of a trained machine-learning model may cause the quantized trained machine-learning model to exhibit more noise than quantizing other functions, operations, and/or layers of the trained machine-learning model. For example, quantizing a SoftMax operation (e.g., quantizing the activations, such as the output, of the SoftMax function) of a trained machine- learning model may result in more noise than quantizing inputs to the SoftMax function, queries of a transformer including the SoftMax function, keys of the transformer, and/or values of the transformer. In some cases, small values (e.g., small relative to the resolution of the output data format and/or characteristics output distribution) output by a quantized normalization function may be rounded to zero. If the SoftMax function is quantized, many of those outputs may be rounded to 0. For example, if the outputs of a SoftMax function are stored as int8 in 255 steps between 0 and 1, values less than 0.001953 may be rounded to zero. If many outputs of a SoftMax function are rounded to 0, it is possible that the sum of the outputs of the SoftMax function is not 1. The sum of the outputs of the quantized SoftMax function not being 1 may be a cause of degradation in the quantized machine-learning model. [0046] Systems, apparatuses, methods (also referred to as processes), and computer-readable media (collectively referred to herein as “systems and techniques”) are described herein for calibrating a quantized machine learning model. The systems and techniques described herein may obtain a quantized machine-learning model. In some cases, the systems and techniques may obtain a machine-learning model and quantize the machine-learning model. In other cases, the systems and techniques may obtain the quantized machine-learning model, which may have been quantized by another system or technique. In either case, the unquantized machine-learning model may be associated with a first data format. For example, the unquantized machine-learning model may include and/or store parameters (e.g., weights and activations) according to the first data format. The quantized machine-learning model may be associated with a second data format. For example, the quantized machine-learning model may include and/or store parameters (e.g., weights and activations) according to the second data format. The second data format may be smaller and/or less computationally expensive to use than the first data format. For example, the Qualcomm Ref. No.2307322WO first data format may be a floating-point data format (e.g., FP16 or FP32) and the second data format may be an integer data format (e.g., int8 or int16). [0047] The systems and techniques may determine a bias associated with an output of a normalization function of the quantized machine-learning model. For example, the systems and techniques may provide an input vector (e.g., of a calibration data set) to the normalization function. The normalization function may generate an output (e.g., an output vector). The systems and techniques may compare a sum of values of the output vector to an expected output sum. For example, the normalization function may be a SoftMax function. The expected output sum may be 1. For example, it may be expected that any output vector of a SoftMax function sum to 1. The systems and techniques may determine a bias of the normalization function of the quantized machine-learning model based on the difference between the sum of the values of the output vector and the expected output sum. In some cases, the systems and techniques may provide a number of input vectors (e.g., from the calibration data set) to the normalization function and determine a number of differences. The systems and techniques may use a statistical technique (e.g., an average) to determine the bias based on the number of differences. [0048] The systems and techniques may determine a correction factor based on the bias. For example, the systems and techniques may determine that the bias is the amount by which output vectors of the normalization function should be adjusted to correct the bias. Because the output may be a vector, the systems and techniques may determine to spread the adjustment to correct the bias across the vector. For example, the systems and techniques may determine the correction factor to be the bias divided by the number of values in the output vector. The systems and techniques may modify the normalization function of the quantized machine-learning model such that the normalization function applies the correction factor to output vectors (e.g., by adding the correction factor to each value of each output vector generated by the normalization function at inference). By modifying the normalization function of the quantized machine-learning model, the systems and techniques may be calibrating the quantized machine-learning model. Further, the systems and techniques may be specifically calibrating the normalization function of the quantized machine-learning model to correct bias added by the quantization of the normalization function. [0049] Calibrating the quantized machine-learning model (or the normalization function of the quantized machine-learning model) may decrease or eliminate degradation of the quantized Qualcomm Ref. No.2307322WO machine-learning model. For example, a calibrated quantized machine-learning model may exhibit less noise than an uncalibrated quantized machine-learning model. [0050] Various aspects of the application will be described with respect to the figures below. [0051] FIG. 1 illustrates an example implementation of a system 100, which may include a central processing unit (CPU 102) (which may be a multi-core CPU), configured to perform one or more of the functions described herein. Parameters or variables (e.g., neural signals and synaptic weights), system parameters associated with a computational device (e.g., neural network with weights), task information, among other information may be stored in a memory block associated with a neural processing unit (NPU 108), in a memory block associated with a CPU 102, in a memory block associated with a graphics processing unit (GPU 104), in a memory block associated with a digital signal processor (DSP 106), in a memory 116, and/or may be distributed across multiple blocks. Instructions executed at the CPU 102 may be loaded from a program memory associated with the CPU 102 or may be loaded from memory 116. [0052] The system 100 may also include additional processing blocks tailored to specific functions, such as the GPU 104, the DSP 106, a connectivity engine 118, which may include fifth generation (5G) connectivity, fourth generation long term evolution (4G LTE) connectivity, Wi- Fi connectivity, USB connectivity, Bluetooth connectivity, and the like, and a multimedia processor 112 that may, for example, detect and recognize gestures. In one implementation, the NPU is implemented in the CPU 102, the DSP 106, and/or the GPU 104. The system 100 may also include one or more sensor processor(s) 114, one or more image signal processors (ISP(s) 110), and/or navigation engine 120, which may include a global positioning system. In some examples, the sensor processor(s) 114 can be associated with or connected to one or more sensors for providing sensor input(s) to the sensor processor(s) 114. For example, the one or more sensors and sensor processor(s) 114 can be provided in, coupled to, or otherwise associated with a same computing device. [0053] The system 100 may be implemented as a system on a chip (SoC). The system 100 may be based on an Advanced Reduced Instruction Set Computer (RISC) Machine (ARM) instruction set. The system 100 and/or components thereof may be configured to perform machine learning techniques according to aspects of the present disclosure discussed herein. For example, the system 100 and/or components thereof may be configured to implement a machine-learning Qualcomm Ref. No.2307322WO model (e.g., a quantized trained machine-learning model) as described herein and/or according to aspects of the present disclosure. [0054] Machine learning (ML) can be considered a subset of artificial intelligence (AI). ML systems can include algorithms and statistical models that computer systems can use to perform various tasks by relying on patterns and inference, without the use of explicit instructions. One example of a ML system is a neural network (also referred to as an artificial neural network), which may include an interconnected group of artificial neurons (e.g., neuron models). Neural networks may be used for various applications and/or devices, such as image and/or video coding, image analysis and/or computer vision applications, Internet Protocol (IP) cameras, Internet of Things (IoT) devices, autonomous vehicles, service robots, among others. [0055] Individual nodes in a neural network may emulate biological neurons by taking input data and performing simple operations on the data. The results of the simple operations performed on the input data are selectively passed on to other neurons. Weight values are associated with each vector and node in the network, and these values constrain how input data is related to output data. For example, the input data of each node may be multiplied by a corresponding weight value, and the products may be summed. The sum of the products may be adjusted by an optional bias, and an activation function may be applied to the result, yielding the node’s output signal or “output activation” (sometimes referred to as a feature map or an activation map). The weight values may initially be determined by an iterative flow of training data through the network (e.g., weight values are established during a training phase in which the network learns how to identify particular classes by their typical input data characteristics). [0056] Different types of neural networks exist, such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), generative adversarial networks (GANs), multilayer perceptron (MLP) neural networks, transformer neural networks, diffusion-based neural networks, among others. For instance, convolutional neural networks (CNNs) are a type of feed- forward artificial neural network. Convolutional neural networks may include collections of artificial neurons that each have a receptive field (e.g., a spatially localized region of an input space) and that collectively tile an input space. RNNs work on the principle of saving the output of a layer and feeding this output back to the input to help in predicting an outcome of the layer. A GAN is a form of generative neural network that can learn patterns in input data so that the neural network model can generate new synthetic outputs that reasonably could have been from Qualcomm Ref. No.2307322WO the original dataset. A GAN can include two neural networks that operate together, including a generative neural network that generates a synthesized output and a discriminative neural network that evaluates the output for authenticity. In MLP neural networks, data may be fed into an input layer, and one or more hidden layers provide levels of abstraction to the data. Predictions may then be made on an output layer based on the abstracted data. [0057] Deep learning (DL) is one example of a machine learning technique and can be considered a subset of ML. Many DL approaches are based on a neural network, such as an RNN or a CNN, and utilize multiple layers. The use of multiple layers in deep neural networks can permit progressively higher-level features to be extracted from a given input of raw data. For example, the output of a first layer of artificial neurons becomes an input to a second layer of artificial neurons, the output of a second layer of artificial neurons becomes an input to a third layer of artificial neurons, and so on. Layers that are located between the input and output of the overall deep neural network are often referred to as hidden layers. The hidden layers learn (e.g., are trained) to transform an intermediate input from a preceding layer into a slightly more abstract and composite representation that can be provided to a subsequent layer, until a final or desired representation is obtained as the final output of the deep neural network. [0058] As noted above, a neural network is an example of a machine learning system, and can include an input layer, one or more hidden layers, and an output layer. Data is provided from input nodes of the input layer, processing is performed by hidden nodes of the one or more hidden layers, and an output is produced through output nodes of the output layer. Deep learning networks typically include multiple hidden layers. Each layer of the neural network can include feature maps or activation maps that can include artificial neurons (or nodes). A feature map can include a filter, a kernel, or the like. The nodes can include one or more weights used to indicate an importance of the nodes of one or more of the layers. In some cases, a deep learning network can have a series of many hidden layers, with early layers being used to determine simple and low-level characteristics of an input, and later layers building up a hierarchy of more complex and abstract characteristics. [0059] A deep learning architecture may learn a hierarchy of features. If presented with visual data, for example, the first layer may learn to recognize relatively simple features, such as edges, in the input stream. In another example, if presented with auditory data, the first layer may learn to recognize spectral power in specific frequencies. The second layer, taking the output of the Qualcomm Ref. No.2307322WO first layer as input, may learn to recognize combinations of features, such as simple shapes for visual data or combinations of sounds for auditory data. For instance, higher layers may learn to represent complex shapes in visual data or words in auditory data. Still higher layers may learn to recognize common visual objects or spoken phrases. [0060] Deep learning architectures may perform especially well when applied to problems that have a natural hierarchical structure. For example, the classification of motorized vehicles may benefit from first learning to recognize wheels, windshields, and other features. These features may be combined at higher layers in different ways to recognize cars, trucks, and airplanes. [0061] Neural networks may be designed with a variety of connectivity patterns. In feed- forward networks, information is passed from lower to higher layers, with each neuron in a given layer communicating to neurons in higher layers. A hierarchical representation may be built up in successive layers of a feed-forward network, as described above. Neural networks may also have recurrent or feedback (also called top-down) connections. In a recurrent connection, the output from a neuron in a given layer may be communicated to another neuron in the same layer. A recurrent architecture may be helpful in recognizing patterns that span more than one of the input data chunks that are delivered to the neural network in a sequence. A connection from a neuron in a given layer to a neuron in a lower layer is called a feedback (or top-down) connection. A network with many feedback connections may be helpful when the recognition of a high-level concept may aid in discriminating the particular low-level features of an input. [0062] FIG. 2 is a block diagram illustrating an example machine-learning model 200, according to various aspects of the present disclosure. The machine-learning model 200 may be a diffusion model. The machine-learning model 200 may generate an output image based on a user prompt. For example, a textual user prompt (e.g., “a person riding a horse”) may be provided to the machine-learning model 200. The machine-learning model 200 may generate text embeddings based on the user prompt (e.g., using a text encoder, such as a frozen Contrastive Vision Language Pre-training (CLIP) encoder). Additionally, the machine-learning model 200 may obtain a latent seed (e.g., gaussian noise, for example, having a mean of 0 and a variance of 1). The machine-learning model 200 may arranged values of the latent seed into a grid. The machine-learning model 200 may provide the latents and the text embeddings to a U-Net (e.g., a U-Net encoder decoder). The U-Net may generate conditioned latents which may be combined, using a scheduler algorithm, with the Gaussian noise and provided as input to the U-Net. The Qualcomm Ref. No.2307322WO conditioned outputs may again be combined with the latents and input into the U-Net a number (N) of times. The conditioned latents may be provided to a variational autoencoder decoder which may generate the output image based on the conditioned latents. [0063] A diffusion model provides a general-purpose, high quality, model for performing a task (e.g., image generation, depth estimation, optical flow estimation, stereo estimation, etc.) and enables more general applications as well. Diffusion models are latent-variable generative models trained to transform a sample of a noise into a sample from a data distribution. For example, a diffusion model can define a Markov chain of diffusion steps to slowly add random noise (e.g., Gaussian noise) to data and then learn to reverse the diffusion process to construct desired data samples from the noise. Once trained, the diffusion model can successfully perform a particular task (e.g., object classification, depth estimation, etc.) when provided a conditioning image or other conditional input and random noise to perform reverse diffusion for task-specific prediction. [0064] Diffusion models (e.g., diffusion-based neural networks) can also be referred to as diffusion probabilistic models. Diffusion models are latent-variable models. For example, a diffusion model defines a Markov chain of diffusion steps to slowly add random noise (e.g., Gaussian noise) to data and then learn to reverse the diffusion process to construct desired data samples from the noise. For instance, a diffusion model can be trained using a forward diffusion process (which is fixed) and a reverse diffusion process (which is learned). A diffusion model can be trained to be able to perform a generative process (e.g., a denoising process). One example goal of a diffusion model is to be able to denoise any arbitrary noise added to input data (e.g., a video). [0065] FIG. 3 provides two sets of images 300 that show the forward diffusion process (which is fixed) and the reverse diffusion process (which is learned) of a diffusion model. As shown in the forward diffusion process of FIG.3, noise 303 is gradually added to a first set of images 302 at different time steps for a total of T time steps (e.g., making up a Markov chain), producing a sequence of noisy samples X1 through XT. [0066] Diffusion models from a training perspective will take an image and will slowly add noise to the image to destroy the information in the image. In some aspects, the noise 303 is Gaussian noise. Each time step can correspond to each consecutive image of the first set of Qualcomm Ref. No.2307322WO images 302 shown in FIG.3. The initial image X0 of FIG.3 is of a cat. Addition of the noise 303 to each image (corresponding to noisy samples X1 to XT) results in gradual diffusion of the pixels in each image until the final image (corresponding to sample XT) essentially matches the noise distribution. For example, by adding the noise, each data sample X1 through XT gradually loses its distinguishable features as the time step becomes larger, eventually resulting in the final sample XT being equivalent to the target noise distribution, for instance a unit variance zero- Gaussian ^^^0, ^^^. [0067] The second set of images 304 shows the reverse diffusion process in which XT is the starting point with a noisy image (e.g., one that has Gaussian noise). The diffusion model can be trained to reverse the diffusion process (e.g., by training a model pθ(xt-1 | xt)) to generate new data. In some aspects, a diffusion model can be trained by finding the reverse Markov transitions that maximize the likelihood of the training data. By traversing backwards along the chain of time steps, the diffusion model can generate the new data. For example, as shown in FIG.3, the reverse diffusion process proceeds to generate X0 as the image of the cat. In other cases, the input data and output data can vary based on the task for which the diffusion model is trained. [0068] As noted above, the diffusion model is trained to be able to denoise or recover the original image X0 in an incremental process as shown in the second set of images 304. In some aspects, the neural network of the diffusion model can be trained to recover Xt given Xt-1, such as provided in the below example equation: ^^ ^ ^^௧| ^^௧ି^ ^ ൌ ^^൫ ^^௧; ^1 െ ^^௧ ^^௧ି^, ^^௧ ^^൯ [0069]
Figure imgf000019_0001
Define ∝^ ൌ ∏ ^ୀ^ ^1 െ ^^^^ → ^^^ ^^| ^^^^ ൌ ^^^ ^^; ^∝^ ^^^ , ^1 െ ∝^ ^ ^^^ [0070] Sampling can be defined as
Figure imgf000019_0002
^^௧ ൌ ^∝ ^ ௧ ^^^ ^ ^1 െ ∝ ^ ௧ ^^ where ε ∼ ^^^ ^^, ^^^. [0071] In some
Figure imgf000019_0003
schedule) is designed such that ∝ ^ ఁ ^ 0 and ^^ ^ ^^ఁ| ^^^ ^ ^ ^^ ^ ^^ఁ ; ^^, ^^ ^ . Qualcomm Ref. No.2307322WO [0072] The diffusion model runs in an iterative manner to incrementally generate the input image X0. In one example, the model may have twenty steps. However, in other examples, the number of steps can vary. [0073] FIG.4 is a diagram 400 illustrating how diffusion data is distributed from initial data to noise using a diffusion model in the forward diffusion direction, in accordance with some aspects. Note that the initial data q(X0) is detailed in the initial stage of the diffusion process. An illustrative example of the data q(X0) is the initial image of the cat shown in FIG. 3. As the diffusion model iterates and iteratively adds sampled noise to the data from t = 0 to t = T, as shown in FIG.4, the data becomes nosier and may ultimately result in pure noise (e.g., at q(XT)). The example of FIG. 4 illustrates the progression of the data and how it becomes diffused with noise in the forward diffusion process. [0074] In some aspects, the diffused data distribution (e.g., as shown in FIG. 4) can be as follows: ^^^ ^^ ^ ൌ ^ ^^^ ^^ ^ , ^^ ^ ^^ ^^ ^ ^ ^^^ ^^ ^ ^ ^^^ ^^ | ^^ ^ ^ ^^ ^^ ^ . [0075] In the
Figure imgf000020_0001
distribution, ^^ ^ ^^^, ^^௧ ^ represents the joint distribution, ^^ ^ ^^^ ^ represents the input data distribution, and ^^ ^ ^^௧| ^^^ ^ is the diffusion kernel. In this regard, the model can sample ^^௧ ∼ ^^ ^ ^^௧ ^ by first sampling ^^^ ∽ ^^ ^ ^^^ ^ and then sampling ^^ ∼ ^^^ ^^| ^^^ ^ (which may be referred to as ancestral sampling). The diffusion kernel takes the input and returns a vector or other data structure as output. [0076] The following is a summary of a training algorithm and a sampling algorithm for a diffusion model. A training algorithm can include the following steps: 1: repeat 2: ^^^ ∼ ^^^ ^^^^ 3: ^^ ∼ Uniform ^^1, ... , ^^ ^^ 4:∈ ~ ^^^ ^^, ^^^ 5: Take gradient descent step on ∇ ∥ ∈ െ ∈ ^^∝^ ^^ ^ ^1 െ ∝^ ∈, ^^ ଶ ∅ ∅ ௧ ^ ௧ ^ ∥
Figure imgf000020_0002
[0077] A sampling algorithm can include the following steps: 1: ^^ఁ ∼ ^^^ ^^, ^^^ Qualcomm Ref. No.2307322WO 2: for ^^ ൌ ^^, ... , 1 do 3: ^^ ∼ ^^^ ^^, ^^^ 4: ^^ ^ ^ି∝^ ^ ௧ି^ ൌ ^∝^^ ^ ^^௧ െ ^^ି ∝^^ ∈ ∅ ^ ^^௧ , ^^^^ ^ ^^௧ ^^ 5: end for 6: return ^^^ [0078] FIG. 5 is a diagram illustrating a U-Net architecture 500 for a diffusion model, in accordance with some aspects. The initial image 502 (e.g., of a cat) is provided to the U-Net architecture 500 which includes a series of residual networks (ResNet) blocks and self-attention layers to represent the network ^^ (xt, t). The U-Net architecture 500 also includes fully connected layers 508. In some cases, time representation 510 can be sinusoidal positional embeddings or random Fourier features. Noisy output 506 from the forward diffusion process is also shown. [0079] The U-Net architecture 500 includes a contracting path 504 and an expansive path 505 as shown in FIG. 5, which gives it the U-shaped architecture. The contracting path 504 can be a convolutional network that includes repeated convolutional layers (that apply convolutional operations), each followed by a rectified linear unit (ReLU) and a max pooling operation. When images are being processed (e.g., the image 502) during the contracting path 504, the spatial information of the image 502 is reduced as features are generated. The expansive path 505 combines the features and spatial information through a sequence of up-convolutions and concatenations with high-resolution features from the contracting path 504. Some of the layers can be self-attention layers, which leverage global interactions between semantic features at the end of the encoder to explicitly model full contextual information. [0080] FIG. 6 is a diagram illustrating an example attention block 600, according to various aspects of the present disclosure. The attention block 600 may be an example of an attention layer of the architecture 500 of FIG.5 and/or of the U-Net of the machine-learning model 200 of FIG. 2. In general, the attention block 600 may combine Query (Q), Key (K), and Value (V) to generate an output. More specifically, the attention block 600 may perform a matrix multiplication of Q and K, scale the output of the matrix multiplication, optionally mask the output of the scaling, normalize the output of the mask (or of the scaling in cases in which the attention block 600 does not include the mask), and multiply the output of the normalization with V. SoftMax is provided as an example of a normalization function. SoftMax may generate a Qualcomm Ref. No.2307322WO probability distribution, which may sum to 1. For example, SoftMax may implement the equation. ௭ ^^ ^^ ^^ ^^ ^^ ^^ ^^൫ ^^ ^^ ^൯ ൌ^ for j ൌ 1, … , K ^ୀ^ ^^௭ೖ The sum of all outputs of
Figure imgf000022_0001
[0081] A machine-learning model (e.g., a diffusion model or a language transformer) may include any number of normalization functions at a respective number of locations within the machine-learning model. SoftMax of the attention block 600 is provided as one example. [0082] As described previously, quantizing a machine-learning model may result in degradation of the machine-learning model (e.g., as evidenced by noise). The quantization of normalization functions of the machine-learning model may be a significant source of the degradation. For example, quantizing a SoftMax function may cause outputs of the SoftMax function to no longer sum to 1. [0083] FIG.7 is a diagram illustrating an example system for calibrating a quantized machine- learning model 708, according to various aspects of the present disclosure. For example, a system 700 may obtain the quantized machine-learning model 708. In some cases, the system 700 may obtain a machine-learning model 702 and quantize the machine-learning model 702 to generate the quantized machine-learning model 708. In other cases, the system 700 may obtain the quantized machine-learning model 708 (e.g., which may have been quantized by another system or technique). In either case, the system 700 may calibrate the quantized machine-learning model 708 to generate a calibrated quantized machine-learning model 714. [0084] The machine-learning model 702 may be, or may include, a machine-learning model trained to perform one or more operations and/or to generate one or more outputs. The machine- learning model 702 may be, or may include, a generative machine-learning model. For example, the machine-learning model 702 may be a diffusion model (e.g., as described with regard to FIG. 2, FIG.3, FIG. 4, and FIG.5). As another example, the machine-learning model 702 may be, or may include, a language transformer. [0085] In various aspects, the machine-learning model 702 may include one or more normalization functions (including, in some examples, a normalization function 704). The Qualcomm Ref. No.2307322WO normalization function 704 may be, or may include, an activation function (e.g., to provide activations to a further layer of the machine-learning model 702). The normalization function 704 may be configured to generate a probability distribution. The normalization function 704 may be, as examples, a SoftMax function, a batch normalization function (e.g., BatchNorm), a layer normalization function (e.g., LayerNorm), a group normalization function (e.g., GroupNorm), and instance normalization function (e.g., InstanceNorm). The normalization function 704 may be associated with an expected output. For example, values of the normalization function 704 may be expected to sum to an expected output sum. For example, in cases in which the normalization function 704 generates a probability distribution, it may be expected that values of a vector output by the normalization function 704 sum to 1. For instance, in cases in which the normalization function 704 is a SoftMax function, it may be expected that values of an output of the normalization function 704 sum to 1. [0086] The machine-learning model 702 may be associated with a first data format. For example, the machine-learning model 702 may be associated with a high-precision data format, (e.g., a floating-point data format) (e.g., FP16 or FP32). The machine-learning model 702 may include and/or store parameters according to the first data format. For example, the machine- learning model 702 may include weights. The weights may be stored in the machine-learning model 702 according to the first data format (e.g., the weights may be stored as FP16 or FP32). Further, the machine-learning model 702 may include multiple layers that may have respective activations. The machine-learning model 702 may be configured to store the activations according to the first data format. The normalization function 704 may be configured to output data according to the first data format. For example, the normalization function 704 may be configured to output a vector, and the normalization function 704 may be configured such that each value of the vector may be according to the first data format. [0087] A quantization engine 706 may quantize the machine-learning model 702 to generate the quantized machine-learning model 708. The quantization engine 706 may operate according to any suitable quantization techniques including, as examples, post-training quantization (PTQ) techniques, mixed-precision-quantization (MPQ) techniques, automatic mixed precision (AMP) techniques, etc. In quantizing the machine-learning model 702 to generate the quantized machine-learning model 708, the quantization engine 706 may change a data format of the machine-learning model 702 from the first data format to a second data format. In quantizing the Qualcomm Ref. No.2307322WO machine-learning model 702, the quantization engine 706 may change a data format of the output of the normalization function 704 from the first data format to the second data format. [0088] As the machine-learning model 702 was associated with the first data format, the quantized machine-learning model 708 may be associated with the second data format. In various aspects, the second data format may have a lower precision than the first data format. For example, the second data format may be an integer format (e.g., int8, int16, etc.). The quantized machine-learning model 708 may include and/or store parameters according to the second data format. For example, the quantized machine-learning model 708 may include weights. The weights may be stored in the quantized machine-learning model 708 according to the second data format. Further, the quantized machine-learning model 708 may include multiple layers that may have respective activations. The quantized machine-learning model 708 may be configured to store the activations according to the second data format. A quantized normalization function 710 may be configured to output data according to the second data format. For example, the quantized normalization function 710 may be configured to output a vector, and the quantized normalization function 710 may be configured such that each value of the vector may be according to the second data format. [0089] In some aspects, the machine-learning model 702 may include and/or store parameters according to multiple data formats (including the second data format). For example, the machine- learning model 702 may include some weights stored according to the first data format and some weights according to another data format. Additionally, or alternatively, the machine-learning model 702 may include multiple operations, functions, and/or layers with activations according to other respective data formats. Similarly, the quantized machine-learning model 708 may include and/or store parameters according to multiple data formats (including the first data format). The machine-learning model 702 includes at least some parameters according to the first format and the quantized machine-learning model 708 includes at least some parameters according to the second format. For example, the normalization function 704 may be configured to output data according to a data format (e.g., the first data format) and the quantized normalization function 710 is configured to output data according to a different data format (e.g., the second data format). The quantized machine-learning model 708 is a quantized version of the machine-learning model 702 and outputs of the quantized normalization function 710 are quantized relative to outputs of the normalization function 704. Qualcomm Ref. No.2307322WO [0090] A calibration engine 712 may calibrate the quantized machine-learning model 708 to generate the calibrated quantized machine-learning model 714. The calibration engine 712 may determine a bias associated with an output of the quantized normalization function 710. For example, the calibration engine 712 may obtain calibration data 718, which may include data representative of the data used to train the machine-learning model 702. In some cases, the calibration data 718 may be a subset of the data used to the train machine-learning model 702. In some cases, the calibration engine 712 may provide the calibration data 718 to the quantized machine-learning model 708 and observe outputs of the quantized normalization function 710. In providing the calibration data 718 to the quantized machine-learning model 708, the calibration engine 712 may provide one or more input vectors to the quantized normalization function 710. In other cases, the calibration data 718 may include input vectors representative of input vectors that were provided to the normalization function 704 during the training of the machine-learning model 702, and the calibration engine 712 may provide one or more of the input vectors to the quantized normalization function 710. In any case, the quantized normalization function 710 may generate one or more outputs (e.g., one or more output vectors) based on the respective one or more input vectors, and the calibration engine 712 may observe the output(s). [0091] The calibration engine 712 may compare the output(s) to an expected output. For example, the calibration engine 712 may compare a sum of values of each of the output vectors to an expected output sum. For example, the normalization function 704 may be a SoftMax with an expected output sum of 1. As additional examples, the quantized normalization function 710 may be a batch normalization, a layer normalization, a group normalization, or an instance normalization with an expected output sum that may be determined by the calibration engine 712. The calibration engine 712 may determine a bias of the quantized normalization function 710 based on the difference between the sum of the values of each of the output vectors and the expected output sum. The calibration engine 712 may use a statistical technique (e.g., an average or a median) to determine the bias based on the number of differences between the sum of number of output vectors and the expected output sum. [0092] The calibration engine 712 may determine a correction factor based on the bias. For example, the calibration engine 712 may determine that a calibrated quantized normalization function 716 should adjust outputs of the calibrated quantized normalization function 716 to Qualcomm Ref. No.2307322WO correct for the bias. Because the output of the quantized normalization function 710 may be a vector, the calibration engine 712 may determine to spread the adjustment to correct the bias across the vector. For example, the calibration engine 712 may determine the correction factor to be the bias divided by the number of values in the output vector. The calibration engine 712 may modify the quantized normalization function 710 to generate the calibrated quantized normalization function 716 such that the calibrated quantized normalization function 716 applies the correction factor to output vectors (e.g., by adding the correction factor to each value of each output vector generated by the calibrated quantized normalization function 716 when the calibrated quantized normalization function 716 operates, such as at inference). By modifying the calibrated quantized normalization function 716, the calibration engine 712 may calibrate the quantized machine-learning model 708 to generate the calibrated quantized machine-learning model 714. The calibration engine 712 may be calibrating the quantized normalization function 710 to generate the calibrated quantized normalization function 716 to correct bias added by the quantization of the quantized normalization function 710. [0093] Calibrating the quantized normalization function 710 may decrease or eliminate degradation of the quantized machine-learning model 708. For example, the calibrated quantized machine-learning model 714 may exhibit less noise than the quantized machine-learning model 708. [0094] FIG. 8A is a diagram illustrating a system 802 including a normalization function 806 for normalizing values. The normalization function 806 may be an example of the normalization function 704 of FIG.7. [0095] The normalization function 806 may receive an input vector 804 and generate and output an output vector 808. In the example illustrated in FIG. 8A, the input vector 804 includes the values [-1, 0, 3, 5], and the output vector 808 includes the values [0.002, 0.006, 0.118, 0.874]. The output vector 808 may be according to a first data format, (e.g., a floating-point data format). Notably, the values of the output vector 808 may sum to 1. [0096] FIG. 8B is a diagram illustrating a system 812 including a normalization function 816 for normalizing values. The normalization function 816 may be an example of the quantized machine-learning model 708 of FIG.7. Qualcomm Ref. No.2307322WO [0097] The normalization function 816 may receive an input vector 814 (which, according to the example of FIG. 8B, may be the same as the input vector 804). The normalization function 816 may generate and output an output vector 818. In the example illustrated in FIG. 8B, the input vector 814 includes the values [-1, 0, 3, 5], and the output vector 808 includes values that may be represented as [0, 0.003922, 0.117647, 0.87451]. The output vector 818 may be according to a second data format, (e.g., an integer data format). As such, the output vector 818 may store integer values that may represent scaled steps between 0 and 1. For example, the output vector 818 may store an integer value of 0 that may represent 0 and an integer value of 1 that may represent 1/255 (or 0.003922), an integer value of 2 that may represent 2/255 (or 0.007843), etc. Notably, the values of the output vector 818 do not sum to 1, rather the values of the output vector 818 sum to 0.996078, which is 0.003922 less than 1. This may be the cause of degradation of a machine-learning model, including the normalization function 816. It should be understood that while the output vector 818 of the example of FIG.8B includes 4 values, other output vectors may include dozens, hundreds, or more values. The bias illustrated by the example of FIG. 8B may be exacerbated in cases in which the output vector includes more values. [0098] FIG. 8C is a diagram illustrating a system 822 including a normalization function 826 for normalizing values. The normalization function 826 may be an example of the calibrated quantized normalization function 716 of FIG.7. [0099] The normalization function 826 may receive an input vector 824 (which, according to the example of FIG.8C, may be the same as the input vector 804 and the input vector 814). The normalization function 826 may generate and output an intermediate vector 828. In the example illustrated in FIG. 8C, the input vector 824 may include the values [-1, 0, 3, 5], and the intermediate vector 828 may include values that may be represented as [0, 0.003922, 0.117647, 0.87451]. The intermediate vector 828 may be according to a second data format, (e.g., an integer data format). As such, the intermediate vector 828 may store integer values that may represent scaled steps between 0 and 1. The values of the intermediate vector 828 do not sum to 1, rather the values of the intermediate vector 828 sum to 0.996078, which is 0.003922 less than 1. [0100] Prior to the deployment of the system 822, a calibration engine (e.g., the calibration engine 712 of FIG.7) may determine that 0.003922 is the bias of the normalization function 826. For example, the calibration engine may have provided the normalization function 826 with calibration data and determined the bias through numerous calibration tests. Qualcomm Ref. No.2307322WO [0101] Further, the calibration engine may modify the system 822 such that in operation, the system 822 corrects the bias by adding the bias to values of the intermediate vector 828. For example, the calibration engine may modify the system 822 such that the system 822 adds a fraction of the bias to each value of the intermediate vector 828 to generate an output vector 832. According to the example of FIG. 8C, the calibration engine may modify the system 822 such that the system 822 adds 0.003922 / 4 to each value of the intermediate vector 828 (e.g., based on the intermediate vector 828 including 4 values). Notably, after the applying a correction factor 830 (i.e., the bias divided by the number of values of the intermediate vector 828) to the intermediate vector 828, the values of the output vector 832 sum to 1. Thus, the calibration engine may modify the system 822 to decrease or eliminate degradation of the quantized machine- learning model including the system 822. For example, a machine-learning model including the system 822 may exhibit less noise than a machine-learning model including the system 812. [0102] FIG.9 is a diagram illustrating a tensor 902 that may be input into a calibration function, according to various aspects of the present disclosure. The tensor 902 is three-dimensional. In particular, the tensor 902 includes “heads” values in a first dimension (e.g., a height dimension or a z dimension), Ly values in a second dimension (e.g., a width dimension or a y dimension), and Lx values in a third dimension (e.g., a depth dimension or a z dimension). [0103] A normalization function (e.g., any of the normalization function 704, the quantized normalization function 710, the calibrated quantized normalization function 716, the normalization function 806, the normalization function 816, the normalization function 826 and/or the system 822) may operate on a vector (e.g., a one-dimensional data structure). A machine-learning model may provide one vector of the tensor 902 to a normalization function at a time. For example, the machine-learning model may provide a vector of values defined by one head and one Ly to the normalization function at a time. [0104] FIG. 10 is a diagram illustrating the tensor 902 of FIG. 9 with example formats of correction factors that may be applied when normalizing the tensor 902, according to various aspects of the present disclosure. As a first example, a calibration engine (e.g., 712 in FIG. 7) may determine a correction factor matrix 1004 (e.g., a two-dimensional matrix) that may be applied to vectors of the tensor 902. For example, the correction factor matrix 1004 may include “heads” * Ly values (e.g., one value corresponding to each vector of the tensor 902). The calibration engine may cause the normalization function to apply a value of the correction factor Qualcomm Ref. No.2307322WO matrix 1004 to each corresponding vector of the tensor 902 when the normalization function normalizes the corresponding vector. [0105] As a second example, the calibration engine may determine a correction factor vector 1006 (e.g., a one-dimensional vector) that may be applied to vectors of the tensor 902. For example, the correction factor vector 1006 may include “heads” values (e.g., one value corresponding to each layer, in the height dimension, of the vectors of the tensor 902). The calibration engine may cause the normalization function to apply a value of the correction factor vector 1006 to all vectors of the tensor 902 of the same layer when the normalization function normalizes the vectors. [0106] As a third example, the calibration engine may determine a correction factor value 1008 (e.g., a single value) that may be applied to all vectors of the tensor 902. The calibration engine may cause the normalization function to apply the correction factor value 1008 to all vectors of tensor 902 when the normalization function normalizes the vectors. [0107] The systems and techniques may determine bias as a systematic discrepancy between quantized and unquantized activation vector of size, according to the equation: ^^ ^^ ^^ ^^^ ^^; ^^^ ൌ ^^^^ ^^^ ^^^^ െ ^^^^ ^^൫ ^^^ ^^^൯൧,
Figure imgf000029_0001
^^ is transformation function, and ^^ ^ ^ is the quantization function [0108] For certain normalization functions (or activation layers, e.g., of normalization functions), such as BatchNorm or SoftMax, the expected value of the activation may be known in advance. For example: ^^ ൌ Batch ^ ^ ^^^ିఓ ^ Norm ^^ ^^ ൌ ^^√ఙమାఢ ^ ^^, such that the empirical mean ^^^^ ^^ ^^^ ൌ ^^ with ^^^x^ ൌ x [0109] As another example: Qualcomm Ref. No.2307322WO ^^ ൌ SoftMax ^ ^^ ^ , such that ^^^ ^∑ ^ ^^^ ^ ൌ 1, with ^^ ^ y ^ ^ ^^^ [0110] If the bias is large, the bias could affect quantized accuracy. To address this the systems and techniques may determine and/or apply a bias correction term, ^^. For example: ^^^^^^^^௧^ௗ ൌ ^^ ^ ^^ ^ ^ ^^, where ^^ ൌ ^^ ^^ ^^ ^^^ ^^; ^^^ [0111] The correction term may lead to an unbiased activation after quantization: ^^^ ^ ^^ ^ ^^^^^^^^௧^ௗ ^^ ൌ ^^^[ ^^^ ^^^] [0112] Further, the bias correction may have a dimensionality. Activation tensor in deep learning can have different dimensions, (e.g., 4D for ConvNets and 3D for Transformers). A user may choose the dimensionality of the bias correction vector depending on the target hardware and application. For example: ^^ can be a scalar applied to the full tensor ^^ can be an 1D vector applied along the output channel dimensions of BatchNorm. [0113] For SoftMax quantization we can calculate a different ^^ for each attention head. In stable diffusion we can have a different ^^ per diffusion step making it time-dependent. [0114] When implementing the bias correction in hardware, assuming asymmetrically quantized activation, the bias correction term can be readily added to existing offset: ^^ ^ ^^ ^ െ ^^ ∙ ^ ^^ ^^ ^^ ^^ ^^ ൬^ ^^ ^ 2 ^ െ 1^ െ ^^൨ ൌ ^^ ∙ െ ^^ ∙ ^^ ^ ^^
Figure imgf000030_0001
where ^^ ∙ ^^ ^ ^^ is the offset. Qualcomm Ref. No.2307322WO [0115] As an example, an input a softmax layer may be a 3D tensor ^^ ∈ ℝ^^^ௗ^ൈ^^ൈ^^. SoftMax may be applied along the final dimension (red in the figure). The bias may be applied along the first 2 dimensions independently: ^^ ^ ି ^^౮^∑^^ ೖ ୗ୭^^^ୟ^^ ^^^,ೕ,ೖ^ ^ ^,^ ൌ ^^ , ^^ ∈ ^ 1 … , ℎ ^^ ^^ ^^ ^^ ^ , ^^ ∈ ^1, … , ^^௬^ [0116]
Figure imgf000031_0001
can be reduced: ^ ି ^^౮^∑^^ ∑^^ ୗ୭^^^ୟ^^ ^^^,ೕ, ^ ^ Per-head: ^^ ൌ ೕ ೖ ೖ ^ ^^ ^^ ^ ି ^^ ∑^^ೌ^ೞ ^^ ^^ er-tensor ^^ ൌ ౮^ ∑ ∑ ୗ୭^^^ୟ^^ ^^^ ^ ^ P ^ ೕ ೖ ,ೕ,ೖ ^^^ௗ^ ^^ ^^ [0117] FIG. 11 is a
Figure imgf000031_0002
a process 1100 for calibrating a quantized machine-learning model, in accordance with aspects of the present disclosure. One or more operations of the process 1100 may be performed by a computing device (or apparatus) or a component (e.g., a chipset, codec, etc.) of the computing device. The computing device may be a mobile device (e.g., a mobile phone), a network-connected wearable such as a watch, an extended reality (XR) device such as a virtual reality (VR) device or augmented reality (AR) device, a vehicle or component or system of a vehicle, or other type of computing device. The one or more operations of the process 1100 may be implemented as software components that are executed and run on one or more processors. [0118] At a block 1102, a computing device (or one or more components thereof) may obtain a quantized machine-learning model, wherein a machine-learning model on which the quantized machine-learning model is based is associated with a first data format and the quantized machine- learning model is associated with a second data format. For example, system 700 of FIG.7 may obtain quantized machine-learning model 708. Quantized machine-learning model 708 may be associated with a second data format (e.g., int8). For example, one or more parameters (e.g., weights and/or activations) of quantized machine-learning model 708 may be formatted according to the second data format. Quantized machine-learning model 708 may be based on machine-learning model 702. Machine-learning model 702 may be associated with a first data format (e.g., float16). For example, one or more parameters (e.g., weights and/or activations) of machine-learning model 702 may be formatted according to the first data format. Qualcomm Ref. No.2307322WO [0119] In some aspects, the quantized machine-learning model may be, or may include, a diffusion model. In some aspects, the quantized machine-learning model may be, or may include, a language transformer. [0120] In some aspects, obtaining the quantized machine-learning model may be, or may include, changing a format associated with a machine-learning model from the first data format to the second data format to generate the quantized machine-learning model. For example, in some aspects, system 700 may obtain machine-learning model 702 and quantize machine- learning model 702 to generate quantized machine-learning model 708. In some aspects, changing the format associated with the machine-learning model may be, or may include, changing parameters of the machine-learning model from being stored according to the first data format to being stored according to the second data format. In some aspects, the parameters may be, or may include, one or both of weights and activations of the machine-learning model. [0121] In some aspects, the machine-learning model on which the quantized machine-learning model is based may store parameters of the machine-learning model according to the first data format and the quantized machine-learning model may store parameters of the quantized machine-learning model according to the second data format. In some aspects, the parameters of the machine-learning model may be, or may include, one or both of weights and activations of the machine-learning model and the parameters of the quantized machine-learning model may be, or may include, one or both of weights and activations of the quantized machine-learning model. In some aspects, the first data format may be, or may include, a floating-point number data format, and the second data format may be, or may include, an integer data format. In some aspects, the integer data format comprises an 8-bit integer data format. Machine-learning model 702 may be associated with a first data format (e.g., float16). For example, one or more parameters (e.g., weights and/or activations) of machine-learning model 702 may be formatted according to the first data format. Quantized machine-learning model 708 may be based on machine-learning model 702. For example, quantized machine-learning model 708 may be a quantized version of machine-learning model 702. Quantized machine-learning model 708 may be associated with a second data format (e.g., int8). For example, one or more parameters (e.g., weights and/or activations) of quantized machine-learning model 708 may be formatted according to the second data format. Qualcomm Ref. No.2307322WO [0122] At a block 1104, the computing device (or one or more components thereof) may determine a bias associated with at least one output of a normalization function of the quantized machine-learning model. For example, system 700 of FIG. 7 may determine a bias associated with quantized normalization function 710 of quantized machine-learning model 708. [0123] In some aspects, determining the bias may include: normalizing at least one input vector using the normalization function of the quantized machine-learning model to generate the at least one output, wherein the at least one output comprises at least one output vector; determining a difference between a sum of values of the at least one output vector and an expected output sum; and determining the bias based on the difference. For example, calibration engine 712 may provide at least one input vector of calibration data 718 to quantized normalization function 710. Quantized normalization function 710 may generate at least one output vector. Calibration engine 712 may determine a difference between a sum of values of the at least one output vector and an expected output sum. Calibration engine 712 may determine the bias based on the difference. [0124] In some aspects, the normalization function may be, or may include, an activation function configured to generate a probability distribution. In some aspects, the normalization function may be configured to receive an input vector and generate an output vector. In some aspects, the normalization function may be configured such that values of the output vector sum to an expected output sum. For example, the normalization function may be a SoftMax function that may have an expected output sum of 1. In some aspects, the normalization function may be, or may include, at least one of: a SoftMax function; a batch normalization function; a layer normalization function; a group normalization function; or an instance normalization function. [0125] At a block 1106, the computing device (or one or more components thereof) may determine a correction factor based on the bias, wherein the correction factor is to be applied by the normalization function of the quantized machine-learning model when normalizing vectors. For example, system 700 may determine a correction factor based on the bias determined at block 1104. The correction factor may be for calibrated quantized machine-learning model 714 to apply when calibrated quantized machine-learning model 714 normalizes vectors. [0126] In some aspects, the correction factor may be determined further based on a size of a set of data input to the normalization function. Additionally or alternatively, the correction factor may be determined further based on a size of a set of data output by the normalization function. Qualcomm Ref. No.2307322WO For example, the correction factor may be the bias divided by the number of values in vectors input to the quantized normalization function. Additionally or alternatively, the correction factor may be the bias divided by the number of values in vectors output by the quantized normalization function. [0127] In some aspects, the correction factor is to be combined with each value of each output vector of the normalization function when the normalization function generates the output vectors. For example, correction factor 830 of FIG. 8 may be applied to each value of intermediate vector 828 to generate output vector 832. [0128] In some aspects, the correction factor may be, or may include, two or more correction factors to be combined with respective output vectors of the normalization function when the normalization function generates the output vectors. For example, the correction factor of block 1106 may be correction factor vector 1006 or correction factor matrix 1004 including multiple values to be combined with multiple respective output vectors. [0129] FIG. 12 is a flow diagram illustrating a process 1200 for normalizing a vector in a quantized machine-learning model, in accordance with aspects of the present disclosure. One or more operations of the process 1200 may be performed by a computing device (or apparatus) or a component (e.g., a chipset, codec, etc.) of the computing device. The computing device may be a mobile device (e.g., a mobile phone), a network-connected wearable such as a watch, an extended reality (XR) device such as a virtual reality (VR) device or augmented reality (AR) device, a vehicle or component or system of a vehicle, a desktop computing device, a tablet computing device, a server computer, a robotic device, and/or any other computing device with the resource capabilities to perform the process 1200. The one or more operations of the process 1200 may be implemented as software components that are executed and run on one or more processors. [0130] At a block 1202, a computing device (or one or more components thereof) may normalize an input vector using a normalization function to generate an intermediate vector. for example, normalization function 826 of FIG. 8 may normalize input vector 824 to generate intermediate vector 828. [0131] In some aspects, the normalization function may be, or may include, an activation function configured to generate a probability distribution. In some aspects, the normalization Qualcomm Ref. No.2307322WO function may be configured such that values of the intermediate vector sum to an expected output sum. In some aspects, the normalization function may be, or may include, at least one of: a SoftMax function; a batch normalization function; a layer normalization function; a group normalization function; or an instance normalization function. [0132] In some aspects, the computing device (or one or more components thereof) may select a vector of a tensor as the input vector. In some aspects, the predetermined value may be determined based on a size of the input vector. Additionally or alternatively, the predetermined value may be determined further based on a size of a set of data output by the normalization function. For example, the predetermined value may be the bias divided by the number of values in vectors input to the quantized normalization function. Additionally or alternatively, the predetermined value may be the bias divided by the number of values in vectors output by the quantized normalization function. [0133] At a block 1204, the computing device (or one or more components thereof) may combine a predetermined value with each value of the intermediate vector to generate an output vector, wherein the predetermined value is determined to correct a bias of the normalization function of the quantized machine-learning model. [0134] In some aspects, the quantized machine-learning model may be, or may include, a diffusion model. In some aspects, the quantized machine-learning model may be, or may include, a language transformer. [0135] In some aspects, the computing device (or one or more components thereof) may obtain a tensor. The tensor may be, or may include, the input vector and one or more vectors. The computing device (or one or more components thereof) may normalize the one or more vectors using the normalization function to generate one or more intermediate vectors. For example, the computing device (or one or more components thereof) may receive tensor 902. The computing device (or one or more components thereof) may normalize tensor 902, one vector at a time. [0136] In some aspects, the computing device (or one or more components thereof) may combine the predetermined value with each value of the one or more intermediate vectors to generate one or more output vectors. For example, the computing device (or one or more components thereof) may combine correction factor value 1008 with each value of each intermediate vector normalized based on all vectors of tensor 902. Qualcomm Ref. No.2307322WO [0137] In some aspects, the computing device (or one or more components thereof) may combine a predetermined value of one or more predetermined values with each value of each one of the one or more intermediate vectors to generate one or more output vectors. For example, the computing device (or one or more components thereof) may combine values of correction factor matrix 1004 with each value of respective ones of intermediate vectors normalized based on respective vectors of tensor 902. As another example, the computing device (or one or more components thereof) may combine values of correction factor vector 1006 with each value of respective ones of intermediate vectors normalized based on respective vectors of tensor 902. [0138] In some examples, as noted previously, the methods described herein (e.g., the process 1100 of FIG. 11, the process 1200 of FIG. 12, and/or other methods described herein) can be performed, in whole or in part, by a computing device or apparatus. In one example, one or more of the methods can be performed by system 100 of FIG. 1, machine-learning model 200 of FIG. 2, two sets of images 300 of FIG. 3, U-Net architecture 500 of FIG. 5, attention block 600 of FIG.6, system 700 of FIG.7, calibrated quantized machine-learning model 714 of FIG.7, system 822 of FIG. 8, or by another system or device. In another example, one or more of the methods (e.g., the process 1100 of FIG.11, the process 1200 of FIG. 12, and/or other methods described herein) can be performed, in whole or in part, by the computing-device architecture 1500 shown in FIG.15. For instance, a computing device with the computing-device architecture 1500 shown in FIG.15 can include, or be included in, the components of the system 100 of FIG.1, machine- learning model 200 of FIG. 2, two sets of images 300 of FIG.3, U-Net architecture 500 of FIG. 5, attention block 600 of FIG. 6, system 700 of FIG. 7, calibrated quantized machine-learning model 714 of FIG. 7, system 822 of FIG. 8 and can implement the operations of process 1100, process 1200, and/or other process described herein. In some cases, the computing device or apparatus can include various components, such as one or more input devices, one or more output devices, one or more processors, one or more microprocessors, one or more microcomputers, one or more cameras, one or more sensors, and/or other component(s) that are configured to carry out the steps of processes described herein. In some examples, the computing device can include a display, a network interface configured to communicate and/or receive the data, any combination thereof, and/or other component(s). The network interface can be configured to communicate and/or receive Internet Protocol (IP) based data or other type of data. Qualcomm Ref. No.2307322WO [0139] The components of the computing device can be implemented in circuitry. For example, the components can include and/or can be implemented using electronic circuits or other electronic hardware, which can include one or more programmable electronic circuits (e.g., microprocessors, graphics processing units (GPUs), digital signal processors (DSPs), central processing units (CPUs), and/or other suitable electronic circuits), and/or can include and/or be implemented using computer software, firmware, or any combination thereof, to perform the various operations described herein. [0140] The process 1100, the process 1200, and/or other process described herein are illustrated as logical flow diagrams, the operation of which represents a sequence of operations that can be implemented in hardware, computer instructions, or a combination thereof. In the context of computer instructions, the operations represent computer-executable instructions stored on one or more computer-readable storage media that, when executed by one or more processors, perform the recited operations. Generally, computer-executable instructions include routines, programs, objects, components, data structures, and the like that perform particular functions or implement particular data types. The order in which the operations are described is not intended to be construed as a limitation, and any number of the described operations can be combined in any order and/or in parallel to implement the processes. [0141] Additionally, the process 1100, the process 1200, and/or other process described herein can be performed under the control of one or more computer systems configured with executable instructions and can be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) executing collectively on one or more processors, by hardware, or combinations thereof. As noted above, the code can be stored on a computer- readable or machine-readable storage medium, for example, in the form of a computer program comprising a plurality of instructions executable by one or more processors. The computer- readable or machine-readable storage medium can be non-transitory. [0142] As noted above, various aspects of the present disclosure can use machine-learning models or systems. [0143] FIG.13 is an illustrative example of a neural network 1300 (e.g., a deep-learning neural network) that can be used to implement the machine-learning based feature segmentation, implicit-neural-representation generation, rendering, and/or classification described above. Qualcomm Ref. No.2307322WO Neural network 1300 may be an example of, or can implement, machine-learning model 200 of FIG. 2, two sets of images 300 of FIG. 3, U-Net architecture 500 of FIG. 5, machine-learning model 702 of FIG. 2, quantized machine-learning model 708 of FIG. 7, and/or calibrated quantized machine-learning model 714 of FIG.7. [0144] An input layer 1302 includes input data. In one illustrative example, input layer 1302 can include data representing a prompt, text embeddings (e.g., based on a prompt), and/or a latent seed. Neural network 1300 includes multiple hidden layers hidden layers 1306a, 1306b, through 1306n. The hidden layers 1306a, 1306b, through hidden layer 1306n include “n” number of hidden layers, where “n” is an integer greater than or equal to one. The number of hidden layers can be made to include as many layers as needed for the given application. Neural network 1300 further includes an output layer 1304 that provides an output resulting from the processing performed by the hidden layers 1306a, 1306b, through 1306n. In one illustrative example, output layer 1304 can provide output images or output language. [0145] Neural network 1300 may be, or may include, a multi-layer neural network of interconnected nodes. Each node can represent a piece of information. Information associated with the nodes is shared among the different layers and each layer retains information as information is processed. In some cases, neural network 1300 can include a feed-forward network, in which case there are no feedback connections where outputs of the network are fed back into itself. In some cases, neural network 1300 can include a recurrent neural network, which can have loops that allow information to be carried across nodes while reading in input. [0146] Information can be exchanged between nodes through node-to-node interconnections between the various layers. Nodes of input layer 1302 can activate a set of nodes in the first hidden layer 1306a. For example, as shown, each of the input nodes of input layer 1302 is connected to each of the nodes of the first hidden layer 1306a. The nodes of first hidden layer 1306a can transform the information of each input node by applying activation functions to the input node information. The information derived from the transformation can then be passed to and can activate the nodes of the next hidden layer 1306b, which can perform their own designated functions. Example functions include convolutional, up-sampling, data transformation, and/or any other suitable functions. The output of the hidden layer 1306b can then activate nodes of the next hidden layer, and so on. The output of the last hidden layer 1306n can activate one or more nodes of the output layer 1304, at which an output is provided. In some Qualcomm Ref. No.2307322WO cases, while nodes (e.g., node 1308) in neural network 1300 are shown as having multiple output lines, a node has a single output and all lines shown as being output from a node represent the same output value. [0147] In some cases, each node or interconnection between nodes can have a weight that is a set of parameters derived from the training of neural network 1300. Once neural network 1300 is trained, it can be referred to as a trained neural network, which can be used to perform one or more operations. For example, an interconnection between nodes can represent a piece of information learned about the interconnected nodes. The interconnection can have a tunable numeric weight that can be tuned (e.g., based on a training dataset), allowing neural network 1300 to be adaptive to inputs and able to learn as more and more data is processed. [0148] Neural network 1300 may be pre-trained to process the features from the data in the input layer 1302 using the different hidden layers 1306a, 1306b, through 1306n in order to provide the output through the output layer 1304. In an example in which neural network 1300 is used to identify features in images, neural network 1300 can be trained using training data that includes both images and labels, as described above. For instance, training images can be input into the network, with each training image having a label indicating the features in the images (for the feature-segmentation machine-learning system) or a label indicating classes of an activity in each image. In one example using object classification for illustrative purposes, a training image can include an image of a number 2, in which case the label for the image can be [0010 00 000 0]. [0149] In some cases, neural network 1300 can adjust the weights of the nodes using a training process called backpropagation. As noted above, a backpropagation process can include a forward pass, a loss function, a backward pass, and a weight update. The forward pass, loss function, backward pass, and parameter update is performed for one training iteration. The process can be repeated for a certain number of iterations for each set of training images until neural network 1300 is trained well enough so that the weights of the layers are accurately tuned. [0150] For the example of identifying objects in images, the forward pass can include passing a training image through neural network 1300. The weights are initially randomized before neural network 1300 is trained. As an illustrative example, an image can include an array of numbers representing the pixels of the image. Each number in the array can include a value from 0 to 255 Qualcomm Ref. No.2307322WO describing the pixel intensity at that position in the array. In one example, the array can include a 28 x 28 x 3 array of numbers with 28 rows and 28 columns of pixels and 3 color components (such as red, green, and blue, or luma and two chroma components, or the like). [0151] As noted above, for a first training iteration for neural network 1300, the output will likely include values that do not give preference to any particular class due to the weights being randomly selected at initialization. For example, if the output is a vector with probabilities that the object includes different classes, the probability value for each of the different classes can be equal or at least very similar (e.g., for ten possible classes, each class can have a probability value of 0.1). With the initial weights, neural network 1300 is unable to determine low-level features and thus cannot make an accurate determination of what the classification of the object might be. A loss function can be used to analyze error in the output. Any suitable loss function definition can be used, such as a cross-entropy loss. Another example of a loss function includes the mean squared error (MSE), defined as . The loss can be set to be equal to the value of Etotal.
Figure imgf000040_0001
[0152] The loss (or error) will be high for the first training images since the actual values will be much different than the predicted output. The goal of training is to minimize the amount of loss so that the predicted output is the same as the training label. Neural network 1300 can perform a backward pass by determining which inputs (weights) most contributed to the loss of the network and can adjust the weights so that the loss decreases and is eventually minimized. A derivative of the loss with respect to the weights (denoted as dL/dW, where W are the weights at a particular layer) can be computed to determine the weights that contributed most to the loss of the network. After the derivative is computed, a weight update can be performed by updating all the weights of the filters. For example, the weights can be updated so that they change in the opposite direction of the gradient. The weight update can be denoted as , where w denotes a weight, wi denotes the initial weight, and η denotes a learning
Figure imgf000040_0002
rate can be set to any suitable value, with a high learning rate including larger weight updates and a lower value indicating smaller weight updates. [0153] Neural network 1300 can include any suitable deep network. One example includes a convolutional neural network (CNN), which includes an input layer and an output layer, with multiple hidden layers between the input and out layers. The hidden layers of a CNN include a Qualcomm Ref. No.2307322WO series of convolutional, nonlinear, pooling (for downsampling), and fully connected layers. Neural network 1300 can include any other deep network other than a CNN, such as an autoencoder, a deep belief nets (DBNs), a Recurrent Neural Networks (RNNs), among others. [0154] FIG. 14 is an illustrative example of a convolutional neural network (CNN) 1400. The input layer 1402 of the CNN 1400 includes data representing an image or frame. For example, the data can include an array of numbers representing the pixels of the image, with each number in the array including a value from 0 to 255 describing the pixel intensity at that position in the array. Using the previous example from above, the array can include a 28 x 28 x 3 array of numbers with 28 rows and 28 columns of pixels and 3 color components (e.g., red, green, and blue, or luma and two chroma components, or the like). The image can be passed through a convolutional hidden layer 1404, an optional non-linear activation layer, a pooling hidden layer 1406, and fully connected layer 1408 (which fully connected layer 1408 can be hidden) to get an output at the output layer 1410. While only one of each hidden layer is shown in FIG.14, one of ordinary skill will appreciate that multiple convolutional hidden layers, non-linear layers, pooling hidden layers, and/or fully connected layers can be included in the CNN 1400. As previously described, the output can indicate a single class of an object or can include a probability of classes that best describe the object in the image. [0155] The first layer of the CNN 1400 can be the convolutional hidden layer 1404. The convolutional hidden layer 1404 can analyze image data of the input layer 1402. Each node of the convolutional hidden layer 1404 is connected to a region of nodes (pixels) of the input image called a receptive field. The convolutional hidden layer 1404 can be considered as one or more filters (each filter corresponding to a different activation or feature map), with each convolutional iteration of a filter being a node or neuron of the convolutional hidden layer 1404. For example, the region of the input image that a filter covers at each convolutional iteration would be the receptive field for the filter. In one illustrative example, if the input image includes a 28×28 array, and each filter (and corresponding receptive field) is a 5×5 array, then there will be 24×24 nodes in the convolutional hidden layer 1404. Each connection between a node and a receptive field for that node learns a weight and, in some cases, an overall bias such that each node learns to analyze its particular local receptive field in the input image. Each node of the convolutional hidden layer 1404 will have the same weights and bias (called a shared weight and a shared bias). For example, the filter has an array of weights (numbers) and the same depth as the input. A filter Qualcomm Ref. No.2307322WO will have a depth of 3 for an image frame example (according to three color components of the input image). An illustrative example size of the filter array is 5 x 5 x 3, corresponding to a size of the receptive field of a node. [0156] The convolutional nature of the convolutional hidden layer 1404 is due to each node of the convolutional layer being applied to its corresponding receptive field. For example, a filter of the convolutional hidden layer 1404 can begin in the top-left corner of the input image array and can convolve around the input image. As noted above, each convolutional iteration of the filter can be considered a node or neuron of the convolutional hidden layer 1404. At each convolutional iteration, the values of the filter are multiplied with a corresponding number of the original pixel values of the image (e.g., the 5x5 filter array is multiplied by a 5x5 array of input pixel values at the top-left corner of the input image array). The multiplications from each convolutional iteration can be summed together to obtain a total sum for that iteration or node. The process is next continued at a next location in the input image according to the receptive field of a next node in the convolutional hidden layer 1404. For example, a filter can be moved by a step amount (referred to as a stride) to the next receptive field. The stride can be set to 1 or any other suitable amount. For example, if the stride is set to 1, the filter will be moved to the right by 1 pixel at each convolutional iteration. Processing the filter at each unique location of the input volume produces a number representing the filter results for that location, resulting in a total sum value being determined for each node of the convolutional hidden layer 1404. [0157] The mapping from the input layer to the convolutional hidden layer 1404 is referred to as an activation map (or feature map). The activation map includes a value for each node representing the filter results at each location of the input volume. The activation map can include an array that includes the various total sum values resulting from each iteration of the filter on the input volume. For example, the activation map will include a 24 x 24 array if a 5 x 5 filter is applied to each pixel (a stride of 1) of a 28 x 28 input image. The convolutional hidden layer 1404 can include several activation maps in order to identify multiple features in an image. The example shown in FIG. 14 includes three activation maps. Using three activation maps, the convolutional hidden layer 1404 can detect three different kinds of features, with each feature being detectable across the entire image. [0158] In some examples, a non-linear hidden layer can be applied after the convolutional hidden layer 1404. The non-linear layer can be used to introduce non-linearity to a system that Qualcomm Ref. No.2307322WO has been computing linear operations. One illustrative example of a non-linear layer is a rectified linear unit (ReLU) layer. A ReLU layer can apply the function f(x) = max(0, x) to all of the values in the input volume, which changes all the negative activations to 0. The ReLU can thus increase the non-linear properties of the CNN 1400 without affecting the receptive fields of the convolutional hidden layer 1404. [0159] The pooling hidden layer 1406 can be applied after the convolutional hidden layer 1404 (and after the non-linear hidden layer when used). The pooling hidden layer 1406 is used to simplify the information in the output from the convolutional hidden layer 1404. For example, the pooling hidden layer 1406 can take each activation map output from the convolutional hidden layer 1404 and generates a condensed activation map (or feature map) using a pooling function. Max-pooling is one example of a function performed by a pooling hidden layer. Other forms of pooling functions be used by the pooling hidden layer 1406, such as average pooling, L2-norm pooling, or other suitable pooling functions. A pooling function (e.g., a max-pooling filter, an L2-norm filter, or other suitable pooling filter) is applied to each activation map included in the convolutional hidden layer 1404. In the example shown in FIG.14, three pooling filters are used for the three activation maps in the convolutional hidden layer 1404. [0160] In some examples, max-pooling can be used by applying a max-pooling filter (e.g., having a size of 2x2) with a stride (e.g., equal to a dimension of the filter, such as a stride of 2) to an activation map output from the convolutional hidden layer 1404. The output from a max- pooling filter includes the maximum number in every sub-region that the filter convolves around. Using a 2x2 filter as an example, each unit in the pooling layer can summarize a region of 2×2 nodes in the previous layer (with each node being a value in the activation map). For example, four values (nodes) in an activation map will be analyzed by a 2x2 max-pooling filter at each iteration of the filter, with the maximum value from the four values being output as the “max” value. If such a max-pooling filter is applied to an activation filter from the convolutional hidden layer 1404 having a dimension of 24x24 nodes, the output from the pooling hidden layer 1406 will be an array of 12x12 nodes. [0161] In some examples, an L2-norm pooling filter could also be used. The L2-norm pooling filter includes computing the square root of the sum of the squares of the values in the 2×2 region (or other suitable region) of an activation map (instead of computing the maximum values as is done in max-pooling) and using the computed values as an output. Qualcomm Ref. No.2307322WO [0162] The pooling function (e.g., max-pooling, L2-norm pooling, or other pooling function) determines whether a given feature is found anywhere in a region of the image and discards the exact positional information. This can be done without affecting results of the feature detection because, once a feature has been found, the exact location of the feature is not as important as its approximate location relative to other features. Max-pooling (as well as other pooling methods) offer the benefit that there are many fewer pooled features, thus reducing the number of parameters needed in later layers of the CNN 1400. [0163] The final layer of connections in the network is a fully-connected layer that connects every node from the pooling hidden layer 1406 to every one of the output nodes in the output layer 1410. Using the example above, the input layer includes 28 x 28 nodes encoding the pixel intensities of the input image, the convolutional hidden layer 1404 includes 3×24×24 hidden feature nodes based on application of a 5×5 local receptive field (for the filters) to three activation maps, and the pooling hidden layer 1406 includes a layer of 3×12×12 hidden feature nodes based on application of max-pooling filter to 2×2 regions across each of the three feature maps. Extending this example, the output layer 1410 can include ten output nodes. In such an example, every node of the 3x12x12 pooling hidden layer 1406 is connected to every node of the output layer 1410. [0164] The fully connected layer 1408 can obtain the output of the previous pooling hidden layer 1406 (which should represent the activation maps of high-level features) and determines the features that most correlate to a particular class. For example, the fully connected layer 1408 can determine the high-level features that most strongly correlate to a particular class and can include weights (nodes) for the high-level features. A product can be computed between the weights of the fully connected layer 1408 and the pooling hidden layer 1406 to obtain probabilities for the different classes. For example, if the CNN 1400 is being used to predict that an object in an image is a person, high values will be present in the activation maps that represent high-level features of people (e.g., two legs are present, a face is present at the top of the object, two eyes are present at the top left and top right of the face, a nose is present in the middle of the face, a mouth is present at the bottom of the face, and/or other features common for a person). [0165] In some examples, the output from the output layer 1410 can include an M-dimensional vector (in the prior example, M=10). M indicates the number of classes that the CNN 1400 has to choose from when classifying the object in the image. Other example outputs can also be Qualcomm Ref. No.2307322WO provided. Each number in the M-dimensional vector can represent the probability the object is of a certain class. In one illustrative example, if a 10-dimensional output vector represents ten different classes of objects is [000.050.800.150000], the vector indicates that there is a 5% probability that the image is the third class of object (e.g., a dog), an 80% probability that the image is the fourth class of object (e.g., a human), and a 15% probability that the image is the sixth class of object (e.g., a kangaroo). The probability for a class can be considered a confidence level that the object is part of that class. [0166] FIG. 15 illustrates an example computing-device architecture 1500 of an example computing device which can implement the various techniques described herein. In some examples, the computing device can include a mobile device, a wearable device, an extended reality device (e.g., a virtual reality (VR) device, an augmented reality (AR) device, or a mixed reality (MR) device), a personal computer, a laptop computer, a video server, a vehicle (or computing device of a vehicle), or other device. For example, the computing-device architecture 1500 may include, implement, or be included in any or all of system 100 of FIG. 1, machine- learning model 200 of FIG. 2, two sets of images 300 of FIG.3, U-Net architecture 500 of FIG. 5, attention block 600 of FIG. 6, system 700 of FIG. 7, calibrated quantized machine-learning model 714 of FIG.7, and/or system 822 of FIG.8. [0167] The components of computing-device architecture 1500 are shown in electrical communication with each other using connection 1512, such as a bus. The example computing- device architecture 1500 includes a processing unit (CPU or processor) 1502 and computing device connection 1512 that couples various computing device components including computing device memory 1510, such as read only memory (ROM) 1508 and random-access memory (RAM) 1506, to processor 1502. [0168] Computing-device architecture 1500 can include a cache of high-speed memory connected directly with, in close proximity to, or integrated as part of processor 1502. Computing-device architecture 1500 can copy data from memory 1510 and/or the storage device 1514 to cache 1504 for quick access by processor 1502. In this way, the cache can provide a performance boost that avoids processor 1502 delays while waiting for data. These and other modules can control or be configured to control processor 1502 to perform various actions. Other computing device memory 1510 may be available for use as well. Memory 1510 can include multiple different types of memory with different performance characteristics. Processor 1502 Qualcomm Ref. No.2307322WO can include any general-purpose processor and a hardware or software service, such as service 1 1516, service 21518, and service 3 1520 stored in storage device 1514, configured to control processor 1502 as well as a special-purpose processor where software instructions are incorporated into the processor design. Processor 1502 may be a self-contained system, containing multiple cores or processors, a bus, memory controller, cache, etc. A multi-core processor may be symmetric or asymmetric. [0169] To enable user interaction with the computing-device architecture 1500, input device 1522 can represent any number of input mechanisms, such as a microphone for speech, a touch- sensitive screen for gesture or graphical input, keyboard, mouse, motion input, speech and so forth. Output device 1524 can also be one or more of a number of output mechanisms known to those of skill in the art, such as a display, projector, television, speaker device, etc. In some instances, multimodal computing devices can enable a user to provide multiple types of input to communicate with computing-device architecture 1500. Communication interface 1526 can generally govern and manage the user input and computing device output. There is no restriction on operating on any particular hardware arrangement and therefore the basic features here may easily be substituted for improved hardware or firmware arrangements as they are developed. [0170] Storage device 1514 is a non-volatile memory and can be a hard disk or other types of computer readable media which can store data that are accessible by a computer, such as magnetic cassettes, flash memory cards, solid state memory devices, digital versatile disks, cartridges, random-access memories (RAMs) 1506, read only memory (ROM) 1508, and hybrids thereof. Storage device 1514 can include services 1516, 1518, and 1520 for controlling processor 1502. Other hardware or software modules are contemplated. Storage device 1514 can be connected to the computing device connection 1512. In one aspect, a hardware module that performs a particular function can include the software component stored in a computer-readable medium in connection with the necessary hardware components, such as processor 1502, connection 1512, output device 1524, and so forth, to carry out the function. [0171] The term “substantially,” in reference to a given parameter, property, or condition, may refer to a degree that one of ordinary skill in the art would understand that the given parameter, property, or condition is met with a small degree of variance, such as, for example, within acceptable manufacturing tolerances. By way of example, depending on the particular parameter, Qualcomm Ref. No.2307322WO property, or condition that is substantially met, the parameter, property, or condition may be at least 90% met, at least 95% met, or even at least 99% met. [0172] Aspects of the present disclosure are applicable to any suitable electronic device (such as security systems, smartphones, tablets, laptop computers, vehicles, drones, or other devices) including or coupled to one or more active depth sensing systems. While described below with respect to a device having or coupled to one light projector, aspects of the present disclosure are applicable to devices having any number of light projectors and are therefore not limited to specific devices. [0173] The term “device” is not limited to one or a specific number of physical objects (such as one smartphone, one controller, one processing system and so on). As used herein, a device may be any electronic device with one or more parts that may implement at least some portions of this disclosure. While the below description and examples use the term “device” to describe various aspects of this disclosure, the term “device” is not limited to a specific configuration, type, or number of objects. Additionally, the term “system” is not limited to multiple components or specific aspects. For example, a system may be implemented on one or more printed circuit boards or other substrates and may have movable or static components. While the below description and examples use the term “system” to describe various aspects of this disclosure, the term “system” is not limited to a specific configuration, type, or number of objects. [0174] Specific details are provided in the description above to provide a thorough understanding of the aspects and examples provided herein. However, it will be understood by one of ordinary skill in the art that the aspects may be practiced without these specific details. For clarity of explanation, in some instances the present technology may be presented as including individual functional blocks including functional blocks including devices, device components, steps or routines in a method embodied in software, or combinations of hardware and software. Additional components may be used other than those shown in the figures and/or described herein. For example, circuits, systems, networks, processes, and other components may be shown as components in block diagram form in order not to obscure the aspects in unnecessary detail. In other instances, well-known circuits, processes, algorithms, structures, and techniques may be shown without unnecessary detail in order to avoid obscuring the aspects. Qualcomm Ref. No.2307322WO [0175] Individual aspects may be described above as a process or method which is depicted as a flowchart, a flow diagram, a data flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations may be re- arranged. A process is terminated when its operations are completed but could have additional steps not included in a figure. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a function, its termination can correspond to a return of the function to the calling function or the main function. [0176] Processes and methods according to the above-described examples can be implemented using computer-executable instructions that are stored or otherwise available from computer- readable media. Such instructions can include, for example, instructions and data which cause or otherwise configure a general-purpose computer, special purpose computer, or a processing device to perform a certain function or group of functions. Portions of computer resources used can be accessible over a network. The computer executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, firmware, source code, etc. [0177] The term “computer-readable medium” includes, but is not limited to, portable or non- portable storage devices, optical storage devices, and various other mediums capable of storing, containing, or carrying instruction(s) and/or data. A computer-readable medium may include a non-transitory medium in which data can be stored and that does not include carrier waves and/or transitory electronic signals propagating wirelessly or over wired connections. Examples of a non-transitory medium may include, but are not limited to, a magnetic disk or tape, optical storage media such as compact disk (CD) or digital versatile disk (DVD), flash memory, magnetic or optical disks, USB devices provided with non-volatile memory, networked storage devices, any suitable combination thereof, among others. A computer-readable medium may have stored thereon code and/or machine-executable instructions that may represent a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, a software package, a class, or any combination of instructions, data structures, or program statements. A code segment may be coupled to another code segment or a hardware circuit by passing and/or receiving information, data, arguments, parameters, or memory contents. Information, arguments, parameters, data, etc. may be passed, forwarded, or transmitted via any suitable means including memory sharing, message passing, token passing, network transmission, or the like. Qualcomm Ref. No.2307322WO [0178] In some aspects the computer-readable storage devices, mediums, and memories can include a cable or wireless signal containing a bit stream and the like. However, when mentioned, non-transitory computer-readable storage media expressly exclude media such as energy, carrier signals, electromagnetic waves, and signals per se. [0179] Devices implementing processes and methods according to these disclosures can include hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof, and can take any of a variety of form factors. When implemented in software, firmware, middleware, or microcode, the program code or code segments to perform the necessary tasks (e.g., a computer-program product) may be stored in a computer-readable or machine-readable medium. A processor(s) may perform the necessary tasks. Typical examples of form factors include laptops, smart phones, mobile phones, tablet devices or other small form factor personal computers, personal digital assistants, rackmount devices, standalone devices, and so on. Functionality described herein also can be embodied in peripherals or add-in cards. Such functionality can also be implemented on a circuit board among different chips or different processes executing in a single device, by way of further example. [0180] The instructions, media for conveying such instructions, computing resources for executing them, and other structures for supporting such computing resources are example means for providing the functions described in the disclosure. [0181] In the foregoing description, aspects of the application are described with reference to specific aspects thereof, but those skilled in the art will recognize that the application is not limited thereto. Thus, while illustrative aspects of the application have been described in detail herein, it is to be understood that the inventive concepts may be otherwise variously embodied and employed, and that the appended claims are intended to be construed to include such variations, except as limited by the prior art. Various features and aspects of the above-described application may be used individually or jointly. Further, aspects can be utilized in any number of environments and applications beyond those described herein without departing from the broader spirit and scope of the specification. The specification and drawings are, accordingly, to be regarded as illustrative rather than restrictive. For the purposes of illustration, methods were described in a particular order. It should be appreciated that in alternate aspects, the methods may be performed in a different order than that described. Qualcomm Ref. No.2307322WO [0182] One of ordinary skill will appreciate that the less than (“<”) and greater than (“>”) symbols or terminology used herein can be replaced with less than or equal to (“≤”) and greater than or equal to (“≥”) symbols, respectively, without departing from the scope of this description. [0183] Where components are described as being “configured to” perform certain operations, such configuration can be accomplished, for example, by designing electronic circuits or other hardware to perform the operation, by programming programmable electronic circuits (e.g. microprocessors, or other suitable electronic circuits) to perform the operation, or any combination thereof. [0184] The phrase “coupled to” refers to any component that is physically connected to another component either directly or indirectly, and/or any component that is in communication with another component (e.g. connected to the other component over a wired or wireless connection, and/or other suitable communication interface) either directly or indirectly. [0185] Claim language or other language reciting “at least one of” a set and/or “one or more” of a set indicates that one member of the set or multiple members of the set (in any combination) satisfy the claim. For example, claim language reciting “at least one of A and B” or “at least one of A or B” means A, B, or A and B. In another example, claim language reciting “at least one of A, B, and C” or “at least one of A, B, or C” means A, B, C, or A and B, or A and C, or B and C, or A and B and C. The language “at least one of” a set and/or “one or more” of a set does not limit the set to the items listed in the set. For example, claim language reciting “at least one of A and B” or “at least one of A or B” can mean A, B, or A and B, and can additionally include items not listed in the set of A and B. [0186] Claim language or other language reciting “at least one processor configured to,” “at least one processor being configured to,” or the like indicates that one processor or multiple processors (in any combination) can perform the associated operation(s). For example, claim language reciting “at least one processor configured to: X, Y, and Z” means a single processor can be used to perform operations X, Y, and Z; or that multiple processors are each tasked with a certain subset of operations X, Y, and Z such that together the multiple processors perform X, Y, and Z; or that a group of multiple processors work together to perform operations X, Y, and Z. In another example, claim language reciting “at least one processor configured to: X, Y, and Qualcomm Ref. No.2307322WO Z” can mean that any single processor may only perform at least a subset of operations X, Y, and Z. [0187] The various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the aspects disclosed herein may be implemented as electronic hardware, computer software, firmware, or combinations thereof. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application. [0188] The techniques described herein may also be implemented in electronic hardware, computer software, firmware, or any combination thereof. Such techniques may be implemented in any of a variety of devices such as general-purposes computers, wireless communication device handsets, or integrated circuit devices having multiple uses including application in wireless communication device handsets and other devices. Any features described as modules or components may be implemented together in an integrated logic device or separately as discrete but interoperable logic devices. If implemented in software, the techniques may be realized at least in part by a computer-readable data storage medium including program code including instructions that, when executed, performs one or more of the methods described above. The computer-readable data storage medium may form part of a computer program product, which may include packaging materials. The computer-readable medium may include memory or data storage media, such as random-access memory (RAM) such as synchronous dynamic random-access memory (SDRAM), read-only memory (ROM), non-volatile random- access memory (NVRAM), electrically erasable programmable read-only memory (EEPROM), FLASH memory, magnetic or optical data storage media, and the like. The techniques additionally, or alternatively, may be realized at least in part by a computer-readable communication medium that carries or communicates program code in the form of instructions or data structures and that can be accessed, read, and/or executed by a computer, such as propagated signals or waves. Qualcomm Ref. No.2307322WO [0189] The program code may be executed by a processor, which may include one or more processors, such as one or more digital signal processors (DSPs), general-purpose microprocessors, an application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Such a processor may be configured to perform any of the techniques described in this disclosure. A general-purpose processor may be a microprocessor; but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, (e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration). Accordingly, the term “processor,” as used herein may refer to any of the foregoing structure, any combination of the foregoing structure, or any other structure or apparatus suitable for implementation of the techniques described herein. [0190] Illustrative aspects of the disclosure include: [0191] Aspect 1. A method for calibrating a quantized machine-learning model, the method comprising: obtaining the quantized machine-learning model, wherein a machine-learning model on which the quantized machine-learning model is based is associated with a first data format and the quantized machine-learning model is associated with a second data format; determining a bias associated with at least one output of a normalization function of the quantized machine- learning model; and determining a correction factor based on the bias, wherein the correction factor is to be applied by the normalization function of the quantized machine-learning model when normalizing vectors. [0192] Aspect 2. The method of aspect 1, wherein determining the bias comprises: normalizing at least one input vector using the normalization function of the quantized machine-learning model to generate the at least one output, wherein the at least one output comprises at least one output vector; determining a difference between a sum of values of the at least one output vector and an expected output sum; and determining the bias based on the difference. [0193] Aspect 3. The method of aspect 2, wherein the correction factor is determined further based on a size of a set of data input to the normalization function. Qualcomm Ref. No.2307322WO [0194] Aspect 4. The method of any one of aspects 1 to 3, wherein the correction factor is to be combined with each value of each output vector of the normalization function when the normalization function generates the output vectors. [0195] Aspect 5. The method of any one of aspects 1 to 4, wherein the correction factor comprises two or more correction factors to be combined with respective output vectors of the normalization function when the normalization function generates the output vectors. [0196] Aspect 6. The method of any one of aspects 1 to 5, wherein the normalization function comprises an activation function configured to generate a probability distribution. [0197] Aspect 7. The method of any one of aspects 1 to 6, wherein the normalization function is configured to receive an input vector and generate an output vector. [0198] Aspect 8. The method of aspect 7, wherein the normalization function is configured such that values of the output vector sum to an expected output sum. [0199] Aspect 9. The method of any one of aspects 1 to 8, wherein the normalization function comprises at least one of: a SoftMax function; a batch normalization function; a layer normalization function; a group normalization function; or an instance normalization function. [0200] Aspect 10. The method of any one of aspects 1 to 9, wherein the quantized machine- learning model comprises a diffusion model. [0201] Aspect 11. The method of any one of aspects 1 to 10, wherein the quantized machine- learning model comprises a language transformer. [0202] Aspect 12. The method of any one of aspects 1 to 11, wherein obtaining the quantized machine-learning model comprises changing a format associated with a machine-learning model from the first data format to the second data format to generate the quantized machine-learning model. [0203] Aspect 13. The method of aspect 12, wherein changing the format associated with the machine-learning model comprises changing parameters of the machine-learning model from being stored according to the first data format to being stored according to the second data format. Qualcomm Ref. No.2307322WO [0204] Aspect 14. The method of aspect 13, wherein the parameters comprise one or both of weights and activations of the machine-learning model. [0205] Aspect 15. The method of any one of aspects 1 to 14, wherein the machine-learning model on which the quantized machine-learning model is based stores parameters of the machine- learning model according to the first data format and the quantized machine-learning model stores parameters of the quantized machine-learning model according to the second data format. [0206] Aspect 16. The method of aspect 15, wherein the parameters of the machine-learning model comprise one or both of weights and activations of the machine-learning model and wherein the parameters of the quantized machine-learning model comprise one or both of weights and activations of the quantized machine-learning model. [0207] Aspect 17. The method of any one of aspects 1 to 16, wherein the first data format comprises a floating-point number data format and the second data format comprises an integer data format. [0208] Aspect 18. The method of aspect 17, wherein the integer data format comprises an 8-bit integer data format. [0209] Aspect 19. A method for normalizing a vector in a quantized machine-learning model, the method comprising: normalizing an input vector using a normalization function to generate an intermediate vector; and combining a predetermined value with each value of the intermediate vector to generate an output vector, wherein the predetermined value is determined to correct a bias of the normalization function of the quantized machine-learning model. [0210] Aspect 20. The method of aspect 19, wherein the normalization function comprises an activation function configured to generate a probability distribution. [0211] Aspect 21. The method of any one of aspects 19 or 20, wherein the normalization function is configured such that values of the intermediate vector sum to an expected output sum. [0212] Aspect 22. The method of any one of aspects 19 to 21, wherein the normalization function comprises at least one of: a SoftMax function; a batch normalization function; a layer normalization function; a group normalization function; or an instance normalization function. Qualcomm Ref. No.2307322WO [0213] Aspect 23. The method of any one of aspects 19 to 22, wherein the quantized machine- learning model comprises a diffusion model. [0214] Aspect 24. The method of any one of aspects 19 to 23, wherein the quantized machine- learning model comprises a language transformer. [0215] Aspect 25. The method of any one of aspects 19 to 24, further comprising selecting a vector of a tensor as the input vector. [0216] Aspect 26. The method of any one of aspects 19 to 25, wherein the predetermined value is determined based on a size of the input vector. [0217] Aspect 27. The method of any one of aspects 19 to 26, further comprising: obtaining a tensor, wherein the tensor comprises the input vector and one or more vectors; and normalizing the one or more vectors using the normalization function to generate one or more intermediate vectors. [0218] Aspect 28. The method of aspect 27, further comprising combining the predetermined value with each value of the one or more intermediate vectors to generate one or more output vectors. [0219] Aspect 29. The method of any one of aspects 27 or 28, further comprising combining a predetermined value of one or more predetermined values with each value of each one of the one or more intermediate vectors to generate one or more output vectors. [0220] Aspect 30. An apparatus for calibrating a quantized machine-learning model, the apparatus comprising: at least one memory; and at least one processor coupled to the at least one memory and configured to: obtain the quantized machine-learning model, wherein a machine- learning model on which the quantized machine-learning model is based is associated with a first data format and the quantized machine-learning model is associated with a second data format; determine a bias associated with at least one output of a normalization function of the quantized machine-learning model; and determine a correction factor based on the bias, wherein the correction factor is to be applied by the normalization function of the quantized machine-learning model when normalizing vectors. Qualcomm Ref. No.2307322WO [0221] Aspect 31. A non-transitory computer-readable storage medium having stored thereon instructions that, when executed by at least one processor, cause the at least one processor to perform operations according to any of aspects 1 to 29. [0222] Aspect 32. An apparatus for providing virtual content for display, the apparatus comprising one or more means for perform operations according to any of aspects 1 to 29.

Claims

Qualcomm Ref. No.2307322WO CLAIMS WHAT IS CLAIMED IS: 1. A method for calibrating a quantized machine-learning model, the method comprising: obtaining the quantized machine-learning model, wherein a machine-learning model on which the quantized machine-learning model is based is associated with a first data format and the quantized machine-learning model is associated with a second data format; determining a bias associated with at least one output of a normalization function of the quantized machine-learning model; and determining a correction factor based on the bias, wherein the correction factor is to be applied by the normalization function of the quantized machine-learning model when normalizing vectors. 2. The method of claim 1, wherein determining the bias comprises: normalizing at least one input vector using the normalization function of the quantized machine-learning model to generate the at least one output, wherein the at least one output comprises at least one output vector; determining a difference between a sum of values of the at least one output vector and an expected output sum; and determining the bias based on the difference. 3. The method of claim 2, wherein the correction factor is determined further based on a size of a set of data input to the normalization function. 4. The method of claim 1, wherein the correction factor is to be combined with each value of each output vector of the normalization function when the normalization function generates the output vectors. 5. The method of claim 1, wherein the correction factor comprises two or more correction factors to be combined with respective output vectors of the normalization function when the normalization function generates the output vectors. Qualcomm Ref. No.2307322WO 6. The method of claim 1, wherein the normalization function comprises an activation function configured to generate a probability distribution. 7. The method of claim 1, wherein the normalization function is configured to receive an input vector and generate an output vector. 8. The method of claim 7, wherein the normalization function is configured such that values of the output vector sum to an expected output sum. 9. The method of claim 1, wherein the normalization function comprises at least one of: a SoftMax function; a batch normalization function; a layer normalization function; a group normalization function; or an instance normalization function. 10. The method of claim 1, wherein the quantized machine-learning model comprises a diffusion model. 11. The method of claim 1, wherein the quantized machine-learning model comprises a language transformer. 12. The method of claim 1, wherein obtaining the quantized machine-learning model comprises changing a format associated with a machine-learning model from the first data format to the second data format to generate the quantized machine-learning model. 13. The method of claim 12, wherein changing the format associated with the machine- learning model comprises changing parameters of the machine-learning model from being stored according to the first data format to being stored according to the second data format. 14. The method of claim 13, wherein the parameters comprise one or both of weights and activations of the machine-learning model. Qualcomm Ref. No.2307322WO 15. The method of claim 1, wherein the machine-learning model on which the quantized machine-learning model is based stores parameters of the machine-learning model according to the first data format and the quantized machine-learning model stores parameters of the quantized machine-learning model according to the second data format. 16. The method of claim 15, wherein the parameters of the machine-learning model comprise one or both of weights and activations of the machine-learning model and wherein the parameters of the quantized machine-learning model comprise one or both of weights and activations of the quantized machine-learning model. 17. The method of claim 1, wherein the first data format comprises a floating-point number data format and the second data format comprises an integer data format. 18. The method of claim 17, wherein the integer data format comprises an 8-bit integer data format. 19. A method for normalizing a vector in a quantized machine-learning model, the method comprising: normalizing an input vector using a normalization function to generate an intermediate vector; and combining a predetermined value with each value of the intermediate vector to generate an output vector, wherein the predetermined value is determined to correct a bias of the normalization function of the quantized machine-learning model. 20. The method of claim 19, wherein the normalization function comprises an activation function configured to generate a probability distribution. 21. The method of claim 19, wherein the normalization function is configured such that values of the intermediate vector sum to an expected output sum. 22. The method of claim 19, wherein the normalization function comprises at least one of: a SoftMax function; Qualcomm Ref. No.2307322WO a batch normalization function; a layer normalization function; a group normalization function; or an instance normalization function. 23. The method of claim 19, wherein the quantized machine-learning model comprises a diffusion model. 24. The method of claim 19, wherein the quantized machine-learning model comprises a language transformer. 25. The method of claim 19, further comprising selecting a vector of a tensor as the input vector. 26. The method of claim 19, wherein the predetermined value is determined based on a size of the input vector. 27. The method of claim 19, further comprising: obtaining a tensor, wherein the tensor comprises the input vector and one or more vectors; and normalizing the one or more vectors using the normalization function to generate one or more intermediate vectors. 28. The method of claim 27, further comprising combining the predetermined value with each value of the one or more intermediate vectors to generate one or more output vectors. 29. The method of claim 27, further comprising combining a predetermined value of one or more predetermined values with each value of each one of the one or more intermediate vectors to generate one or more output vectors. 30. An apparatus for calibrating a quantized machine-learning model, the apparatus comprising: at least one memory; and Qualcomm Ref. No.2307322WO at least one processor coupled to the at least one memory and configured to: obtain the quantized machine-learning model, wherein a machine-learning model on which the quantized machine-learning model is based is associated with a first data format and the quantized machine-learning model is associated with a second data format; determine a bias associated with at least one output of a normalization function of the quantized machine-learning model; and determine a correction factor based on the bias, wherein the correction factor is to be applied by the normalization function of the quantized machine-learning model when normalizing vectors.
PCT/US2024/030395 2023-07-24 2024-05-21 Calibrating a quantized machine-learning models Pending WO2025024035A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GR20230100607 2023-07-24
GR20230100607 2023-07-24

Publications (1)

Publication Number Publication Date
WO2025024035A1 true WO2025024035A1 (en) 2025-01-30

Family

ID=91585899

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2024/030395 Pending WO2025024035A1 (en) 2023-07-24 2024-05-21 Calibrating a quantized machine-learning models

Country Status (1)

Country Link
WO (1) WO2025024035A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230118505A1 (en) * 2021-10-18 2023-04-20 Samsung Electronics Co., Ltd. Method and apparatus for neural network operation

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230118505A1 (en) * 2021-10-18 2023-04-20 Samsung Electronics Co., Ltd. Method and apparatus for neural network operation

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
J. YU ET AL.: "NN-LUT: neural approximation of non-linear operations for efficient transformer inference", PROCEEDINGS OF THE 59TH ACM/IEEE DESIGN AUTOMATION CONFERENCE, 10 July 2022 (2022-07-10), pages 577 - 582, XP059501991, ISBN: 979-8-4007-0103-0, DOI: 10.1145/3489517.3530505 *
N. P. PANDEY ET AL.: "Softmax Bias Correction for Quantized Generative Models", 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), IEEE, 2 October 2023 (2023-10-02), pages 1445 - 1450, XP034503724, DOI: 10.1109/ICCVW60793.2023.00157 *
SEHOON KIM ET AL: "I-BERT: Integer-only BERT Quantization", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 8 June 2021 (2021-06-08), XP081978445 *
YANG LIN ET AL: "FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 17 February 2023 (2023-02-17), XP091441534 *

Similar Documents

Publication Publication Date Title
US12141981B2 (en) System and method for performing semantic image segmentation
WO2020061884A1 (en) Composite binary decomposition network
WO2025106451A1 (en) Object detection using visual language models via latent feature adaptation with synthetic data
WO2024015811A1 (en) Feature conditioned output transformer for generalizable semantic segmentation
US20250166133A1 (en) Modifying video content
US20250225374A1 (en) Reinforced total variation distance loss for machine learning models
US20240296665A1 (en) Training video segmentation models using temporal consistency loss
US20250157053A1 (en) Object tracking using predicted positions
US20250182460A1 (en) Refining image features and/or descriptors
US20240338868A1 (en) Head-pose and gaze redirection
WO2025024035A1 (en) Calibrating a quantized machine-learning models
US20230419087A1 (en) Adapters for quantization
US20240386704A1 (en) System and method for image processing using mixed inference precision
WO2024102521A1 (en) Cross-view attention for visual perception tasks using multiple camera inputs
JP2024089666A (en) Extracting features from screen images for task mining
US20250166367A1 (en) Object detection using visual language models via latent feature adaptation with synthetic data
US20250272965A1 (en) Leveraging adapters for parameter efficient transformer models
US20240412493A1 (en) Test-time self-supervised guidance for diffusion models
US20250285356A1 (en) Generating image data
US20240420276A1 (en) Convolution acceleration using activation static vectorization
CN114365155A (en) Efficient inference with fast point-by-point convolution
US20250307651A1 (en) Training and fine-tuning neural network on neural processing unit
US20250232451A1 (en) Detecting moving objects
WO2023220891A1 (en) Resolution-switchable segmentation networks
WO2025184047A1 (en) Leveraging adapters for parameter efficient transformer models

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24734530

Country of ref document: EP

Kind code of ref document: A1