WO2025166361A1

WO2025166361A1 - Mutual alignment vector quantization

Info

Publication number: WO2025166361A1
Application number: PCT/US2025/014343
Authority: WO
Inventors: Siyuan Qiao; Basil MUSTAFA
Original assignee: DeepMind Technologies Ltd; Gdm Holding LLC
Current assignee: DeepMind Technologies Ltd; Gdm Holding LLC
Priority date: 2024-02-01
Filing date: 2025-02-03
Publication date: 2025-08-07
Anticipated expiration: 2026-08-01

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating training an encoder neural network to generate discrete latent representations of data items by performing both a forward and a backward function during training.

Description

MUTUAL ALIGNMENT VECTOR QUANTIZATION

CROSS-REFERENCE TO RELATED APPLICATION

This application claims priority to U.S. Application No. 63/548,843, filed February 1, 2024, the disclosure of which is incorporated herein by reference.

BACKGROUND

This specification relates to processing inputs, e.g., images, videos, or audio signals, using machine learning models.

As one example, neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to another layer in the network, e.g., the next hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current values of a respective set of weights.

SUMMARY

This specification describes a system implemented as computer programs on one or more computers that trains an encoder neural network that can be used to generate discrete representations of data items, e.g., images, audio signals, or videos.

In one aspect, there is described a method of training an encoder neural network having a set of encoder network parameters and of updating a codebook of latent embedding vectors stored in a memory. The method includes receiving a set of one or more training data items: for each training data item: processing the training data item through the encoder neural network in accordance with cunent values of the encoder network parameters of the encoder neural network to generate a training encoder output that comprises, for each of one or more latent variables, a respective training encoded vector; performing a forward function, comprising: for each latent variable, determining, for each latent embedding vector in the codebook, a respective similarity score between the latent embedding vector and the respective training encoded vector for the latent variable; and selecting, for each latent variable, a latent embedding vector that is nearest to the training encoded vector for the latent variable according to the respective similarity scores; performing a backward function, comprising: for each latent variable, generating a soft latent embedding vector by computing a respective weight for each of the latent embedding vectors based on the respective similarity scores for the latent embedding vectors and computing a weighted sum of the latent embedding vectors in accordance with the respective weights; and generating a training output for a training task from the selected latent embedding vectors generated by performing the forward function; and updating the current values of the encoder network parameters and the latent embedding vectors using a loss that is based on, for each training data item, the training output for the training task and the soft latent embedding vectors generated by performing the backward function.

In another aspect, there is described another method of training an encoder neural network having a set of encoder network parameters and of updating a codebook of latent embedding vectors stored in a memory. The method includes receiving a set of one or more training data items; for each training data item: processing the training data item through the encoder neural network to generate a training encoder output; performing a forward function on the training encoder output to generate a hard selection of one or more latent embedding vectors from the codebook, wherein the hard selection is used to generate a training output for a training task; performing a backward function on the training encoder output, different from the forward function, to generate a soft selection of latent embedding vectors from the codebook, wherein the soft selection is used to update at least one of the encoder neural network parameters and the codebook of latent embedding vectors; generating the training output for the training task using the hard selection of latent embedding vectors; and updating at least one of the encoder neural network parameters and the codebook of latent embedding vectors using a loss based on the training output and the soft selection of latent embedding vectors.

In some implementations of the above aspect a ‘"hard” selection is a selection of a single one of the latent embedding vectors in the codebook. In some implementations of the above aspect a ‘‘soft” selection is a combination of two or more latent embedding vectors from the codebook.

In another aspect, there is described another method of training an encoder neural network having a set of encoder network parameters and of updating a codebook of latent embedding vectors stored in a memory. The method includes receiving a set of one or more training data items; for each training data item: processing the training data item through the encoder neural network to generate a training encoder output; performing a forward function on the training encoder output to obtain a latent embedding vector from the codebook, wherein the latent embedding vector is used to generate a training output for a training task; performing a backward function on the training encoder output, different from the forward function, to obtain a combined representation of a plurality of latent embedding vectors from the codebook, wherein the combined representation is used to update at least one of the encoder neural network parameters and the codebook of latent embedding vectors; generating the training output for the training task using the discrete representation; and updating at least one of the encoder neural network parameters and the codebook of latent embedding vectors using a loss based on the training output and the combined representation.

In another aspect, there is described a method that includes obtaining a new data item; processing the training data item through an encoder neural network to generate an encoder output that comprises, for each of one or more latent variables, a respective encoded vector; performing a forward function, comprising: for each latent variable, determining, for each latent embedding vector in a codebook of latent vectors, a respective similarity score between the latent embedding vector and the respective encoded vector for the latent variable; and selecting, for each latent variable, a latent embedding vector that is nearest to the encoded vector for the latent variable according to the respective similarity scores; and generating a compressed representation of the new data item that identifies the selected latent embedding vectors for the latent variables, wherein the encoder neural network has been trained and the latent vectors in the codebook have been learned by performing the respective operations of any one of the above aspects.

In another aspect, there is described a method that includes; obtaining a data item; processing the data item through an encoder neural network to generate an encoder output that comprises, for each of one more latent variables, a respective encoded vector; for each latent variable, determining, for each latent embedding vector in a codebook of latent vectors, a respective similarity score between the latent embedding vector and the respective encoded vector for the latent variable; and selecting, for each latent variable, a latent embedding vector that is nearest to the training encoded vector for the latent variable according to the respective similarity scores; and generating a compressed representation of the new data item that identifies the selected latent embedding vectors for the latent variables, wherein the encoder neural network and the codebook have been trained using a loss function that is based on both (a) a representation derived from selecting a single latent embedding vector from the codebook for a latent variable, and (b) a representation derived from a combination of multiple latent embedding vectors from the codebook for the latent variable.

In another aspect, there is described a method that includes obtaining a new data item; processing the new data item through an encoder neural network to generate an encoder output that comprises, for each of one more latent variables, a respective encoded vector; for each latent variable, determining, for each latent embedding vector in a codebook of latent vectors, a respective similarity score between the latent embedding vector and the respective encoded vector for the latent variable; and selecting, for each latent variable, a latent embedding vector that is nearest to the encoded vector for the latent variable according to the respective similarity scores; and generating a compressed representation of the new data item that identifies the selected latent embedding vectors for the latent variables, wherein (a) an output of a forward function generated by selecting a single latent embedding vector from the codebook for a latent variable, and (b) an output of a backward function derived from a combination of multiple latent embedding vectors from the codebook for the latent variable.

In another aspect, there is described a system of one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform the respective operations of any of the above methods. In yet another aspect, there is described one or more computer readable media, e.g.. one or more non-transitory computer readable media, storing instructions that when executed by one or more computers cause the one or more computers to perform the respective operations of any of the above methods.

Particular embodiments of the subject matter described in this specification can be implemented so as to realize one or more of the following advantages.

The techniques described in this specification jointly train an encoder neural network and learn a codebook of embedding vectors so that discrete representations generated after the joint training effectively represent the corresponding data items.

The discrete representations generated from a given data item may collectively have a smaller size than the data item they represent (e.g.. as measured by their collective number of bits, which may be lower than the number of bits of the data item); that is, the discrete representations may be compressed representations of the data item.

Optionally, the compressed representations can be used for a computing task of decompression, i.e., regenerating the data item, or generating an estimate of the data item. However, they may alternatively or additionally be used as the input for another computational task. Because of the reduced size of the discrete representations compared to the data item, the computing task may be performed, in a given time, using hardware having reduced computing power (e.g., measured as floating point operations per second) and/or memory.

The joint training of the encoder and the codebook improves the efficiency of the compression for a given data item distribution. In some implementations, the training process jointly trains the encoder network and the codebook based on a loss function which includes a term which encourages the generation of discrete representations which can be used to perform the computing task with high quality (according to a quality metric), thus encouraging the discrete representations to preserve information in the data items which is useful for the computing task.

In particular, unlike other approaches to learning discrete representations, the described techniques use inconsistent feedforward and backward functions throughout the training. In other words, the described techniques use a “hard” selection generated by a forward function to compute task outputs during training while updating the parameters of the encoder neural network and the codebook using a “soft” selection generated by a backward function.

A “hard” selection is a selection of a single one of the vectors in the codebook. A “soft” selection, on the other hand, is a combination of two or more vectors in the codebook, i.e., so that the resulting “combined” vector may not match any single one of the vectors in the codebook.

This training update updates the entire codebook globally, unlike other updating techniques that perform only a forward function and then perform straight-through gradient copy and therefore only update the latent embedding vectors that each encoder output selects. As a result, these “global” updates result in improved training efficiency, which is important to minimize any gap brought by vector quantization, i.e., minimize information loss due to quantizing the continuous representation instead of directly learning continuous representations.

Moreover, making updates using the backward function results in faster training and stronger performance after training than the straight-through gradient copy technique. In other words, relative to existing techniques, the described techniques reduce training time and reduce the amount of computational resources required to perform the training, e.g., the amount of memory and processor cycles required, while yielding improved representation quality after training. In some cases, the described techniques incorporate auxiliary losses to ensure that the quality of the training persists despite the inconsistent use of the forward and backward functions.

For example, the system can make use of a consistency loss that minimizes the gap between forward function and backward function outputs so that the updating the encoder parameters using the output of the backward function can also lower the task loss even though the task outputs are computed using the output of the forward function.

As another example, the system can make use of a coverage loss that maximizes the codebook utilization rate so that vector quantization can effectively take advantage of scaling up codebook sizes.

The details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A shows an example inference system.

FIG. IB shows an example training system.

FIG. 2 is a flow diagram of an example process for training the encoder neural network.

FIG. 3 is a flow diagram of an example process for updating the current values of the network parameters of the encoder neural network.

FIG. 4 is a flow diagram of an example process for updating the codebook of latent embedding vectors.

FIG. 5 is a flow diagram of an example process for generating a discrete representation after training.

FIG. 6 shows an example of the performance of the described techniques.

Like reference numbers and designations in the various drawings indicate like elements.

DETAILED DESCRIPTION

FIG. 1A shows an example inference system 100. The inference system 100 is an example of a system implemented as computer programs on one or more computers in one or more locations, in which the systems, components, and techniques described below can be implemented.

The inference system uses an encoder neural network 110 to generate discrete representations 112 of data items 102. For example, each data item can be e.g., an image 104 (that is, one or more intensity⁷ values associated with each of an array of pixels, e.g. a 2-dimensional array), an audio signal 106. or a video 108 (that is, an ordered sequence of video frames which are images). An audiovisual item with an audio track and a video is here considered a type of video. In the case of an image or video, the inference system forms the discrete representations 112 based on pixel-level data in the image or video. The image 104 or video 108 may be one captured from the real- world (by a camera device), and the audio signal may be one captured from the real-world (e.g., by a microphone).

The representations 112 are referred to as “discrete” because the representation of a given data item 102 represents the data item 102 as a collection of latent embedding vectors selected from a discrete codebook 120 (also referred as a “set” or a “vocabulary”) of latent embedding vectors, i.e., a codebook that includes a fixed number of embedding vectors.

Discrete representations 112 can also be referred to as quantized or compressed representations.

For example, the discrete representation 112 can identify, for each of one or more latent variables, a respective latent embedding vector from the codebook 120.

This is in contrast to a “continuous” representation of a data item 102, where the only constraint on the embedding vectors in the representation is that each numeric value in each embedding vector be representable in the numerical format used by the system 100 when processing inputs through the neural network 110.

For example, when the data item 102 is an image 104, the set of latent variables can include multiple latent variables and each latent variable can correspond to a spatial region within the image 104.

As another example, when the data item 102 is a video 106. the set of latent variables can include multiple latent variables and each latent variable can correspond to a spatial region within one of the video frames of the video or a spatio-temporal region within multiple video frames of the video 106. As another example, when the data item 102 is an audio signal 108, the set of latent variables can include one or more latent variables and each latent variable can correspond to a time window within the audio signal 108.

In particular, the encoder neural network 110 is configured to process an input data item 102 in accordance with current values of encoder network parameters of the encoder neural network 110 to generate an encoder output that includes, for each of the one or more latent variables, a respective encoded vector.

Generally, the encoder neural network 1 10 can have any appropriate architecture, e.g., can be a Transformer neural network, a vision Transformer (ViT) neural network, convolutional neural network, e.g., a ResNet, a recurrent neural network, a state space model (SSM), and so on. That is, the encoder neural network 110 can have any architecture that is appropriate for mapping a data item 102 of a corresponding t pe to an output that includes a respective vector for each of the latent variables. As specific examples, when the data item 102 is an image or a video, the encoder neural network 110 can be a ViT, a convolutional neural network, or a neural network that includes both attention layers and convolutional layers. As another specific example, when the data item 102 is an audio signal, the encoder neural network 110 can be a recurrent neural network or a convolutional neural network. Other ty pes of neural network architectures can also be used.

To generate a discrete representation 1 12 of a data item 102, the system 100 processes the data item 102 through the encoder neural network 1 10 to generate an encoder output that includes, for each of the one or more latent variables, a respective encoded vector.

The system 100 then performs a forward function 140.

Performing the forward function 140 includes, for each latent variable, determining, for each latent embedding vector in the codebook 120, a respective similarity score between the latent embedding vector and the respective encoded vector for the latent variable. Performing the forward function 140 also includes selecting, for each latent variable, a latent embedding vector 142 that is nearest to the training encoded vector for the latent variable according to the respective similarity scores, e.g., the latent embedding vector that is most similar to the training encoded for the latent vector according to the respective similarity scores.

As a result, the output of the forward function 140 is the discrete representation 1 12 that identifies a respective selected latent embedding vector 142 from the codebook 120 for each latent variable. For example, the discrete representation 112 can include an identifier for each selected latent embedding vector 142 or can include the selected latent embedding vectors 142 themselves. In the case that the discrete representation 1 12 includes an identifier, the identifiers can be compact identifiers, e.g., integers, that can be stored with minimal memory and transmitted using minimal bandwidth, thereby yielding a discrete representation 112 that is a compressed representation of the data item 102.

Prior to the inference system 100 using the encoder neural network 110 to generate discrete representations, a training system trains the encoder neural network 110 and learns the codebook 120 so that the discrete representations effectively represent the corresponding data items.

In particular, unlike during inference when the system performs only the forward function 140, during training, the training system performs both the forward function 140 and a backward function that generates a “soft” latent embedding vector for each latent variable.

Performing the training is described in more detail below.

After the encoder neural network 1 10 has been trained, representations 112 generated by the trained encoder neural network 110 can be used, e.g., by the inference system 100 or by another system that receives the representations 112 from the inference system 100 to perform one or more downstream tasks.

For example, the discrete representations 112 generated by the encoder neural network 1 10 can be used to train a generative neural network that generates data items conditioned on discrete representations. One example of such a neural network is an image generation neural network that generates images conditioned on discrete representations. That is, the image generation neural network can receive an input that includes a discrete representation and generate an output image conditioned on the discrete representation. For example, the generated image can be a reconstruction of an image that would be represented by the discrete representation. As another example, the input to the image generation neural network can include another conditioning input, and the generated image can be a reconstruction of an image that would be represented by the discrete representation, but modified as indicated by the other conditioning input. Alternatively, the input to the image generation neural network can include only another ty pe of conditioning input, and the image generation neural network can generate a discrete representation of the output image. In particular, the image generation neural network can generate a sequence of tokens from the codebook 120, e.g., conditioned on another sequence of tokens from the vocabulary or on another sequence representing a different type of conditioning input or both, and then an “inverter” or “decoder” neural network can decode the sequence of tokens to generate an image or other data item.

As another example, the discrete representations 112 can be used for data item compression, e.g., so that the discrete representations 112 are used to later reconstruct an input data item by a reconstruction neural network.

As yet another example, the discrete representations 112 can be used as a representation of the data item 102, e.g., a representation of an image or video in visual understanding tasks, e.g., image (or video)-text retrieval tasks, image (or video, or audio, or audiovisual) classification tasks (that is, inferring which of a set of classes an image (or video, or audio item) belongs to), image (or video) captioning tasks, segmentation (that is, identifying portions of an image, or video, or audio item having a certain property), generating control data for a real-world electromechanical agent (e.g. robot) based on an image or video of a real-world environment with which the agent interacts, and visual question answering tasks.

In some cases, as will be described in more detail below, the discrete representations 112 can be used to perform multi-modal tasks by a multi-modal generative neural network.

FIG. IB shows an example training system 150. The training system 150 is an example of a system implemented as computer programs on one or more computers in one or more locations, in which the systems, components, and techniques described below can be implemented.

The training system 150 trains the encoder neural network 110 described above.

In some implementations, the encoder neural network 110 includes an initial, pretrained backbone neural network and an additional neural network.

In these implementations, the system 150 trains the additional neural network while either holding the pre-trained backbone neural network fixed or fine-tuning the pretrained backbone.

For example, the backbone neural network can generate an initial feature representation of the input data item and the additional neural network can project or update the initial feature representation. For example, the backbone neural network can be one of the neural networks described above while the additional neural network includes a linear projection layer, one or more Transformer neural network layers, or both.

In some other implementations, the system 150 trains the entire encoder neural network 110 '‘from scratch.”

The system 150 can train the encoder neural network 110 jointly with the codebook 120, i.e.. can leam the latent embedding vectors in the codebook 120 jointly with the training of the encoder neural network 110.

Generally, during the training, the system 150 receives a set of one or more training data items 130.

For each training data item 130, the system 150 processes the training data item 130 through the encoder neural network 110 in accordance with current values of the encoder network parameters of the encoder neural network 110 to generate a training encoder output that includes, for each of one or more latent variables, a respective training encoded vector.

The system 150 then performs the forward function 140.

As described above, performing the forward function 140 includes, for each latent variable, determining, for each latent embedding vector in the codebook, a respective similarity⁷ score between the latent embedding vector and the respective training encoded vector for the latent variable and selecting, for each latent variable, a latent embedding vector 142 that is nearest to the training encoded vector for the latent variable according to the respective similarity scores.

Unlike at inference, the system also performs a backward function 154.

Performing the backward function includes, for each latent variable, generating a soft latent embedding vector 152 by computing a respective weight for each of the latent embedding vectors based on the respective similarity scores for the latent embedding vectors and computing a weighted sum of the latent embedding vectors in accordance with the respective weights.

The system 150 then generates a respective training output 160 for each of one or more training tasks from the selected latent embedding vectors 142 generated by performing the forward function 140.

The system 150 then updates the current values of the encoder network parameters and the latent embedding vectors using a loss 170 that is based on, for each training data item 130. the training output 160 for the training task and the soft latent embedding vectors 152 generated by performing the backward function 150. Generally, the loss 170 includes one or more task losses for one or more training tasks and, optionally, one or more auxiliary losses, e.g., a coverage loss, a consistency loss, or both.

To compute the task losses, for each training task, the system 150 generally processes the output of the forward function using one or more additional components 180 for the training task to generate an output 160 for the training task, and then computes the task loss using the task output.

The system can then backpropagate gradients of the task loss through the additional components 180 in order to train the encoder neural network 110, as is described below. Optionally, some or all of the additional components 180 can be trained jointly with the encoder neural network 110.

One example of a training task is a reconstruction task that requires reconstructing the data item 130. In this example, the additional components 180 for the reconstruction task include a decoder neural network that generates a reconstruction of the training data item 130 from the selected latent vectors 142. The task loss may be calculated using the training output 160 and the training data item 130.

Another example of a training task is a contrastive learning task that computes a contrastive loss using the forward function output. In this example, each training data item 130 is one of a pair of data items and the additional components 180 include (i) a first component that generates a contrastive representation of the training data item 130 from the selected latent vectors 142 and (ii) a second component that generates a contrastive representation of the other data item in the same pair as the training data item 130.

Another example of a training task is a sequence generation task that requires generating an output sequence conditioned on the selected latent embedding vectors. Examples of such tasks include image captioning and visual question answering tasks. In this example, the additional components 180 include a component, e.g., a generative neural network, that generates the output sequence conditioned on the selected latent vectors 142.

FIG. 2 is a flow diagram of an example process 200 for training the encoder neural network. For convenience, the process 200 will be described as being performed by a system of one or more computers located in one or more locations. For example, a training system, e.g.. the training system 150 of FIG. IB, appropriately programmed, can perform the process 200. The system receives a set of one or more training data items (step 202).

The system then performs steps 204-210 for each of the training data items.

The system processes the training data item through the encoder neural network in accordance with current values of the encoder network parameters of the encoder neural network to generate a training encoder output (step 204). As described above, the training encoder output includes, for each of one or more latent variables, a respective training encoded vector.

The system performs a forward function (step 206).

As part of performing the forward function, the system determines, for each latent variable and for each latent embedding vector in the codebook, a respective similarity score between the latent embedding vector and the respective training encoded vector for the latent variable.

For example, the similarity score can be equal to the dot product between the latent embedding vector and the training encoded vector for the latent variable. In other words, the system can determine, for each latent embedding vector in the codebook, the respective similarity score between the latent embedding vector and the respective training encoded vector for the latent variable by computing a dot product between the latent embedding vector and the respective training encoded vector.

Thus, as a result, the system has, for each latent variable, a respective similarity score for each of the training encoded vectors. The system then selects, for each latent variable, the latent embedding vector that is nearest to the training encoded vector for the latent variable according to the respective similarity' scores for the latent variable.

Thus, when the similarity' score is the dot product, the training encoder output Q includes q training encoded vectors that each have dimensionality , and the codebook C includes c latent embedding vectors, the operations of the forward function to generate a matrix O of the selected latent embedding vectors can be represented as:

The system performs a backward function (step 208).

As part of performing the backward function, the system generates, for each latent variable, a soft latent embedding vector by computing a respective weight for each of the latent embedding vectors based on the respective similarity scores for the latent embedding vectors and computing a weighted sum of the latent embedding vectors in accordance with the respective weights.

That is, the system determines a weighted sum of the latent embedding vectors in which each of the latent embedding vectors is weighted by a weight that is based on the similarity score for the latent embedding vector.

For example, the system can determine the weights by applying a softmax function with a specified temperature to the respective similarity scores for the latent embedding vectors.

In this example, a matrix S of the soft latent embedding vectors can be expressed as:

S = Softmax(M/cr)C, where cr is the temperature.

In some implementations, the system holds the temperature fixed during training, e.g., the specified temperature can be a hyperparameter, or anneals the temperature during training according to a fixed schedule.

In some other implementations, the specified temperature is learned during the training of the encoder neural network. That is, the system can update the specified temperature, e.g., by computing gradients of task and, optionally, auxiliary losses with respect to the specified temperature during the training and applying an optimizer to the computed gradients.

Generally, the value of the temperature controls how widely a training signal spreads over nearby latent embedding vectors. Intuitively speaking, larger codebook sizes will result in denser latent embedding vectors in the embedding space which might need a smaller temperature to limit spreading the gradients. To remove the requirement for tuning temperature for different codebook sizes, the system can set the temperature as a learnable parameter and learn which temperature gives lower training losses as training progresses.

In some cases, rather than compute a weighted sum, the system can perform a different combination of the vectors in the codebook that makes use of the weights. For example, the system can add randomly sampled noise to each of the weights before performing the weighted sum, can add a respective learned bias to each of the weights before performing the weighted sum, or can apply a different transformation to the weights before using the weights to compute the sum. Thus, while performing the forward function yields a ‘‘hard'’ selection of one of the latent embedding vectors for each of the latent variables, performing the backward function yields a '‘soft” combination of all of the latent embedding vectors for each of the latent variables.

In some cases, to further improve computational efficiency during training, the system can perform the forward function, the backward function, or both using only a proper subset, i.e., using less than all, of the codebook vectors. For example, the system can randomly sample the subset or can perform a pre-filtering step on the codebook to select the subset for a given input data item or a given batch of input data items.

The system generates a training output for a training task from the selected latent embedding vectors generated by performing the forward function (step 210). That is, while the system performs both the forward and backward function, the system uses only the output of the forward function to generate the training output for the training task.

As described above, the training task can be any of a variety of training tasks, e.g., a reconstruction task, a contrastive learning task, and so on.

The system updates (i) the cunent values of the encoder network parameters and (ii) the latent embedding vectors using a loss that is based on, for each training data item, the training output for the training task and the soft latent embedding vectors generated by performing the backward function (step 212).

That is. although the system does not use the soft latent embedding vectors generated by performing the backward function to determine the task output, the system does use the soft latent embedding vectors to train the encoder neural network.

FIG. 3 is a flow diagram of an example process 300 for updating the current values of the network parameters of the encoder neural network. For convenience, the process 300 will be described as being performed by a system of one or more computers located in one or more locations. For example, a training system, e.g., the training system 150 of FIG. IB, appropriately programmed, can perform the process 300.

The system determines a gradient of the task loss with respect to the selected latent embedding vectors, i.e., the output of the forward function (step 302). For example, the system can determine this gradient through backpropagation. That is, the system determines dL/dO, where L is the task loss and O is a matrix of the selected latent embedding vectors generated by performing the forward function.

The system determines a gradient of the soft latent embedding vectors with respect to the training encoder output (step 304). That is, the system determines dS/dQ. where is the set of soft latent embedding vectors and Q is the training encoder output.

The system determines a gradient of the task loss with respect to the current values of the encoder network parameters from (i) the gradient of the task loss with respect to the selected latent embedding vectors and (ii) the gradient of the soft latent embedding vectors with respect to the training encoder output (step 306).

In particular, the system determines a gradient of the task loss with respect to the training encoder output from (i) the gradient of the task loss with respect to the selected latent embedding vectors and (ii) the gradient of the soft latent embedding vectors with respect to the training encoder output. For example, the gradient of the task loss can be the product of (i) and (ii). That is, the system sets dL/dQ to be equal to

The system then determines the gradient of the task loss with respect to the current values of the network parameters by backpropagating the gradient of the task loss with respect to the training encoder output through the encoder neural network.

The system updates the current values of the network parameters using the gradient of the task loss (step 308). For example, the system can apply an optimizer, e g., stochastic gradient descent, Adam. Adafactor. or another appropriate optimizer, to the current values of the network parameters using the gradient of the task loss to generate updated values of the network parameters.

FIG. 4 is a flow diagram of an example process 400 for updating the latent embedding vectors. For convenience, the process 400 will be described as being performed by a system of one or more computers located in one or more locations. For example, a training system, e.g., the training system 150 of FIG. IB, appropriately programmed, can perform the process 400.

The system determines a gradient of the soft latent embedding vectors with respect to the latent embedding vectors (step 402). That is, the system determines dS/dC. where S is the set of soft latent embedding vectors and C is the codebook of latent embedding vectors.

The system determines the gradient of the task loss with respect to the latent embedding vectors from (i) the gradient of the task loss with respect to the selected latent embedding vectors and (ii) gradient of the soft latent embedding vectors with respect to the latent embedding vectors (step 404). For example, the gradient of the task loss with respect to the latent embedding vectors can be equal to the product of (i) and (ii). That is, the system, the system sets d /dC to be equal to In some implementations, to improve training quality, the system can also use an auxiliary loss when updating the latent embedding vectors.

For example, the auxiliary loss can include a consistency loss that encourages consistency between the selected latent embedding vectors and the soft latent embedding vectors. That is, when the selected latent embedding vectors and the soft latent embedding vectors are not consistent, computing the gradient of the task loss wi th respect to the training encoder output as described above in step 306 can result in a poor learning signal being provided to the encoder neural network because is not equal to

Similarly, when the selected latent embedding vectors and the soft latent embedding vectors are not consistent, computing the gradient of the task loss with respect to the codebook as described above in step 404 can result in a poor learning signal being provided to the codebook because to (^~)(^~) is ⁿ°t equal to

Including a consistency loss can reduce the discrepancy and improve the quality of the learning signal that is provided to the encoder neural network and the codebook.

In some cases, rather than minimizing the discrepancy between S and O directly, the system minimizes the difference between O and O. which tightens the differences between O, Q. and S at the same time.

As a particular example of this, the consistency loss can measure a negative of a sum of or an average of respective first nearest similarity scores for each of the latent variables. The respective first nearest similarity score for a given latent variable is a similarity score between the training encoded vector for the given latent variable and the latent embedding vector that is nearest to the training encoded vector for the given latent variable according to the respective similarity scores.

That is the consistency loss L_COnsiste cy ^can measure:

As a particular example, the consistency loss can be equal to or directly proportional to a constant minus the sum or the average of the respective first nearest similarity scores.

For example, when the constant is 1, the consistency loss can be equal to: This loss moves each training encoded vector towards its selected latent embedding by maximizing their alignment computed by M.

As another example, the auxiliary loss can include a coverage loss that encourages, for each latent embedding vector, a similarity between the latent embedding vector and a closest training embedding vector to the latent embedding vector according to the respective similarity scores.

The coverage loss is designed to maximize the codebook utilization rate so that vector quantization can effectively take advantage of scaling up codebook sizes. The gradient of the task loss above results in the codebook being updated globally so that in theory' the entire codebook is updated toward a lower task loss which encourages a high utilization rate already. In practice, however, latent embeddings that are far away from all the encoded vectors receive too few gradients, particularly limited by the Softmax weights and a low temperature. To avoid this, the system uses the coverage loss to pull all latent embeddings into the space with high density⁷ of encoded vectors. This pulling force should be light-weighted, as once the latent embedding enters the high- density areas, the task loss gradient will pick it up and update it towards a lower task loss L.

To achieve this, the system can use a coverage loss that measures a negative of a sum of or an average of respective second nearest similarity scores for each of the latent embedding vectors. The respective second nearest similarity score for a given latent embedding vector is a similarity score between the latent embedding vector and a training encoded vector that is nearest to the latent embedding vector according to the respective similarity scores.

That is the coverage loss L_coverage can measure:

As a particular example, the coverage loss can be equal to or directly proportional to a constant minus the sum or the average of the respective second nearest similarity scores.

For example, when the constant is 1. the coverage loss can be equal to:

As a particular example, the auxiliary loss can be a combination, e.g., a sum, an average, or a weighted sum, of the consistency loss and the coverage loss. In these cases, the system can determine a gradient of the auxiliary' loss with respect to the latent embedding vectors (step 406).

The system can then update the latent embedding vectors (step 408).

When the auxiliary⁷ loss is not used, the system can update the latent embedding vectors using the gradient of the task loss with respect to the latent embedding vectors. For example, the system can apply an optimizer, e.g., stochastic gradient descent, Adam. Adafactor, or another appropriate optimizer, to the latent embedding vectors and the gradient of the task loss with respect to the latent embedding vectors to generate updated latent embedding vectors.

When the auxiliary loss is used, the system can update the latent embedding vectors using the gradient of the task loss with respect to the latent embedding vectors and the gradient of the auxiliary loss with respect to the latent embedding vectors. For example, the system can determine a combined gradient, e.g., by determining a sum or a weighted sum of the two gradients, and then apply an optimizer, e.g., stochastic gradient descent, Adam, Adafactor, or another appropriate optimizer, to the latent embedding vectors and the combined gradient to generate updated latent embedding vectors.

FIG. 5 is a flow diagram of an example process 500 for performing inference using the trained encoder neural network. For convenience, the process 500 will be described as being performed by a system of one or more computers located in one or more locations. For example, an inference system, e.g.. the inference system 100 of FIG. 1 A, appropriately programmed, can perform the process 500.

The system obtains a new data item (step 502).

The system processes the new data item through the encoder neural network to generate an encoder output that includes, for each of the one or more latent variables, a respective encoded vector (step 504).

The system performs the forward function (step 506). As described above, performing the forward function includes, for each latent variable, determining, for each latent embedding vector in the codebook of latent vectors, a respective similarity score between the latent embedding vector and the respective encoded vector for the latent variable. Performing the forward function also includes selecting, for each latent variable, a latent embedding vector that is nearest to the encoded vector for the latent variable according to the respective similarity scores.

After training, the system does not perform the backward function.

The system generates a compressed representation of the new data item that identifies the selected latent embedding vectors for the latent variables (step 508). For example, the compressed representation can include a respective identifier, e.g., a respective integer or other compact identifier, for each of the selected latent embedding vector that uniquely identifies the embedding vector among the other embedding vectors in the codebook, and therefore does not need to include the actual selected latent embedding vectors.

Optionally, the system can then perform a downstream task using the compressed representation of the new data item (step 10).

For example, the dow nstream task can be a task performed by a generative model. As a particular example, the generative model can be a multi-modal generative model.

For example, the system can process an input that includes the compressed representation and optionally, other tokens of other modalities using the multi-modal generative model to generate an output for the downstream task.

The multi-modal generative model can be, e g., a multi-modal language model neural network or a different neural network that processes an input sequence of tokens to generate an output sequence of tokens. As a particular example of the above, the generative neural netw ork can be an auto-regressive neural network that generates the tokens in the output sequence auto-regressively, i.e., one after another.

One example of such a neural netw ork is a decoder-only Transformer neural network. Examples of such neural networks include Gemini (described in arXiv:2403.05530), Gemma (described in arXiv: 2403.08295), and PaliGemma (described in arXiv:2412.03555).

Other examples of such neural networks are neural networks that include both recurrent and attention layers. Examples of such neural networks include Recurrent Gemma (described in arXiv:2404.07839) and Griffin (described in arXiv:2402. 19427).

Other examples of such a neural network include recurrent neural networks. Examples of such neural networks include Hawk (described in arXiv:2402.19427).

In some cases, the generative neural network can perform a multi-modal processing task that requires processing multi-modal data. In general, multi-modal data is a combination of two or more different types of data, e g., two or more of audio data, image data, text data, or graph data. As one example the multi-modal data may comprise audio-visual data, comprising a combination of pixels of an image or of video and audio data representing values of a digitized audio waveform. As another example the multimodal data may comprise a combination of i) text data representing text in a natural language and ii) pixels of an image or of video or audio data representing values of an audio waveform. Optionally, but not necessarily, the different types of data may represent the same or overlapping objects using the different modalities (types), and when processing multi-modal data the data may be mapped into a common embedding space.

As a particular example, the task is a multi-modal processing task that requires processing both text and image inputs. That is, the target output to be generated by the computer vision neural network for a given image depends on one or more outputs generated by the text processing neural network for one or more corresponding text inputs (and vice versa). Examples of such tasks include open-vocabulary image classification, open-vocabulary object detection, image captioning, text-based image search, imagebased retrieval, and so on.

In particular, the neural network is capable of receiving network inputs and generating network outputs for multiple different machine learning tasks. Generally, two machine learning tasks are different if they have different desired outputs for the inputs received for the tasks. For example, two image classification tasks can be different if the object categories into which each task requires classifying input images are different. As another example, two robot learning tasks can be different if the two tasks require generating outputs defining actions to be performed by a robot to reach two different goals.

In practice, for any of these examples, the task to be performed by the neural network can be defined by (at least a part of) the network input, e g., that is in the form of a prompt or a request, received by the neural network. In other words, the neural network will be able to perform any of these tasks when an appropriate prompt or request is received.

FIG. 6 shows an example 600 of the performance of the described techniques.

In particular, the example 600 shows the performance of the described techniques (denoted as “MaVQ” in the Figure) when training a Contrastive Capti oners (“CoCa”) neural network relative to (i) a baseline technique (first row) that does not use quantization and directly trains the neural network using continuous representation and (ii) three other techniques (L2, Gumbel, EMA) that use vector quantization during training. The Contrastive Captioners neural network is a neural network that processes images to generate representations of images and is trained jointly with a text processing neural network on a task loss that includes (i) a contrastive loss and (ii) a captioning loss. Contrastive Capti oners are described in more detail in Yu, et al, CoCa: Contrastive Captioners are Image-Text Foundation Models, arXiv:2205.01917.

Thus, because the first row does not use quantization and all approaches train the same neural network on the same training data, the first row is a performance target for the remaining approaches due to the inherent information loss during quantization.

For each approach, the Figure shows the batch size used during the training, the codebook size, the batch coverage (a number that represents the fraction of vectors in the codebook that are updated when training on a given batch), and the results on three different downstream tasks: accuracy on image classification on the ImageNet data set, accuracy on image to text retrieval on the COCO data set, and accuracy on text to image retrieval on the COCO data set.

As can be seen from the results, the described techniques are the only ones that allow- the CoCa neural network to be trained from scratch to achieve performance that approximates that of the neural network that is trained using continuous representations. For example, this may be because training using the inconsistent forward and backward functions and, in some cases, the auxiliary losses, allows the codebook size to be effectively scaled up to improve performance.

In this specification, the term "configured" is used in relation to computing systems and environments, as well as computer program components. A computing system or environment is considered "configured" to perform specific operations or actions when it possesses the necessary software, firmw are, hardware, or a combination thereof, enabling it to carry out those operations or actions during operation. For instance, configuring a system might involve installing a software library with specific algorithms, updating firmware with new instructions for handling data, or adding a hardware component for enhanced processing capabilities. Similarly, one or more computer programs are "configured" to perform particular operations or actions when they contain instructions that, upon execution by a computing device or hardware, cause the device to perform those intended operations or actions.

The embodiments and functional operations described in this specification can be implemented in various forms, including digital electronic circuitry, software, firmware, computer hardw are (encompassing the disclosed structures and their structural equivalents), or any combination thereof. The subject matter can be realized as one or more computer programs, essentially modules of computer program instructions encoded on a tangible non-transitory storage medium for execution by or to control the operation of a computing device or hardware. The storage medium can be a storage device such as a hard drive or solid-state drive (SSD), a storage medium, a random or serial access memory device, or a combination of these. Additionally or alternatively, the program instructions can be encoded on a transmitted signal, such as a machine-generated electrical, optical, or electromagnetic signal, designed to carry information for transmission to a receiving device or system for execution by a computing device or hardware. Furthermore, implementations may leverage emerging technologies like quantum computing or neuromorphic computing for specific applications, and may be deployed in distributed or cloud-based environments where components reside on different machines or within a cloud infrastructure.

The term "computing device or hardware" refers to the physical components involved in data processing and encompasses all types of devices and machines used for this purpose. Examples include processors or processing units, computers, multiple processors or computers working together, graphics processing units (GPUs), tensor processing units (TPUs). and specialized processing hardware such as field- programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs). In addition to hardware, a computing device or hardware may also include code that creates an execution environment for computer programs. This code can take the form of processor firmware, a protocol stack, a database management system, an operating system, or a combination of these elements. Embodiments may particularly benefit from utilizing the parallel processing capabilities of GPUs, in a General-Purpose computing on Graphics Processing Units (GPGPU) context, where code specifically designed for GPU execution, often called kernels or shaders, is employed. Similarly, TPUs excel at running optimized tensor operations crucial for many machine learning algorithms. By leveraging these accelerators and their specialized programming models, the system can achieve significant speedups and efficiency gains for tasks involving artificial intelligence and machine learning, particularly in areas such as computer vision, natural language processing, and robotics.

A computer program, also referred to as software, an application, a module, a script, code, or simply a program, can be written in any programming language, including compiled or interpreted languages, and declarative or procedural languages. It can be deployed in various forms, such as a standalone program, a module, a component, a subroutine, or any other unit suitable for use within a computing environment. A program may or may not correspond to a single file in a file system and can be stored in various ways. This includes being embedded within a file containing other programs or data (e g., scripts within a markup language document), residing in a dedicated file, or distributed across multiple coordinated files (e.g., files storing modules, subprograms, or code segments). A computer program can be executed on a single computer or across multiple computers, whether located at a single site or distributed across multiple sites and interconnected through a data communication network. The specific implementation of the computer programs may involve a combination of traditional programming languages and specialized languages or libraries designed for GPGPU programming or TPU utilization, depending on the chosen hardware platform and desired performance characteristics.

In this specification, the term "engine" broadly refers to a software-based system, subsystem, or process designed to perform one or more specific functions. An engine is typically implemented as one or more software modules or components installed on one or more computers, which can be located at a single site or distributed across multiple locations. In some instances, one or more dedicated computers may be used for a particular engine, while in other cases, multiple engines may operate concurrently on the same one or more computers. Examples of engine functions within the context of Al and machine learning could include data pre-processing and cleaning, feature engineering and extraction, model training and optimization, inference and prediction generation, and post-processing of results. The specific design and implementation of engines will depend on the overall architecture and the distribution of computational tasks across various hardware components, including CPUs, GPUs, TPUs, and other specialized processors.

The processes and logic flows described in this specification can be executed by one or more programmable computers running one or more computer programs to perform functions by operating on input data and generating output. Additionally, graphics processing units (GPUs) and tensor processing units (TPUs) can be utilized to enable concurrent execution of aspects of these processes and logic flows, significantly accelerating performance. This approach offers significant advantages for computationally intensive tasks often found in Al and machine learning applications, such as matrix multiplications, convolutions, and other operations that exhibit a high degree of parallelism. By leveraging the parallel processing capabilities of GPUs and TPUs, significant speedups and efficiency gains compared to relying solely on CPUs can be achieved. Alternatively or in combination with programmable computers and specialized processors, these processes and logic flows can also be implemented using specialized processing hardware, such as field-programmable gate arrays (FPGAs) or applicationspecific integrated circuits (ASICs), for even greater performance or energy efficiency in specific use cases.

Computers capable of executing a computer program can be based on general- purpose microprocessors, special-purpose microprocessors, or a combination of both. They can also utilize any other type of central processing unit (CPU). Additionally, graphics processing units (GPUs), tensor processing units (TPUs), and other machine learning accelerators can be employed to enhance performance, particularly for tasks involving artificial intelligence and machine learning. These accelerators often work in conjunction with CPUs, handling specialized computations while the CPU manages overall system operations and other tasks. Typically, a CPU receives instructions and data from read-only memory (ROM), random access memory (RAM), or both. The elements of a computer include a CPU for executing instructions and one or more memory devices for storing instructions and data. The specific configuration of processing units and memory will depend on factors like the complexity of the Al model, the volume of data being processed, and the desired performance and latency requirements. Embodiments can be implemented on a wide range of computing platforms, from small embedded devices with limited resources to large-scale data center systems with high-performance computing capabilities. The system may include storage devices like hard drives, SSDs, or flash memory for persistent data storage.

Computer-readable media suitable for storing computer program instructions and data encompass all forms of non-volatile memory, media, and memoiy devices. Examples include semiconductor memory devices such as read-only memory (ROM), solid-state drives (SSDs), and flash memory devices; hard disk drives (HDDs); optical media; and optical discs such as CDs, DVDs, and Blu-ray discs. The specific type of computer-readable media used will depend on factors such as the size of the data, access speed requirements, cost considerations, and the desired level of portability or permanence.

To facilitate user interaction, embodiments of the subject matter described in this specification can be implemented on a computing device equipped with a display device, such as a liquid crystal display (LCD) or an organic light-emitting diode (OLED) display, for presenting information to the user. Input can be provided by the user through various means, including a keyboard), touchscreens, voice commands, gesture recognition, or other input modalities depending on the specific device and application. Additional input methods can include acoustic, speech, or tactile input, while feedback to the user can take the form of visual, auditory, or tactile feedback. Furthermore, computers can interact with users by exchanging documents with a user's device or application. This can involve sending web content or data in response to requests or sending and receiving text messages or other forms of messages through mobile devices or messaging platforms. The selection of input and output modalities will depend on the specific application and the desired form of user interaction.

Machine learning models can be implemented and deployed using machine learning frameworks, such as TensorFlow or JAX. These frameworks offer comprehensive tools and libraries that facilitate the development, training, and deployment of machine learning models.

Embodiments of the subject matter described in this specification can be implemented within a computing system comprising one or more components, depending on the specific application and requirements. These may include a back-end component, such as a back-end server or cloud-based infrastructure; an optional middleware component, such as a middleware server or application programming interface (API), to facilitate communication and data exchange; and a front-end component, such as a client device with a user interface, a web browser, or an app, through which a user can interact with the implemented subject matter. For instance, the described functionality could be implemented solely on a client device (e.g., for on-device machine learning) or deployed as a combination of front-end and back-end components for more complex applications. These components, when present, can be interconnected using any form or medium of digital data communication, such as a communication network like a local area network (LAN) or a wide area network (WAN) including the Internet. The specific system architecture and choice of components will depend on factors such as the scale of the application, the need for real-time processing, data security requirements, and the desired user experience.

The computing system can include clients and servers that may be geographically separated and interact through a communication network. The specific type of network, such as a local area network (LAN), a wide area network (WAN), or the Internet, will depend on the reach and scale of the application. The client-server relationship is established through computer programs running on the respective computers and designed to communicate with each other using appropriate protocols. These protocols may include HTTP, TCP/IP, or other specialized protocols depending on the nature of the data being exchanged and the security requirements of the system. In certain embodiments, a server transmits data or instructions to a user's device, such as a computer, smartphone, or tablet, acting as a client. The client device can then process the received information, display results to the user, and potentially send data or feedback back to the serv er for further processing or storage. This allows for dynamic interactions between the user and the system, enabling a wide range of applications and functionalities.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or on the scope of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially be claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings and recited in the claims in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.

Claims

1. A method of training an encoder neural network having a set of encoder network parameters and of updating a codebook of latent embedding vectors stored in a memory', wherein the method comprises: receiving a set of one or more training data items; for each training data item: processing the training data item through the encoder neural network in accordance with current values of the encoder network parameters of the encoder neural network to generate a training encoder output that comprises, for each of one or more latent variables, a respective training encoded vector; performing a forward function, comprising: for each latent variable, determining, for each latent embedding vector in the codebook, a respective similarity score between the latent embedding vector and the respective training encoded vector for the latent variable; and selecting, for each latent variable, a latent embedding vector that is nearest to the training encoded vector for the latent variable according to the respective similarity scores; performing a backward function, comprising: for each latent variable, generating a soft latent embedding vector by computing a respective weight for each of the latent embedding vectors based on the respective similarity scores for the latent embedding vectors and computing a weighted sum of the latent embedding vectors in accordance with the respective weights; and generating a training output for a training task from the selected latent embedding vectors generated by performing the forward function; and updating the current values of the encoder network parameters and the latent embedding vectors using a loss that is based on. for each training data item, the training output for the training task and the soft latent embedding vectors generated by performing the backward function.

2. The method of claim 1, wherein the training data items comprise images.

3. The method of any preceding claim, wherein the training data items comprise audio signals.

4. The method of any preceding claim, wherein the training data items comprise videos.

5. The method of any preceding claim, wherein the loss comprises a task loss for the training task that is based on the training outputs for the training data items, and wherein updating the current values of the encoder network parameters comprises: determining a first gradient of the task loss with respect to the selected latent embedding vectors; determining a second gradient of the soft latent embedding vectors with respect to the training encoder output; determining a third gradient of the task loss with respect to the current values of the encoder network parameters from the first gradient and the second gradient; and updating the current values of the network parameters using the third gradient.

6. The method of claim 5, wherein updating the latent embedding vectors comprises: determining a fourth gradient of the soft latent embedding vectors with respect to the latent embedding vectors; determining a fifth gradient of the task loss with respect to the latent embedding vectors from the first gradient and the fourth gradient; and updating the latent embedding vectors using the fifth gradient.

7. The method of any preceding claim, wherein the loss comprises an auxiliary loss for the latent embedding vectors, and wherein updating the latent embedding vectors comprises: determining a sixth gradient of the auxiliary loss with respect to the latent embedding vectors; and updating the latent embedding vectors using the sixth gradient.

8. The method of claim 7. wherein the auxiliary loss comprises a consistency loss that encourages consistency between the selected latent embedding vectors and the soft latent embedding vectors.

9. The method of claim 7 or claim 8, wherein the auxiliary loss comprises a coverage loss that encourages, for each latent embedding vector, a similarity between the latent embedding vector and a closest training embedding vector to the latent embedding vector according to the respective similarity scores.

10. The method of any preceding claim, wherein determining, for each latent embedding vector in the codebook, a respective similarity score between the latent embedding vector and the respective training encoded vector for the latent variable comprises computing a dot product between the latent embedding vector and the respective training encoded vector.

11. The method of any preceding claim, wherein computing a respective weight for each of the latent embedding vectors based on the respective similarity scores for the latent embedding vectors comprises: applying a softmax function with a specified temperature to the respective similarity scores for the latent embedding vectors.

12. The method of claim 11, wherein the specified temperature is a hyperparameter.

13. The method of claim 11. wherein the specified temperature is learned during the training of the encoder neural network.

14. The method of any preceding claim, wherein generating a training output for a training task from the selected latent embedding vectors generated by performing the forward function comprises: processing the selected latent embedding vectors generated by performing the forward function using a decoder neural network to generate a decoder output; and generating the training output for the training task from the decoder output.

15. The method of claim 14, further comprising: updating the decoder neural network using a task loss that is based on the training outputs for the training task.

16. The method of any preceding claim, wherein the training task comprises a reconstruction task.

17. The method of any preceding claim, wherein the training task comprises a contrastive learning task.

18. The method of any preceding claim, wherein the training task comprises a sequence generation task conditioned on the selected latent embedding vectors.

19. The method of any preceding claim, further comprising: after the training, compressing new data items using the encoder neural network and the codebook of latent embedding vectors.

20. The method of any preceding claim, further comprising: after the training, generating quantized representations of new data items using the encoder neural network and the codebook of latent embedding vectors and using the quantized representations for a downstream task.

21. The method of claim 20, wherein the downstream task is a task performed by a generative model.

22. The method of claim 21, wherein the generative model is a multi-modal generative model.

23. The method of any preceding claim, wherein selecting, for each latent variable, a latent embedding vector that is nearest to the training encoded vector for the latent variable according to the respective similarity scores comprises: selecting, for each latent variable, a latent embedding vector that is nearest to the training encoded vector for the latent variable according to the respective similarity scores for providing a discrete latent representation as a compressed representation of the training data item, wherein the discrete latent representation comprises, for each latent variable, an identifier of the nearest latent embedding vector to the training encoded vector for the latent variable.

24. A method performed by one or more computers, the method comprising: obtaining a new data item; processing the training data item through an encoder neural network to generate an encoder output that comprises, for each of one or more latent variables, a respective encoded vector; performing a forward function, comprising: for each latent variable, determining, for each latent embedding vector in a codebook of latent vectors, a respective similarity score between the latent embedding vector and the respective encoded vector for the latent variable; and selecting, for each latent variable, a latent embedding vector that is nearest to the encoded vector for the latent variable according to the respective similarity scores; generating a compressed representation of the new data item that identifies the selected latent embedding vectors for the latent variables, wherein the encoder neural network has been trained and the latent vectors in the codebook have been learned by performing the respective operations of any one of claims 1-23.

25. A method performed by one or more computers, the method comprising: obtaining a new data item; processing the new data item through an encoder neural netw ork to generate an encoder output that comprises, for each of one more latent variables, a respective encoded vector; for each latent variable, determining, for each latent embedding vector in a codebook of latent vectors, a respective similarity score between the latent embedding vector and the respective encoded vector for the latent variable; and selecting, for each latent variable, a latent embedding vector that is nearest to the encoded vector for the latent variable according to the respective similarity scores; and generating a compressed representation of the new data item that identifies the selected latent embedding vectors for the latent variables, wherein (a) an output of a forward function generated by selecting a single latent embedding vector from the codebook for a latent variable, and (b) an output of a backward function derived from a combination of multiple latent embedding vectors from the codebook for the latent variable.

26. The method of claim 24 or 25, further comprising: performing a downstream task using the compressed representation of the new data item.

27. The method of claim 26, wherein the downstream task is a task performed by a generative model.

28. The method of claim 27, wherein the generative model is a multi-modal generative model.

29. The method of any preceding claim, when dependent on claim 8, wherein the consistency loss measures a negative of a sum of or an average of respective first nearest similarity scores for each of the latent embedding vectors, wherein the respective first nearest similarity score is a similarity' score between the training encoded vector for the latent embedding vector and the latent embedding vector that is nearest to the training encoded vector for the latent variable according to the respective similarity scores.

30. The method of claim 29, wherein the consistency loss is equal to or directly proportional to a constant minus the sum or the average of the respective first nearest similarity scores.

31. The method of any preceding claim, when dependent on claim 9, wherein the coverage loss measures a negative of a sum of or an average of respective second nearest similarity scores for each of the latent embedding vectors, wherein the respective second nearest similarity' score is a similarity score between the latent embedding vector and a training encoded vector that is nearest to the latent embedding vector according to the respective similarity scores.

32. The method of claim 31, wherein the coverage loss is equal to or directly proportional to a constant minus the sum or the average of the respective second nearest similarity scores.

33. A system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform operations of the respective method of any one of claims 1-32.

34. One or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations of the respective method of any one of claims 1-32.

35. A system comprising: one or more storage devices storing a respective compressed representation for each of a set of new data items, wherein the respective compressed representation of each of the new data items has been generated by performing operations comprising: obtaining the new data item; processing the training data item through an encoder neural network in accordance to generate an encoder output that comprises, for each of one or more latent variables, a respective encoded vector; performing a forward function, comprising: for each latent variable, determining, for each latent embedding vector in a codebook of latent vectors, a respective similarity score between the latent embedding vector and the respective training encoded vector for the latent variable; and selecting, for each latent variable, a latent embedding vector that is nearest to the training encoded vector for the latent variable according to the respective similarity scores; and generating a compressed representation of the new data item that identifies the selected latent embedding vectors for the latent variables, wherein the encoder neural network has been trained and the latent vectors in the codebook have been learned by performing the respective operations of any one of claims 1-32.

36. The system of claim 35, further comprising one or more computers coupled to the one or more storage devices, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform operations comprising: performing a downstream task using the compressed representations of the new data items.

37. The system of claim 36, wherein the downstream task is a task performed by a generative model.

38. The system of claim 37, wherein the generative model is a multi-modal generative model.