WO2025166250A1 - Apprentissage de représentations visuelles à l'aide d'une auto-attention et d'un débruitage - Google Patents
Apprentissage de représentations visuelles à l'aide d'une auto-attention et d'un débruitageInfo
- Publication number
- WO2025166250A1 WO2025166250A1 PCT/US2025/014146 US2025014146W WO2025166250A1 WO 2025166250 A1 WO2025166250 A1 WO 2025166250A1 US 2025014146 W US2025014146 W US 2025014146W WO 2025166250 A1 WO2025166250 A1 WO 2025166250A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- image
- patch
- processing
- neural network
- representation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0475—Generative networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Definitions
- This specification relates to processing data using machine learning models.
- Neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input.
- Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer.
- Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters.
- This specification describes systems and methods, implemented as computer programs on one or more computers in one or more locations, for training image generation neural network systems.
- a neural network trained as described herein learns good image representations using self-attention, e.g. to predict image patches autoregressively, and by training on a denoising task.
- a computer-implemented method of training a neural network to represent an image In one aspect there is described a computer-implemented method of training a neural network to represent an image.
- the method comprises, for a plurality of training images, processing each image patch of a plurality of image patches of the training image to generate a sequence of patch embeddings, one for each of the image patches.
- the sequence of patch embeddings is processed using an image representation neural network comprising a plurality of self-attention layers, to generate a sequence of latent variables.
- Processing an image patch involves processing patch embeddings for previous image patches in the sequence, using the image representation neural network, to generate a current latent variable for a current patch embedding representing a current patch. That is in implementations, the image representation neural network process the patch embeddings autoregressively.
- a corrupted patch embedding is generated that is an embedding of a corrupted version of the current image patch.
- the current latent variable and the corrupted patch embedding are processed using a denoising patch decoder neural network to generate pixel values for an output image.
- the output image can represent either the noise in the corrupted version of the current image patch, or a reconstruction of the current image patch.
- the denoising patch decoder neural network and the image representation neural network are trained using a training objective function that depends on a difference, either between the output image and the (previously added) noise in the version of the current image patch, or between the output image and the current image patch.
- a computer-implemented method of training a neural network to represent an image involves, for a plurality of training images, processing pixel values of each image patch of a plurality of image patches of the training image to generate a sequence of patch embeddings, one for each of the image patches.
- the sequence of patch embeddings is processed using an image representation neural network comprising a plurality of self-attention layers, to generate a sequence of latent variables by for each image patch, processing patch embeddings for previous image patches in the sequence, using the image representation neural network, to generate a current latent variable for a current patch embedding representing a current image patch, and processing the current latent variable using a patch decoder neural network to generate pixel values for an output image that represents a reconstruction of the current image patch.
- the patch decoder neural network and the image representation neural network are then trained using a training objective function that depends on a difference between the output image and the current image patch.
- the techniques described in this specification may be employed in such a method.
- the current patch embedding, two-dimensional relative position encoding for the current image patch, the patch embeddings for the previous image patches in the sequence, and the two- dimensional relative position encodings for the previous image patches can be processed using the image representation neural network to generate the current latent variable for the current patch embedding.
- More particularly relative rotational positional embeddings as described later may be applied in one or each of the self-attention layers of the image representation neural network.
- a computer-implemented method involves receiving an input sequence of embeddings, each embedding derived from an e.g. respective region of a two dimensional data item.
- the input sequence is processed using a neural network, e.g. a Transformer-based neural network, to generate a network output.
- the neural network comprises a plurality of layers including a plurality of attention layers.
- Each attention layer is configured to apply query-key-value attention over an attention layer input to generate an attention layer output.
- Applying the query-key-value attention involves applying a query transformation, a key transformation, and a value transformation to the attention layer input to obtain respective query, key, and value vectors. Applying the query-key-value attention also involves applying each query vector to each key vector to determine a respective weight for each value vector. The value vectors weighted by the respective weights are combined to determine the attention layer output.
- Applying one of the query vectors to one of the key vectors to determine a respective weight for one of the value vectors involves determining a query rotation matrix for the query vector that separately represents each dimension of a two dimensional position of a (corresponding) query region of the two dimensional data item and determining a key rotation matrix for the key vector that separately represents each dimension of a two dimensional position of a (corresponding) key region of the two dimensional data item.
- the query rotation matrix, the key rotation matrix, the query vector, and the key vector are multiplicatively combined to determine the respective weight value dependent on a relative distance in two dimensions in the training image between the position of the query region and the position of the key region.
- the method is performed on a combination of a host processor and one or more hardware accelerators.
- the network output comprises an output sequence, and the input sequence and the output sequence each comprises a sequence of tokens.
- the method can involve loading a set of weights for the neural network from the host processor into memory of the one or more hardware accelerators, and processing the input sequence using the plurality of attention layers, implemented on the one or more hardware accelerators, to auto-regressively generate the network output by generating each token in the output sequence conditioned on a current input sequence that includes at least some of the tokens that precede the particular token in the output sequence.
- the processing can involve maintaining a KV cache for some or all of the plurality of attention layers on the one or more hardware accelerators, the KV cache comprising stored keys and values of said some or all of the plurality of attention layers for use in applying the querykey-value attention.
- the processing can also involve performing the steps of determining the query rotation matrix, determining the key rotation matrix, and multiplicatively combining the query rotation matrix, the key rotation matrix, the query vector, and the key vector, to determine the respective weight value, on the one or more hardware accelerators.
- a computer-implemented method of performing an image processing task involves obtaining an image and providing pixels of the image to a trained image representation neural network, comprising a plurality of self-attention layers, to generate a representation of the image.
- the representation of the image is processed to perform the image processing task.
- the image representation neural network has been trained by determining a set of patch embeddings that represents a set of image patches that tile a training image, processing the patch embeddings to generate a sequence of latent variables that encodes information across multiple image patches, training the image representation neural network to use the latent variables to reconstruct corrupted image patches (i.e. versions of the image patches with noise added) using an image patch reconstruction objective, such as the above described training objective function.
- an image patch reconstruction objective such as the above described training objective function.
- a computer-implemented method of performing an image processing task involves obtaining an image, and processing pixel values of each image patch of a plurality of image patches of the image to generate a sequence of patch embeddings, one for each of a corresponding sequence of image patches.
- the sequence of patch embeddings is processed using an image processing neural network, e.g. comprising a plurality of (self)-attention layers, to generate a neural network output that represents a result of performing an image processing task on the image. Examples of some image processing tasks that can be performed are described later.
- Determining the sequence of image patches involves dividing the image into a set of coarse blocks that tile the image, and dividing each coarse block into a set of image patches that tile the coarse block.
- the sequence of image patches is constructed, by for each coarse block in a raster scan sequence of the coarse blocks determining the sequence of image patches within each coarse block as a raster scan sequence of the image patches within the coarse block, and determining the sequence of image patches across the coarse blocks as a raster scan sequence of the coarse blocks within the image.
- Implementations of the described systems and methods use a combination of an attentionbased autoregressive model such as a model with a decoder-only Transformer architecture, and a de-noising patch decoder, to implement a generative model that is able to generate reconstructions of image patches.
- the described techniques can generate image patch reconstructions that include fine detail, and do not require labelled training data. Further, training the system to generate image patch reconstructions using a denoising objective as described results in the system internally learning how to generate good representations of images, e.g. images from the training distribution. These representations can be used in downstream image processing tasks to enable such tasks to be performed more effectively.
- the trained image representation neural network of the system can be used as a pre-trained image representation neural network in an image processing system, to perform a wide range of image processing tasks.
- it can be used to provide a pre-trained image representation neural network for a multimodal machine learning model such as a visual language model (VLM).
- VLM visual language model
- the system is trained using a denoising diffusion model objective, which can produce good image representations.
- the quality of the image representations can be further improved by tailoring a noise schedule of the denoising diffusion model objective used to train the image representation neural network.
- the described techniques can use a relative positional encoding to encode the relative positions of the image patches within an image from which they were derived.
- the patch embeddings can be processed in conjunction with their relative positional encodings.
- the relative positions are encoded using rotation matrices that separately encode the x- and y- locations of a patch, and that are applied to (transformed) query and key vectors in the attention mechanism of one or more of, e.g. each of the attention neural network layers. This can facilitate the training and can improve the image representations generated by the trained image representation neural network.
- FIG. 1 shows an example system for training a neural network to represent an image.
- FIG. 2 is a flow diagram of an example process for training a neural network to represent an image.
- FIGs. 3A-D illustrate example noise schedules for training on a denoising task.
- FIG. 4 illustrates a particular example implementation of the system of FIG. 1.
- FIG. 5 shows an example of an image processing system.
- FIG. 6 is a flow diagram of an example process for using the image processing system of FIG. 5 to perform an image processing task.
- FIGs. 7A-B illustrate a nested raster scan ordering of image patch processing.
- FIGs. 8A-B illustrate the performance of trained image representation neural networks.
- FIG. 1 shows a system 100 for training a neural network to represent an image.
- the system of FIG. 1 can be implemented as computer programs on one or more computers in one or more locations.
- the system 100 comprises an image representation neural network 120.
- the image representation neural network 120 is configured to process an input sequence, specifically a sequence of patch embeddings 112, to generate an output sequence, specifically a sequence of latent variables 122.
- the image representation neural network 120 comprises a plurality of self-attention layers.
- the image representation neural network 120 may have a so-called decoder-only Transformer neural network architecture.
- a decoder-only Transformer neural network architecture typically comprises one or more Transformer layers or “blocks” each including a self-attention neural network layer followed by a feedforward neural network layer, and often one or more normalization layers.
- a causal neural network can be a neural network that includes one or more causally masked self-attention layers, i.e. so that for each position in the input sequence the self-attention neural network layer(s) see only previous positions in the input sequence (only the patch embeddings for previous image patches in the sequence).
- the system includes a local or remote data store of training images 110.
- the training images can be any images, e.g. images of the real world captured by a camera or other image sensor.
- the training images can be augmented, e g. by randomly cropping, resizing, and/or flipping.
- Each training image is processed to generate a sequence of patch embeddings 112, one for each of a plurality of image patches of the training image.
- the image patches may be obtained by segmenting the training image into nonoverlapping patches, i.e. the image patches can tile the training image.
- the patch embedding for a patch may be generated by processing pixel values of the patch, e.g. using one or more patch embedding neural network layers, e.g. a linear neural network layer (not shown in FIG. 1).
- patch embedding neural network layers e.g. a linear neural network layer (not shown in FIG. 1).
- a 2D P X P pixel image patch with C color channels can be flattened into a P 2 C vector that is processed by the patch embedding neural network layer(s) to generate the patch embedding.
- the sequence of patch embeddings corresponds to a fixed order of image patches such as a raster scan order, a nested raster scam order (in which the image is divided into blocks within each of which a raster scan order is used), or a round robin order (the first patch of each respective block followed by the second patch of each block, and so forth).
- the sequence of patch embeddings corresponds to a random order of image patches.
- the corrupted patch embedding is an embedding of a corrupted version of the image patch.
- the corrupted version of the current image patch can be obtained by adding noise to the image patch.
- the corrupted patch embedding can be obtained by processing the corrupted version of the current image patch using the patch embedding neural network layer(s), e.g. the linear layer.
- the image representation neural network 120 is configured to process the sequence of patch embeddings 112 to generate the sequence of latent variables 122. In implementations the image representation neural network 120 processes the sequence of patch embeddings 112 in parallel to generate the sequence of latent variables 122. That is, in implementations, references to “previous” and “current” patch embeddings and latent variables are to positions in the respective sequences rather than to particular times.
- the system 100 also includes a denoising patch decoder neural network 130, configured to process a current latent variable 122 and a corrupted patch embedding 114, to generate pixel values for an output image 132.
- the output image 132 which has the same number of pixels as the current image patch, can represent either the noise in the corrupted version of the current image patch, or a reconstruction of a current image patch from which the corrupted patch embedding was generated.
- the denoising patch decoder neural network 130 may have any architecture e.g. comprising one or more attention neural network layers, e.g. it may comprise a single or multilayer Transformer architecture; or one or more fully connected layers, e.g. an MLP (Multilayer Perceptron) optionally with a residual connection; or one or more linear layers.
- the denoising patch decoder neural network 130 comprises a Transformer architecture in general it does not use causal attention masking.
- the denoising patch decoder neural network 130 and the image representation neural network 120 are trained using a training engine 140, as described below.
- FIG. 2 is a flow diagram of an example process for training a neural network, such as the image representation neural network 120, to represent an image.
- the process of FIG. 2 may be implemented by one or more computers in one or more locations; for convenience the process is described with reference to FIG. 1.
- the process involves obtaining a plurality of training images 110 (step 200) and, for each training image, processing each of a plurality or sequence of image patches of the training image to generate a sequence of patch embeddings, one for each of the image patches (step 202).
- the sequence of patch embeddings is processed using the image representation neural network 120, to generate a corresponding sequence of latent variables.
- a start of sequence token may be prepended to the sequence of patch embeddings, e.g. comprising a predetermined or learnable embedding.
- the final patch embedding of the sequence of patch embeddings may be omitted.
- the one-position offset induced by the start-of-sequence token when combined with causal attention masking, can ensure that the patch generated at a current position (see below) only receives information from the previous patches.
- a corrupted patch embedding 114 is generated from a corrupted, e.g. noisy, version of the image patch (step 206).
- the current latent variable 122 and the corrupted patch embedding 114 are processed using the denoising patch decoder neural network 130 to generate pixel values for the output image 132 (step 208).
- processing the current latent variable 122 and the corrupted patch embedding 114 can involve providing the current latent variable and the corrupted patch embedding as inputs to an attention neural network layer of the denoising patch decoder neural network 130 to generate the pixel values for the output image 132.
- the output image 132 can represent either the noise in the corrupted version of the current image patch, or a reconstruction of the current image patch. That is, the system can predict the previously added noise (that could then be subtracted from the corrupted version of the current image patch to recover the current image patch), or the current image patch.
- the denoising patch decoder neural network 130 and the image representation neural network 120 are trained using a training objective function that depends on a difference, either between the output image 132 and the (previously added) noise in the corrupted version of the current image patch, or between the output image 132 and the current image patch (step 210).
- a training objective function that depends on a difference, either between the output image 132 and the (previously added) noise in the corrupted version of the current image patch, or between the output image 132 and the current image patch (step 210).
- training a neural network involves backpropagating gradients of an objective function to adjust the trainable parameters, e.g. weights, of the neural network.
- the training can use any appropriate gradient descent optimization algorithm, e.g. Adam, or another optimization algorithm.
- some implementations of the method can randomly sampling a plurality of diffusion process time step values from a diffusion training distribution (for a particular current image patch). Each diffusion process time step value can then be used to generate a respective corrupted patch embedding, and a respective value of the training objective function can be determined, as described above, for each respective corrupted patch embedding. Then the denoising patch decoder neural network and the image representation neural network can be trained using each respective value of the training objective function, separately or combined. [069] Once trained part of all of the image representation neural network 120 can be used to generate a representation of an image, e.g.
- the image representation neural network 120 comprises a decoder-only Transformer it has been found that the best performing layer is situated roughly in the middle of the Transformer stack.
- the output can be used to perform an image processing task, which may be referred to as a “downstream” task. This can, but need not, involve further training on the specific task.
- the system learns representations by learning to perform denoising.
- the denoising patch decoder neural network 130 performs a denoising task as the output image 130 represent either a reconstruction of the current image patch or noise to be subtracted from the current image patch to reconstruct the image patch.
- this process is conditioned on previous uncorrupted image patches via the image representation neural network 120.
- a training objective function that depends on a difference between the output image and the current image patch, i.e. training the system so that the output image 132 predicts the current image patch rather than the noise, is advantageous.
- a noise schedule, y(s), where s is a diffusion process time step value, can be used to determine a noise-defining value that defines a level of noise to add to the current image patch to generate the corrupted version of the current image patch.
- the noise-defining value, y(s) can have an inverse relationship with the level of noise, i.e. so that the level of noise added decreases as the noise-defining value, y(s), increases.
- the noise schedule, y(s) defines that the level of noise added to the current image patch decreases (monotonically) as the diffusion process time step value s decreases.
- the variation can be linear, i.e. a linear noise schedule can be used; or nonlinear, e.g. a quadratic schedule, or a cosine schedule. As described below, some noise schedules can be particularly useful when training the image representation neural network 120 for use in a downstream task.
- the noise schedule can define that the level of noise added to the current image patch, for randomly sampled diffusion process time step values (from the diffusion training distribution), prioritizes adding a high level of noise over adding a relatively lower level of noise.
- the noise-defining value, y(s) is closer to 0 than to a maximum value that it can take, e.g. 1 (where the level of noise added to the current image patch, e.g. y l — y(s), increases as the noise-defining value, y(s), decreases). This can result in better performance, after training, in downstream image processing tasks.
- the noise schedule can be treated as a hyperparameter and optimized for the particular task(s). This can be done by training with a range of different noise schedules, and then choosing a noise schedule that gives an optimum performance according to some metric of the task(s) such as the value of an objective function for the task(s). This can be facilitated by using a noise schedule parameterized by one or more hyperparameters and varying the hyperparameters to vary the noise schedule.
- the corrupted version of the current image patch can be obtained by adding noise to the current image patch at a noise level defined by the noise-defining value.
- the corrupted patch embedding can then be generated as an embedding of the corrupted version of the current image patch.
- the noise schedule may be, but need not be, defined by a closed-form expression.
- the noise-defining value which defines a level of noise to add to the current image patch to generate the corrupted version of the current image patch, can be sampled directly from the noise schedule, y(s).
- the noise schedule, y(s) may be determined as or dependent upon s fi-1 (l — s)' b ⁇
- the hyperparameters a and b can be varied to vary the noise schedule and may be optimized according to the particular downstream image processing task or tasks to be performed using the image representation neural network 120 after training.
- FIG. 3 shows some example values of a and b, showing the Beta distributions in FIG. 3A and the corresponding noise schedules in FIG. 3B.
- FIG. 3C illustrates example noise schedules 300, 302 that sample mainly from high noise regions, as illustrated in FIG. 3D (which shows the probability p(y)), by comparison with some other noise schedules 304, 306 (e.g. noise schedule 306 samples 80% from [0.1,0.6]).
- the described system representation learning is driven by a combination of autoregressive prediction and denoising, where the denoising is particularly beneficial with a noise schedule that prioritizes adding a high level of noise over adding a relatively lower level of noise.
- the corrupted version of the current image patch can be generated by randomly sampling a diffusion process time step value, s, from a diffusion training distribution.
- the diffusion training distribution can be, e.g. a uniform distribution over a range of possible time step values, e.g. [0,1].
- the noise schedule, y(s) can be used to map the diffusion process time step value to the noise-defining value, y(s).
- the noise schedule comprises a function, y, that maps the diffusion process time step value, s, to the noise-defining value, y(s).
- the function may be, but need not be, defined by a closed-form expression.
- the noise level can be defined by, or depend on, the noise-defining value as ⁇ /l — y(s).
- the corrupted version of the current image patch, x s may be determined as: where x° is the uncorrupted image patch (dropping the subscript t that indexes the patch); and where, e.g., e ⁇ J ⁇ f(0,l) comprises Gaussian noise.
- An example training objective (loss) function, f, that depends on the difference between the output image 132 and the current image patch can be determined as: where x° is the current image patch and the patch index t has been included for clarity. A consequence of the training process is that this is averaged over the noise added to the corrupted image patches according to the noise schedule.
- the training objective function that depends on the difference between the output image 132 and the (previously added) noise in the corrupted version of the current image patch may be substituted by the corrupted version of the current image patch, i.e. the corrupted x° may be used in place of x°.
- FIG. 4 illustrates a particular example implementation of the system of FIG. 1.
- the image representation neural network 120 comprises a causal decoder-only Transformer neural network as previously described. This takes in a ⁇ -dimensional set of inputs, one for each image patch embedding and generates a corresponding set of d- dimensional outputs, one for each of the inputs.
- An ordering can be defined for the set of inputs by an order, such as a raster scan order, of the image patches; and/or the ordering of the inputs can be defined by including a position encoding for each input.
- the set of inputs can define a sequence of inputs.
- the denoising patch decoder neural network 130 processes the current latent variable output of the image representation neural network 120 for step t, (x ⁇ t ), and the embedding of the noise corrupted patch for step t.
- the image representation neural network 120 does not encode the final patch, and the current latent variable representation of this patch is generated by processing the previous patches in the sequence, f(x ⁇ t ).
- the patch decoder processes the current latent variable representation of the previous patches in the sequence, and the corrupted patch embedding for the corrupted (“noise added”) final patch.
- a one position offset can conveniently be generated by prepending a start-of-sequence token to the image patch embeddings (not explicitly shown in FIG. 4).
- generating the current latent variable for the current patch embedding also involves encoding a two-dimensional position of each of the image patches in the training image to determine a two- dimensional relative position encoding that defines a relative position of the image patch in the training image. Then the current patch embedding, two-dimensional relative position encoding for the current image patch, the patch embeddings for the previous image patches in the sequence, and the two-dimensional relative position encodings for the previous image patches, can be processed using the image representation neural network 120 to generate the current latent variable for the current patch embedding.
- two-dimensional relative position encoding is described later.
- the image representation neural network 120 comprises a plurality of self-attention layers that are used to processes the patch embeddings to generate the sequence of latent variables.
- this involves using one or more of, e.g. each of, the self-attention layers to apply query -key-value (QKV) attention over an attention layer input derived from the current patch embedding and previous patch embeddings (broadly, representing the previously generated latent variables), to generate an attention layer output for generating the current latent variable.
- QKV query -key-value
- QKV attention comprises applying a query transformation, a key transformation, and a value transformation to the attention layer input to obtain respective query, key, and value vectors, applying each query vector to each key vector (using a similarity function) to determine a respective weight for each value vector, and combining the value vectors weighted by the respective weights to determine the attention layer output.
- one or more normalization layers e.g. layer normalization layers may be included; a selfattention layer may include a skip connection; and multi-headed self-attention may be used.
- the image representation neural network 120 comprises a Transformer neural network.
- a Transformer neural network can be characterized by having a succession of self-attention neural network layers.
- Each attention neural network layer has an attention layer input for each element of the input and is configured to apply an attention mechanism over the attention layer input to generate an attention layer output for each element of the input.
- the attention mechanism computes a similarity between a query and a set of key -value pairs.
- the query and the set of key -value pairs are determined from the attention layer input.
- an input embedding may be used to determine a query vector and a set of key -value vector pairs, that are used to generate an updated embedding comprising a weighted sum of the values, weighted by a similarity function of the query to each respective key.
- the similarity function may comprise, e.g., a dot product, cosine similarity, or other similarity measure; the query, keys, and values may all be vectors.
- the attention mechanism may be configured to apply each of a query transformation e.g. defined by a matrix a key transformation e.g. defined by a matrix W K , and a value transformation e.g.
- two-dimensional relative position encoding applying one of the query vectors to one of the key vectors to determine a respective weight for one of the value vectors involves determining a query rotation matrix for the query vector that separately represents each dimension of a two dimensional position of a corresponding query image patch in the training image. This can also involve determining a key rotation matrix for the key vector that separately represents each dimension of a two dimensional position of a corresponding key image patch in the training image.
- the query rotation matrix, the key rotation matrix, the query vector, and the key vector may then be multiplicatively combined to determine the respective weight value.
- the weight value is dependent on a relative distance in two dimensions in the training image between the position of the corresponding query image patch and the position of the corresponding key image patch.
- the multiplicative combining involves multiplying the matrices together, using a transposed version of a matrix where appropriate.
- relative position is encoded by an angle of rotation
- the two dimensional position is encoded as two angles of rotation (each according to a respective parameter 9 that is a non-zero constant that defines a speed of angular rotation).
- the two dimensional position of the corresponding query image patch may be defined as and the two dimensional position of the corresponding key image patch may be defined as (n x , n y
- the multiplicative combination of the query rotation matrix and the key rotation matrix may then depend on a combination of m x — n x and m y — n y , i.e. a distance apart of the corresponding query image patch and the corresponding key image patch in two dimensions.
- the is, the product of the query rotation matrix, the key rotation matrix, the query vector, and the key vector will represent terms having the general form (m — n)9 where 9 is an angle of rotation.
- the above described techniques can be applied independently of the denoising patch decoder neural network 130.
- the current latent variable 122 can be processed using a patch decoder neural network to generate the pixel values for an output image that represents a reconstruction of the current image patch.
- the patch decoder neural network and the image representation neural network 120 can then be trained using a training objective function that depends on a difference between the output image and the current image patch, e g. an MSE (Mean Square Error) loss.
- a patch decoder neural network may comprise, e.g., a linear neural network layer.
- relative rotational positional embeddings as described above may be applied in one or each of the self-attention layers of the image representation neural network 120.
- a self-attention layer implements QKV attention this can involve determining a query rotation matrix for the query vector that separately represents each dimension of a two dimensional position of a corresponding query image patch in the training image, determining a key rotation matrix for the key vector that separately represents each dimension of a two dimensional position of a corresponding key image patch in the training image.
- a weight value for the QKV self-attention can then be determined from a product of the query rotation matrix, the key rotation matrix, the query vector, and the key vector. The weight value is dependent on a relative distance in two dimensions in the training image between the position of the corresponding query image patch and the position of the corresponding key image patch.
- the above described 2D relative position encoding techniques can be used in any neural network that processes an input sequence, e.g. a Transformer-based neural network, such as a neural network that processes a sequence of input tokens to generate a sequence of output tokens. This can involve implementing the above described 2D relative position encoding techniques on one or more hardware accelerators.
- a hardware accelerator is specialized hardware that is used to accelerate neural network computations, such as a GPU (Graphics Processing Unit) or TPU (Tensor Processing Unit). Typically it includes hardware to perform matrix multiplications; it can include a set of one or more multiply accumulate units (MACs).
- MACs multiply accumulate units
- the described techniques can be combined with maintaining, by a host processor, at least part of a KV cache on the hardware accelerator(s). Such a combination can help make best use of limited memory bandwidth.
- the KV cache can store keys and values of said some or all of the plurality of attention layers for use in applying the query-key-value attention.
- the processing can also involve performing the steps of determining the query rotation matrix, determining the key rotation matrix, and multiplicatively combining the query rotation matrix, the key rotation matrix, the query vector, and the key vector, to determine the respective weight value, on the one or more hardware accelerators.
- the image representation neural network 120 can comprise a vision transformer (ViT) neural network architecture (backbone), as described in Dosovitskiy et al, arXiv:2010.11929, 2021.
- the vision transformer backbone can have a ViT-B/16, ViT-B/32, ViT-L/16, ViT-L/32 or ViT-H/14 architecture.
- an image patch can be 16 x 16 pixels.
- the vision transformer can be pretrained, e.g. using image augmentation such as random cropping, resizing, and horizontal flipping, e.g. following a similar approach to that described in He et al. arXiv:2111.06377, 2021 (there for masked autoencoders).
- FIG. 5 shows an example of an image processing system 500, configured to perform an image processing task.
- the image processing system of FIG. 5 can be implemented as computer programs on one or more computers in one or more locations.
- an image 502 is provided to part or all of the (pre-)trained image representation neural network 120, which processes pixels of the image 502 to generate a representation 504 of the image.
- the image representation 504 can be obtained from the trained image representation neural network 120.
- the image representation neural network 120 can be used to process a sequence of image patches representing the (complete) image and one or more of the sequence of latent variables generated by the image representation neural network 120 can be used as the image representation 504.
- a feature representation from one of the self-attention neural network layers of the image representation neural network 120 can be used for the image representation.
- the output from a self-attention neural network layer can be processed by a linear neural network layer, or by pooling (e.g. max-pooling, mean-pooling, or summing), or aggregated by another attention layer, or processed in some other way to obtain the image representation 504.
- the representation 504 can be obtained by combining the latent variables for each patch, e.g. using mean pooling.
- a last token i.e. a last latent variable of the sequence of latent variables
- this last latent variable acts as a summary of the preceding tokens, i.e. as a global image descriptor.
- the corresponding input token is the patch embedding of the final image patch, which is not used during the (pre-) training.
- the image representation can be derived from just an initial part of the image representation neural network 120, i.e. comprising the input to the image representation neural network and one or more subsequent self-attention layers.
- the image representation 504 can be derived from a self-attention layer in the middle third of a sequence of the self-attention layers between the input and output of the image representation neural network 120.
- the dimensionality of the image representation 504 can be adjusted as needed, e.g. by projection.
- the representation of the image may be obtained in any of these ways or in some other way from part or all of the trained image representation neural network. It will be appreciated that in such applications, i.e. when used in inference, the denoising patch decoder neural network 130 is not needed.
- the representation 504 of the image is processed by a neural network head 506, i.e. one or more additional linear and/or non-linear neural network layers, to generate an output 508 that represents the result of a particular image processing task performed on the image 502.
- a neural network head 506 i.e. one or more additional linear and/or non-linear neural network layers, to generate an output 508 that represents the result of a particular image processing task performed on the image 502.
- the neural network head 506 can generate the output 508 in any appropriate manner according to the task to be performed, according to known techniques.
- the output 508 can define a value for each of a plurality of categories for classification task; the output 508 can defines one or more bounding boxes for an object detection task; and so forth.
- the image processing system 500 can be trained, i.e. fine-tuned, to perform the particular image processing task, e.g. using supervised training. This can, but need not, involve training, i.e. updating the trainable parameters, e.g. weights, of the part of the image representation neural network 120 used in the image processing system 500.
- Fine-tuning of the used part of the image representation neural network 120 can, but need not, use a causal attention mask as described above. Omitting causal masking enables attention between one image patch and any other image patch of an image, which can be useful as in general image data, representing pixels of an image, does not have an inherent ordering. Empirically it has been found that when using a causal attention mask it is advantageous to use the last latent variable (token) as the global image descriptor.
- training the image processing system 500 involves training the system using a supervised learning objective function appropriate to the task to be learned. Any suitable training data can be used according to the image processing task(s) to be performed. There are many public databases of suitable training data; as one example the VTAB (Visual Task Adaptation Benchmark, Zhai et al. 2020, arXiv: 1910.04867) includes training data that can be sued for many of the example tasks described later.
- VTAB Visual Task Adaptation Benchmark, Zhai et al. 2020, arXiv: 1910.04867
- FIG. 6 is a flow diagram of an example process for using the image processing system 500 to perform an image processing task.
- the process of FIG. 6 may be implemented by one or more computers in one or more locations.
- the process receives the image 502. Pixels of the image 502 are processed by an image representation neural network that comprises part of all of the image representation neural network 120 (that has been pre-trained as previously described), to generate the representation 504 of the image (step 604). The representation 504 of the image is then processed to perform a particular image processing task (step 606).
- FIGS. 7A and 7B illustrates two examples of the construction of a sequence of image patches by, for each coarse block in a raster scan sequence of the coarse blocks, determining the sequence of image patches within each coarse block as a raster scan sequence of the image patches within the coarse block; and determining the sequence of image patches across the coarse blocks as a raster scan sequence of the coarse blocks within the image.
- the example of FIG. 7A illustrates a square coarse block size of 2 x 2 image patches;
- FIG. 7B illustrates a square coarse block size of 4 x 4 image patches.
- the image can be divided into coarse blocks each containing multiple image patches, each of which is subdivided into smaller image patches.
- the coarse blocks are visited in raster order and, withing each coarse block, the image patches are visited in raster order.
- FIG. 8 A illustrates the performance of the image representation neural network 120 trained as described above (with causal masking), and using a single linear layer classification head to process a mean pooling image representation 504. The figure illustrates performance on an ImageNet classification task, with percentage accuracy on the y-axis and training length (in arbitrary units) on the x-axis.
- Curve 700A shows the performance of the image representation neural network 120 when trained using a denoising patch decoder neural network 130, as described.
- Curve 702A shows the performance of the image representation neural network 120 when trained using a patch decoder neural network with an MSE loss.
- FIG 8B shows corresponding respective curves 700B, 702B, with percentage accuracy on the y-axis and patch size (in pixels) on the x-axis. Whilst accuracy decreases for both curves as patch size increases (and hence computational cost decreases), the reduction in accuracy is less for the image representation neural network 120 trained using the denoising patch decoder neural network 130.
- an image may be any still or moving image, i.e. the image may be part of a video, in 2D or 3D, and may be a monochrome, color or hyperspectral image, i.e. comprising monochrome or color pixels.
- an “image” includes a point cloud e.g. from a LIDAR system, and a “pixel” includes a point of the point cloud.
- the image, during or after (pre-)training and/or fine tuning may, e.g. have been captured by a camera or other image sensor from the real world, and objects in the image or video may comprise physical objects, represented by the image or video.
- performing an image processing task using the (pre-)trained image representation neural network 120 involves providing an image to an image representation neural network that has been trained by performing a method as described above. Pixels of the image are processed using the trained image representation neural network (i.e. using part or all of the previously described image representation neural network) to generate a representation of the image, in particular of features of the image. The representation of the image is then processed, by an image processing system that includes the image representation neural network to perform the image processing task. As described above, the image processing system may, e.g., comprise the image representation neural network and a neural network head configured to generate an output appropriate to the image processing task(s) to be performed. [0134] In general techniques for processing image representations to perform a wide range of image processing tasks are well known. A video processing task can be performed by processing the representations of multiple images representing a sequence video frames.
- the image processing task may comprise one or more of: image classification, image segmentation, e.g. semantic segmentation or instance segmentation; depth prediction; keypoint prediction; pose estimation, e.g. 3D pose estimation; surface normal estimation, e.g. by determining a vector in 2D or 3D; or object detection, including object tracking.
- any prediction task may be performed, e.g. by determining a scalar or vector value for the image or for regions of the image, e.g. using a head neural network.
- Other types of task may be performed in the same way, e.g. a curvature or other shape estimation task, a task that involves identifying aspects of an image using color, a counting task that involves counting objects or objects of a particular type, a task that involves understanding spatial relationships between objects or object attributes, and so forth.
- the image representation 504 may be processed to assign a categorical value defining a category for the image or regions of the image, or a value representing a probability that the image or a region of the image belongs to a particular category.
- the category may represent an object or type of object or (for video) an action or type of action.
- the values may identify a type or category of object and in an instance segmentation task the values may (also) distinguish between different instances of the same category of object. More generally a value can distinguish between an object (or action) and image background, and the image representation can be used, e.g., to perform an object localization, detection, orientation detection, or tracking task, e.g. for gesture recognition, i.e. recognition of gestures that are performed by entities depicted in a video.
- the image representation 504 may be processed to determine data representing one or more bounding boxes or other location data for an object or type of object in the processed image or, for moving images, location data for an action or type of action in the processed image.
- location data may comprise, e.g. data defining coordinates of a bounding box or region for one or more objects represented in the image.
- a bounding box or region may be defined in two, three or more dimensions (time counting as a dimension).
- Such location data may contribute to higher level tasks, e g. to object tracking across video frames.
- object segmentation may be used to segment medical images, to label patches of an input medical image in accordance with whether they show a region of a human or animal body in which a particular medical condition is present.
- An object segmentation may be used to provide an input to a control system of a mechanical agent, such as a robot or vehicle operating in a real -world environment.
- the detected objects may be, e.g., obstacles or paths upon which the mechanical agent can move, and may be used by the control system e.g. to make decisions on how to accomplish a task performed by the robot, or for controlling the direction or speed of movement of the agent.
- the image representation 504 can be used to perform an agent control task, where the patch embeddings represent an observation of an environment, e.g. a real world environment, are processed to generate an output that defines an action to be performed by the agent, in particular to perform a task.
- the agent can be a mechanical agent, e.g. a robot or vehicle, controlled to perform actions in the real world environment, in response to the observations, to perform the task, e.g. to manipulate an object or to navigate in the environment.
- the agent can be, e.g., a real-world or simulated robot; as some other examples the agent can be a control system to control one or more machines in an industrial facility.
- the image representation 504 may be used to provide an input to a control system of a mechanical agent, such as a robot or vehicle operating in a real- world environment.
- the control system may provide an output that controls the operation of the robot or vehicle to perform a task such as manipulating an object in the environment or moving in the environment.
- the image representation may be used to, e.g. to detect objects for the robot to manipulate, or obstacles or paths upon which the mechanical agent can move, and may be used by the control system e.g. to make decisions on how to accomplish a task performed by the robot, or for controlling the direction or speed of movement of the agent.
- the image representation 504 may be processed to obtain a scalar patch value representing an estimated depth value for a region of the image or for an object in the image such as a nearest object in the image, e.g. a distance of the region or object in a depth or z-direction from an x-y image plane or camera viewpoint.
- the image representation may define a depth distribution, e.g. a probability distribution over discrete depth value buckets.
- the depth values can define a depth map for the image.
- the patch embeddings may identify keypoints in the image, e.g.
- the image representation 504 may be mapped to a 3D surface, e.g. of a human body or face. Or the image representation may be used to estimate a 6D pose representing translation and orientation components of an object in the image, e.g. in quaternion form. The image representation can be used to estimate the pose of one or more objects in the image.
- the image representation 504 may be used to obtain a vector in, e.g., three dimensions defining a surface normal.
- the image representation can be used to obtain a surface normal map for one or more objects in the image, e.g. for use in an augmented reality or other application.
- the image representation 504 may be processed to generate text represented in the image, e.g. by generating tokens representing characters, wordpieces, or words, for text represented in the image.
- OCR optical character recognition
- the image representations 504 of multiple frames may be processed to provide output data that predicts one or more properties of or events that involve objects in the image. For example a position and/or velocity of one or more objects in the video may be determined, or a collision (of an agent) with an object in the video or a collision between two objects in the video may be predicted.
- Multimodal e.g. visual language models
- Some tasks that can be performed using the trained image representation neural network include multimodal tasks that involve processing a combination of an image and natural or computer language to generate an output that performs the image processing task.
- the output can be generated from a neural network that processes both the image representation 504 and token embeddings representing text or spoken words in a natural or computer language.
- the image processing task to be performed may be specified by the words in the natural or computer language.
- An example involves generating an output that requires reasoning, e g. spatio-temporal reasoning, to respond to a natural language query input, e.g. relating to a moving image (video).
- a query may require predictive reasoning (“what will happen next”), counterfactual reasoning (“what would happen in a different circumstance”), explanatory reasoning (“why did something happen”), or causal reasoning generally.
- the image representation can be used to detect objects in the video frames and provide information relating to the detected objects in response to a query.
- the query may comprise, for example, a request for a prediction of a future event or state relating to one or more of the objects (e.g. “will objects X and Y collide?”), or a request for conditional or counterfactual information relating to one or more of the objects (e.g.
- the output may, for example, be in the form of a yes/no answer, or may define a probability distribution over a set of possible answers; or the response may define the location of an object. Such systems can be used to predict whether or not two objects will collide, or how this may be avoided.
- the output may be used e.g. to provide a warning and/or to control motion of one or more of the objects.
- Such a computer implemented method can be performed by a multimodal machine learning model such as a visual language model (VLM).
- VLM visual language model
- a multimodal machine learning model has a multimodal input configured to receive a first multimodal input of a first modality and a second multimodal input of a second, different modality.
- a “modality” refers to a type of data, and thus a multimodal machine learning model is one that can process multiple different types of data.
- the first multimodal input may comprise, e.g. a text input to receive a sequence of text or audio data representing values of an audio waveform, e.g. instantaneous amplitude data or time-frequency domain data; or non-image sensor data representing an environment with which an agent controlled by the multimodal machine learning model interacts.
- the second multimodal input may receive an image or video.
- the multimodal machine learning model may be configured to jointly process encoded versions of the first and second multimodal inputs.
- a computer-implemented method of performing an image processing task can involve obtaining an image and a string of words, i.e. text, in a natural or computer language, and processing pixels of the image using an image representation neural network trained as described above, to generate a representation of the image. That is the image representation neural network for the image processing task can be part or all of the image representation neural network 120 trained by the above described method.
- the text can be processed using a text encoder to generate a sequence of token embeddings representing the text.
- the representation of the image and the sequence of token embeddings can be processed using a visual language model to perform an image processing task defined by the text.
- the text may be received, e.g., as a series of encoded characters, e.g. UTF-8 encoded characters; such “characters” can include Chinese and other similar characters, as well as logograms, syllabograms and the like.
- the multimodal machine learning model can include a text encoder that processes a sequence of text to represent the text as a series of text tokens from a vocabulary of text tokens, e.g. that each represent words, wordpieces or characters in a natural or computer language.
- the computer language may be any formal language used to communicate with a computer, e.g. a markup language, or a command or configuration language, or a data exchange language such as JSON, or a programming language.
- the model output may comprise any form of output appropriate to the machine learning task performed by the multimodal machine learning model.
- the model output may comprises text in a natural or computer language that defines a result of the task, e.g. for tasks such as image captioning, visual question answering, or object detection or instance segmentation.
- the model output may comprise data defining an image, video or audio object, e.g. in a generative task; or the model output may comprise non-textual action selection data for selecting an action to be performed by an agent controlled by the model.
- the model output may also or instead define an intermediate step to be performed during the task, e.g. a call to a software API for a software tool that is used when performing the task; the multimodal input may then receive an output from the software tool that is used to generate a final model output that performs the task.
- model output may also or instead define an intermediate step to be performed during the task, e.g. a call to a software API for a software tool that is used when performing the task; the multimodal
- Some example multimodal machine learning models with which the techniques described herein may be used include: Flamingo (Alayrac et al. arXiv:2204.14198); ALIGN (Jia et al., arXiv:2102.05918); PaLI (Chen et al. arXiv:2209.06794) and PaLI-X (Chen et al. arXiv:2305.18565); and Gemini (“Gemini: A Family of Highly Capable Multimodal Models, Gemini Team, Google”). These references also include indications of training datasets that may be used to train the respective models.
- a few examples of training datasets can be used to fine tune the system 500 for a particular task or tasks for visual tasks are the Visual Genome dataset for Visual Question Answering (Krishna et al., arXiv: 1602.07332); Objects365 (Shao et al., “Objects365: A large- scale, high-quality dataset for object detection”, IEEE/CVF international conference on computer vision, pages 8430-8439); Open Images V4 (Kuznetsova et al., arXiv: 1811.00982); the SBU dataset (Ordonez et al.
- a particular task that is to be performed by the multimodal machine learning model can be described by part or all of the sequence of text in the multimodal input to the model.
- a prompt might specify “Generate a caption”, “Generate a description”, “Answer the following question: [about the image or video]”, “What is the object in the top left comer?”, or “Detect a person”.
- a prompt may define “Take the knife out of the drawer”, or “Q: What action should the robot take to take the knife out of the drawer?”.
- such a prompt may give one or more examples of a task to be performed.
- a multimodal machine learning model can be trained on multiple natural and/or computer languages and the prompt may then specify a language to use.
- the task may comprise a still or moving image generation task.
- a training data item for such a system may comprise an image or video and a sequence of text that describes the image or video.
- the model output may comprise data for an image or video, e.g. image data defining values for pixels of a still or moving image, and the sequence of text in the multimodal input to the model may describe or characterize the image or video to be generated.
- the term “configured” is used in relation to computing systems and environments, as well as computer program components.
- a computing system or environment is considered “configured” to perform specific operations or actions when it possesses the necessary software, firmware, hardware, or a combination thereof, enabling it to carry out those operations or actions during operation.
- configuring a system might involve installing a software library with specific algorithms, updating firmware with new instructions for handling data, or adding a hardware component for enhanced processing capabilities.
- one or more computer programs are “configured” to perform particular operations or actions when they contain instructions that, upon execution by a computing device or hardware, cause the device to perform those intended operations or actions.
- the embodiments and functional operations described in this specification can be implemented in various forms, including digital electronic circuitry, software, firmware, computer hardware (encompassing the disclosed structures and their structural equivalents), or any combination thereof.
- the subject matter can be realized as one or more computer programs, essentially modules of computer program instructions encoded on a tangible non-transitory storage medium for execution by or to control the operation of a computing device or hardware.
- the storage medium can be a storage device such as a hard drive or solid-state drive (SSD), a storage medium, a random or serial access memory device, or a combination of these.
- the program instructions can be encoded on a transmitted signal, such as a machine-generated electrical, optical, or electromagnetic signal, designed to carry information for transmission to a receiving device or system for execution by a computing device or hardware.
- a transmitted signal such as a machine-generated electrical, optical, or electromagnetic signal
- implementations may leverage emerging technologies like quantum computing or neuromorphic computing for specific applications, and may be deployed in distributed or cloud-based environments where components reside on different machines or within a cloud infrastructure.
- computing device or hardware refers to the physical components involved in data processing and encompasses all types of devices and machines used for this purpose. Examples include processors or processing units, computers, multiple processors or computers working together, graphics processing units (GPUs), tensor processing units (TPUs), and specialized processing hardware such as field-programmable gate arrays (FPGAs) or applicationspecific integrated circuits (ASICs).
- a computing device or hardware may also include code that creates an execution environment for computer programs. This code can take the form of processor firmware, a protocol stack, a database management system, an operating system, or a combination of these elements.
- Embodiments may particularly benefit from utilizing the parallel processing capabilities of GPUs, in a General -Purpose computing on Graphics Processing Units (GPU) context, where code specifically designed for GPU execution, often called kernels or shaders, is employed.
- GPU Graphics Processing Unit
- TPUs excel at running optimized tensor operations crucial for many machine learning algorithms.
- the system can achieve significant speedups and efficiency gains for tasks involving artificial intelligence and machine learning, particularly in areas such as computer vision, natural language processing, and robotics.
- a computer program also referred to as software, an application, a module, a script, code, or simply a program, can be written in any programming language, including compiled or interpreted languages, and declarative or procedural languages. It can be deployed in various forms, such as a standalone program, a module, a component, a subroutine, or any other unit suitable for use within a computing environment.
- a program may or may not correspond to a single file in a file system and can be stored in various ways. This includes being embedded within a file containing other programs or data (e.g., scripts within a markup language document), residing in a dedicated file, or distributed across multiple coordinated files (e.g., files storing modules, subprograms, or code segments).
- a computer program can be executed on a single computer or across multiple computers, whether located at a single site or distributed across multiple sites and interconnected through a data communication network.
- the specific implementation of the computer programs may involve a combination of traditional programming languages and specialized languages or libraries designed for GPU programming or TPU utilization, depending on the chosen hardware platform and desired performance characteristics.
- engine broadly refers to a software-based system, subsystem, or process designed to perform one or more specific functions.
- An engine is typically implemented as one or more software modules or components installed on one or more computers, which can be located at a single site or distributed across multiple locations. In some instances, one or more dedicated computers may be used for a particular engine, while in other cases, multiple engines may operate concurrently on the same one or more computers. Examples of engine functions within the context of Al and machine learning could include data preprocessing and cleaning, feature engineering and extraction, model training and optimization, inference and prediction generation, and post-processing of results. The specific design and implementation of engines will depend on the overall architecture and the distribution of computational tasks across various hardware components, including CPUs, GPUs, TPUs, and other specialized processors.
- GPUs graphics processing units
- TPUs tensor processing units
- This approach offers significant advantages for computationally intensive tasks often found in Al and machine learning applications, such as matrix multiplications, convolutions, and other operations that exhibit a high degree of parallelism.
- FPGAs field-programmable gate arrays
- ASICs application-specific integrated circuits
- Computers capable of executing a computer program can be based on general-purpose microprocessors, special -purpose microprocessors, or a combination of both. They can also utilize any other type of central processing unit (CPU). Additionally, graphics processing units (GPUs), tensor processing units (TPUs), and other machine learning accelerators can be employed to enhance performance, particularly for tasks involving artificial intelligence and machine learning. These accelerators often work in conjunction with CPUs, handling specialized computations while the CPU manages overall system operations and other tasks. Typically, a CPU receives instructions and data from read-only memory (ROM), random access memory (RAM), or both.
- the essential elements of a computer include a CPU for executing instructions and one or more memory devices for storing instructions and data.
- processing units and memory will depend on factors like the complexity of the Al model, the volume of data being processed, and the desired performance and latency requirements.
- Embodiments can be implemented on a wide range of computing platforms, from small embedded devices with limited resources to large-scale data center systems with high- performance computing capabilities.
- the system may include storage devices like hard drives, SSDs, or flash memory for persistent data storage.
- Computer-readable media suitable for storing computer program instructions and data encompass all forms of non-volatile memory, media, and memory devices. Examples include semiconductor memory devices such as read-only memory (ROM), solid-state drives (SSDs), and flash memory devices; hard disk drives (HDDs); optical media; and optical discs such as CDs, DVDs, and Blu-ray discs.
- semiconductor memory devices such as read-only memory (ROM), solid-state drives (SSDs), and flash memory devices
- HDDs hard disk drives
- optical media such as CDs, DVDs, and Blu-ray discs.
- the specific type of computer-readable media used will depend on factors such as the size of the data, access speed requirements, cost considerations, and the desired level of portability or permanence.
- embodiments of the subject matter described in this specification can be implemented on a computing device equipped with a display device, such as a liquid crystal display (LCD) or an organic light-emitting diode (OLED) display, for presenting information to the user.
- a display device such as a liquid crystal display (LCD) or an organic light-emitting diode (OLED) display
- Input can be provided by the user through various means, including a keyboard), touchscreens, voice commands, gesture recognition, or other input modalities depending on the specific device and application.
- Additional input methods can include acoustic, speech, or tactile input, while feedback to the user can take the form of visual, auditory, or tactile feedback.
- computers can interact with users by exchanging documents with a user's device or application. This can involve sending web content or data in response to requests or sending and receiving text messages or other forms of messages through mobile devices or messaging platforms. The selection of input and output modalities will depend on the specific application and the desired form of user interaction.
- Machine learning models can be implemented and deployed using machine learning frameworks, such as TensorFlow or JAX. These frameworks offer comprehensive tools and libraries that facilitate the development, training, and deployment of machine learning models.
- Embodiments of the subject matter described in this specification can be implemented within a computing system comprising one or more components, depending on the specific application and requirements. These may include a back-end component, such as a back-end server or cloud-based infrastructure; an optional middleware component, such as a middleware server or application programming interface (API), to facilitate communication and data exchange; and a front-end component, such as a client device with a user interface, a web browser, or an app, through which a user can interact with the implemented subject matter.
- a back-end component such as a back-end server or cloud-based infrastructure
- middleware component such as a middleware server or application programming interface (API)
- API application programming interface
- a front-end component such as a client device with a user interface, a web browser, or an app,
- the described functionality could be implemented solely on a client device (e.g., for on- device machine learning) or deployed as a combination of front-end and back-end components for more complex applications.
- client device e.g., for on- device machine learning
- back-end components for more complex applications.
- These components when present, can be interconnected using any form or medium of digital data communication, such as a communication network like a local area network (LAN) or a wide area network (WAN) including the Internet.
- LAN local area network
- WAN wide area network
- the specific system architecture and choice of components will depend on factors such as the scale of the application, the need for real-time processing, data security requirements, and the desired user experience.
- the computing system can include clients and servers that may be geographically separated and interact through a communication network.
- the specific type of network such as a local area network (LAN), a wide area network (WAN), or the Internet, will depend on the reach and scale of the application.
- the client-server relationship is established through computer programs running on the respective computers and designed to communicate with each other using appropriate protocols. These protocols may include HTTP, TCP/IP, or other specialized protocols depending on the nature of the data being exchanged and the security requirements of the system.
- a server transmits data or instructions to a user's device, such as a computer, smartphone, or tablet, acting as a client.
- the client device can then process the received information, display results to the user, and potentially send data or feedback back to the server for further processing or storage. This allows for dynamic interactions between the user and the system, enabling a wide range of applications and functionalities.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Multimedia (AREA)
- Image Processing (AREA)
- Image Analysis (AREA)
Abstract
L'invention concerne des systèmes, des procédés et un code de programme informatique pour entraîner des systèmes de réseau neuronal de génération d'image afin de générer de bonnes représentations d'image à l'aide d'une auto-attention. Des modes de réalisation des systèmes prédisent des incorporations de partie d'image de manière autorégressive et s'entraînent à une tâche de débruitage, en particulier à l'aide d'un objectif de modèle de diffusion. L'invention concerne également des systèmes de traitement d'image qui utilisent les représentations d'image générées.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202463548772P | 2024-02-01 | 2024-02-01 | |
| US63/548,772 | 2024-02-01 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025166250A1 true WO2025166250A1 (fr) | 2025-08-07 |
Family
ID=94869852
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2025/014146 Pending WO2025166250A1 (fr) | 2024-02-01 | 2025-01-31 | Apprentissage de représentations visuelles à l'aide d'une auto-attention et d'un débruitage |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2025166250A1 (fr) |
-
2025
- 2025-01-31 WO PCT/US2025/014146 patent/WO2025166250A1/fr active Pending
Non-Patent Citations (17)
| Title |
|---|
| "Gemini: A Family of Highly Capable Multimodal Models, Gemini Team", GOOGLE |
| ALAYRAC ET AL., ARXIV:2204.14198 |
| CHEN ET AL., ARXIV:2209.06794 |
| CHEN ET AL., ARXIV:2305.18565 |
| DOSOVITSKIY ET AL., ARXIV:2010.11929, 2021 |
| EL-NOUBY ALAAELDIN ET AL: "Scalable Pre-training of Large Autoregressive Image Models", 16 January 2024 (2024-01-16), XP093268873, Retrieved from the Internet <URL:https://arxiv.org/pdf/2401.08541> [retrieved on 20250410] * |
| HE ET AL., ARXIV:2111.06377, 2021 |
| JIA ET AL., ARXIV:2102.05918 |
| KAY ET AL., ARXIV:1705.06950 |
| KRISHNA ET AL., ARXIV:1602.07332 |
| KUZNETSOVA ET AL., ARXIV:1811.00982 |
| ORDONEZ ET AL.: "Im2Text: Describing Images Using 1 Million Captioned Photographs", NEURIPS, 2011 |
| SHAO ET AL.: "Objects365: A large-scale, high-quality dataset for object detection", IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, pages 8430 - 8439 |
| SHARMA ET AL.: "Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning", ACL, 2018 |
| SU JIANLIN ET AL: "RoFormer: Enhanced transformer with Rotary Position Embedding", NEUROCOMPUTING, vol. 568, 8 November 2023 (2023-11-08), AMSTERDAM, NL, pages 127063, XP093269676, ISSN: 0925-2312, Retrieved from the Internet <URL:https://arxiv.org/pdf/2104.09864> [retrieved on 20250410], DOI: 10.1016/j.neucom.2023.127063 * |
| WEI CHEN ET AL: "Diffusion Models as Masked Autoencoders", 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), IEEE, 1 October 2023 (2023-10-01), pages 16238 - 16248, XP034513743, [retrieved on 20240115], DOI: 10.1109/ICCV51070.2023.01492 * |
| ZHAI ET AL.: "Visual Task Adaptation Benchmark", ARXIV: 1910.04867, 2020 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11341364B2 (en) | Using simulation and domain adaptation for robotic control | |
| CN110622169A (zh) | 用于视频中的动作识别的神经网络系统 | |
| CN110062934A (zh) | 使用神经网络确定图像中的结构和运动 | |
| JP7512416B2 (ja) | 少数ショット類似性決定および分類のためのクロストランスフォーマニューラルネットワークシステム | |
| JP7618844B2 (ja) | 時空間上のアテンションを使用したビデオシーケンスからの物体表現の教師なし学習 | |
| US20230053618A1 (en) | Recurrent unit for generating or processing a sequence of images | |
| WO2025104314A1 (fr) | Apprentissage de réseaux neuronaux de traitement d'image à l'aide d'un alignement intermodal | |
| JP2023535502A (ja) | 半教師付きキーポイントベースモデル | |
| US20250259068A1 (en) | Training object discovery neural networks and feature representation neural networks using self-supervised learning | |
| WO2023086198A1 (fr) | Robustifier la nouvelle synthèse de vue du modèle de champ de radiance neuronal (nerf) pour les données éparses | |
| Kim et al. | Acceleration of actor-critic deep reinforcement learning for visual grasping in clutter by state representation learning based on disentanglement of a raw input image | |
| Kim et al. | Acceleration of actor-critic deep reinforcement learning for visual grasping by state representation learning based on a preprocessed input image | |
| WO2025166250A1 (fr) | Apprentissage de représentations visuelles à l'aide d'une auto-attention et d'un débruitage | |
| CN120858378A (zh) | 使用点轨迹将图像动画化 | |
| KR20250009301A (ko) | 기하학적 표현 학습을 위한 스케치 변환 방법 및 장치 | |
| CN116868203A (zh) | 利用自适应梯度裁剪的神经网络 | |
| US20250348980A1 (en) | Processing multi-modal inputs using denoising neural networks | |
| WO2025245138A1 (fr) | Distillation en plusieurs étapes de modèles de diffusion par mise en correspondance de moments | |
| US20250363783A1 (en) | Training image generation neural networks to generate semantically similar images | |
| US20250245873A1 (en) | Generative interactive environments | |
| US20250336101A1 (en) | Systems and methods for generating multimodal data using a single-tower architecture with a data generation subsystem | |
| WO2025255143A1 (fr) | Génération de réseaux neuronaux de tâches à partir de réseaux neuronaux plus grands à l'aide de projections | |
| WO2025109182A1 (fr) | Représentations d'apprentissage et génération de nouvelles vues d'éléments de données à l'aide de modèles de diffusion | |
| WO2025104203A1 (fr) | Classification d'éléments de données multimédia à l'aide d'un modèle de réseau de neurones artificiels génératif | |
| WO2025235892A1 (fr) | Reconstruction de scène à l'aide de modèles génératifs multi-vues |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 25709512 Country of ref document: EP Kind code of ref document: A1 |