WO2024236063A1

WO2024236063A1 - Performing image processing tasks based on demonstration examples

Info

Publication number: WO2024236063A1
Application number: PCT/EP2024/063423
Authority: WO
Inventors: Ivana BALAZEVIC; David Steiner; Nikhil PARTHASARATHY; Relja ARANDJELOVIC; Olivier Jean HÉNAFF
Original assignee: DeepMind Technologies Ltd
Current assignee: DeepMind Technologies Ltd
Priority date: 2023-05-17
Filing date: 2024-05-15
Publication date: 2024-11-21
Anticipated expiration: 2025-11-17

Abstract

Computer-implemented methods, systems, and software for image processing. A particular image processing task to be performed is defined by a set of task examples that demonstrate the task. A memory stores keys and values based on the task examples, and a task image on which the particular image processing task is to be performed is processed using an image encoder to obtain a task image feature vector for each of a plurality of spatial locations in the task image. The task image feature vectors for the spatial locations are used to obtain query vectors that are applied to the memory using a query-key-value attention mechanism, to obtain predicted local label values that, in turn, provide a result for the image processing task.

Description

PERFORMING IMAGE PROCESSING TASKS BASED ON DEMONSTRATION

EXAMPLES

BACKGROUND

[01] This specification relates to image processing using machine learning models.

[02] Neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters.

SUMMARY

[03] This specification describes systems and methods, implemented as computer programs on one or more computers in one or more locations, that are able to perform multiple different image processing tasks. A task is defined by prompting with examples demonstrating the task. The different image processing tasks, which are generally so-called dense prediction tasks, can be performed using the same trained system without updating any learnable parameters of the system to adapt it to a specific task. That is, the described systems exhibit in-context learning and can learn to perform a task based only on the examples in a prompt.

[04] In one aspect there is described a computer-implemented method and system for performing a particular image processing task using an image processing system.

[05] In implementations the particular image processing task is defined by a set of task examples that demonstrate the task. A memory stores example image feature vectors and corresponding local label values for the set of task examples, these defining keys and values. An image on which the particular image processing is to be performed, i.e. a “task image”, is processed using a trained image encoder to obtain a task image feature vector for each of a plurality of spatial locations in the task image. The task image feature vector for a spatial location is used to obtain a query vector that is applied to the memory using a query-key- value attention mechanism, to obtain a predicted local label value for the spatial location.

The predicted local label values for the spatial locations in the task image are used to obtain a result for the particular image processing task. [06] In another aspect there is described a computer-implemented method of training an image encoder for the image processing system, so that it learns representations that are particularly useful for the described technique. In implementations this involves determining local contextualized representations in which the representation for a spatial location in a training image comprises a combination of the query vector for the spatial location and a predicted value vector for the spatial location. The predicted value vector for the spatial location is obtained by applying the query vector to average feature vector keys and value vectors in a memory, these having been obtained from a set of training images. During training this encourages the image encoder to learn representations that can minimize a selfsupervised learning objective by attending to similar (nearest neighbor) images. This in turn facilitates the image encoder solving dense prediction tasks based on a memory storing keys and values derived from task examples.

[07] In some implementations the training method uses an attention-pooled representation of the training image. In combination with the self-supervised learning objective this encourages the image encoder to learn representations that maximally distinguish one training image from others (e.g. in the same batch), i.e. representations that are based on the most distinctive part of an image.

[08] There is further described software to implement the described systems and methods, e.g. one or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform the operations of a method as described herein.

[09] The subject matter described in this specification can be implemented in particular embodiments so as to realize one or more of the following advantages.

[010] The described techniques enable a system to be configured to perform a wide variety of dense prediction image processing tasks, by providing the system with prompts comprising demonstration examples of the task. The type of task to be performed is indicated by the demonstrations, i.e. by examples of input images and corresponding task outputs. The described techniques do not rely on assumptions about the nature of the task. For example, as well as existing types of task were, say, a new type of dense prediction task to be identified as useful the system should be able to perform that task just by giving the system examples of the task.

[Oil] This alleviates the need for task-specific models, providing a generalist system that can assist with many different tasks in a fast and data-efficient manner. In some cases a relatively small number of examples of a task, e.g. < 100 examples, can be sufficient to configure the system to perform a particular task. Further, the performance of the system in such tasks can approach that of a system specifically trained to perform just the single, particular task and can outperform other approaches such as fine-tuning.

[012] A generalist model of this type can have a wide range of applications. For example the system can be used to adapt to a particular environment or distribution of images. In, say, an autonomous or semi-autonomous vehicle the system could quickly adapt to new weather conditions, and could potentially adapt to previously unseen conditions. Implementations of the system can adapt to previously unforeseen or unencountered conditions, e.g. in real-time, just by providing the system with some examples (which may have been generated synthetically).

[013] Surprisingly, the way in which memory and attention is used in implementations of the described approach enables good performance on dense prediction tasks.

Implementations of the system also scale well as the number of parameters, e.g. weights, of the image encoder is increased, and as the number of training examples increases. Thus the system performance can be increased by training larger image processing systems using larger datasets, providing a straightforward route to achieving better performance on more complex tasks.

[014] In general implementations of the described techniques provide improvements in generality, data efficiency, and adaptation speed by comparison with other techniques.

[015] The details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

[016] FIG. 1 shows an example of an image processing system.

[017] FIG. 2 is a flow diagram of an example process for configuring an image processing system.

[018] FIG. 3 is a flow diagram of an example process for using the image processing system.

[019] FIG. 4 illustrates the operation of an example of the process of FIG. 3.

[020] FIG. 5 shows an example of a system for training an image encoder.

[021] FIG. 6 is a flow diagram of an example process for training an image encoder.

[022] FIG. 7 shows a training system for training an image processing system. [023] FIG. 8 is a flow diagram of an example process for training an image processing system.

[024] FIG. 9 illustrates, schematically, the operation of a contrastive training process.

[025] FIG. 10 compares the performance of an image processing system as described herein to that of other image processing systems.

[026] Like reference numbers and designations in the various drawings indicate like elements.

DETAILED DESCRIPTION

[027] This specification generally describes techniques for image processing. More particularly it describes examples of systems that are able to perform a range of different image processing tasks without modification i.e. without any learned parameter updates (finetuning or otherwise). A system as described herein can effectively “learn” from a few examples of the particular image processing task to be performed, that may be provided in a “prompt”. More particularly implementations of the system can perform so-called dense prediction tasks, that require pixel-level labelling. Implementations of the system are agnostic to the form of the labels and can, e.g., perform both pixel-level classification and regression tasks.

[028] Implementations of the system use an image encoder but do not rely on a particular type of training for the image encoder. Nonetheless the specification also describes techniques for training the image encoder so that it performs particularly well when used as described. In general this involves using attention across multiple training images. In some implementations (but not necessarily) this also involves using attention within a training image.

[029] FIG. 1 shows an example of an image processing system 100 that may be implemented as one or more computer programs on one or more computers in one or more locations, for performing an image processing task.

[030] The image processing system 100 is configured to perform a particular image processing task of a range of possible image processing tasks that the image processing system can perform, using a set of task examples 102, or task “prompts”, that demonstrate the particular image processing task. That is, the system is enabled to perform the particular image processing task by a form of in-context learning. The particular image processing task is performed on a task image 104 comprising a plurality of task image pixels. The particular image processing task that is performed can be a dense image processing task or an imagelevel task.

[031] The image processing system 100 comprises a (trained) image encoder 110 configured to process an image to generate a spatial representation 112 of the image. In implementations the spatial representation comprises a feature vector for each of a plurality of spatial locations in the image, e.g. for each of a set of regions, such as “patches” that tile the image.

[032] In general any type of image encoder may be used as image encoder 110. In implementations the (trained) image encoder 110 comprises a trained image encoding neural network. Such an image encoding neural network may have any suitable architecture and may include, e.g., one or more feed forward neural network layers, one or more recurrent neural network layers, one or more convolutional neural network layers, one or more attention neural network layers, or one or more normalization layers.

[033] As one example image encoder 110 may comprise a residual neural network encoder (ResNet), e.g. as described in He et al., ’’Deep residual learning for image recognition” Proc. IEEE conference on Computer Vision and Pattern Recognition, 2016.

[034] As another example image encoder 110 may comprise a transformer neural network. A transformer network is typically a neural network characterized by having a succession of self-attention neural network layers. A self-attention neural network layer has an attention layer input for each element of the input and is configured to apply an attention mechanism over the attention layer input to generate an attention layer output for each element of the input; there are many different attention mechanisms that can be used. In some implementations the image encoder 110 comprises a vision transformer (ViT), e.g. as described in Dosovitskiy et al., arXiv:2010.11929.

[035] The image processing system 100 also comprises a memory, M, 120 configured to store keys and corresponding values. These can be obtained from the set of task examples 102, as described later. Each task example can comprise an example image comprising a plurality of example image pixels and a set of labels, e.g. pixel labels, each pixel label having a label value. The memory 120 stores image feature vector keys and corresponding local label values. More particularly the memory 120 stores example image feature vectors and corresponding local label values for the images in the set of task examples. The example image feature vectors define keys, i.e. the image feature keys, for accessing the corresponding local label values. [036] The image processing system 100 includes a query-key -value attention mechanism 130. The query-key-value attention mechanism 130 is configured to apply a query vector for a spatial location in the task image 104, derived from the spatial representation 112 of the task image obtained from the image encoder 110, to keys and values 122 in the memory 120, i.e. to the image feature vector keys and corresponding local label values. The query result comprises a predicted local label value for the spatial location, from which a set of pixel labels 132 for pixels of the task image can be obtained, either directly, or indirectly e.g. by upsampling.

[037] In general a query-key-value attention mechanism can be a mechanism that maps a query and a set of key -value pairs to an output, where the query, keys, and values are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function (e.g. a similarity function), such as a dot product or scaled dot product, of the query with the corresponding key.

[038] FIG. 2 is a flow diagram of an example process for configuring an image processing system including a trained image encoder, e.g. the image processing system 100 of FIG. 1, to perform a particular image processing task of a range of possible image processing tasks. The process of FIG. 2 may be implemented by one or more computers in one or more locations.

[039] More particularly FIG. 2 shows a flow diagram of an example process for receiving a set of task examples 102, and for using these to store keys and values in the memory 120. In some applications of the techniques described herein suitable keys and values will have been previously stored in the memory 120, and thus the process of FIG. 2 is optional.

[040] Referring to FIG. 2, at step 200 the process receives a set of task examples 102. The set of task examples can demonstrate a particular image processing task, i.e. each task example can be an example of the particular image processing task.

[041] Each task example includes an example image comprising a plurality of example image pixels and a set of labels, e.g. a set of pixel labels in which each pixel label has a label value. That is, in some implementations there can be a pixel label for each pixel of the example image. As one example the label value can be, e.g., a categorical label in which the label value corresponds to one of a predetermined set of categories, e.g. for classifying the pixel, e.g. for a task such as semantic segmentation. As another example the label value can be a (continuous) scalar value, e.g. an image depth value; or a vector value, e.g. a vector defining a surface normal direction. The combination of the example image and the set of pixel labels defines an example of a particular image processing task to be performed. [042] Implementations of the described techniques can work in a regime with a limited number of training examples, e.g. in the range 10² to 10⁴ examples, and can be particularly useful in a few-shot regime, e.g. <100 training examples. The training examples are used to populate the memory 120, and thus performance can be traded against processing speed, by trading memory size (which depends on the number of training examples) against memory query time (which depends on the memory size).

[043] For each task example, the example image in the task example is processed using the (trained) image encoder 110 to generate the spatial representation 112 of the example image (step 202). In implementations this spatial representation comprises an example image feature vector for each of the plurality of spatial locations in the example image. As described above, each spatial location may comprise a patch of the image, such as a 16x16 pixel patch or a patch of some other size or shape.

[044] A local label value for each of the spatial locations in the example image is obtained from the set of pixel labels (step 204). Obtaining the local label value for each of the spatial locations in the example image, from the set of pixel labels, may comprise, for each of the spatial locations, averaging the pixel labels over a region of the example image corresponding to the spatial location. Alternatively the local label value can be obtained by taking a maximum, minimum, or median value of the pixel labels for a spatial location, or by determining the local label value from the set of pixel labels in some other way.

[045] The example image feature vectors and corresponding local label values can then be stored in the memory 120 (step 206). The example image feature vectors define keys for accessing the corresponding local label values. In some implementations just a subset of these may be stored in the memory 120.

[046] As one particular example, the image encoder 110 can process an example image x_t that has H x W patches, to generate a spatially flattened feature map comprising HW example image feature vectors, k^J _t each of D -dimensions, where j indexes a patch. This may be represented as k_t = fe xi), where f_e( ) denotes the image encoder 110 comprising learned parameters, e.g. weights, 0. Optionally the features of k_t may be normalized, e.g. L2 normalized. The local label value l^J _t for a patch may be obtained by averaging the pixel labels y^J _t of a patch. The pixel labels may comprise, e.g., a one-hot vector of class labels, or a scalar, say denoting depth in the example of monocular depth estimation. The memory 120 may then store a set of keys and values

obtained from the training examples. [047] There is no requirement for all the example images to be the same size, nor is there a requirement for all the patches to be the same size. Nonetheless it can be useful to store the same number of features for each example image.

[048] For any particular image processing task the process of FIG. 2 need only be performed once, to store keys and values for the task in the memory 120. The stored keys and values can be reused to process multiple different task images.

[049] In some implementations a proper subset, i.e., less than all, of the example image feature vectors and corresponding local label values is sampled, and the sampled subset is stored in the memory 120. Subsampling of this type can be used depending on the number of training examples and the available size of memory 120, e.g. depending on whether the data from the training examples will fit in the memory 120.

[050] The sampling, where used, may comprise random sampling e.g. according to a sampling distribution. There are many possible sampling strategies. In some implementations the sampling distribution is determined so as to prioritize sampling the most salient patches within an example image. For example where pixel values comprise class labels (categorical values) this may be done by increasing the relative likelihood of sampling patches containing class labels that appear less frequently in the example image, e.g. by up- weighting these patches. Also or instead this may be done by decreasing the relative likelihood of sampling patches that do not contain any valid pixel labels.

[051] As one particular example, a class score for each patch j may be determined as

(class c is in patch j) and 0 otherwise, and where K_C is a count of how many patches class c appears in. Then a predetermined number of features per image, nf_eatures __per_i_mage, may be stored in the memory 120, by choosing the features k^J _t that have the lowest final scores according to

sampled from a uniform distribution in the range [0,1] and C is a constant, e.g. 10⁶, that deprioritizes patches with no class label. The predetermined number of features per image, n^_{eatures perjmage}, may be determined

memory in terms of numbers of features, N is a number of images in the set of training examples, and num_augmentations is a number of augmentations applied to each example image (where image augmentation is used to increase the number of example images). [052] As another particular example, to decrease a relative likelihood of sampling patches that do not contain any valid pixel labels an ordered list may be created by randomly ordering each patch in an example image and then placing all patches with no valid class labels after any patch with a valid class label, and then taking the first n^_{eatures perjmage} from the list.

[053] FIG. 3 is a flow diagram of an example process for using the image processing system 100 of FIG. 1 to perform a particular image processing task. The process of FIG. 3 may be implemented by one or more computers in one or more locations.

[054] Referring to FIG. 3, at step 300, a task image 104 comprising a plurality of task image pixels is received for processing according to the particular image processing task defined by the set of task examples 102. The task image 104 is processed using the trained image encoder 110 to generate the spatial representation 112 of the task image, in particular comprising a task image feature vector for each of the plurality of spatial locations in the task image (step 302). In general each spatial location correspond to a region or “patch” comprising multiple pixels, although in some implementations each spatial location may correspond to a single pixel.

[055] The process then performs a set of steps, labelled below as steps 306 to 310, for each of the plurality of spatial locations in the task image (step 304).

[056] For each spatial location a query vector for the spatial location is obtained from the task image feature vector corresponding to the spatial location (step 306). For example the task image feature vector can be used as the query vector.

[057] The query vector is then applied to the image feature vector keys and local label values in the memory 120, using the query-key -value attention mechanism, to obtain a predicted local label value for the spatial location as the query result (step 308). The predicted local label values for the spatial locations may be collected together, e.g. concatenated, for processing.

[058] A set of pixel labels for the task image pixels is then determined from the predicted local label value for each of the spatial locations in the task image (step 310). In implementations the particular image processing task is performed by obtaining the set of pixel labels for the task image pixels. That is, in implementations the set of pixel labels for the task image pixels constitutes a result of the particular image processing task.

[059] In some implementations determining the set of pixel labels for the task image pixels from the predicted local label value may comprise, for each of the spatial locations in the task image, upsampling the predicted local label value to a resolution of the task image. In some other implementations the set of pixel labels for the task image pixels can be the predicted local label value for each of the spatial locations in the task image.

[060] In implementations applying the query vector to the image feature vector keys and local label values in the memory 120, using the query-key-value attention mechanism, to obtain a predicted local label value for the spatial location, involves determining a value of a similarity function of the query vector and each of a plurality of the image feature vector keys. The similarity function, in particular a value of the similarity function, can define a similarity metric that measures a similarity between a query vector and an image feature vector key. The predicted local label value, e.g. for a spatial location, can then be determined as the query result from a weighted sum of the local label values in the memory, each weighted by the value of the similarity function for a corresponding image feature vector key. This process defines a cross-attention operation, CA( ).

[061] In implementations, therefore, the method applies cross-attention between the query vector and the keys and values stored in the memory. The cross-attention may be defined as l^l = CA(q^l, k , !⁷) where l^l denotes a predicted local label value for a spatial location in the task image indexed by i; q^l denotes the query vector for the spatial location i, from the task image; fc⁷ denotes the image feature vector key stored in the memory 120 and indexed by j; and V denotes the local label value stored in the memory 120 and indexed by j. In this example a single integer j is used to index into the memory 120 = {fc , P], which has | | entries, j = 1, ... , | |, not distinguishing between entries from different example images.

[062] As a particular example, cross-attention, CA(J), specified by the operations below may be used:

a¹ = softmax(s⁷)

where (■/) denotes a compatibility (similarity) function such as a dot product, cosine, or other compatibility (similarity) function; and where ft denotes an optional temperature scaling.

[063] In some implementations the cross-attention, CA(J), is limited to a set of k nearest neighbors for each query. That is, applying the query vector to the image feature vector keys and local label values in the memory to obtain the predicted local label value for the spatial location can involve identifying a set of k nearest neighbor image feature vector keys to the query vector according to a similarity metric. The similarity metric can measure a similarity between a query vector and an image feature vector key; it may be, but need not be, the above described compatibility (similarity) function. In some implementations the set of k nearest neighbors is determined using an approximate nearest neighbor search. As one example, the open source ScaNN library (Guo et al., arXiv: 1908.10396, 2020) can be used to return the top-k nearest neighbors for a query and scores for the similarity that can be used as attention logits.

[064] Merely as an illustration, for a memory 120 that has a size of order 10⁶ to 10⁷ entries, corresponding to of order 10⁵ example images, example values are k < 100; and ? < 1. Implementations of the system, particularly with memory sizes towards the lower end of this range, can be fast enough for real time video processing.

[065] In implementations the image processing task comprises a dense image processing (prediction) task, i.e. an image processing (prediction) task that involves assigning a value to each region or pixel of an image. The particular image processing task to be performed is defined by the set of task examples 102. The image processing system 100 is configured to perform a particular image processing task by storing the example image feature vectors and corresponding local label values from the task examples in the memory 120, without any adjustment of learnable parameters, e.g. neural network or other weights, of the image processing system. That is, image processing system 100 can perform in-context learning. Provided that the particular image processing task is not changed there is no need to update the information stored in the memory 120.

[066] The image processing system 100 can also perform image-level processing tasks, such as classification. This can be done by combining, e.g. spatially averaging or otherwise pooling, the example image feature vectors for each of the plurality of spatial locations in an example image. For each of the example images the combined, e.g. spatially averaged, example image feature vectors, and corresponding image-level label values, can be stored in the memory 120 as respective keys and values. A query vector for the task image can similarly be obtained by combining, e.g. pooling, the task image feature vectors for the spatial locations in the task image, and the query vector can then be used to query memory 120, using a query-key -value attention mechanism as described above, to determine an image-level label for the task image. [067] FIG. 4 illustrates the operation of an example of the process of FIG. 3, for a case when the particular image processing task to be performed is a semantic segmentation task. The top row of FIG. 4 shows example images from a set of task examples 102 on the left, and an example of a task image 104 on the right. The lower row of FIG. 4 shows pixel-level semantic segmentation labels, for the example images on the left, and for the task image 104 on the right. The links 400, 402 between the images illustrate the cross-attention process, locating patches in the example images that are similar to a patch in the task image (link 400), and aggregating their corresponding local label values (link 402).

[068] FIG. 5 shows an example of an image processing system 500 for training the image encoder 110 of FIG. 1. The image processing system 500 of FIG. 5 may be implemented as one or more computer programs on one or more computers in one or more locations.

[069] The image processing system 500 is configured to receive and process a training image 502. More particularly the image encoder 110 processes the training image to generate a spatial representation of the training image comprising a feature vector for each of a plurality of spatial locations in the image, as previously described.

[070] The image processing system 500 also includes a memory 140 memory that is configured to store keys and corresponding values. Memory 140 is different to memory 120 because it is configured to store different information. Memory 140 is not needed after training. The training image feature vectors are used to determine the stored keys and values as described below.

[071] In implementations the training image feature vectors are (spatially) averaged over the plurality of spatial locations in the training image to determine an average feature vector 114 for the training image. The average feature vector 114 defines a key that is stored in memory 140.

[072] A value update neural network 116, e.g. a multilayer perceptron (MLP), is configured to process the average feature vector 114, e.g. the key, to generate a corresponding value vector 118 for the training image. Memory 140 stores the corresponding value vector 118 with the average feature vector 114, and the average feature vector 114 defines a key for the corresponding value vector 118.

[073] As one particular example, the image encoder 110 can process a training image x_t that has H x W patches to generate a spatially flattened feature map comprising HW training image feature vectors, h^J _t each of D -dimensions, where j indexes a spatial location or patch. This may be represented as h_t = ^(Xj) where, as before, f_e( ) denotes the image encoder 110 comprising learnable parameters, e.g. weights, 6. Then a D -dimensional spatially averaged feature vector for the training image 114, i.e. the key, may be determined as:

where the subscript i indexes the training image. The corresponding D -dimensional value vector 118 for the training image x_t may be determined as:

Vi = <Pe(k where the value update neural network 116 comprises learnable parameters, e.g. weights, 0. Here 0 denotes a set of learnable parameters including, inter alia, those of the image encoder 110 and those of the value update neural network 116.

[074] The image processing system 500 includes a memory access subsystem 150 that is configured to process a query vector 142 for a spatial location in a training image, in particular by applying a query-key-value attention mechanism as previously described to the keys and values in memory 140, to obtain a predicted value vector 152 for the spatial location. The predicted value vector 152 for the spatial location is used as the query result. [075] In implementations the query-key-value attention mechanism comprises a crossattention mechanism, e.g. defined as v^l = CA(q^l, kj, v ), where v^l is the query result, where the superscript indexes location and the subscript indexes training images. Here q = e(x) is the spatial representation of a training image x obtained from the image encoder 110 (comprising a feature vector for each of the plurality of spatial locations in the image), and q^l denotes the training image feature vector for the spatial location or patch i of the training image. That is, the training image feature vector q^l is used as a query vector 142 to crossattend to the keys 114 and values 118 stored in memory 140 to compute the query result.

[076] Some implementations of the image processing system 500 also include a linear neural network layer 160 that is used to process a combination of the predicted value vector 152 for a spatial location and of the query vector for the spatial location to generate a local contextualized representation 162 for the spatial location.

[077] The image processing system 500 is configured to use the local contextualized representation for each of the plurality of spatial locations in a training image to obtain a contextualized representation 172 for the training image from (comprising) the local contextualized representation for each of the spatial locations. That is, the image processing system 500 is configured to determine a contextualized representation 172 for the training image from (comprising) the local contextualized representations. [078] Some implementations of the image processing system 500 include an attention neural network 170 for determining an attention-pooled representation of the training image from the local contextualized representations 162 for each of the plurality of spatial locations, for obtaining the contextualized representation 172 comprising the local contextualized representation for each of the spatial locations.

[079] In some implementations determining the attention-pooled representation can involve determining a soft mask for each of the plurality of spatial locations, i.e. in which a value of the soft mask for each the spatial locations defines a respective weight for the spatial location. The local contextualized representations for each of the plurality of spatial locations can be combined weighted by the respective weight for the spatial location defined by the soft mask to obtain the attention-pooled representation.

[080] Determining a soft mask for each of the plurality of spatial locations can involve processing the local contextualized representation for each of the plurality of spatial locations using the attention neural network 170 to generate the respective weight for each of the spatial locations. Using the attention neural network 170 to attend over the local contextualized representations 162 in this way can help the system to learn to select the most distinctive part of each training image for the contextualized representation 172 during training.

[081] As a particular example, the soft mask, m₍, for a training image i can be determined as: mi = softmax(

where c_t denotes a set of the local contextualized representations for the spatial locations in training image i, e.g. a concatenation of the ( -dimensional) local contextualized representations for each of the plurality of spatial locations in training image i; j denotes the, e.g., H ■ W spatial locations or patches; and where the soft mask, m_h defines the value of the mask, or weight, for each spatial location j, m . Here a_e denotes the attention neural network 170, which is configured to process the set of local contextualized representations in accordance with learnable parameters, e.g. weights, of the attention neural network 170, to generate a corresponding set of values of the mask (to which a softmax is applied). Here the subscript 0 denotes that the previously described set of learnable parameters includes the learnable parameters of the attention neural network 170.

[082] An attention-pooled representation, c of training image i, i.e. the ( -dimensional) contextualized representation 172, can be determined as:

represents an additional value head of the attention neural network 170, i.e. an additional set of outputs of the attention neural network 170 that provides a D -dimensional value for each of the set of local contextualized representations c . In some other implementations the cl can be used instead of

and the value head can be omitted.

[083] FIG. 6 is a flow diagram of an example process for training an image encoder, e.g. the image encoder 110 of FIG. 1. The process of FIG. 6 may be implemented by one or more computers in one or more locations.

[084] During the training the image encoder is incorporated in an image processing system such as image processing system 500, and the image processing system including the image encoder is trained so that the image encoder can be used in the image processing system 100 to perform a particular image processing task defined by the set of task examples 102. This training of the image encoder may be referred to as pre-training, to distinguish from the incontext learning used to perform a particular image processing task. For convenience the training will be described with reference to image processing system 500.

[085] In broad terms the training involves populating the memory 140 with keys 114 and values 118 as described above, obtained from a first plurality of training images in a set of training examples, and using the stored keys and values to process a second plurality of training images in the set of training examples, to train the image processing system 500.

[086] The first and second pluralities of images can overlap, i.e. one of the second plurality of training images may be used to train the system as described below, and afterwards used as one of the first plurality of training images, to update the kays and values stored in the memory. For example, a batch of training images in the set of training examples can be used to train the image processing system 500 based on keys and values computed from previous batches of training images.

[087] In general the training encourages the representation of an image generated by the image encoder 110 to be expressed as a combination of representations of similar example images. This facilitates use of the image encoder 110 in the image processing system 100, in which an image processing task is performed by combining local label values of similar example images. [088] The process involves obtaining a set of training examples (step 600), each training example comprising at least a training image. In some implementations, but not necessarily, the training examples can include an image label (described later).

[089] For each of the first plurality of the training images the training image is processed using the image encoder 110 to generate a spatial representation of the training image comprising a training image feature vector for each of the plurality of spatial locations in the training image (step 602). The training image feature vectors are (spatially) averaged over the plurality of spatial locations in the training image, e.g. as described above, to determine the average feature vector 114 for the training image (step 604).

[090] The average feature vector 114 is processed using the value update neural network 116 to generate the corresponding value vector 118 for the training image (step 606). The average feature vector 114 and the corresponding value vector 118 are stored in memory 140 (step 608), the average feature vector 114 defining a key for the corresponding value vector 118.

[091] Training the image processing system 500 can involve, for each of the second plurality of training images, processing the training image using the image encoder 110 to generate the spatial representation of the training image comprising a training image feature vector for each of the plurality of spatial locations in the training image (step 610).

[092] The training can also involve, for each of the plurality of spatial locations in the training image, obtaining the query vector 142 for the spatial location from the training image feature vector, q^l, corresponding to the spatial location (step 612). The query vector 142 is applied to the average feature vector keys 114 and value vectors 118 in memory 140, using the memory access subsystem 150, to obtain the predicted value vector 152 for the spatial location as the query result v^l (step 614).

[093] The local contextualized representation 162 is determined for each of the plurality of spatial locations in the training image (step 616). In implementations the local contextualized representation 162 for a spatial location in the training image comprises a combination of the query vector for the spatial location, i.e. q^l, and the predicted value vector 152 for the spatial location, v^l. Thus the contextualized representation 172 for the training image can comprise, i.e. be derived from, the local contextualized representation 162 for each of the plurality of spatial locations in the training image.

[094] In some implementations determining the local contextualized representation 162 for a spatial location in the training image involves determining a weighted combination of the query vector 142 for the spatial location, i.e. q^l, and the predicted value vector 152 for the spatial location, v^l. The weighted combination can be processed using the linear neural network layer 160 to generate the local contextualized representation 162.

[095] For example the local contextualized representation 162, c^l, can be determined as:

where ⁷© (') denotes the linear neural network layer 160 and the subscript 6 denotes that the previously described set of learned parameters includes the learnable parameters of the linear neural network layer(s) 160; where || ■ || denotes the L2 norm; and where is a weighting parameter with < 1, and in implementations < 0.5.

[096] In some implementations the contextualized representation 172 for a training image comprises an attention-pooled representation, c_t, of a training image, determined as described above, although a concatenation of the local contextualized representations for each of the plurality of spatial locations in the training image can be used instead.

[097] The image processing system 500, in particular the image encoder 110, is trained using the contextualized representations 172 for each of the second plurality of the training images (step 618). In general the training also involves updating the learnable parameters of the value update neural network 116.

[098] In general training the image processing system can involve backpropagating gradients of an objective function, e.g. a loss function, to update learnable parameters, e.g. weights, of the system. For example the training can update the set of learnable parameters denoted 0, i.e. the parameters of the image encoder 110 and the value update neural network 116, and of the linear neural network layer 160 and/or attention neural network 170 where implemented. The updating can use any appropriate gradient descent optimization algorithm, e.g. Adam or another optimization algorithm.

[099] In some implementations the image processing system is trained using, i.e. based on the value of, a self-supervised learning objective function dependent on the contextualized representation 172 for a training image.

[0100] As one example the self-supervised learning objective function may comprise a contrastive learning objective function of the type used in SimCLR (Chen et al., arXiv:2002.05709, 2020), such as a temperature-scaled cross-entropy loss. As another example the self-supervised learning objective function may comprise an objective function of the type used in BYOL (Grill et al., arXiv: 2006.07733, 2020), that involves minimizing a least squares objective.

[0101] In general the training can use a self-predictive method that involves minimizing a self-supervised learning objective function representing a difference between two neural network outputs from two different augmented views of the same input. For such implementations a target neural network can define a learning target, where some or all of the parameters of the target neural network are obtained from a moving average, e.g. an exponential moving average, of parameters of an online neural network that is trained by backpropagation (implementing a stop gradient at the output of the target neural network). In this case the image processing system 500 with learnable parameters 0 can be considered to be the online neural network, and a version of the image processing system 500 with the same architecture but with parameters 0' determined by a moving average of the parameters 0 can be considered to be the target neural network. As an example the an exponential moving average can be determined as 0“ <- //0- + (l — [i)0 with 0 < // < 1.

[0102] The training can also use a supervised learning objective function, as described further later.

[0103] FIG. 7 shows a training system 700 for training the image processing system 500 using the contextualized representation of a training image and a self-supervised learning objective function. The training system 700 of FIG. 7 may be implemented as one or more computer programs on one or more computers in one or more locations.

[0104] The training system 700 comprises a first, “target” version of the image processing system 500’, and a second, “online” version of the image processing system 500. The learnable parameters, e.g. weights, of the image processing system 500’ are obtained from a moving average, e.g. an exponential moving average, of the parameters of the image processing system 500’. The image processing system 500’ is configured to process a first transformed view 702’ of a training image to generate the contextualized representation 172’ of the first transformed view. The image processing system 500 is configured to process a second transformed view 702 of a training image to generate the contextualized representation 172 of the second transformed view. The first and second transformed views of the training image can be obtained as described below.

[0105] The training system 700 can include a first neural network, e.g. comprising a first projection neural network 710’; and a second neural network, e.g. comprising a second projection neural network 710 and a prediction neural network 720. The first and second neural networks are respectively configured to process the contextualized representations of the first and second transformed views, to generate respective first and second representations of the contextualized representations.

[0106] In more detail, in implementations the first and second projection neural networks 710’, 710 are configured to process the contextualized representations of the first and second transformed views respectively, to generate respective first and second projected representations 712’, 712 of the contextualized representations. The first and second projection neural networks 710’, 710 can be helpful in projecting the contextualized representations into a space that facilitates use of the self-supervised learning objective function. In some implementations the first and second projection neural networks can reduce a dimensionality of the contextualized representations.

[0107] The learnable parameters, e.g. weights, of the first projection neural network 710’ are obtained from a moving average, e.g. an exponential moving average, of the parameters of the second projection neural network 710.

[0108] The prediction neural network 720 is configured to process an input, e.g. the second projected representation 712, to generate a predicted representation 722 that is a prediction of a target generated using the image processing system 500’, e.g. a prediction of the first representation 712’. The above described second representation of the contextualized representations of the second transformed view comprises the predicted representation 722. [0109] FIG. 8 is a flow diagram of an example process for training the image processing system 500 using the contextualized representation of a training image and a self-supervised learning objective function. The process of FIG. 8 may be implemented by one or more computers in one or more locations, e.g. by the training system 700 of FIG. 7

[0110] At step 800 the process determined first and second transformed, e.g. “augmented”, versions of the training image 702’, 702. The transformed versions of the training data item may generally include e.g. random crops or distortion of the training data item.

[OHl] As some particular examples, transformed views of the training data item may be obtained by transformations including one or more of random cropping of the image; flipping the image; brightness, color, saturation, hue or contrast jittering; color dropping; brightness, saturation,, hue or contrast adjusting; Gaussian blurring; and solarization. Random cropping may comprise selecting a random patch of the image, optionally after increasing a size of the image by a scale factor; optionally the patch may then be re-sized. Flipping the image or video may involve applying a horizontal or vertical flip to the image. Color jittering may comprise changing one or more of the brightness, contrast, saturation and hue of some or all pixels of the image by a random offset. Color dropping may comprise converting the image or video to greyscale. Gaussian blurring may comprise applying a Gaussian blurring kernel to the image or video; other types of kernel may be used for other types of filtering. Solarization may comprise applying a solarizing color transform to the image; other color transforms may be used. Other transforms are possible such as rotation, or cutting out part of the image (e.g. by setting pixels of a random patch to a uniform value). [0112] The process determines the contextualized representation 172’ of the first transformed view of the training image, and processes this using the first neural network, i.e. the first projection neural network 710’, to generate the first representation 712’ (step 802). Similarly the contextualized representation of the second transformed view of the training image is determined and processed using the second neural network, i.e. the second projection neural network 710, to generate the second representation 712 (step 804, performed before or after step 802).

[0113] The image processing system 500 is trained using an objective function that depends on a metric (of similarity or of difference) that measures a difference between the first and second representations (step 806).

[0114] In implementations this involves updating learnable parameters of the second neural network using the objective function. The learnable parameters of the first neural network can be updated based on the corresponding parameters of the second neural network, e.g. using a moving average such as an exponential moving average of corresponding parameters of the second neural network.

[0115] In implementations the training includes processing the contextualized representation 172’ of the first transformed view using the first projection neural network 710’ to generate the first representation, which comprises the first projected representation 712’. The training can also include processing the contextualized representation 172 of the second transformed view using the second projection neural network 710 to generate the second projected representation 712, and processing the second projected representation 712 using the prediction neural network to generate the second representation comprising the predicted representation 722.

[0116] The objective function can depend on a metric (of similarity or of difference) that measures a difference between the first projected representation 712’ and the predicted representation 722.

[0117] Where the training system 700 includes projection neural networks the training can then also involve updating parameters of the second projection neural network 710 and the prediction neural network 720, using the objective function. The learnable parameters of the first projection neural network 710’ can be updated based on the corresponding parameters of the second projection neural network 710, using a moving average.

[0118] In implementations processing the contextualized representation of the first transformed view using the first neural network, e.g. the first projection neural network can involve determining the above described attention-pooled representation from the contextualized representation of the first transformed view and processing the attention- pooled representation of the first transformed view using the first neural network. As previously described, determining the attention-pooled representation of the first transformed view can comprise applying an attention operation over the local contextualized representation 162 for each of the plurality of spatial locations in the first transformed view. [0119] Processing the contextualized representation of the second transformed view using the second neural network, e.g. the second projection neural network, can comprise determining the attention-pooled representation from the contextualized representation of the second transformed view and processing the attention-pooled representation of the second transformed view using the second neural network. Determining the attention-pooled representation of the second transformed view can comprise applying an attention operation over the local contextualized representation 162 for each of the plurality of spatial locations in the second transformed view.

[0120] As one particular example, the self-supervised learning objective function (loss function), £_SSL, may be defined as:

Here, as above, c_t denotes the attention-pooled representation of training image i. The subscripts i and j refer to, respectively the first and second transformed views of a particular training image; z_t = p₀(Cj) denotes the second projected representation 712, p₀(-) denotes the second projection neural network 710 (with learnable parameters 0), qg( ) denotes the prediction neural network 720. The superscript denotes values that have been obtained from a neural network with learnable parameters that are a moving average of the corresponding parameters 0, so that, e.g. z denotes the first projected representation 712 for the second transformed view, j, of a particular training image. The sum is taken over different, i.e. contrasting examples, e.g. over the first transformed views of other training images indexed by k (k j), e.g. from other training images in a batch of training images (which are assumed to be different). The self-supervised loss, £_SSL, is ^a form °f contrastive cross-entropy loss.

[0121] In some implementations one or more of the training examples comprises the training image and an image label, y_£, for the training image, e.g. a classification label for a classification task. In general the image label y_£ can have one or more dimensions. For each of the first plurality of the training images the image label may be stored in memory 140 with the average feature vector 114 for the training image and the corresponding value vector 118. [0122] The training may then involve (for one or more of the above described second plurality of the training images) querying memory 140 using the contextualized representation of the training image to retrieve a predicted label for the training image.

[0123] The image processing system 500 can then be trained using a supervised objective function, £_SL, to minimize an error between the image label for the training image, y₇, and the predicted label for the training image, y_£. Merely as an example, the supervised objective function may comprise a cross-entropy loss, £cE(yi> yj)-

[0124] In some implementations querying the memory using the contextualized representation of the training image to retrieve the predicted label for the training image can involve determining an attention-pooled representation from the contextualized representation of the training image, and querying the memory using the attention-pooled representation. Determining the attention-pooled representation can involve applying an attention operation over the local contextualized representation for each of the plurality of spatial locations in the training image, e.g. as described above. Querying the memory using the attention-pooled representation may then comprise determining a value of a similarity function of the attention-pooled representation and each of a plurality of the average feature vector keys 114. The predicted label for the training image may be determined as a weighted combination of the image labels stored in the memory for the plurality of the average feature vector keys. Each of the image labels may be weighted according the value of the similarity function for a respective one of the plurality of the average feature vector keys. As a particular example memory 140 can be queried using the attention-pooled representation, c_£, of the training image, using cross attention to determine the predicted label according to y_t = CAtc . kj. yj),

[0125] In some implementations the self-supervised learning objective function, _SSL ^and the supervised learning objective function, £_SL, can be combined in a weighted combination. Where first and second transformed views of a training image are determined the supervised objective function may comprise a sum of a loss determined from each of these views. For example a total loss may be determined as £_S ^1J _SL

+ £^J _CE) where a is a weight, e.g. a < 1.

[0126] FIG. 9 illustrates, schematically, the operation of a particular example implementation of the above described contrastive training process.

[0127] First and second transformed (augmented) views 902a, b, of a training image 900 are generated, and each is processed by the image encoder 110 to generate a set of training image feature vectors (query vector 142, q^l), one for each spatial location or patch i of each augmented view of the training image 904a, b. These are used to query the spatially averaged training image feature vectors and corresponding value vectors stored in memory 140 to determine a respective set of local contextualized representations 906a, b for the first and second views. A respective attention pooled representation 908a, b is determined for each of the sets of local contextualized representations, and these are used evaluate a self-supervised learning objective function that is used to train the system, in particular the image encoder 110.

[0128] FIG. 10 shows a graph illustrating adaption of an implementation of the image processing system 1000 to a semantic segmentation task. More particularly the graph shows the mean loU (Intersection over Union) metric for an image segmentation task on the y-axis against adaption time allowed for the model on the x-axis, for images from the PASCAL Visal Object Classes challenge dataset (Everingham et al., International Journal of Computer Vision 111(1): 98-136, 2015). The graph compares the adaption times of an implementation of the image processing system 100, curve 1000 (varying the compute by varying the number of example images stored); a system comprising a frozen image encoder with a specially adapted head in which only the head is trained, curve 1002; and a system comprising an image encoder with a specially adapted head in which the entire system is trained end-to-end, curve 1004. It can be seen that the image processing system 100 requires less compute for equivalent performance or, put differently, that the image processing system 100 learns much faster than the other approaches; and that it can achieve better final performance than some other approaches.

[0129] An image processed by a system as described above, in particular the task image, an example image, or a training image, may be a monochrome or color, still or moving image (e.g. video), in 2D or in 3D. One advantage of the described techniques is that, in implementations, they can be fast enough to apply to real-time video at, say, 30fps. [0130] Such an image may have been captured from a real-world environment, e.g. by a camera or other image sensor. A dense image processing task, in particular a dense prediction task performed by the system, may be a prediction task that relates to one or more real -world objects represented in the image.

[0131] As defined herein “image” includes a point cloud e.g. from a LIDAR system; and a “pixel” includes a point of the point cloud.

[0132] In general any type of dense prediction task may be performed by the image processing system 100, by giving examples of the image processing task to the system as a set of task examples. It will be appreciated that the above described (pre)training process does not rely on having examples of any particular image processing task that the trained image processing system might perform. For example the training process populates the memory 140 using spatially averaged training image feature vectors rather than explicitly attempting, say, to match up similar regions amongst the training images. The (pre)training does not make any assumption about the nature of the task(s) to be performed, and the trained image processing system can be adapted to perform a variety of tasks using in-context learning.

[0133] As some particular examples, the trained image processing system 100 may be used to perform an image processing task comprising one or more of: image segmentation, e.g. semantic segmentation or instance segmentation; depth prediction; keypoint prediction; pose estimation, e.g. 3D pose estimation; surface normal estimation, e.g. by determining a vector in 2D or 3D; or object detection, including object tracking. In general any dense prediction task may be performed. Many other types of task may be performed in the same way, e.g. a curvature or other shape estimation task, a task that involves identifying aspects of an image using color, and so forth.

[0134] As one example, in an image segmentation task the pixel labels for the task image pixels may each have a categorical value defining a category for the pixel, or a value representing a probability that the pixel belongs to a particular category. The category may represent an object or type of object or (for video) an action or type of action. For example in a semantic segmentation task the pixel labels may identify a type or category of object and in an instance segmentation task the pixel labels may (also) distinguish between different instances of the same category of object. Thus the set of pixel labels for the task image pixels can locate categories or instances of objects or actions in an image. More generally a pixel label can distinguish between an object (or action) and image background, and the set of pixel labels for the task image pixels can, e.g., perform an object localization, detection, or tracking task, e.g. for gesture recognition.

[0135] Merely as some illustrative examples, object segmentation may be used to segment medical images, to label pixels of an input medical image in accordance with whether they show a region of a human or animal body in which a particular medical condition is present. An object segmentation may be used to provide an input to a control system of a mechanical agent, such as a robot or vehicle operating in a real -world environment. The detected objects may be, e.g., obstacles or paths upon which the mechanical agent can move, and may be used by the control system e.g. to make decisions on how to accomplish a task performed by the robot, or for controlling the direction or speed of movement of the agent.

[0136] As another example, in a (monocular) depth prediction task the pixel labels for the task image pixels may comprise a scalar value representing an estimated depth value for the pixel, e.g. a distance of the pixel (for an object) from in a depth or z-direction from an x-y image plane or camera viewpoint. Or the pixel labels may each define a depth distribution, e.g. a probability distribution over discrete depth value buckets. The set of pixel labels for the task image pixels can define a depth map for the task image.

[0137] As another example, in a keypoint prediction task the pixel labels for the task image pixels may identify keypoints in the task image, e.g. by labelling a pixel as a keypoint or as one of multiple keypoints. The set of pixel labels for the task image pixels can thus label keypoints in the task image, e.g. one or more keypoints of an object in the image. For example keypoints may define landmarks of an object represented in the image, e.g. the positions of body joints for a human.

[0138] As another example, in a pose estimation task the pixel labels for the task image pixels may map the task image pixels to a 3D surface, e.g. of a human body or face. Or the pixel labels may estimate a 6D pose representing translation and orientation components of an object in the task image, e.g. in quaternion form. The set of pixel labels for the task image pixels can estimate the pose of one or more objects in the task image.

[0139] As another example, in a surface normal estimation task the pixel labels for the task image pixels may comprise a vector in, e.g., three dimensions defining the surface normal. The set of pixel labels for the task image pixels can provide a surface normal map for one or more objects in the task image, e.g. for use in an augmented reality of other application.

[0140] Implementations of the trained image processing system can also be used to perform image-level image processing tasks, such as classification, as previously described. [0141] In principle the described techniques may be extended to audio signal processing, with time domain or time-frequency domain samples of an audio waveform used in place of image pixels, and using an audio signal encoder in place of an image encoder. The described techniques may be otherwise unchanged; the types of task performed may correspond to those described above, e.g. audio signal segmentation (semantic or instance), audio object detection (e.g. detecting particular sounds, or words, e.g. hotword detection), and so forth. [0142] This specification uses the term “configured” in connection with systems and computer program components. For a system of one or more computers to be configured to perform particular operations or actions means that the system has installed on it software, firmware, hardware, or a combination of them that in operation cause the system to perform the operations or actions. For one or more computer programs to be configured to perform particular operations or actions means that the one or more programs include instructions that, when executed by data processing apparatus, cause the apparatus to perform the operations or actions.

[0143] Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non-transitory storage medium for execution by, or to control the operation of, data processing apparatus. The computer storage medium can be a machine- readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. Alternatively or in addition, the program instructions can be encoded on an artificially-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus.

[0144] The term “data processing apparatus” refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can also be, or further include, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). The apparatus can optionally include, in addition to hardware, code that creates an execution environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.

[0145] A computer program, which may also be referred to or described as a program, software, a software application, an app, a module, a software module, a script, or code, can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages; and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub-programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a data communication network.

[0146] In this specification the term “engine” is used broadly to refer to a software-based system, subsystem, or process that is programmed to perform one or more specific functions. Generally, an engine will be implemented as one or more software modules or components, installed on one or more computers in one or more locations. In some cases, one or more computers will be dedicated to a particular engine; in other cases, multiple engines can be installed and running on the same computer or computers.

[0147] The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by special purpose logic circuitry, e.g., an FPGA or an ASIC, or by a combination of special purpose logic circuitry and one or more programmed computers. [0148] Computers suitable for the execution of a computer program can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data. The central processing unit and the memory can be supplemented by, or incorporated in, special purpose logic circuitry. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few. [0149] Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks.

[0150] To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user’s device in response to requests received from the web browser. Also, a computer can interact with a user by sending text messages or other forms of message to a personal device, e.g., a smartphone that is running a messaging application, and receiving responsive messages from the user in return.

[0151] Data processing apparatus for implementing machine learning models can also include, for example, special-purpose hardware accelerator units for processing common and compute-intensive parts of machine learning training or production, i.e., inference, workloads.

[0152] Machine learning models can be implemented and deployed using a machine learning framework, e.g., a TensorFlow framework.

[0153] Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back-end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a graphical user interface, a web browser, or an app through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (LAN) and a wide area network (WAN), e.g., the Internet.

[0154] The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In some embodiments, a server transmits data, e.g., an HTML page, to a user device, e.g., for purposes of displaying data to and receiving user input from a user interacting with the device, which acts as a client. Data generated at the user device, e.g., a result of the user interaction, can be received at the server from the device.

[0155] While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or on the scope of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Moreover, although features may be described above as acting in certain combinations and even initially be claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

[0156] Similarly, while operations are depicted in the drawings and recited in the claims in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products. [0157] Particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.

Claims

1. A computer-implemented method of performing a particular image processing task of a range of possible image processing tasks, the particular image processing task being defined by a set of task examples that demonstrate the particular image processing task, the method comprising: receiving a task image comprising a plurality of task image pixels for processing according to the particular image processing task; processing the task image using a trained image encoder to generate a spatial representation of the task image comprising a task image feature vector for each of a plurality of spatial locations in the task image, and for each of the plurality of spatial locations in the task image: obtaining a query vector for the spatial location from the task image feature vector corresponding to the spatial location, and applying the query vector to image feature vector keys and corresponding local label values in a memory, using a query-key-value attention mechanism, to obtain a predicted local label value for the spatial location as the query result, the memory storing example image feature vectors and corresponding local label values for the set of task examples, the example image feature vectors defining keys for accessing the corresponding local label values; and determining a set of pixel labels for the task image pixels from the predicted local label value for each of the spatial locations in the task image, wherein the particular image processing task is performed by obtaining the set of pixel labels for the task image pixels.

2. The method of claim 1, further comprising: receiving a set of task examples, each task example comprising an example image comprising a plurality of example image pixels and a set of pixel labels, each pixel label having a label value, wherein the combination of the example image and the set of pixel labels defines an example of the particular image processing task to be performed; and for each task example: processing the example image using the trained image encoder to generate the spatial representation of the example image comprising the example image feature vector for each of the plurality of spatial locations in the example image, and obtaining a local label value for each of the spatial locations in the example image from the set of pixel labels; sampling a subset of the example image feature vectors and corresponding local label values; and storing the sampled subset in the memory.

3. The method of claim 2, comprising randomly sampling the subset of the example image feature vectors and corresponding local label values according to a sampling distribution.

4. The method of any one of claims 1-3, wherein obtaining the local label value for each of the spatial locations in the example image from the set of pixel labels comprises, for each of the spatial locations, averaging the pixel labels over a region of the example image corresponding to the spatial location; and wherein determining the set of pixel labels for the task image pixels from the predicted local label value comprises, for each of the spatial locations in the task image, upsampling the predicted local label value to a resolution of the task image.

5. The method of any one of claims 1-4, wherein applying the query vector to the image feature vector keys and local label values in the memory to obtain the predicted local label value for the spatial location comprises: identifying a set of k nearest neighbor image feature vector keys to the query vector according to a similarity metric that measures a similarity between a query vector and an image feature vector key.

6. The method of any one of claims 1-5, wherein applying the query vector to the image feature vector keys and local label values in the memory to obtain a predicted local label value for the spatial location comprises: determining a value of a similarity function of the query vector and each of a plurality of the image feature vector keys, wherein the similarity function defines a similarity metric that measures a similarity between a query vector and an image feature vector key; and determining the predicted local label value as the query result from a weighted sum of the local label values in the memory each weighted by the value of the similarity function for a corresponding image feature vector key.

7. The method of any one of claims 1-6, wherein the image processing task comprises a dense image processing task that involves assigning a value to each pixel of an image, and wherein the particular image processing task to be performed is defined by the set of task examples, the method further comprising configuring an image processing system to perform the particular image processing task by storing the example image feature vectors and corresponding local label values from the task examples in the memory without any adjustment of learnable parameters of the image processing system.

8. A computer-implemented method of training an image processing system to perform any particular image processing task of a range of possible image processing tasks, the image processing system comprising: an image encoder configured to process an image to generate a spatial representation of the image, the spatial representation comprising a feature vector for each of a plurality of spatial locations in the image; a memory configured to store keys and corresponding values; and a memory access subsystem configured to process a query vector by applying a query-key-value attention mechanism to the keys and values in the memory to obtain a query result; the method comprising: obtaining a set of training examples, each training example comprising at least a training image; and for each of a first plurality of the training images: processing the training image using the image encoder to generate the spatial representation of the training image comprising a training image feature vector for each of the plurality of spatial locations in the training image, averaging the training image feature vectors over the plurality of spatial locations in the training image to determine an average feature vector for the training image; processing the average feature vector using a value update neural network to generate a corresponding value vector for the training image; storing the average feature vector and the corresponding value vector in the memory, wherein the average feature vector defines a key for the corresponding value vector; and for each of a second plurality of the training images: processing the training image using the image encoder to generate the spatial representation of the training image comprising a training image feature vector for each of the plurality of spatial locations in the training image, and for each of the plurality of spatial locations in the training image: obtaining a query vector for the spatial location from the training image feature vector corresponding to the spatial location, and applying the query vector to the average feature vector keys and value vectors in the memory using the memory access subsystem to obtain a predicted value vector for the spatial location as the query result; and determining a local contextualized representation for each of the plurality of spatial locations in the training image, the local contextualized representation for a spatial location in the training image comprising a combination of the query vector for the spatial location and the predicted value vector for the spatial location, and obtaining a contextualized representation for the training image from the local contextualized representation for each of the plurality of spatial locations in the training image; and training the image processing system using the contextualized representations for each of the second plurality of the training images.

9. The method of claim 8, wherein determining the local contextualized representation for a spatial location in the training image comprises determining a weighted combination of the query vector for the spatial location and the predicted value vector for the spatial location and processing the weighted combination using a linear neural network layer to generate the local contextualized representation.

10. The method of claim 8 or 9, wherein training the image processing system using the contextualized representations for each of the second plurality of the training images comprises training the image processing system using a self-supervised learning objective function dependent on the contextualized representation for a training image.

11. The method of any one of claims 8-10, wherein training the image processing system using the contextualized representation for a training image comprises: determining first and second transformed versions of the training image; determining the contextualized representation of the first transformed view of the training image, and processing the contextualized representation using a first neural network to generate a first representation; determining the contextualized representation of the second transformed view of the training image, and processing the contextualized representation using a second neural network to generate a second representation; and training the image processing system using an objective function that depends on a metric that measures a difference between the first and second representations.

12. The method of claim 11, wherein the first neural network comprises a first projection neural network, wherein the second neural network comprises a second projection neural network and a prediction neural network; the method further comprising: processing the contextualized representation of the first transformed view using the first projection neural network to generate the first representation, wherein the first representation comprises a first projected representation; and processing the contextualized representation of the second transformed view using the second projection neural network to generate a second projected representation, and processing the second projected representation using the prediction neural network to generate the second representation, wherein the second representation comprises a predicted representation; and wherein the objective function depends on a metric that measures a difference between the first projected representation and the predicted representation; and wherein the training includes updating parameters of the second neural network based on a value of the objective function and updating parameters of the first neural network based on the parameters of the second neural network.

13. The method of claim 11 or 12, wherein processing the contextualized representation of the first transformed view using the first neural network comprises: determining an attention-pooled representation from the contextualized representation of the first transformed view and processing the attention-pooled representation using the first neural network, wherein determining the attention-pooled representation comprises applying an attention operation over the local contextualized representation for each of the plurality of spatial locations in the first transformed view; and/or wherein processing the contextualized representation of the second transformed view using the second neural network comprises: determining an attention-pooled representation from the contextualized representation of the second transformed view and processing the attention-pooled representation using the second neural network, wherein determining the attention-pooled representation comprises applying an attention operation over the local contextualized representation for each of the plurality of spatial locations in the second transformed view.

14. The method of any one of claims 8-13, wherein each training example comprises the training image and an image label for the training image; the method further comprising: for each of the first plurality of the training images, storing the image label in the memory with the average feature vector for the training image and the corresponding value vector; and for each of the second plurality of the training images: querying the memory using the contextualized representation of the training image to retrieve a predicted label for the training image; and training the image processing system to minimize an error between the image label for the training image and the predicted label for the training image.

15. The method of claim 14, wherein querying the memory using the contextualized representation of the training image to retrieve the predicted label for the training image comprises: determining an attention-pooled representation from the contextualized representation of the training image and querying the memory using the attention-pooled representation, wherein determining the attention-pooled representation comprises applying an attention operation over the local contextualized representation for each of the plurality of spatial locations in the training image.

16. The method of claim 13 or 15, determining the attention-pooled representation by applying an attention operation over the local contextualized representation for each of the plurality of spatial locations comprises: determining a soft mask for each of the plurality of spatial locations, where a value of the soft mask for each the spatial locations defines a respective weight for the spatial location; and combining the local contextualized representation for each of the plurality of spatial locations weighted by the respective weight for the spatial location defined by the soft mask to obtain the attention-pooled representation.

17. The method of claim 16, wherein determining a soft mask for each of the plurality of spatial locations comprises processing the local contextualized representation for each of the plurality of spatial locations using an attention neural network to generate the respective weight for each of the spatial locations.

18. The method of any one of claims 15-17 when dependent on claim 15, wherein querying the memory using the attention-pooled representation comprises: determining a value of a similarity function of the attention-pooled representation and each of a plurality of the average feature vector keys; and determining the predicted label for the training image as a weighted combination of the image labels stored in the memory for the plurality of the average feature vector keys, each weighted according the value of the similarity function for a respective one of the plurality of the average feature vector keys.

19. The method of any preceding claim, wherein the particular image processing task comprises one or more of: image segmentation, depth prediction, keypoint prediction, pose estimation, surface normal estimation, and object detection.

20. A method of configuring an image processing system including a trained image encoder to perform a particular image processing task of a range of possible image processing tasks, the method comprising: receiving a set of task examples that demonstrate the particular image processing task, each task example comprising an example image comprising a plurality of example image pixels and a set of pixel labels, each pixel label having a label value, wherein the combination of the example image and the set of pixel labels defines an example of the particular image processing task to be performed; and for each task example: processing the example image using a trained image encoder to generate the spatial representation of the example image comprising an example image feature vector for each of a plurality of spatial locations in the example image, and obtaining a local label value for each of the spatial locations in the example image from the set of pixel labels; sampling a subset of the example image feature vectors and corresponding local label values; and storing the sampled subset in a memory of the image processing system.

21. A system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform operations of the respective method of any one of claims 1-20.

22. One or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations of the respective method of any one of claims 1-20.