WO2025104203A1

WO2025104203A1 - Media data item classification using a generative neural network model

Info

Publication number: WO2025104203A1
Application number: PCT/EP2024/082415
Authority: WO
Inventors: Tao Zhu; Yuan Cao; Siyuan Qiao; Chenglin YANG
Original assignee: DeepMind Technologies Ltd
Current assignee: DeepMind Technologies Ltd
Priority date: 2023-11-17
Filing date: 2024-11-14
Publication date: 2025-05-22
Anticipated expiration: 2026-05-17

Abstract

A caption for a media data item, such as an image or video, is chosen from plurality of candidate captions. The choice is made using both a respective posterior probability of the media data item given the candidate caption, and a respective prior probability for the candidate caption.

Description

DeepMind Technologies Limited _{F&R Ref: 45288-0401WO1 PCT Application} MEDIA DATA ITEM CLASSIFICATION USING A GENERATIVE NEURAL NETWORK MODEL CROSS-REFERENCE TO RELATED APPLICATION [0001] This application claims priority to U.S. Provisional Application No.63/600,578, filed on November 17, 2023. The disclosure of the prior application is considered part of and is incorporated by reference in the disclosure of this application. BACKGROUND [0002] This specification relates to processing data using machine learning models. [0003] Machine learning models receive an input and generate an output, e.g., a predicted output, based on the received input. Some machine learning models are parametric models and generate the output based on the received input and on values of the parameters of the model. [0004] Some machine learning models are deep models that employ multiple layers of models to generate an output for a received input. For example, a deep neural network is a deep machine learning model that includes an output layer and one or more hidden layers that each apply a non-linear transformation to a received input to generate an output. SUMMARY [0005] This specification describes a neural network system, implemented as computer programs on one or more computers in one or more locations, and a method, that select, from a set of candidate captions, an appropriate caption (the “most” appropriate caption, according to a definition of “appropriate” proposed here) to describe a given media data item. The system and method can be used in even a zero-shot case, i.e. when the most appropriate caption for the media item was not included in any of the training items (training examples) in training data which was used to train the neural network system. [0006] In general terms, the present disclosure proposes that a caption is chosen from plurality of candidate captions using both a respective posterior probability of each candidate caption given the media data item, and a respective prior probability for each candidate caption. [0007] A first aspect of the disclosure proposes for example a computer-implemented method for assigning one of a set of candidate captions to a media data item using a generative neural DeepMind Technologies Limited _{F&R Ref: 45288-0401WO1 PCT Application} network. Each of the candidate captions may be a text comprising one or more “text tokens” selected from a vocabulary. The assigned candidate caption is a description of features of a media data item, e.g. where the media data item comprises an image, each candidate caption may describe (one or more) objects that are represented in the image. The method comprises using the generative neural network to process, for each candidate caption in the set of candidate captions, i) the respective candidate caption to generate an unconditioned value for the caption indicative of the likelihood (prior probability) of the respective candidate caption, and ii) the media data item and the respective candidate caption to generate a conditioned value indicative of a likelihood of the respective candidate caption describing the media data item (i.e. the posterior probability of the caption being correct, given the image). The most appropriate candidate caption (e.g. the candidate captions that most accurately describes the media data item) is assigned to the media data item based on both the generated likelihoods (i.e. both the prior and posterior probabilities). [0008] Another second aspect of the disclosure is a method of training a generative neural network using a training dataset which comprises a plurality of training items. Each training item comprises a media data item and a corresponding caption. The generative neural network processes the training items to predict corresponding indicative values for the posterior and prior probabilities of the caption. The predicted indicative values for the posterior and prior probabilities are then used to evaluate two loss functions (corresponding to multimodal and unimodal manners of operation) during inference, and the model parameters of the generative neural network are then updated to reduce an objective function based on a function (e.g. sum) of the unimodal loss function and the multimodal loss function. [0009] In a third aspect, the disclosure provides one or more computer storage media (tangible recording media) storing instructions that when executed by one or more computers cause the one or more computers to perform the operations of the respective method of the first or second aspect. [00010] In a fourth aspect, the disclosure provides a system comprising one or more computers and one or more storage devices communicatively coupled to the one or more computers. The one or more storage devices may store instructions that, when executed by the one or more computers, cause the one or more computers to perform operations of the respective method of the first or second aspect. [00011] The subject matter described in this specification can be implemented in particular embodiments so as to realize one or more of the following advantages. DeepMind Technologies Limited _{F&R Ref: 45288-0401WO1 PCT Application} [00012] A captioning system is provided which is able to provide higher quality captioning, e.g. of still or moving images or sounds (samples of audio data), than known captioning systems, particularly in the case of zero-shot classifications (i.e. without examples belonging to the most appropriate candidate caption being in the training data). In particular, the captioning system uses the information gain from unconditionally to conditionally generating the captions to reduce the text bias for classification, which is effective even in zero-shot classification tasks. This enables the captioning system to achieve a higher classification accuracy than some known captioning systems. BRIEF DESCRIPTION OF THE DRAWINGS [00013] Fig. 1 shows schematically an example method for choosing a caption for a media data item. [00014] Fig. 2 is a flow diagram of an example method for choosing a caption for a media data item. [00015] Fig.3 is a diagram of a training method for a captioning system. [00016] Fig.4 is a flow diagram of a method of training a generative neural network. [00017] Fig.5 is a table showing quantitative results for two exemplary classifiers and two classifiers according to two comparative examples. [00018] Like reference numbers and designations in the various drawings indicate like elements. DETAILED DESCRIPTION [00019] Classification is an important application of machine leaning models. In a classification task, a set of class labels is defined and a machine learning model assigns the most appropriate label to a given media data item (such as an item of image data (still or moving image) or an item of sound data). Each class label corresponds to a category in which the media item falls. “Captioning” may be considered an example of classification in which there is a respective category for each possible caption, and each possible caption is a possible label for the media data item. The caption is a description of the media data item as a sequence of text tokens selected from a vocabulary (e.g. a natural language text sequence of tokens, such as letters, selected from an alphabet, or a computer language). A challenging classification task is “zero-shot” classification which means that the machine leaning model DeepMind Technologies Limited _{F&R Ref: 45288-0401WO1 PCT Application} can assign, during inference, labels to samples belonging to classes that were not present in training data used to produce the machine learning model. [00020] Some captioning systems, such as CLIP (A. Radford, et al., “Learning transferrable visual models from natural language supervision”, in International Conference on Machine Learning, pp 8784-8763, 2021), use a text encoder to encode possible captions, and a media data item encoder (e.g. an image encoder) to encode the image, and choose the caption for which the respective encoded caption is most similar to the encoded image. [00021] Alternatively, a generative machine learning model may also be used for such a task, i.e. a model which has been trained using a generative training objective to model a joint probability of media data items and captions describing the media data items (i.e. not necessarily the particular captions forming the labels for the classification). Such a generative machine learning model may be converted into a zero-shot classifier by using a maximum likelihood estimation approach, e.g. by using the model to predict the posterior probability (or “conditioned likelihood”) for each of the (unseen) labels (possible captions) given a particular media data item and selecting the label with the highest posterior likelihood as the most appropriate label. However, it has been realized that the performance of such a generative zero-shot classifier is typically poor because the predictions of generative models for modeling a joint probability of media data items and captions are often strongly biased by the learned caption likelihood (i.e. the posterior probability predicted by the model is biased by the likelihood (prior probability) of the captions in the language domain and insufficiently grounded on the media data item inputs). [00022] This specification describes a system, implemented as computer programs on one or more computers in one or more locations, and a method that select, from a set of candidate captions, the most “appropriate” caption to describe a given media data item (even in a zero-shot approach, i.e. without examples belonging to the most appropriate candidate caption being in the training data). [00023] According to a first aspect, there is provided a computer-implemented method for assigning one of a set of candidate captions to a media data item using a generative neural network. For example, the media data item may comprise a (still or moving) image (e.g. an (RGB) image captured from the real world by a camera, a medical X-ray image, a frame of a video stream or the like; that is, pixel-level data) or/and sound data (e.g. audio or speech signals recorded from the real world using a microphone). Each of the candidate captions may be a text, e.g. a computer language or a natural language text sequence, comprising one or more “text tokens” (letters and/or words or parts of works) selected from a vocabulary. DeepMind Technologies Limited _{F&R Ref: 45288-0401WO1 PCT Application} More specifically, each candidate caption is a description of features of a media data item (correctly) described by the candidate caption. In one example, where the media data item is an image, each candidate caption may describe (one or more) objects that are represented in the image. [00024] The generative neural network may be a neural network model (having a plurality of model parameters) which has been trained using a generative training objective to model a joint probability of media data items and captions describing the media data items. The generative neural network may be configured to have two “manners of operation” (operating modes), a “multimodal” manner of operation and a “unimodal” manner of operation. When the network input to the generative neural network comprises both a media data item and a caption, the generative neural network processes the network input according to the multimodal manner of operation. In the multimodal manner of operation, the generative neural network processes the media data item and the input caption to predict how likely the input caption is given the media data item, i.e. the generative neural network generates (or predicts) a conditioned value for the media data item and the caption which is indicative of a likelihood of the caption describing the media data item i.e. the posterior probability of the input caption given the media data item. When the network input to the generative neural network comprises only a caption (i.e. no media data item; in this case the input information is provided in only one modality, namely as text input), the generative neural network processes the network input according to the unimodal manner of operation. In the unimodal manner of operation, the generative neural network processes the input caption to generate (or predict) an unconditioned value for the caption indicative of a likelihood of the caption, i.e. the prior probability of the input caption. [00025] The method comprises using the generative neural network to process, for each candidate caption in the set of candidate captions, i) the respective candidate caption to generate an unconditioned value for the caption indicative of the likelihood (prior probability) of the respective candidate caption, and ii) the media data item and the respective candidate caption to generate a conditioned value indicative of a likelihood of the respective candidate caption describing the media data item (i.e. the posterior probability of the caption being correct, given the image). In broad terms, the amount of discrepancy between the predicted priors and corresponding posteriors is a measure of the amount of information that the model gained by the provision of the input image. For example, the predicted prior and posterior probabilities allow determining the pointwise mutual information for the input image and each candidate caption. This, in turn, means that the predicted prior and posterior DeepMind Technologies Limited _{F&R Ref: 45288-0401WO1 PCT Application} probabilities can be used to reduce the aforementioned caption bias of the network. Thus, the most appropriate candidate caption (e.g. the candidate captions that most accurately describes the media data item) is assigned to the media data item based on both the generated likelihoods (i.e. both the prior and posterior probabilities of each candidate caption). [00026] As noted above, the method uses the information gain from unconditionally to conditionally generating the captions with the aim of reducing the text bias for zero-shot classification tasks. Many possibilities for using the generated likelihoods (prior and posterior probabilities) to reduce caption bias of the network exist. In one possibility, for each candidate caption, a difference measure may be derived based on the log-likelihood of the generated posterior probability and the log-likelihood of the generated prior probability of the respective candidate caption is derived. In this case, the most appropriate candidate caption may be assigned to the media data item based on the derived difference measures. Typically, the assigned candidate caption is the one for which the difference measure is highest (but this can be different depending on implementation details). [00027] The difference measure between the log-likelihoods of the posterior and prior probabilities for a particular candidate caption Ti may satisfy l_{og ^^^^^^|^^^ െα log ^^^^^^^} _{where ^^ is the media data item, log ^^} ^{^} _^^^ ^| _^^ ^{^} _{is the generated conditioned value, log ^^^^^^^ is the} generated unconditioned value, and α is a factor with a value between zero and one (typically the value of α is larger than zero and no greater than one, i.e. selected from the half-open _{interval ^0,1^). Thus, in one example, i.e. when α ൌ 1, the difference measure may be the} difference between the log-likelihood of the posterior probability and the corresponding log- likelihood of the prior probability. In broad terms, the factor α adjusts the degree of the text bias removal, and its numerical value may be selected (or optimised) based on the application. In some implementations, α has a numerical value between 0.7 and 0.9. [00028] In general, any known and suitable architecture may be used to implement the generative neural network. In one implementation, the generative neural network may comprise a media data item encoder coupled to a text decoder. The media data item encoder receives the media data item of the network input and processes the media data item to generate an encoded media data item, e.g. a vector in a latent space of the network (the vector may comprise a plurality of numerical values). The encoded media data item is typically lower-dimensional than the media data item. In implementations where the media data items comprise images, the media data item encoder (which processes pixel-level data from the image(s)) may comprise a DeepMind Technologies Limited _{F&R Ref: 45288-0401WO1 PCT Application} (conventional) image encoder, e.g. a vision transformer model. In implementations where the media data items comprise sound, the media data item encoder may processes audio samples of the sound, e.g. captured with a microphone. The text decoder may receive the encoded media data item from the media data item encoder and the caption of the network input. The text decoder may be configured to have a multimodal manner of operation and a unimodal manner of operation. In the multimodal manner of operation, the text decoder processes the encoded media data item (provided by the media data tem encoder) and the caption of the network input to generate the conditioned value indicative of the likelihood of the caption describing the media data item. In the unimodal manner of operation, the text decoder processes the caption of the network input to generate the unconditioned value indicative of the (prior) likelihood of the caption. [00029] An example process 100 carried out by a media captioning system is illustrated in Fig.1. As shown, the media captioning system receives a media data item 101 (in this example, an image, denoted ^^), and the image 101 is encoded by a media data item encoder 103 (in this case an image encoder) to produce an encoded media data item (denoted _^ ^{^} _^). [00030] The media captioning system is also operative to receive, at different times, a plurality of candidate captions 102, labelled T₁, T₂ , T_{3 ,}….. The candidate captions may be generated by a known candidate caption generation system (e.g. one used in one of the known captioning systems referred to elsewhere in this document, e.g. the one used in the CLIP system); alternatively, see below for a discussion of how the candidate captions can be generated using outputs of the text decoder 104. The candidate text captions may comprise text in one or more natural languages, or text in a computer language, or both. The computer language may be any formal language used to communicate with a computer, e.g. a markup language, or a command or configuration language, or a data exchange language such as JSON, or a programming language. The text captions may alternatively be a series of encoded characters, e.g. UTF-8 encoded characters; such “characters” can include Chinese and other similar characters, as well as logograms, syllabograms and the like. Labelling the captions by an integer index variable i, each caption Ti is composed of Ni text tokens (the “token length of the i-th caption), denoted

where n is an integer index variable in the range 1,…, N_i . [00031] The media captioning system also comprises a text decoder 104 which, at different times, receives each candidate caption, and generates a corresponding output. For a DeepMind Technologies Limited _{F&R Ref: 45288-0401WO1 PCT Application} given candidate caption, the process may be performed autoregressively, e.g. in each of a number of time steps Ni labelled by the variable n, the text decoder 104 receives all the tokens ^^_^ ^{^} :_^ , and generates a corresponding n-th output. Note that for n=1, the text encoder may receive a default or randomly-set input, e.g. denoted [s] (which can be considered as ^^_^ ^{^} :_^ ). [00032] The text decoder 104 may be configured to be operated in two manners of operation. [00033] In a first “multimodal” manner of operation, the text decoder 104 is configured to receive in each time step the encoded media data item (in the case that the media data item is an image, this is a “visual input” ^^{^}^ to the text decoder, based on ^^ ^, and to process the tokens ^^_^ ^{^} :_^ to predict (i.e. generate an estimated value for) ^^൫^^_^ ^{^} ห^^ _^ _{^:^ି^ , ^^ ^, which} is the likelihood of the text token ^^_^ ^{^} given the preceding n-1 text tokens ^{^ ^}

(i.e. ^^_^ , ^^_ଶ , … ^^_^ ^{^} ି_^ ) and the image

_{^ is a likelihood for the first text token ^^^} ^{^} given the image ^^ . This process is carried out for each of n=1,…, N_i time steps, to obtain respective values

_{, ^^ ^ is obtained.} [00034] In a second “unimodal” manner of operation, the text decoder 105 is configured to process the tokens ^^_^ ^{^} :_^ to predict (i.e. generate an estimated value for) ^ which is the “a p ^{^}

riori” likelihood of the text token ^^ given the preceeding n- text tokens ^{^} (i.e. without knowledge of the image ^^ ). ^^൫^^_^ ^{^} ห^^_^ ^{^} :_^ ^ is an a priori for the first text token ^^_^ ^{^} . Note that, in the unimodal manner of operation of operation, the text decoder 104 does not receive an input based on the media data item 101. It may instead optionally receive a default visual input ∅ which is not informative about the media data item 101. This process is carried out for each of n=1,…, N_i time steps to obtain N respective _{values , from which log P(T^ ൌ ∑} ^ே ^_ୀ ^{^} _^ _{^ log^^൫^^^ ห^^} ^{^}

^ is [00035] This process is carried out for each of the candidate captions. A candidate _{caption is then selected for which log ^^൫^^^ห ^^ ^ െ ^^log P(Ti^ is highest. Here ^^ is a variable} taking a positive real value, e.g. in the range 0 to 1. The value ^^ may be chosen (e.g. by trial and error) depending upon the technical application, e.g. to maximise a measure of captioning quality. [00036] While, as shown in Fig.1, the media captioning system may comprise a media data item encoder coupled to a single text decoder having the multimodal and the unimodal manner of operation, in other implementations the generative neural network comprises a first DeepMind Technologies Limited _{F&R Ref: 45288-0401WO1 PCT Application} neural network configured to generate the posterior probabilities (i.e. the conditioned values) and a second neural network configured to generate the prior probabilities (i.e. the unconditioned values). For example, the first neural network may comprise the media data item encoder coupled to a first text decoder having the multimodal manner of operation, and the second neural network may comprise a second text decoder having the unimodal manner of operation. [00037] Furthermore, in both forms of the media captioning system, the candidate captions may be generated during the performance of the method 100. As noted, a first input [s] (which may be denoted ^^_^ ^{^} :_^ ^ to the text decoder 104, may be chosen randomly or as a default value, and used to generate ^^_^ ^{^} based on ^^൫^^_^ ^{^} ห^^ _^ _{^:^ , ^^ ^ (and optionally also ^^൫^^^} ^{^} ห^^_^ ^{^} :_^ ^), for example as the value of ^^_^ ^{^} which maximises ^^൫^^_^ ^{^} ห^^ _^ _{^:^ , ^^ ^, or a text token selected randomly} from a probability distribution over the possible values of ^^_^ ^{^} based on ^^൫^^_^ ^{^} ห^^ _^ _{^:^ , ^^ ^. Each} subsequent text token

may be generated based on

_{, ^^ ^ (and optionally} ^), for example as the value o ^{^ ^} _^

f ^^_^ which maximises ^^൫^^ ห^^_{^:^ି^ , ^^ ^, or a text} selected randomly from a probability distribution over the possible values

based

[00038] Fig.2 shows a method 200 of which method 100 of Fig.1 is an example. The method 200 can be carried out by one or more computers in one or more locations, to assign one of a set of candidate captions to a media data item. [00039] In step 201, a generative model (such as the generative model shown in Fig. 1 comprising the image encoder 103 and the text decoder 104) is used to process, for each of a set of candidate captions ^{^}^^_^ ^{^}, the candidate caption Ti to generate an unconditioned value (e.g. log P(Ti^^ indicative of the likelihood of the candidate caption Ti. [00040] In step 202, the generative model is used to process, for each of the set of candidate captions, the candidate caption Ti and the media data item (e.g. image ^^ ^, to generate a conditioned value (e.g. log ^^൫^^^{^}ห ^^ ^^ indicative of the likelihood of the candidate caption T_i correctly describing the media data item (e.g. image ^^ ). [00041] In step 203, one of the candidate captions ^^^_^ ^ is chosen based on the generated conditioned values (e.g.

^^ ^) and the generated unconditioned values _{(e.g. ^log P^Ti^^), e.g. as the ^^ ^} _{^ which maximises the magnitude of log ^^൫^^ ห ^^ ^ െ ^^log P(Ti^,} or some other function which increases in magnitude as log ^^൫^^^{^}ห ^^ ^^ and log P(Ti^ diverge. DeepMind Technologies Limited _{F&R Ref: 45288-0401WO1 PCT Application} [00042] According to a second aspect of the disclosure, there is provided a method of training the generative neural network. The generative neural network is trained using a training dataset which comprises a plurality of training items. Each training item comprises a media data item and a corresponding caption. The generative neural network processes the training items to predict the corresponding indicative values for the posterior and prior probabilities (i.e. to predict, for each training item, the unconditioned value indicative of a likelihood of the caption and the conditioned value indicative of a likelihood of the caption describing the media data item). The predicted indicative values for the posterior and prior probabilities are then used to evaluate two loss functions corresponding to the multimodal and unimodal manner of operation during inference. More specially, a unimodal loss function is evaluated based on the generated prior probabilities (i.e. based on the generated unconditioned values), and a multimodal loss function is evaluated based on the generated posterior probabilities (i.e. based on the generated conditioned values). The model parameters of the generative neural network are then updated to reduce an objective function based on a sum of the unimodal loss function and the multimodal loss function. Conventional methods (e.g. gradient decent algorithms) may be used to find appropriate parameter updates based on a sum of the unimodal loss function and the unimodal loss function. By updating the network parameters based on the unimodal and multimodal loss functions, the performance of the trained network is improved compared to (conventional) training using only the multimodal loss functions. [00043] As noted above, the captions typically comprise a sequence of one or more text tokens. The generated indicative values for the posterior and prior probabilities (i.e. the unconditioned value indicative of a likelihood of the caption and the conditioned value indicative of a likelihood of the caption describing the media data item) may each be generated, by the generative neural network, as a product of factors where each factor is indicative of a likelihood of autoregressively predicting a corresponding token. Thus, in this case, the unimodal and the multimodal loss functions can be used to train the generative neural network to predict the words of the caption in an autoregressive way. [00044] The method of training the generative neural network may optionally take as a starting point a pre-trained text encoder model. That is, the method of training the generative neural network can be used to fine-tune the text encoder model. Alternatively the text encoder model may initially by defined by random, or default, parameters. [00045] A possible system for performing the training method is illustrated in Fig. 3. This system is operative to train the text decoder 104 of the generative model of Fig.1. Fig.3 DeepMind Technologies Limited _{F&R Ref: 45288-0401WO1 PCT Application} assumes, for simplicity, that the media data items are images, but media data items of any of the other format discussed here may be used instead. It is assumed that the image encoder 103 of the generative model is “frozen”. The image encoder 103 may be one obtained by any known method for training a image encoder of a known visual language model, such as the examples given below. [00046] The system of Fig.3 employs a training database 302 of images ^^^^{^}^ and corresponding correct captions ^^^^{^}^. Note that in the case of Fig.3, the integer index variable i labels the images in the database (rather than candidate captions as in Fig.1). Each caption ^^^{^} is composed of Ni text tokens (the “token length of the i-th caption), denoted

where n is an integer index variable in the range 1,…, N_i . [00047] The image encoder 103 is used to modify the training database to replace the images ^^^^{^}^ by corresponding encoded images ^^^{^}^^{^}^, and thus form the training database 303. [00048] A training engine 300 comprises a text decoder input formation unit 304. This selects a batch of B of the training items from the training database 303. [00049] For each training item ^^^{^}^^{^} ,^^^{^}), the input formation unit causes the text decoder 104 to process the caption ^^^{^} auto-regressively in each of two manners of operation. [00050] In the first manner of operation, the text decoder 104 receives no visual input, or a default visual input ∅ which is not informative about ^^^{^}. In a first time step, the text decoder 104 processes an input denoted ^^_^ ^{^} :_^ (e.g. set randomly or as a default) to generate

^. In each of Ni subsequent time steps, n=2,… Ni, the text decoder 104 processes an input ^^_^ ^{^} :_^ି^ to generate ^^൫^^_^ ^{^} ห^^_^ ^{^} :_^ି^ ^. _{[00051] In the second manner of operation, the text decoder 104 receives visual input ^} ^{^} _^ ^{^} _. In a first time step, the text decoder 104 also receives an input denoted ^^_^ ^{^} :_^ (e.g. set randomly or as a default), and generates ^^൫^^_^ ^{^} ห^^ _{^ ^} _{^:^ , ^^ ^. In each of Ni subsequent time steps, n=2,… Ni,} the text decoder 104 processes an input ^^ _^ _^ ^{^} :_^ି^ to generate ^^൫^^_^ ^{^} ห^^_{^:^ି^ , ^^^^.} [00052] A loss calculation unit 305 of the training engine calculates an objective function L which is a function (e.g. weighted sum) of a unimodal loss function based on ^^^൫^^ ห^^ ^{^} ^^, and a multimodal loss function based on

[00053] In one implementation, the unimodal loss function may satisfy

DeepMind Technologies Limited _{F&R Ref: 45288-0401WO1 PCT Application} where ^^ is an index of a training item of a training batch, B is a training batch size, ^^^{^} is a text token length of the caption of the i-th training item, ^^_௧ ^{^} is n-th token of the caption of the i-th training item and ^ is the log-likelihood of the t ^{^}

oken ^^_௧ given tokens ^^_^ ^{^} :_^ି^ . [00054] In one implementation, the multimodal loss function satisfies

where ^^^{^} is the media data item of the i-th training item, and ^^^^^^ ^{^}

, ^^ ^ is the log- likelihood of the token given tokens and m ^{^}

edia data item ^^ . [00055] The objective function L may satisfy: ^_{^ ൌ ^^ ^^^௨^௧^^^ௗ^^ ^ ^^ ^^௨^^^^ௗ^^} where ^^ and ^^ are weights of the objective function. While the numerical values ^^ and ^^ may be application specific, ^^ may be larger than ^^. In broad terms, this is because the text decoder is supervised by both loss functions while the media data item encoder is only supervised by the multimodal loss function. In some implementations, ^^ is equal to 1.5 and ^^ is equal to 0.5. [00056] This process is repeated iteratively, i.e. the model parameters defining the text decoder 104 are iteratively updated based on a plurality of batches of the training items. This is carried out until a termination criterion is reached. For example, the termination criterion may be there is an iteration in which the value of the multimodal loss function is reduced by an amount which is below a threshold. Alternatively, the termination criterion may be that a predetermined number of iterations has been carried out, or that the iterative process has been running on the computer system which performs it for a predetermined amount of time. [00057] Fig. 4 shows a method 400 training a generative neural network having a plurality of model parameters. The method 400 can be carried out by one or more computers in one or more locations, to assign one of a set of candidate captions to a media data item. For example, the system 300 of Fig.3, suitably programmed, can perform the method 400. [00058] In a step 401, a plurality of training items (training examples) are provided. For example, these may be a batch of training items from the training database 301. Each training item comprises a media data item and a corresponding caption. [00059] In step 402, the generative neural network is processes, for each training item (e.g. ^^^{^} ,^^^{^}^, only the caption of the training item ^^^{^} to generate a respective unconditioned value (e.g. log ^^^^^^{^}^^ indicative of a likelihood of the caption. DeepMind Technologies Limited _{F&R Ref: 45288-0401WO1 PCT Application} [00060] In step 403, a unimodal loss function (e.g. ^^_௨^^^^ௗ^^^ is evalubased on the generated unconditioned values. [00061] In step 404, the generative neural network processes, for each training item, the media data item and the caption of the training item to generate a respective conditioned value (e.g. log ^^൫^^^{^}ห ^^^{^}^^ indicative of a likelihood of the caption describing the media data item. [00062] As explained above with reference to Fig. 3, the processing of the media data item can be carried out in two steps. First by processing the media data item of the training item by a media data item encoder (e.g. image encoder 103) and then by processing the caption of the training item by the text decoder 104 conditioned on the encoded media item. [00063] Also, as explained above, in the case that the media data item encoder is frozen during the process 400 (as it is during training by the system 300), since the same training item may be used in multiple batches (e.g. as part of different updates to the generative model), the processing of the data media item of the training item by the media data item encoder can be carried out in advance (e.g. to generate the training database 303), e.g. before the selection of each batch of training items. This reduces the computational operations required to encode the media data item. [00064] In step 405, a multimodal loss function (e.g. ^^_{^௨^௧^^^ௗ^^}) based on the generated conditioned values, is evaluated. [00065] In step 406, current values of the model parameters of the generative neural network (e.g. just a text encoder part of the generative neural network) are updated to reduce an objective function, e.g. L, based on a (weighted) sum of the unimodal loss function and the multimodal loss function. [00066] The process 400 may be performed repeatedly, selecting on each occasion a different set (and possibly a different number B) of training items from the training database 303. Thus, the text decoder 104 is iteratively trained. [00067] Turning to Fig. 5, experimental results are shown comparing the performance of examples of the present disclosure to two known classifiers: (a) a captioner using a CLIP classifier referenced above (which uses a text encoder and image encoder, rather than an image encoder and a text decoder as in a captioner based on a generative model); and (b) a known generative captioner (“captioner”), in which captions are chosen for which the corresponding log

^ is highest. The two examples of the present disclosure are shown for comparison: DeepMind Technologies Limited _{F&R Ref: 45288-0401WO1 PCT Application} (c) the known generative captioner of (b) but selecting each caption as the one which _{maximises log ^^൫^^^ห ^^ ^ െ ^^log P(Ti^ (“captioner + IG eval”); and} (d) a model trained according to Fig.4, using an objective function L based on a sum of the unimodal loss function and the multimodal loss function, and selecting captions as the _{one which maximises log ^^൫^^^ห ^^ ^ െ ^^log P(Ti^. This is referred to as “IG captioner” in Fig.} 5. [00068] The models used in all the four examples substantially consist of standard transformer blocks. The number of layers for the image encoder was the same for all models. The text encoder for example (a) had the same number of layers as the text decoder of examples (b)-(d). Note that for (c), the standard captioner does not directly predict log P(Ti^, so these value for the experiment were obtained using a zero-intensity image O as the input image, i.e. based on log ^^൫^^^{^}ห ^^^. [00069] All models were trained on the Laion-5B dataset (C. Shuhmann, et al., “An open large-scale dataset for training next generation image-text models”, 2022). The horizontal axis shows the zero-shot top-1 ImageNet classification accuracy. Solid lines show results using a ViT-B model as the image encoder, while hashed lines show results using a ViT-L model as the image encoder. As shown (comparing lines (b) and (c)) using an information gain (IG) factor, increased the classification accuracy by 9.0%/7.9%. Using also the training procedure of Fig.4 (i.e. using example (d) rather than example (c)) further improved the performances by 10.75/10.2, and resulted in a performance superior to (a). [00070] Applications of the present technique are now described. The method described above for assigning a candidate caption to a media data item may be employed by a media captioning system, for processing a media data item which is a still or moving (video) image, or a sound sample (i.e. audio data representing sound amplitudes at respective times). In general, if the candidate caption comprises text, this may be provided as speech representing the text, e.g. for the benefit of an unsighted person who would not be able to read the caption as text. [00071] An image (that is, an item of image data) comprises at least one numerical value for each pixel of an array having at least two dimensions. It may be a still image (e.g. with three color values for each pixel) or a moving image (i.e. a sequence of image frames, with one or more numerical values for each pixel for each image frame). In some implementations a still or moving (video) image processed by the visual encoder neural network, either during or after training, or both, may be an image that has been captured by a DeepMind Technologies Limited _{F&R Ref: 45288-0401WO1 PCT Application} camera, i.e., that has been captured from the real world. Elements of the image data may comprise monochrome or color pixels of the image or video. The image may be a 2D or 3D image. As defined herein an “image” includes a point cloud e.g. from a LIDAR system, and a “pixel” includes a point of the point cloud. Similarly references to a moving image or video include a time sequence of point clouds. Objects in the image or video may comprise objects, e.g. physical objects, represented by the image or video. [00072] As one example, the media captioning system can be used to perform a still or moving image classification task (zero-shot). For example to classify an image into one of a plurality of classes, e.g., as a pickup truck, car, or van. The assigned caption may describe the content of the image as belonging to a different respective class, e.g. “this is a photograph of a pickup truck”, and so forth. A similar approach may be used to classify actions in moving images, e.g. gestures. [00073] This approach may also be used to retrieve an image from an image database (or sound database), by comparing a user-specified text with a respective caption assigned by the media captioning system to each image (or sound) in the database, to find the image with the closest match, and extracting that image (or sound). [00074] As another example, a caption assigned by the media captioning system to a target image may be compared to captions assigned by the media captioning system to images in a database, e.g. to determine a value representing a similarity between the target and the images in the database, e.g. as part of an image search task, to identify and retrieve one or more images that are similar to a target image. [00075] Optionally, the media captioning system may be configured to receive the media data item in a form which comprises, in addition to the image(s) and/or sound(s) of the media data item, an associated text item, such that the caption assigned by the present method is additionally dependent on the associated text item. In this case, the media captioning system may operate as a multi-modal language processing system for performing multimodal tasks that involve processing a combination of an image (or sound) and text to generate an output that depends upon both the image and/or sound and the text. [00076] Some example multimodal machine learning models with which the techniques described herein may be used include: Flamingo (Alayrac et al. arXiv:2204.14198); ALIGN (Jia et al., arXiv:2102.05918); PaLI (Chen et al. arXiv:2209.06794); and PaLI-X (Chen et al. arXiv:2305.18565). DeepMind Technologies Limited _{F&R Ref: 45288-0401WO1 PCT Application} [00077] One example of a multi-model task is a visual ground task that takes as input an image and text in a natural language and that generates a caption that identifies or locates the most relevant object or region in the image, where relevancy depends upon the text. [00078] Another example involves generating an output that requires reasoning, e.g. spatio-temporal reasoning, to respond to a natural language query input, e.g. relating to a moving image (video). For example such a query may require predictive reasoning (“what will happen next”), counterfactual reasoning (“what would happen in a different circumstance”), explanatory reasoning (“why did something happen”), or causal reasoning generally. For example the trained neural networks can be used to detect objects in the video frames and provide information relating to the detected objects in response to a query. The query may comprise, for example, a request for a prediction of a future event or state relating to one or more of the objects (e.g. “will objects X and Y collide?”), or a request for conditional or counterfactual information relating to one or more of the objects (e.g. “what event would [not] happen if object X is modified, moved or absent?”), or a request for analysis of the video frames to determine a property or characteristic of one or more of the objects (e.g. “how many objects of type Z are moving?”). The caption produced by the captioning system may, for example, be in the form of a yes/no answer, or a more extended answer. Such systems can be used to predict whether or not two objects will collide, or how this may be avoided. The output may be useful by itself or it may be used to provide a warning and/or to control motion of one or more of the objects. [00079] More generally the captions produced by the captioning system based on a media data item (either containing or not containing text) may be used to perform an agent control task, if image(s) in the media data item are captured in real world environment containing the agent, and the caption defines an action to be performed by the agent, in particular to perform a task (e.g. specified by text in the media data item). The agent can be a mechanical agent, e.g. a robot or vehicle, controlled to perform actions in the real world environment, in response to the observations, to perform the task, e.g. to manipulate an object or to navigate in the environment. Thus the agent can be, e.g., a real-world or simulated robot; as some other examples the agent can be a control system to control one or more machines in an industrial facility. [00080] Some examples of multimodal machine learning models controlling an agent, and with which the techniques described herein may be used, are described in: PaLM-E (Driess et al. arXiv:2303.03378); RT-1 (Brohan et al. arXiv:2212.06817); and RT-2 (Brohan et al. arXiv:2307.15818). DeepMind Technologies Limited _{F&R Ref: 45288-0401WO1 PCT Application} [00081] This specification uses the term “configured” in connection with systems and computer program components. For a system of one or more computers to be configured to perform particular operations or actions means that the system has installed on it software, firmware, hardware, or a combination of them that in operation cause the system to perform the operations or actions. For one or more computer programs to be configured to perform particular operations or actions means that the one or more programs include instructions that, when executed by data processing apparatus, cause the apparatus to perform the operations or actions. [00082] Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non transitory storage medium for execution by, or to control the operation of, data processing apparatus. The computer storage medium can be a machine- readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. Alternatively or in addition, the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. [00083] The term “data processing apparatus” refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can also be, or further include, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). The apparatus can optionally include, in addition to hardware, code that creates an execution environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them. [00084] A computer program, which may also be referred to or described as a program, software, a software application, an app, a module, a software module, a script, or code, can be written in any form of programming language, including compiled or interpreted DeepMind Technologies Limited _{F&R Ref: 45288-0401WO1 PCT Application} languages, or declarative or procedural languages; and it can be deployed in any form, including as a standalone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a data communication network. [00085] In this specification the term “engine” is used broadly to refer to a software- based system, subsystem, or process that is programmed to perform one or more specific functions. Generally, an engine will be implemented as one or more software modules or components, installed on one or more computers in one or more locations. In some cases, one or more computers will be dedicated to a particular engine; in other cases, multiple engines can be installed and running on the same computer or computers. [00086] The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by special purpose logic circuitry, e.g., an FPGA or an ASIC, or by a combination of special purpose logic circuitry and one or more programmed computers. [00087] Computers suitable for the execution of a computer program can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data. The central processing unit and the memory can be supplemented by, or incorporated in, special purpose logic circuitry. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few. DeepMind Technologies Limited _{F&R Ref: 45288-0401WO1 PCT Application} [00088] Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks. [00089] To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user’s device in response to requests received from the web browser. Also, a computer can interact with a user by sending text messages or other forms of message to a personal device, e.g., a smartphone that is running a messaging application, and receiving responsive messages from the user in return. [00090] Data processing apparatus for implementing machine learning models can also include, for example, special-purpose hardware accelerator units for processing common and compute-intensive parts of machine learning training or production, i.e., inference, workloads. [00091] Machine learning models can be implemented and deployed using a machine learning framework, e.g., a TensorFlow framework. [00092] Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e.g., a client computer having a graphical user interface, a web browser, or an app through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. DeepMind Technologies Limited _{F&R Ref: 45288-0401WO1 PCT Application} Examples of communication networks include a local area network (LAN) and a wide area network (WAN), e.g., the Internet. [00093] The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In some embodiments, a server transmits data, e.g., an HTML page, to a user device, e.g., for purposes of displaying data to and receiving user input from a user interacting with the device, which acts as a client. Data generated at the user device, e.g., a result of the user interaction, can be received at the server from the device. [00094] While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or on the scope of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable sub-combination. Moreover, although features may be described above as acting in certain combinations and even initially be claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a sub-combination or variation of a sub-combination. [00095] Similarly, while operations are depicted in the drawings and recited in the claims in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products. [00096] Particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the DeepMind Technologies Limited _{F&R Ref: 45288-0401WO1 PCT Application} particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.

Claims

DeepMind Technologies Limited _{F&R Ref: 45288-0401WO1 PCT Application} CLAIMS 1. A computer-implemented method for assigning one of a set of candidate captions to a media data item using a generative neural network configured i) to process a network input comprising a caption to generate an unconditioned value for the caption indicative of a likelihood of the caption, and ii) to process a network input comprising a media data item and a caption, to generate a conditioned value for the media data item and the caption which is indicative of a likelihood of the caption describing the media data item, the method comprising: using the generative neural network to process, for each candidate caption in the set of candidate captions, the respective candidate caption to generate an unconditioned value indicative of a likelihood of the respective candidate caption; using the generative neural network to process, for each candidate caption in the set of candidate captions, the media data item and the respective candidate caption to generate a conditioned value indicative of a likelihood of the respective candidate caption describing the media data item, and assigning one of the set of candidate captions to the media data item based on the generated conditioned values and the generated unconditioned values. 2. A method according to claim 1 in which the unconditioned value for a caption is proportional to a log-likelihood of the caption, and the conditioned value for a media data item and a caption is proportional to a log-likelihood of the caption describing the media data item. 3. The computer-implemented method of claim 1, wherein assigning one of the set of candidate captions to the media data item comprises: deriving, for each candidate caption, a difference measure based on the corresponding conditioned value and the corresponding unconditioned value, and assigning one of the set of candidate captions to the media data item based on the derived difference measures. 4. The computer-implemented method of any preceding claim, wherein the assigned candidate caption is the one for which the difference measure is highest. DeepMind Technologies Limited _{F&R Ref: 45288-0401WO1 PCT Application} 5. The computer-implemented method of claim 3 or 4, when dependent on claim 2, wherein, for each candidate caption, the difference measure based on the conditioned value and the unconditioned value satisfies l_{og ^^^^^^|^^^ െα log ^^^^^^^} where Ti is the respective candidate caption, ^^ is the data item, ^^og ^^^{^}^^_^ ^|^^^{^} is the log-likelihood of the caption describing the media data item, ^^^^^^ ^^^^^_^^ is the log-likelihood of the caption, and α is a factor with a value between zero and one. 6. The computer-implemented method of any preceding claim, wherein the generative neural network comprises a media data item encoder and a text decoder, the media data item encoder is configured to process the media data item of the network input to generate an encoded media data item, and the text decoder is configured i) to process the encoded media data item and the caption of the network input to generate the conditioned value, and ii) to process the caption of the network input to generate the unconditioned value. 7. The computer-implemented method of any preceding claim, wherein the media data item comprises sound data. 8. The computer-implemented method of any one of claims 1 to 6, wherein the media data item comprises image data. 9. The computer-implemented method of claim 8, wherein the media data item encoder comprises a vision transformer model. 10. The computer-implemented method of any preceding claim, wherein the generative neural network comprises a first neural network configured to generate the conditioned value and a second neural network configured to generate the unconditioned value. 11. A method of training a generative neural network having a plurality of model parameters, wherein the generative neural network is configured i) to process a network input comprising a caption to generate an unconditioned value for the caption indicative of a likelihood of the caption, and ii) to process a network input comprising a media data item and DeepMind Technologies Limited _{F&R Ref: 45288-0401WO1 PCT Application} a caption, to generate a conditioned value for the media data item and the caption which is indicative of a likelihood of the caption describing the media item, the method comprising: providing a plurality of training items, each training item comprising a media data item and a corresponding caption; using the generative neural network to process, for each training item, the caption of the training item to generate a respective unconditioned value indicative of a likelihood of the caption; evaluating a unimodal loss function based on the generated unconditioned values; using the generative neural network to process, for each training item, the media data item of the training item and the caption of the training item to generate a respective conditioned value indicative of a likelihood of the caption describing the media data item; evaluating a multimodal loss function based on the generated conditioned values, and updating current values of the model parameters of the generative neural network to reduce an objective function based on a sum of the unimodal loss function and the multimodal loss function. 12. The method of claim 11, wherein, for each training item, the caption comprises a sequence of one or more tokens, and the unconditioned value is a product of a plurality of factors, each factor being indicative of a likelihood of autoregressively predicting a corresponding token. 13. The method of claim 12, wherein the unimodal loss function satisfies

where ^^ is the index of a training item, B is a training batch size, ^^^{^} is a text token length of the caption of the i-th training item, ^^_௧ ^{^} is n-th token of the caption of the i-th training item, and ^ is the log- ^{^}

likelihood of the token ^^_௧ given tokens

. 14. The method of any one of claims 11 to 13, wherein, for each training item, the caption comprises a sequence of one or more tokens, and the conditioned value is a product of factors, each factor being indicative of a likelihood of autoregressively predicting a corresponding token. DeepMind Technologies Limited _{F&R Ref: 45288-0401WO1 PCT Application} 15. The method of claim 14, wherein the multimodal loss function satisfies

where ^^ is the index of a training item, B is a training batch size, ^^^{^} is a text token length of the caption of the i-th training item, ^^_௧ ^{^} is n-th token of the caption of the i-th training item, and ^^^{^} is the media data item of the i-th training item, and ^^^^^^ ^{^}

, ^^ ^ is the log- likelihood of the token ^^_௧ ^{^} given tokens ^{^}

and media data item ^^ . 16. The method of claim 15, wherein the sum of the unimodal loss function and the multimodal loss function defines an objective function L satisfies ^_{^ ൌ ^^ ^^^௨^௧^^^ௗ^^ ^ ^^ ^^௨^^^^ௗ^^} where ^^ and ^^ are weights of the objective function. 17. The computer-implemented method of any one of claims 11 to 16, wherein the media data item comprises sound data. 18. The computer-implemented method of any one of claims 11 to 16, wherein the media data item comprises image data. 19. The method of computer-implemented method of any one of claims 1 to 10 wherein the generative neural network is trained according to the method of any one of claims 11 to 18. 20. One or more computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform the operations of the respective method of any one of claims 1-19. 21. A system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or DeepMind Technologies Limited _{F&R Ref: 45288-0401WO1 PCT Application} more computers, cause the one or more computers to perform operations of the respective method of any one of claims 1-19.