WO2025068600A1

WO2025068600A1 - Training diffusion neural networks by backpropagating differentiable rewards

Info

Publication number: WO2025068600A1
Application number: PCT/EP2024/077489
Authority: WO
Inventors: Kevin Jordan SWERSKY; Paul Adrian VICOL; David James Fleet; Kevin Stefan Clark
Original assignee: DeepMind Technologies Ltd
Current assignee: DeepMind Technologies Ltd
Priority date: 2023-09-28
Filing date: 2024-09-30
Publication date: 2025-04-03
Anticipated expiration: 2026-03-28

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for training a diffusion neural network using a differentiable reward function.

Description

DeepMind Technologies Limited F&R Ref.: 45288-0370WO1 PCT Application TRAINING DIFFUSION NEURAL NETWORKS BY BACKPROPAGATING DIFFERENTIABLE REWARDS CROSS-REFERENCE TO RELATED APPLICATION This application claims priority to U.S. Provisional Application No.63/541,287, filed on September 28, 2023. The disclosure of the prior application is considered part of and is incorporated by reference in the disclosure of this application. BACKGROUND This specification relates to processing data using machine learning models. As one example, neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to another layer in the network, e.g., the next hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current values of a respective set of weights. SUMMARY This specification describes a system implemented as computer programs on one or more computers in one or more locations that generates an output data item conditioned on a conditioning input using a diffusion neural network. Generally, the conditioning input characterizes one or more desired properties for the data item, i.e., characterizes one or more properties that the final data item generated by the system should have. More specifically, this specification describes how a system can train the diffusion neural network through reinforcement learning using a differentiable reward function, e.g., after the diffusion neural network has already been trained on an objective that does not make use of the reward function. Particular embodiments of the subject matter described in this specification can be implemented so as to realize one or more of the following advantages. Diffusion models have revolutionized generative modeling for continuous data, achieving impressive results across modalities including images, videos, and audio. However, for many use cases, modeling the training data distribution (e.g. diverse images from a large DeepMind Technologies Limited F&R Ref.: 45288-0370WO1 PCT Application data set) exactly does not align with the model’s desired behavior after training (e.g. generating aesthetically pleasing output). To overcome this mismatch, this specification proposes an efficient approach for gradient-based reward fine-tuning based on differentiating through the diffusion sampling process. That is, a training system can perform the described training approach after “pre- training” the diffusion neural network, e.g., on a large data set of diverse data items, in order to align the diffusion neural network to enable the desired behavior after training, i.e., to cause the diffusion neural network to generate data items that have the qualities measured by the reward function after training. In some of the described approaches, the system backpropagates the reward through the full sampling chain. However, this may result in a training process that consumes an excessive amount of memory, is computationally inefficient, or both. To keep memory and compute costs low during the training, this specification describes that the system may adopt any of a variety of modifications. As one example, this specification describes how the system can use gradient checkpointing to reduce the compute costs of the training while minimizing memory overhead. As another example, the system can optimize a proper subset of the network parameters rather than the full set of network parameters to improve the computational efficiency of the training process. As another example, this specification describes how the system can backpropagate through only the last K steps of sampling to compute the gradient. In particular, this can not only improve the computational efficiency of the training process, but can also avoid exploding gradients that occur when backpropagating through the full sampling chain. In other words, using the “truncated” gradient can perform substantially better given the same number of training steps than computing the full gradient both in terms of computational efficiency and resulting model quality. As yet another example, this specification describes how the system can further improve efficiency by introducing lower-variance gradient estimates computed over sets of noise samples. By using these noise samples to reduce the variance of the gradients, the system can stabilize the training process, resulting in fewer training iterations being required (thereby improving the computational efficiency of the training process) in a higher-quality trained model, or both. DeepMind Technologies Limited F&R Ref.: 45288-0370WO1 PCT Application Moreover, as the described techniques leverage gradients, they are substantially more efficient than reinforcement learning based fine-tuning baselines. The details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims. BRIEF DESCRIPTION OF THE DRAWINGS FIG.1 is a diagram of an example training system. FIG.2 is a flow diagram of an example process for training the diffusion neural network. FIG.3 is a flow diagram of an example process for determining a variance reduction term of the loss function. FIG.4 shows an example of the operation of the system when the diffusion neural network is used to generate images. FIG.5 shows an example of the performance of the described techniques when fine- tuned on a reward function that measures an aesthetic score. Like reference numbers and designations in the various drawings indicate like elements. DETAILED DESCRIPTION This specification describes a system implemented as computer programs on one or more computers in one or more locations that trains a diffusion neural network that is used to generate an output data item conditioned on a conditioning input. Generally, the conditioning input characterizes one or more desired properties for the data item, i.e., characterizes one or more properties that the final data item generated by the system should have. The system can be configured to generate any of a variety of output data items conditioned on any of a variety of conditioning inputs. For example, the system can be configured to generate audio data, e.g., a waveform of audio or a spectrogram, e.g., a mel-spectrogram or a spectrogram where the frequencies are in a different scale, of the audio. In this example, the conditioning input can be text or features of text that the audio should represent, i.e., so that the system serves as a text-to-speech machine learning model DeepMind Technologies Limited F&R Ref.: 45288-0370WO1 PCT Application that converts text or features of the text to audio data for an utterance of the text being spoken. As another example, the conditioning input can identify a desired speaker for the audio, i.e., so that the system generates audio data that represents speech by the desired speaker. As another example, the conditioning input can characterize properties of a song or other piece of music, e.g., lyrics, genre, and so on, so that the system generates a piece of music that has the properties characterized by the conditioning input. As another example, the conditioning input can specify a classification for the audio data into a class from a set of possible classes, so that the system generates audio data that belongs to the class. For example, the classes can represent types of musical instruments or other audio emitting devices, i.e., so that the system generates audio that is emitted by the corresponding class, or types of animals, i.e., so that the system generates audio that represents noises generated by the corresponding animal, and so on. As another particular example, the data item can be an image, such that the system can perform conditional image generation by generating the intensity values of the pixels of the image. In general the conditioning input can specify one or more characteristics for the image. In this particular example, the conditioning input can be a sequence of text and the output data item can be an image that describes the text, i.e., the conditioning input can be a caption for the output image. As yet another particular example, the conditioning input can be an object detection input that specifies one or more bounding boxes and, optionally, a respective type of object that should be depicted in each bounding box. As yet another particular example, the conditioning input can specify an object class from a plurality of object classes to which an object depicted in the output image should belong. As another example, the conditioning input can specify one or more images. For example, the conditioning input can specify an image at a first resolution and the output data item can include the image at a second, higher resolution. For example, the conditioning input can specify an image and the output data item can comprise a de-noised, enhanced, stylized, or otherwise edited version of the image. As yet another particular example, the conditioning input can specify an image including a target entity for detection, e.g. a tumor, and the output data item can comprise the DeepMind Technologies Limited F&R Ref.: 45288-0370WO1 PCT Application image without the target entity, e.g., to facilitate detection of the target entity by comparing the images. As yet another particular example, the conditioning input can be a segmentation that assigns each of a plurality of pixels of the output image to a category from a set of categories, e.g., that assigns to each pixel a respective one of the category. As yet another example, the conditioning input can be a different type of structured input, e.g., a mesh or a graph that specifies properties of the image to be generated. More generally, the conditioning input can include one or more different types of inputs of one or more different modalities, e.g., only text, only one or more images, both text and one or more images, and so on. As yet another example, the output data item can be a video. Again the conditioning input can specify one or more characteristics for the video. As a particular example, the conditioning input can include text and the output data item can be a video described by the text. As yet another particular example, the conditioning input can include one or more images and the output data item can be a video that completes the one or images, e.g., video starting from the one or more images. More generally, the task of generating the output data item can be any task that outputs continuous data conditioned on a conditioning input. For example, the output can be an output of a different sensor, e.g., a lidar point cloud, a radar point cloud, an electrocardiogram reading, and so on, and the conditioning input can represent the type of data that should be measured by the sensor. Where a discrete output is desired this can be obtained, e.g., by thresholding the outputs generated by the diffusion neural network. In some applications, the output data item can be used in a control task to control an action of a mechanical agent acting in a real-world environment to perform a mechanical task. For example, the output data item can be processed by a policy neural network of the agent to select one or more actions to be performed by the agent as part of the task. The agent may then perform the one or more actions. The output data item (e.g., image) can, for example, characterize a state of the real-world environment that is predicted to be obtained by the agent performing the one or more actions. The conditioning input can, e.g., specify a state of the real-world environment and the one or more actions. As another example the conditioning input can specify a state of the real-world environment and the output data item can be used to select one or more actions to be performed by the mechanical agent to perform a task (i.e. the diffusion neural network can represent an action selection policy). DeepMind Technologies Limited F&R Ref.: 45288-0370WO1 PCT Application In any of the above examples, the output data item generated using the diffusion neural network can either be an output data item in the output space, i.e., so that the values in the output data item are the values of a data item of the appropriate type, e.g., values of image pixels, amplitude values of an audio signal, and so on, or an output data item in a latent space, i.e., so that the values in the output data item are values in a latent representation of an output data item in the output space. When the output data item is generated in a latent space, the system can generate a final output data item in output space by processing the output data item in the latent space using a decoder neural network, e.g., one that has been pre-trained in an auto-encoder framework. During training, the system can use an encoder neural network, e.g., one that has been pre-trained jointly with the decoder in the auto-encoder framework, to encode target data items in the output space to generate target outputs for the diffusion neural network in the latent space. FIG.1 is a diagram of an example training system 100. The training system 100 is an example of a system implemented as computer programs on one or more computers in one or more locations, in which the systems, components, and techniques described below can be implemented. The system 100 is a system that trains a diffusion neural network 110 that is used to generate an output (final) data item 112 given a conditioning input 102. In particular, the system 100 trains the diffusion neural network 110 so that the output (final) data item 112 that has the one or more desired properties characterized by the conditioning input 102. More specifically, the system 100 performs “fine-tuning,” i.e., further training, of the diffusion neural network 110 using a reward function 120. In other words, prior being trained using the reward function 120, the system 100 or another training system has trained the diffusion neural network on a different objective, e.g., one that does not make use of the reward function 120. In general the diffusion neural network can have been trained conventionally, using any diffusion model objective. As one example, the diffusion neural network can have been trained on a set of training data items on a diffusion score matching objective or a variant thereof. The reward function 120 can be any appropriate differentiable reward function that maps an input that includes (i) a data item or (ii) a latent representation of a data item to a reward score 122. Optionally, the reward function input can also include the conditioning input 102 or a representation of the conditioning input 102. DeepMind Technologies Limited F&R Ref.: 45288-0370WO1 PCT Application For example, the reward function 120 can include one or more trained reward machine learning models, e.g., neural networks. As one example, the reward function 120 can include a machine learning model that maps at least a portion of the reward input to a score that represents an aesthetic quality of the output data item. As a particular example of a reward function that represents aesthetic quality, an aesthetic predictor model can be trained on a data set that includes multiple data items that have each been assigned an aesthetic score that measures the aesthetic quality of the data item. That is, the predictor can have been trained, e.g., using a mean squared error or a mean absolute error loss, to predict the aesthetic scores for the data items in the data set. As a more specific example, the aesthetic predictor model can include (i) a pre-trained data item embedding model, e.g., trained through contrastive learning or other representation learning technique, and (ii) one or more output layers, e.g., fully-connected layers, that process the embedding generated by the data item embedding model to generate the aesthetic score. In this example, during the training, the system can hold the embedding model fixed and only train the output layers. As another example, the reward function 120 can include a machine learning model that maps at least a portion of the reward input to a score that represents a predicted quality of the output data item as would be rated by a human user. For example, the reward function can be a reward model that has been trained to model human preferences, e.g., on an objective function that trains using human preferences between pairs of data items. One example of such a model is the Human Preference Score v2 model, described in Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. As another example, the reward function 120 can include a machine learning model trained to perform a data item detection or recognition task, so that the reward function penalizes the output data item for including a particular class. For example, when the output data items are images, the model can be an object detection model, e.g., an open-vocabulary object detection model. In this example, the system can pass images generated by the diffusion model through a pre-trained object detection model, together with a set of queries Q that should be excluded from the generated images. As the reward, the system can use the sum of scores for the localized objects corresponding to all of the queries, the sum of the areas of their bounding boxes, or the sum of both. DeepMind Technologies Limited F&R Ref.: 45288-0370WO1 PCT Application As another example, the reward function 120 can include a reward that causes the diffusion model to generate adversarial examples. That is, the system can fine-tune the diffusion model such that output data items generated based on a prompt for a class y are classified as a different class y′ by a pre-trained classifier for data items of the particular type. For example, as the reward, the system can use the negative cross-entropy to the target class by the pre-trained classifier. As another example, the reward function 120 can include one or more hard-coded differentiable reward functions. For example, one hard-coded reward function can include a function that measures the compressibility of the output data item. For example, the compressibility reward function can pass the output data item through differentiable compression ( ^^^∙^) and decompression ( ^^^∙^) algorithms to obtain a reconstruction of the output data item, and then output, as the reward score, a value based on an error, e.g., the Euclidean distance, between the original and reconstructed images, e.g., the error or a negative of the error, e.g. െฮ ^^ െ ^^൫ ^^^{^} ^^^{^}൯ฮ ^ଶ ଶ. When there are multiple reward models in the reward function 120, the final reward score 122 can be a sum or a weighted sum of the reward scores generated by the models. Generally, the system 100 performs the further training using a set of conditioning inputs 102. For each conditioning input 102, the system 100 uses the diffusion neural network 110 to generate a final representation 106 of a data item 112. The system 100 then uses the final representation 106 to generate a reward input 118 to the reward function 120. For example, the reward input 118 can include the data item 112 generated from the final representation 106 and, optionally, the conditioning input 102. The system 100 then applies the reward function 120 to the reward input 118 to generate a reward score 122 for the conditioning input 102. The system then uses reward scores 122 to train 130 the diffusion neural network 110, i.e., to update the network parameters 150 of the diffusion neural network 110. The parameters of the diffusion neural network 110 generally include the weights and, in some cases, the biases of the layers of the diffusion neural network 110. In some implementations, by performing the further training, the system 100 updates all of the network parameters 150 of the diffusion neural network 110. In some other implementations, the diffusion neural network 110 has a first set of network parameters and a second set of network parameters and, as part of the (further) DeepMind Technologies Limited F&R Ref.: 45288-0370WO1 PCT Application training of the diffusion neural network 110, the system 100 updates the first set of network parameters while holding the second set of network parameters fixed. For example, prior to the further training, the system 100 or another training system can have trained an instance of the diffusion neural network 110 that does not include the first set of network parameters (i.e., that includes only the second set of network parameters). For example, the training system can have trained the instance of the diffusion neural network 110 on a score matching objective. In this example, the system 100 can, during the further training, hold the second set of network parameters fixed to pre-trained values determined by training the instance of the diffusion neural network 110 that does not include the first set of network parameters, e.g., on the score matching objective. For example, the system 100 can use a low-rank approximation (LoRA) technique (Hu et al., arXiv:2106.09685, 2021) when performing the further training. In this case, for each of one or more weight matrices that are included in the second set of network parameters, the first set of network parameters include a low-rank factorization of an update weight matrix that can be used to update the weight matrix. The low-rank approximation technique can be performed on multiple different weight matrices to update corresponding different layers of the diffusion neural network. The system can use the low-rank approximation to approximate an update to the update weight matrix during each training update of the diffusion neural network, e.g., by optimizing a product of two smaller matrices in order to reduce the dimensionality of the calculation required to compute the change in weights required by the update. More specifically, performing a low-rank approximation refers to breaking up the update weight matrix into a product of two smaller matrices that when multiplied together can recover the values of update weight matrix with high fidelity. In particular, the low-rank decomposition can represent ^^_^ ^ ^^ ^^ ^ ^^_^ ^ ^^ ^^, where ^^_^ is a weight matrix in the second set of network parameters, ^^ ^^ is the update weight matrix corresponding to ^^_^ and the product BA approximates ^^ ^^. For example, the second set of network parameters can include a set of parameters of an initial diffusion neural network that is to be fine-tuned (and that are held fixed during the fine tuning), and the first set of parameters can include a set of parameters that are added to the initial diffusion neural network and that are adjusted during the training (fine tuning). DeepMind Technologies Limited F&R Ref.: 45288-0370WO1 PCT Application In this case, the rank of a matrix refers to the number of linearly independent vectors, e.g., the sum of columns or rows within the matrix decomposition ^^ ^^ that do not contain correlative data. The rank determined specifies the dimensionality of the update needed by providing a constraint on the dimensions of the two smaller matrices. For example, in the case in which B is a matrix of dimension d ൈ r and A has dimension r ൈ k, where r must be the same to enable the matrix multiplication, the rank r can be a value much less than the minimum of d and k, e.g., r ≪ min(d, k). Thus, during training, the system learns the weights in matrices B and A instead of directly learning the weights in ^^ ^^. The diffusion neural network 110 can be any appropriate diffusion neural network that is configured to receive an input that includes a current (noisy) representation of an image and a conditioning input and to generate a denoising output. In some implementations, the diffusion neural network 110 performs a diffusion process in output space, e.g., pixel space when the data items are images. In this example, when the data items are images, the data items (“representations”) operated on and generated by the diffusion neural network 110 have values for each pixel that specify color values, e.g., RGB values or another color encoding scheme. Examples of such diffusion neural networks include Imagen. In some other implementations, the diffusion neural network 110 performs a diffusion process in latent space, e.g., in a latent space that is lower-dimensional than the output space. That is, the data items (“representations”) operated on by the diffusion neural network 110 are latent representations and the values in the representations are learned, latent values, e.g., rather than color values when the data items are images. Examples of such diffusion neural networks include MobileDiffusion, as described in arxiv:2311.16567. In these implementations, during training, the diffusion neural network 110 can be associated with an encoder to encode training data items into the latent space and, after training and to generate new output data items, a decoder neural network that receives an input that includes a latent representation of a data item and decodes the latent representation to reconstruct the data item. Performing the further training is described in more detail below. DeepMind Technologies Limited F&R Ref.: 45288-0370WO1 PCT Application After the training, the system 100 or another inference system can use the diffusion neural network 110 to generate new data items 112 conditioned on new conditioning inputs 102. The diffusion neural network 110 can have any appropriate architecture that allows the neural network to map a diffusion input that includes an input data item that has the same dimensionality as the output data item 112 to a denoising output that also has the same dimensionality as the output data item 112. For example, when the output data item is an audio signal or an image, the diffusion neural network 110 can be a convolutional neural network, e.g., a U-Net or other architecture that maps one input of a given dimensionality to an output of the same dimensionality. As another example, the diffusion neural network 110 can be a Transformer neural network that processes the diffusion input through a set of self-attention layers to generate the denoising output. The neural network 110 can be conditioned on the conditioning input 102 in any of a variety of ways. As one example, the system 100 can use an encoder neural network to generate one or more embeddings that represent the conditioning input 102 and the diffusion neural network 110 can include one or more cross-attention layers that each cross-attend into the one or more embeddings. An embedding, as used in this specification, is an ordered collection of numerical values, e.g., a vector of floating point values or other types of values. For example, when the conditioning input is text, the system can use a text encoder neural network, e.g., a Transformer neural network, to generate a fixed or variable number of text embeddings that represent the conditioning input. When the conditioning input is an image, the system can use an image encoder neural network, e.g., a convolutional neural network or a vision Transformer neural network, to generate a set of embeddings that represent the image. When the conditioning input is audio, the system can use, e.g., an audio encoder neural network, e.g., an audio encoder neural network that has been trained jointly with a decoder neural network as part of a neural audio codec, to generate one or more embeddings that encode the audio. When the conditioning input is a scalar value, the system can use, e.g., an embedding matrix to map the scalar value or a one-hot representation of the scalar value to an embedding. DeepMind Technologies Limited F&R Ref.: 45288-0370WO1 PCT Application In some cases, the conditioning input 102 includes multiple different types of inputs, e.g., two or more of text, images, bound values, or context embeddings. In some of these cases, the system 100 can generate one or more initial embeddings for each of the different types of inputs, i.e., using an appropriate encoder neural network as described above, and then process the initial embeddings for all of the different types of inputs using a Transformer encoder neural network to update each of the initial embeddings to generate a set of final embeddings. The one or more cross-attention layers within the diffusion neural network 110 can then cross-attend into the set of final embeddings. In others of these cases, different cross-attention layers within the diffusion neural network 110 can cross-attend into embeddings of different types of conditioning inputs. In yet others of these cases, the system 100 can concatenate the initial embeddings of the different types of inputs along the sequence dimension and then the one or more cross- attention layers can cross-attend into the concatenated set of final embeddings. As another example, the diffusion neural network 110 can include one or more other types of neural network layers that are conditioned on the one or more embeddings. Examples of such layers include Feature-wise Linear Modulation (FiLM) layers, layers with conditional gated activation functions, and so on. The diffusion input at any given updating iteration can also include data defining a noise level for the iteration. Generally, each updating iteration has a corresponding time step t and the noise level for the iteration depends on the time step. For example, the noise level can be a decreasing function of the time step t. Examples of such functions include a linear function, a cosine function, and a sigmoid function. In these cases, data identifying the noise level, the time step, or both can be embedded using an appropriate neural network, e.g., a multi-layer perceptron (MLP) and used to condition the diffusion neural network 110 as described above for the conditioning input. FIG.2 is a flow diagram of an example process 200 for training a diffusion neural network using reward scores. For convenience, the process 200 will be described as being performed by a system of one or more computers located in one or more locations. For example, a training system, e.g., the training system 100 depicted in FIG.1, appropriately programmed in accordance with this specification, can perform the process 200. The system initializes a representation of a data item (step 202). For example, the system can initialize the representation by sampling the values in the representation from a distribution, e.g., a Gaussian distribution. The system receives a conditioning input c characterizing one or more desired DeepMind Technologies Limited F&R Ref.: 45288-0370WO1 PCT Application properties for the data item (step 204). For example, the system can sample the conditioning input c from a set of conditioning inputs pc. The system then updates the representation of the data item to generate a final representation of the data item (step 206). In particular, the system generates the final representation over a plurality of sampling iterations (T iterations in the example Algorithm given later). At each of the plurality of sampling iterations, the system processes a diffusion input for the sampling iteration that includes the representation of the data item and a representation of the conditioning input using the diffusion neural network to generate a denoising output for the sampling iteration (step 208). Generally, the denoising output defines an estimate of the final representation given the current representation. In some implementations, the denoising output is an estimate of the noise component of the current representation, i.e., the noise that needs to be combined with, e.g., added to or subtracted to, the final representation to generate the current representation. In some other implementations, the denoising output is an estimate of the final representation given the current representation, i.e., an estimate of the data item that would result from removing the noise component of the current representation. In yet other implementations, the system parametrizes the denoising output differently, e.g., using a v-parameterization (Salimans and Ho arXiv: 2202.00512, 2022, section 4; Appendix D) or another appropriate parameterization. The system then updates the representation of the data item using the denoising output (step 210). In some implementations, the system uses the denoising output as the final denoising output for the updating iteration. In some other implementations, the system also generates one or more additional denoising outputs for the sampling iteration. For example, the system can make use of classifier-free guidance. In this example, the system processes a second diffusion input for the sampling iteration that includes the representation of the data item but not the representation of the conditioning input using the diffusion neural network to generate an unconditional denoising output for the updating iteration. For example, the second diffusion input can include the representation of the data item and a predetermined representation that indicates unconditional sampling. DeepMind Technologies Limited F&R Ref.: 45288-0370WO1 PCT Application The system then updates the representation of the data item using the denoising output and the unconditional denoising output. In particular, the system can generate a final denoising output by combining the denoising output and the unconditional denoising output in accordance with a guidance weight for the sampling iteration and update the representation of the data item using the final denoising output. For example, the system the system can set the final denoising output equal to (1+w) * the denoising output – w*the unconditional denoising output. That is, the final denoising output can be determined from a difference between the first denoising output scaled by (1+w) and the sum of the one or more additional denoising outputs scaled by w. For example, at each iteration other than the last, the system can generate an estimate of the representation using the (final) denoising output and then apply a diffusion sampler to the estimate. The system can use any appropriate diffusion sampler to update the representation, e.g., the DDPM (Denoising Diffusion Probabilistic Model) sampler, the DDIM (Denoising Diffusion Implicit Model) sampler or another appropriate sampler, to the estimate to generate the updated current data item. DDPMs are, for example, discussed in Ho et al. arXiv:2006:11239. For the last iteration, the estimate can be the updated representation or the system can use the sampler. The system generates a reward input from the final representation after the last of the plurality of sampling iterations (step 212). When the representation of the data item is a representation in the output space, the system can directly include the final representation in the reward input. Optionally, the reward input can also include other information, e.g., the conditioning input. When the representation of the data item is a latent representation in a latent space, the system can process the final representation using a decoder neural network to generate the final data item and include the final data item in the reward input. The system applies a reward function to the reward input to generate a reward score for the final data item that measures a quality of the final data item (step 214). For example, the reward function can be any of the reward functions described above with reference to FIG.1 or any other appropriate differentiable reward function. As a particular example, the reward score 122 ( ^^^∙^) for the conditioning input 102 (c) can be denoted ^^^ ^^ ^^ ^^ ^^ ^^ ^^^ ^^, ^^, ^^_்^, ^^^ where ^^ ^^ ^^ ^^ ^^ ^^^ ^^, ^^, ^^_்^ denotes the final data item, ^^_் denotes the initial representation of the data item, and ^^ denotes the parameters of the DeepMind Technologies Limited F&R Ref.: 45288-0370WO1 PCT Application diffusion network that are being updated during the training (e.g. including the LoRA parameters). Later, in the Algorithm, this is given as ^^^{^} ^^_^, ^^^{^}, where ^^_^ denotes the final data item. The system trains the diffusion neural network using a loss function that includes a first term that measures the reward score for the final data item (step 216). For example, the first term can be the reward score, e.g., ^^^ ^^ ^^ ^^ ^^ ^^ ^^^ ^^, ^^, ^^_்^, ^^^ or the negative of the reward score. As part of the training, the system can backpropagate gradients of the first term through the reward function and through a subset of the sampling iterations in order to determine a gradient of the first term with respect to the network parameters of the diffusion neural network. For example, the gradient of the first term with respect to the network parameters can be backpropagated through the subset of the sampling iterations in order to determine the gradient, ^^, of the first term with respect to the network parameters of the diffusion neural network, ∇_ఏ ^^^ ^^ ^^ ^^ ^^ ^^ ^^^ ^^, ^^, ^^_்^, ^^^ (i.e. back through multiple calls of the diffusion model neural network in the sampling chain, analogously to backpropagation through time). For example, the system can backpropagate the gradient of the first term through the reward function to determine a gradient of the first term with respect to the reward input, and then backpropagate the gradient of the first term with respect to the reward input through the subset of the sampling iteration to determine the gradient with respect to the network parameters as described above. When the representation of the data item is a latent representation in a latent space, because the system used the decoder neural network to generate the input to the reward function, training includes backpropagating gradients of the first term through the reward function, through the decoder neural network, and through a subset (i.e. some or all) of the sampling iterations. For example, the system can perform the above steps in parallel for multiple different conditioning inputs, and the loss function can be a sum of or average of respective terms for each of the conditioning inputs. As another example, the loss function can include additional terms in addition to the first term(s) for the conditioning input(s), e.g., regularization term(s). As a particular example of an additional term, the loss function can include an additional term that reduces the variance of the updates and that is computed based on reward scores generated for data items that are generated from noise versions of the output data item. DeepMind Technologies Limited F&R Ref.: 45288-0370WO1 PCT Application This additional term is described in more detail below with reference to FIG.3. In some cases, the subset of sampling iterations through which the system backpropagates gradients can include all of the sampling iterations, i.e., the subset is not a so- called “proper subset”. However, backpropagating through all of the sampling iterations can be computationally expensive. The system can use any of a variety of techniques to account for this and make the training more computationally efficient. As one example, the subset of the sampling iterations can not include one or more earliest iterations of the plurality of sampling iterations. In other words, the subset can include all of the sampling iterations except the one or more earliest iterations or, equivalently, exclude, from the subset, one or more earliest iterations of the plurality of sampling iterations. As a particular example, the subset of the sampling iterations can include only a plurality of the latest iterations of the plurality of sampling iterations. That is, the subset of sampling iterations can include only the last K sampling iterations, where K is an integer greater than one. As yet another particular example, the subset of the sampling iterations can include only the last iteration of the plurality of sampling iterations. As another example, in addition or instead of not including certain sampling iterations in the subset, the system can make use of gradient checkpointing during training. In this example, to backpropagate gradients through a particular sampling iteration, the system retrieves the representation of the final data item as of the particular sampling iteration from memory. That is, when performing the particular sampling iteration as part of step 206, the system can store the input representation to the particular sampling iteration in memory. However, the system can refrain from persisting in memory the intermediate activations of the diffusion neural network when generating the denoising output(s) for the particular sampling iteration, i.e., can refrain from keeping the intermediate activations in memory after the sampling iteration has been completed. The system can then re-compute (re-materialize) the intermediate activations of the diffusion neural network for the particular sampling iteration using the retrieved representation, i.e., by re-processing the diffusion input(s) that include the retrieved representation using the diffusion neural network. The system then computes gradients of the first term for the particular sampling DeepMind Technologies Limited F&R Ref.: 45288-0370WO1 PCT Application iteration using the intermediate activations, e.g., by backpropagating the current gradient through the diffusion neural network using the intermediate activations. Thus, by refraining from persisting in memory the intermediate activations, the system can reduce the amount of memory required to compute the gradients. However, by still persisting in memory the input representation for the sampling iteration, the system can ensure that the latency associated with the training is not excessively increased, i.e., because the input representation does not need to be re-computed. As a result of the training, e.g., of backpropagating the gradients, the system has a respective gradient with respect to each parameter in at least a subset of the parameters of the parameters of the diffusion neural network. The system can then apply an optimizer, e.g., SGD, Adam, rmsProp, and so on, to these gradients to update the values of the (at least a) subset of the parameters. By repeatedly performing this training for different sets of one or more conditioning inputs, the system effectively “fine-tunes” the diffusion neural network to generate outputs that result in increased reward scores. FIG.3 is a flow diagram of an example process 300 for determining a variance reduction term of the loss function. For convenience, the process 300 will be described as being performed by a system of one or more computers located in one or more locations. For example, a training system, e.g., the training system 100 depicted in FIG.1, appropriately programmed in accordance with this specification, can perform the process 300. The variance reduction term is also referred to as a “second” term in the loss function in this specification. The system can perform the process 300 at each of one or more noise iterations. The system samples noise for the noise iteration (step 302). For example, the system can sample the values in the noise from the same distribution used to initialize the representation of the data item, e.g., a Gaussian distribution. The system applies the noise to the final representation to generate a noisy representation (step 304). The system processes an input that includes the noisy representation using the diffusion neural network to generate a denoising output for the noise iteration (step 306). The system then updates the noisy representation using the denoising output for the noise iteration to generate an updated noisy representation that is an estimate of the final representation (step 308). The system generates a new reward input from the updated noisy representation (step DeepMind Technologies Limited F&R Ref.: 45288-0370WO1 PCT Application 310). When the representation of the data item is a representation in the output space, the system can directly include the updated noisy representation in the new reward input. Optionally, the reward input can also include other information, e.g., the conditioning input. When the representation of the data item is a latent representation in a latent space, the system can process the updated noisy representation using a decoder neural network to generate a noisy data item and include the noisy data item in the new reward input. The system applies the reward function to the new reward input to generate a new reward score (step 312). Generally, the loss function described above includes a second term that measures the new reward scores for the one or more noise iterations. When there are a plurality of noise iterations, the second term can measure a combination of, e.g., an average of, the new reward scores. When the loss function includes the second term, training the neural network on the loss function can include backpropagating gradients of the second term through the reward function and into the noise iterations but not through any of the sampling iterations. That is, the system can insert a stop gradient to prevent from computing gradients of the second term with respect to the final representation or any representations generated at any preceding sampling iterations. An example technique for training the diffusion neural network ^_θ having parameters θ when the diffusion sampler is DDIM and there are T is shown in Table 1.

DeepMind Technologies Limited F&R Ref.: 45288-0370WO1 PCT Application

Table 1 In particular, in Table 1, DRaFT refers to a version of the described techniques where the subset of sampling iterations includes all of the sampling iterations, DRaFT-K refers to a version of the described techniques where the subset of sampling iterations includes only the last K sampling iterations, and DRaFT-LV refers to a version of the described techniques where the subset includes only the last sampling iteration and the second term is included in the loss function. In Table 1, the “DRaFT-LV then…” portion of the algorithm refers to the computation of the second term described above. However, while the example of Table 1 shows the second term only being used with DRaFT-LV, in practice, the second term can also be used with either DRaFT or DRaFT-K. FIG.4 shows an example 400 of the described techniques when the diffusion neural network is used to generate images from conditioning inputs that include text sequences. In the example 400, the system receives a conditioning input (“prompt”) 402 “majestic lion.” The system then initializes a representation 404 x_T and processes the conditioning input 402 using the diffusion neural network 110 to update the representation across multiple sampling iterations until arriving at a final representation x₀406. DeepMind Technologies Limited F&R Ref.: 45288-0370WO1 PCT Application In particular, in the example 400, the representations are in the output space, so each representation x is an image, with different representations having different amounts of noise. When DRaFT-LV is used, the system can then generate multiple updated noisy representations 408 using x0. For example the system can noise the same final representation x0 n times without regenerating a new final representation, and hence without incurring the computational cost of T updating steps, i.e. sampling iterations, that each involves, e.g., a call to the diffusion neural network. The reward gradient, g, i.e. a gradient of the first term of the loss function (that measures the reward score) of these noisy representations can be summed to obtain the gradient of the first term of the loss function that is backpropagated for training the diffusion neural network. This can be a particularly efficient approach. For example, using ^^ ൌ 2 is twice as efficient as DRaFT whilst only using 10% more compute. For all variants, the system then scores the generated representation(s) using the reward function r 120. In the example 400, the reward function is composed of multiple individual reward functions that measure various properties of the representations, e.g., aesthetic score, human quality score, compressibility, and object detection as described above. The system then backpropagates gradients through various ones of the sampling iterations, depending on whether the technique being used is DRaFT-LV, DRaFT, or DRaFT- K, in order to determine the final gradient that will be used to update the network parameters. Moreover, in the example 400, the system is using a Low-Rank Adaptation (LoRA) training technique to only update a subset of the network parameters of the model. In some implementations, after training, the system can make further modifications to the values of the network parameters of the neural network before using the diffusion neural network to generate new data items. For example, the system can train multiple instances of the diffusion neural network, each with a different reward function using the LoRA training technique. In this example, the system can generate a new instance of the diffusion neural network by computing, for each network parameter in the first subset that is updated through the LoRA training, a weighted sum of the values of the network parameter across the multiple instances, with the weight for each instance being determined by how strongly the property (or properties) measured by the corresponding reward function should be reflected in new data items generated after training. DeepMind Technologies Limited F&R Ref.: 45288-0370WO1 PCT Application As another example, the system can train a single instance of the diffusion neural network a different reward function using the LoRA training technique. In this example, the system can generate a new instance of the diffusion neural network by computing, for each network parameter in the first subset that is updated through the LoRA training, a weighted sum of the value of the network parameter after the fine-tuning and the pre-trained value of the network parameter, with the weight for each being determined by how strongly the property (or properties) measured by the reward function should be reflected in new data items generated after training. FIG.5 shows an example 500 of the performance of the described techniques when fine-tuned on a reward function that measures an aesthetic score. In particular, as can be seen from the example 500, variants of the described techniques outperform two existing techniques (ReFL and DDPO) as well as the pre-trained model (Stable Diffusion) and using prompt engineering across a range of reward queries (training conditioning inputs). This specification uses the term “configured” in connection with systems and computer program components. For a system of one or more computers to be configured to perform particular operations or actions means that the system has installed on it software, firmware, hardware, or a combination of them that in operation cause the system to perform the operations or actions. For one or more computer programs to be configured to perform particular operations or actions means that the one or more programs include instructions that, when executed by data processing apparatus, cause the apparatus to perform the operations or actions. Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, e.g., one or more modules of computer program instructions encoded on a tangible non transitory storage medium for execution by, or to control the operation of, data processing apparatus. The computer storage medium can be a machine- readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. Alternatively or in addition, the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode DeepMind Technologies Limited F&R Ref.: 45288-0370WO1 PCT Application information for transmission to suitable receiver apparatus for execution by a data processing apparatus. The term “data processing apparatus” refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can also be, or further include, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). The apparatus can optionally include, in addition to hardware, code that creates an execution environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them. A computer program, which may also be referred to or described as a program, software, a software application, an app, a module, a software module, a script, or code, can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages; and it can be deployed in any form, including as a stand alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a data communication network. In this specification, the term “database” is used broadly to refer to any collection of data: the data does not need to be structured in any particular way, or structured at all, and it can be stored on storage devices in one or more locations. Thus, for example, the index database can include multiple collections of data, each of which may be organized and accessed differently. Similarly, in this specification the term “engine” is used broadly to refer to a software-based system, subsystem, or process that is programmed to perform one or more specific functions. Generally, an engine will be implemented as one or more software modules or components, installed on one or more computers in one or more locations. In some cases, one or more computers will be dedicated to a particular engine; in other cases, multiple engines can be installed and running on the same computer or computers. DeepMind Technologies Limited F&R Ref.: 45288-0370WO1 PCT Application The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by special purpose logic circuitry, e.g., an FPGA or an ASIC, or by a combination of special purpose logic circuitry and one or more programmed computers. Computers suitable for the execution of a computer program can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data. The central processing unit and the memory can be supplemented by, or incorporated in, special purpose logic circuitry. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few. Computer readable media suitable for storing computer program instructions and data include all forms of non volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks. To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user’s device in response to requests received from the web browser. Also, a computer can DeepMind Technologies Limited F&R Ref.: 45288-0370WO1 PCT Application interact with a user by sending text messages or other forms of message to a personal device, e.g., a smartphone that is running a messaging application, and receiving responsive messages from the user in return. Data processing apparatus for implementing machine learning models can also include, for example, special-purpose hardware accelerator units for processing common and compute-intensive parts of machine learning training or production, e.g., inference, workloads. Machine learning models can be implemented and deployed using a machine learning framework, .e.g., a TensorFlow framework or a Jax framework. Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e.g., a client computer having a graphical user interface, a web browser, or an app through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (LAN) and a wide area network (WAN), e.g., the Internet. The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In some embodiments, a server transmits data, e.g., an HTML page, to a user device, e.g., for purposes of displaying data to and receiving user input from a user interacting with the device, which acts as a client. Data generated at the user device, e.g., a result of the user interaction, can be received at the server from the device. While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or on the scope of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. DeepMind Technologies Limited F&R Ref.: 45288-0370WO1 PCT Application Moreover, although features may be described above as acting in certain combinations and even initially be claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination. Similarly, while operations are depicted in the drawings and recited in the claims in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products. Particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.

Claims

DeepMind Technologies Limited F&R Ref.: 45288-0370WO1 PCT Application CLAIMS 1. A method performed by one or more computers, the method comprising: initializing a representation of a data item; receiving a conditioning input characterizing one or more desired properties for the data item; updating the representation of the data item to generate a final representation of the data item, the updating comprising: at each of a plurality of sampling iterations: processing a diffusion input for the sampling iteration that comprises the representation of the data item and a representation of the conditioning input using a diffusion neural network to generate a denoising output for the sampling iteration; and updating the representation of the data item using the denoising output; generating a reward input from the final representation after the last of the plurality of sampling iterations; applying a reward function to the reward input to generate a reward score for the final data item that measures a quality of the final data item; and training the diffusion neural network using a loss function that comprises a first term that measures the reward score for the final data item, the training comprising backpropagating gradients of the first term through the reward function and through a subset of the sampling iterations. 2. The method of claim 1, wherein the diffusion neural network has been pre-trained on a diffusion model training objective that does not use the reward function. 3. The method of any preceding claim, wherein the reward input comprises the conditioning input. 4. The method of any preceding claim, wherein the subset of the sampling iterations comprises all of the plurality of sampling iterations. 5. The method of any one of claims 1-3, wherein the subset of the sampling iterations is a proper subset that includes less than all of the plurality of iterations. DeepMind Technologies Limited F&R Ref.: 45288-0370WO1 PCT Application 6. The method of claim 5, wherein the subset of the sampling iterations does not include one or more earliest iterations of the plurality of sampling iterations. 7. The method of claim 5 or claim 6, wherein the subset of the sampling iterations includes only a plurality of the latest iterations of the plurality of sampling iterations. 8. The method of any preceding claim, wherein backpropagating gradients through a particular sampling iteration comprises: retrieving the representation of the final data item as of the particular sampling iteration from memory; re-computing intermediate activations of the diffusion neural network for the particular sampling iteration using the retrieved representation; and computing gradients of the first term for the particular sampling iteration using the intermediate activations. 9. The method of any preceding claim, wherein the diffusion neural network comprises a first set of network parameters and a second set of network parameters, and wherein training the diffusion neural network comprises: updating the first set of network parameters while holding the second set of network parameters fixed. 10. The method of claim 9, wherein updating the first set of network parameters while holding the second set of network parameters fixed comprises: holding the second set of network parameters fixed to pre-trained values determined by training an instance of the diffusion neural network that does not include the first set of network parameters on a score matching objective. 11. The method of any preceding claim, the updating further comprising: at each of the plurality of sampling iterations: processing a second diffusion input for the sampling iteration that comprises the representation of the data item but not the representation of the conditioning input using the diffusion neural network to generate an unconditional denoising output for the sampling iteration; and updating the representation of the data item using the denoising output and the DeepMind Technologies Limited F&R Ref.: 45288-0370WO1 PCT Application conditional denoising output. 12. The method of claim 11, wherein updating the representation of the data item using the denoising output and the conditional denoising output comprises: generating a final denoising output by combining the denoising output and the unconditional denoising output in accordance with a guidance weight for the sampling iteration; and updating the representation of the data item using the final denoising output. 13. The method of claim 11 or claim 12, wherein the second diffusion input comprises the representation of the data item and a predetermined representation that indicates unconditional sampling. 14. The method of any preceding claim, wherein the representation of the data item is a latent representation in a latent space. 15. The method of claim 14, wherein generating a reward input from the final representation after the last of the plurality of sampling iterations comprises: processing the reward input using a decoder neural network to generate the final data item; and including the final data item in the reward input, wherein the training comprises backpropagating gradients of the first term through the reward function, through the decoder neural network, and through a subset of the sampling iterations. 16. The method of any preceding claim, wherein the reward function comprises a plurality of reward models and wherein the reward score is a combination of respective initial reward scores generated by each of the plurality of reward models by processing at least a portion of the reward input. 17. The method of any preceding claim, wherein the data item is an image. 18. The method of any preceding claim, wherein the data item is audio data representing an audio signal. DeepMind Technologies Limited F&R Ref.: 45288-0370WO1 PCT Application 19. The method of any preceding claim, wherein the data item is a video comprising a plurality of video frames. 20. The method of any preceding claim, wherein the conditioning input comprises a text prompt. 21. The method of any preceding claim, wherein the conditioning input comprises an image. 22. The method of any preceding claim, wherein the conditioning input comprises audio data representing an audio signal. 23. The method of any preceding claim, further comprising: at each of one or more noise iterations: sampling noise; applying the noise to the final representation to generate a noisy representation; processing an input that comprises the noisy representation using the diffusion neural network to generate a denoising output for the noise iteration; updating the noisy representation using the denoising output for the noise iteration to generate an updated noisy representation that is an estimate of the final representation; generating a new reward input from the updated noisy representation; and applying the reward function to the new reward input to generate a new reward score; and wherein loss function comprises a second term that measures the new reward scores for the one or more noise iterations. 24. The method of claim 23, wherein there are a plurality of noise iterations and wherein the second term measures an average of the new reward scores. 25. The method of claim 23 or claim 24, the training comprising backpropagating gradients of the second term through the reward function and into the noise iterations but not DeepMind Technologies Limited F&R Ref.: 45288-0370WO1 PCT Application through any of the sampling iterations. 26. A system comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform the operations of the respective method of any preceding claim. 27. One or more computer storage media encoded with instructions that, when executed by one or more computers, cause the one or more computers to perform the operations of the respective method of any one of claims 1-25.