WO2025166268A1

WO2025166268A1 - Extracting responses from language model neural networks by scoring response tokens

Info

Publication number: WO2025166268A1
Application number: PCT/US2025/014169
Authority: WO
Inventors: Xuezhi Wang; Dengyong Zhou
Original assignee: DeepMind Technologies Ltd; Gdm Holding LLC
Current assignee: DeepMind Technologies Ltd; Gdm Holding LLC
Priority date: 2024-02-01
Filing date: 2025-01-31
Publication date: 2025-08-07
Anticipated expiration: 2026-08-01

Abstract

Methods, systems, and apparatus for generating a response to a query input. In one aspect, a method includes receiving a query input including a sequence of input tokens and processing the query input using a language model neural network to generate multiple candidate output sequences. Each candidate output sequence includes a sequence of output tokens from a vocabulary of output tokens. For each output token, the method further includes identifying, as response tokens, a subset of the output tokens in the candidate output sequence and determining, from scores assigned by the language model neural network while generating the response tokens, a confidence score for the candidate output sequence. The method further includes selecting one of the candidate output sequences based on the confidence scores for the response tokens and generating a response to the query input from the selected candidate output sequence.

Description

EXTRACTING RESPONSES FROM LANGUAGE MODEL NEURAL

NETWORKS BY SCORING RESPONSE TOKENS

BACKGROUND

[1] This specification relates to processing inputs using neural networks to generate output sequences.

[2] Neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current value inputs of a respective set of parameters.

SUMMARY

[3] This specification describes a system implemented as computer programs on one or more computers in one or more locations that generates a response to a query input by processing the uery input using a language model neural network.

[4] According to a first aspect there is provided a computer-implemented method that comprises receiving a query input comprising a sequence of input tokens and processing the query input using a language model neural network to generate multiple candidate output sequences, where each candidate output sequence includes a sequence of output tokens from a vocabulary of output tokens.

[5] For example, the sequence of input tokens can represent input text, an image, an audio signal, or a combination thereof.

[6] The method further comprises, for each candidate output sequence, identifying a subset of the output tokens in the candidate output sequence as response tokens and determining a confidence score for the candidate output sequence from scores assigned by the language model neural network.

[7] The method further comprises selecting one of the candidate output sequences based on the confidence scores for the response tokens and generating a response to the query' input from the selected candidate output sequence.

[8] In some implementations, the query’ input is received from a user during a communication session with a dialogue agent, and the method further includes outputting, by the dialogue agent, the response to the query input during the communication session. [9] In some implementations, the communication session includes communications between a user device of the user and the dialogue agent.

[10] In some implementations, outputting, by the dialogue agent, the response to the query input includes displaying the response to the query input on a user interface of the user device.

[11] In some implementations, the language model is an auto-regressive neural network that generates each candidate output sequence by generating a respective output token at each of a sequence od decoding steps, including, at each decoding step: processing a current input sequence comprising the sequence of input tokens and the respective output tokens generated at any preceding decoding steps to generate a score distribution over the vocabulary of output tokens, and selecting an output token using the score distribution over the vocabulary of output tokens.

[12] In some implementations, processing the query input using a language model neural network to generate a plurality⁷ of candidate output sequences includes, at a first decoding step of the sequence of decoding steps: for a first candidate output sequence of the plurality of candidate output sequences, selecting the output token having the highest score in the score distribution generated by processing the current input sequence for the first decoding step, and for each other candidate output sequence of the plurality⁷ of candidate output sequences, selecting a respective other output token that is different than the output token having the highest score in the score distribution generated by processing the current input sequence for the first decoding step.

[13] In some implementations, selecting the respective other output token that is different than the output token having the highest score further includes: selecting an output token having a k-th highest score in the score distribution generated by processing the current input sequence for the first decoding step, where k has a different value for each candidate output sequence.

[14] In some implementations, processing the query input using a language model neural network to generate a plurality of candidate output sequences includes, at each subsequent decoding step of the sequence of decoding steps: for each output sequence of the plurality of candidate output sequences, selecting the output token having the highest score in the score distribution generated by processing the current input sequence for the subsequent decoding step.

[15] In some implementations, each subsequent decoding step is performed in parallel for each candidate output sequence of the multiple output sequences. [16] In some implementations, each subsequent decoding step for each candidate output sequence of the plurality of output sequences is performed by a respective set of hardware accelerators.

[17] In some implementations, determining, from scores assigned by the language model neural network while generating the response tokens, the confidence score for the candidate output sequence further includes: for each response token, determining a score difference between: (i) a first score for the response token in the score distribution generated by processing the current input sequence for the decoding step at which the response token was selected, and (ii) a second score for another output token of the vocabulary of output tokens, wherein the second score is the second highest score in the score distribution generated by processing the current input sequence for the decoding step at which the response token was selected, and combining each score difference for each of the response tokens to determine the confidence score.

[18] In some implementations, identifying, as response tokens, the subset of the output tokens in the candidate output sequence includes: prompting the language model neural network to generate a final output sequence of output tokens conditioned on the query input and the candidate output sequence, and identifying the response tokens based on mapping the subset of the output tokens to a subset of the final output sequence.

[19] In some implementations, the method includes identifying the response tokens based on the subset of output tokens being a last occurring span of output tokens within the output sequence that comprises only output tokens of a predetermined type.

[20] In some implementations, the predetermined type is output tokens that represent a number.

[21] In some implementations, the predetermined type is output tokens that represent an option of a plurality of options, where the query input comprises input tokens that represent each option of the plurality of options.

[22] Particular embodiments of the subj ect matter described in this specification can be implemented so as to realize one or more of the following advantages.

[23] Chat bots (e.g.. dialogue agents) are computer software that generate natural language responses to user queries. For example, the chat bots can be large language model (LLM)-based chat bots that generate responses based on outputs generated by one or more LLMs.

[24] In order to generate a response, some conventional systems can implement chain- of-thought prompting (e.g., few-shot prompting, zero shot prompting, etc.). In particular, chain-of-thought prompting includes prompting the language model by dividing the generation task into smaller steps using one or more prompts in order to generate a particular decoding path. In this way, the system can prompt the language model through a chain of reasoning for a particular task. However, chain-of-thought prompting requires selecting the one or more prompts and iteratively calling the language model, which can result in relatively large variance in responses and in increased latency due to the multiple calls to the language model.

[25] On the other hand, some other conventional systems can implement greedy decoding techniques to generate a response to a query' by selecting each output token of a particular output sequence based on a score distribution. In particular, for each decoding step, the system can "‘greedily” select the output token from the vocabulary with the highest score from the score distribution generated during the particular decoding step. However, greedy decoding techniques do not allow the language model to reason through responses as part of the decoding path. Rather, these techniques cause the language model to directly generate a response without a chain of reasoning, which can result in decreased accuracy and decreased confidence in the accuracy of the response generated by the language model.

[26] In contrast, the system described uses a language model to generate multiple output sequences of multiple decoding paths in parallel by leveraging the language model’s intrinsic reasoning abilities, such that the language model can reason through responses to generate the output sequence without chain-of-thought prompting and without iteratively calling the model, resulting in decreased latency and increased accuracy in generating responses.

[27] In particular, the system can generate alternative candidate decoding paths using the language model by non-greedily selecting an output token at a particular decoding step. For example, at a first decoding step, the system can generate multiple alternative candidate paths by selecting the tokens within the A-th highest scores in the score distribution, instead of greedily selecting the output token with the highest score. For subsequent decoding steps, the system can implement greedy decoding for each of the alternative candidate decoding paths, such that the candidate decoding paths have different current input sequences at each subsequent decoding step. In another example, the system can generate the multiple alternative decoding paths by non-greedily selecting the output token at one or more decoding steps other than the first decoding step. In this way, the system allows the language model to effectively “reason” through multiple candidate decoding paths in order for the system to generate responses more confidently and more accurately. Additionally, the system can improve the performance of an already trained language model neural network without requiring any additional training of the language model. Instead of training a relatively large language model to perform a particular task, which can be computationally expensive and, in some cases, not feasible for large models, the system can instead leverage the logits of the language model to perform the chain-of-thought reasoning in order to generate a response without having to update the parameters of the language model.

[28] As an example, a user can provide a query to the chat bot, such as a math problem, and the system can generate multiple different candidate output sequences as answers to the query. In particular, the system generates a strictly greedy decoding path, where the system selects the output token with the highest score at each decoding step. The system then selects a candidate decoding path based on generating confidence scores for the identified response tokens of each of the output sequences corresponding to the decoding paths, and the system processes the selected output sequence to generate the response to the math problem.

[29] Overall, by generating multiple candidate decoding paths in parallel and selecting an output sequence of a candidate decoding path based on a confidence score, the system can leverage the language model’s intrinsic reasoning capability to generate a response to a query with increased confidence and decreased latency.

[30] The details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

[31] FIG. 1 shows an example response generation system.

[32] FIG. 2 shows the operations of an example language model system.

[33] FIG. 3 shows the operations of an example response processing system.

[34] FIG. 4 is a flow diagram of an example process for generating a response to a query input using a language model.

[35] FIG. 5 is a diagram of the results of implementing chain-of-thought prompting for different language models to generate a response to a query input.

[36] FIG. 6 is a diagram of the results of implementing different methods chain-of- thought prompting to generate a response to a query input. [37] Like reference numbers and designations in the various drawings indicate like elements.

DETAILED DESCRIPTION

[38] This specification describes a system implemented as computer programs on one or more computers in one or more locations that performs a machine learning task on a network input to generate a network output for the machine learning task.

[39] The machine learning task can be any machine learning task. For example, the machine learning task can be a task that operates on a network input that is an input sequence, i.e., a collection of multiple elements, to generate a network output for the network input.

[40] Some examples of machine learning tasks that the system can be configured to perform follow.

[41] In some cases, the neural network is aneural network that is configured to perform an image processing task, i.e.. receive an input image and to process the input image to generate a network output for the input image. As another example, the task can be image embedding generation and the output generated by the neural network can be a numeric embedding of the input image. As yet another example, the task can be object detection and the output generated by the neural network can identify locations in the input image at which particular types of objects are depicted. As yet another example, the task can be image segmentation and the output generated by the neural network can assign each pixel of the input image to a category' from a set of categories. In some other cases, the neural network is a neural network that is configured to perform an image generation task, where the input is a conditioning input and the output is a sequence of intensity’ value inputs for the pixels of an image.

[42] As one example, the task may be a neural machine translation task. For example, if the input to the neural network is a sequence of text, e.g., a sequence of words, phrases, characters, or word pieces, in one language, the output generated by the neural network may be a translation of the sequence of text into another language, i.e.. a sequence of text in the other language that is a translation of the input sequence of text. The vocabulary for the input tokens may be words, wordpieces or characters of the first language, and the vocabulary for the output tokens may be words, wordpieces or characters of the other language. As a particular example, the task may be a multi-lingual machine translation task, where a single neural network is configured to translate between multiple different source languages - target language pairs. In this example, the source language text may be augmented with an identifier that indicates the target language into which the neural network should translate the source language text.

[43] The vocabulary of tokens can include any of a variety of tokens that represent text symbols or other symbols. For example, the vocabulary of tokens can include one or more characters, sub-words, words, punctuation marks, numbers, or other symbols that appear in a corpus of natural language text and/or computer code. Additionally, the vocabulary of tokens can include tokens that can represent data other than text. For example, the vocabulary' of tokens can include image tokens that represent a discrete set of image patch embeddings of an image that can be generated by an image encoder neural network based on processing the image patches of the image. As another example, the vocabulary of tokens can include video tokens that represent spatial-temporal dynamics of a video that can be generated by a video encoder neural network based on processing the video frames of the video. As another example, the vocabulary' of tokens can include audio tokens that represent code vectors in a codebook of a quantizer, e.g., a residual vector quantizer. As an example, the language model neural network can generate text sequences, i.e.. each output sequence generated by the language model neural network is a sequence of text tokens from a vocabulary' of text tokens that includes, e.g., one or more of characters, subwords, words, punctuation marks, numbers, or other symbols that appear in natural language text.

[44] Some implementations may be used for automatic code generation. For example the input tokens may represent words, wordpieces or characters in a first natural language and the output tokens may represent instructions in a computer programming or markup language, or instructions for controlling an application program to perform a task e.g. build a data item such as an image or web page.

[45] As another example, the task may be an audio processing task. For example, if the input to the neural network is a sequence representing a spoken utterance, the output generated by the neural network may be a score for each of a set of pieces of text, each score representing an estimated likelihood that the piece of text is the correct transcript for the utterance. As another example, if the input to the neural network is a sequence representing a spoken utterance, the output generated by the neural network can indicate whether a particular word or phrase ("hotword") was spoken in the utterance. As another example, if the input to the neural network is a sequence representing a spoken utterance, the output generated by the neural network can be a classification of the spoken utterance into one of a plurality of categories, for example an identity of the natural language in which the utterance was spoken.

[46] As another example, the task can be a natural language processing or understanding task, e.g., an entailment task, a paraphrase task, a textual similarity' task, a sentiment task, a sentence completion task, a grammaticality' task, and so on, that operates on a sequence of text in some natural language.

[47] As another example, the task can be a text to speech task, where the input is text in a natural language or features of text in a natural language and the network output is a spectrogram, a waveform, or other data defining audio of the text being spoken in the natural language.

[48] In some cases, the machine learning task is a multi-modal processing task that requires processing multi-modal data. In general, multi-modal data is a combination of two or more different types of data, e.g., two or more of audio data, image data, text data, or graph data. As one example the multi-modal data may comprise audio-visual data, comprising a combination of pixels of an image or of video and audio data representing values of a digitized audio waveform. As another example the multi-modal data may comprise a combination of i) text data representing text in a natural language and ii) pixels of an image or of video or audio data representing values of an audio waveform. Optionally, but not necessarily, the different types of data may represent the same or overlapping objects using the different modalities (types), and when processing multimodal data the data may be mapped into a common embedding space.

[49] As a particular example, the task is a multi-modal processing task that requires processing both text and image inputs, so that the neural network includes both a computer vision neural network and a text processing neural network. That is, the target output to be generated by the computer vision neural network for a given image depends on one or more outputs generated by the text processing neural network for one or more corresponding text inputs (and vice versa). Examples of such tasks include open-vocabulary image classification, open-vocabulary object detection, image captioning, text-based image search, image-based retrieval, and so on.

[50] More generally, the multi-modal processing task may correspond to any of the tasks previously described for any of the types of data making up the multi-modal combination. For example, an accuracy of the previously described tasks may be increased when the task is applied to multi-modal data combining the data for which the task has been previously described and another type of data. For example detection or classification of an object or event may be improved when data of multiple different types (modalities) is processed.

[51] In practice, for any of these examples, the task to be performed by the neural network can be defined by (at least a part of) the network input, e.g., that is in the form of a prompt or a request, received by the neural network. In other words, the neural network will be able to perform any of these tasks when an appropriate prompt or request is received.

[52] FIG. 1 shows an example system 100. The system 100 is an example of a system implemented as computer programs on one or more computers in one or more locations, in which the systems, components, and techniques described below can be implemented.

[53] The system 100 includes a response generation system 102 that receives a query input 110 from a user device 104 and processes the query input 110 using a language model system 106 and a response processing system 108 to generate a response 114 to the query input 110 for a machine learning task, e.g., one of the tasks described above. The language model system 106 includes a language model configured to process the query input 110 to generate multiple candidate output sequences 112. as described in further detail below with reference to FIG. 2.

[54] In particular, system 102 can use the language model of the language model system 106 as a conversational agent as part of a communication session with a user via the user device 104. The communication session can include communications between the user device 104 and a dialogue agent. The dialogue agent can be a conversational agent (e.g... a chatbot) that is configured to generate a response to a user query. The system 100 can output the response 114 during the communication session. For example, the system 100 can display the response 114 on a user interface of the user device 104.

[55] The query input 110 includes a sequence of input tokens that can represent a query submitted by a user. That is, the sequence of input tokens can represent input text, an image, an audio signal, or a combination thereof.

[56] The response 114 is a sequence of output tokens from a vocabulary of tokens that represents a response to the query. As used in this specification, a token can represent data of one or more modalities, and the output sequence can include tokens representing data of one or more modalities. As a particular example, the input sequence can include tokens representing text, images, or video, and the output sequence can include tokens representing text. In particular, the language model can generate text sequences, i.e., each response 114 generated by the language model is a sequence of text tokens from a vocabulary of text tokens that includes, e.g., one or more of characters, sub-words, words, punctuation marks, numbers, or other symbols that appear in natural language text.

[57] For example, as shown in FIG.l, the query input 110 can include multiple tokens that represent a query (“I have 3 apples, my dad has 2 more apples than me, how many apples do we have in total?”). In this case, the system 102 is configured to process the query input 110 to generate the response 114 (“We have 8 apples in total”). In another example, such as the case where a user is submitting the query input 110 to a conversational agent, the uery input 1 10 can include the user’s current input and previous information from the conversation (e.g., one or more previous user queries, one or more previous system responses, or a combination thereof).

[58] In general, the system 102 is configured to use the language model system 106 to process the query input 110 to generate multiple candidate output sequences 112 by allowing the language model to “reason” through multiple decoding paths. Each candidate output sequence 112 includes a sequence of output tokens from the vocabulary⁷ of output tokens that represent a candidate response to the query⁷. In particular, the system can “non- greedily” select an output token for a first decoding step in order to allow the language model to generate the multiple candidate output sequences 112, as described in further detail below with reference to FIG. 2. The system 102 is then configured to process the candidate output sequences 112 using the response processing system 108 to generate the response 114 based on determining a confidence score for each of the candidate output sequences 112, as described in further detail below with reference to FIG. 3.

[59] Advantageously, by allowing the language model to generate multiple candidate decoding paths based on non-greedily selecting at least one of the output tokens of the output sequence, the system 100 can generate more accurate responses based on the language model following a chain of reasoning for the particular reasoning task. Additionally, by generating multiple candidate decoding paths in parallel and selecting an output sequence of a candidate decoding path based on a confidence score, the system 100 can leverage the language model’s intrinsic reasoning capability to generate a response to a query without manual prompting.

[60] The language model can be any appropriate language model neural network with any appropriate neural network architecture. As a particular example, the language model neural network can be an auto-regressive Transformer-based neural network that includes multiple attention blocks that each apply a self-attention operation and an output subnetwork that processes an output of the last attention block to generate a score distribution.

[61] The Transformer-based neural network can include a sequence of attention blocks, and, during the processing of a given input sequence, each attention block in the sequence receives a respective input hidden state for each input token in the given input sequence. The attention block then updates each of the hidden states at least in part by applying selfattention to generate a respective output hidden state for each of the input tokens. The input hidden states for the first attention block are embeddings of the input tokens in the input sequence and the input hidden states for each subsequent attention block are the output hidden states generated by the preceding attention block. The output subnetwork can process the output hidden state generated by the last attention block in the sequence for the last input token in the input sequence to generate the score distribution.

[62] In particular, the language model can be a multi-modal language model having a particular architecture configured to process multi-modal data (e.g., PaLM 2, Gemma, PaliGemma, Gemini, etc ).

[63] In some examples, the system or another system can pre-train the language model on a language modeling task, e.g., a task that requires predicting, given a current sequence of text tokens, the next token that follows the current sequence in the training data. Optionally, the system or another system can then fine-tune the language model on one or more fine-tuning tasks, e.g.. through one or more of supervised fine-tuning, instruction tuning, reinforcement learning from human feedback, reinforcement learning from Al feedback, and so on.

[64] For example, the language model can be tasked with generating a response to a reasoning problem provided by a user in the communication session. As another example, the language model can be tasked with generating a number in response to a mathematical problem provided by the user.

[65] In one example, the language model can be tasked with selecting an option from a set of options provided by a user via the user device 104 (e.g.. a multiple-choice question). As another example, the language model can be tasked with generating an answer to a question that is posed by the user about a visual input. As another example, the language model can be tasked with generating a text caption for a visual input. As another example, the language model can be tasked with detecting objects in an input image.

[66] FIG. 2 shows the operations of an example language model system. The language model system 106 is an example of a system implemented as computer programs on one or more computers in one or more locations, in which the systems, components, and techniques described below can be implemented.

[67] The language model system 106 is configured to process a query input 110 to generate the candidate output sequences 112 by performing multiple decoding steps.

[68] The language model system 106 includes a language model 202 configured to process the query’ input 110 to generate a probably score distribution over a vocabulary’ of output tokens (e.g., the language model described in FIG. 1), an initial decoding engine 204 configured to perform a first decoding step, and a greedy decoding engine 204 configured to perform subsequent decoding steps to generate the candidate output sequences 112. Each candidate output sequence 112 includes a sequence of output tokens from the vocabulary of output tokens, where each output token can correspond to a particular decoding step of the multiple decoding steps.

[69] In particular, the language model 202 can be an auto-regressive neural network that is configured to generate the multiple candidate output sequences 112. That is, at each decoding step, the language model 202 is configured to generate a respective output token for each candidate output sequence 112. In particular, the system uses the language model 202 to process a current input sequence and the respective output tokens generated at any preceding decoding steps to generate a score distribution over the vocabulary’ of output tokens. The current input sequence can include one or more input tokens of the sequence of input tokens of the query input 110. The system 100 then selects the output token using the score distribution.

[70] In particular, for the first decoding step, the language model 202 can process the current input sequence to generate the corresponding score distribution for the first decoding step. The initial decoding engine 204 is then configured to select multiple output tokens based on their respective scores from the score distribution. Each output token is referred to as a lop-/t token, where top A refers to the i-lh highest scoring token of the output tokens in the score distribution. For example, the top-1 token refers to the output token with the first highest score, the top-2 token refers to the output token with the second highest score, and the top-3 token refers to the output token with the third highest score.

[71] In particular, the initial decoding engine 204 selects the output token that has the highest score in the corresponding score distribution (e.g., “top-1”). For example, as shown in FIG. 2, the initial decoding engine 202 selects the top-1 output token (e.g.. the output token with the highest score) that represents a chain of thought that starts with the number “5.” Additionally, the initial decoding engine 204 is configured to select one or more respective other output tokens that are different than the output token with the highest score. For example, as shown in FIG. 2, the initial decoding engine 204 can select the other k output tokens (e.g., '‘top-2,” ‘'top-3,” top-4,” top-5,” etc.) that can each correspond to different chains of thought. K has a different value for each candidate output sequence 112.

[72] At each subsequent decoding step, the language model system 106 is configured to process the current input sequence and the selected output tokens to generate a next output token for each output sequence (e.g., each chain of thought) using the greedy decoding engine 206. That is, each subsequent decoding step is performed in parallel for each candidate output sequence 112. In some examples, each subsequent decoding step for each candidate output sequence is performed by a respective set of hardware accelerators. In other words, the described techniques are optimized for execution in parallel, e.g., across multiple parallel processing devices, e.g., hardware accelerators, or using batch processing on a single hardware accelerator, because after the first decoding step is performed, the candidate output sequences can be generated in parallel without requiring any communication between the respective devices that are generating the different output sequences. That is, the subsequent decoding steps can be performed independently and in parallel for the different candidate output sequences.

[73] For example, for the second decoding step, the language model system 106 is configured to process the current input sequence and the top-1 output token for a first chain of thought to generate a second output token for the first candidate output sequence 114 (“5 apples”) while processing the current input sequence and the top-2 output token for a second chain of thought to generate a second output token for the second candidate output sequence 114 (“I have 3 apples, my dad has 2 more apples than me, so he has 5 apples. 3+5=8. We have 8 apples in total”).

[74] In particular, the greedy decoding engine 206 is configured to “greedily” select the next output token for each output sequence. In particular, the greedy decoding engine 204 greedily selects the next output token for the output sequence by selecting the output token having the highest score in the score distribution generated by processing the current input sequence for the subsequent decoding step. For example, as shown in FIG. 2. the greedy decoding engine 206 can select the next output token representing '‘apples” instead of the output token representing “\n” based on the output token representing “apples” having the highest score of the score distribution for the particular decoding step. In some examples, the system 100 can begin greedy decoding after the second decoding step to allow for an even higher number of alternative chains of thought. [75] Thus, by selecting the highest scoring output token along with alternative output tokens for the first decoding step and performing greedy-decoding for the subsequent decoding steps, the system is able to explore alternative chains of thought for the language model 202 that resemble chain-of-thought prompting without the latency of manually calling the language model 202, ultimately relying on the language model’s intrinsic reasoning ability to generate multiple different candidate output sequences for the response to the query.

[76]

[77] FIG. 3 shows the operations of an example response processing system. The response processing system 108 is an example of a system implemented as computer programs on one or more computers in one or more locations, in which the systems, components, and techniques described below can be implemented.

[78] The response processing system 108 is configured to process the candidate output sequences using a processing engine 302 to generate the response 114. The processing engine 302 is configured to select one of the candidate output sequences 112 based on a confidence score and to generate the response 114 from the selected candidate output sequence 112. The confidence score measures a confidence of the language model 202 in generating the candidate output sequence 112 based on the chain of reasoning.

[79] In particular, for each candidate output sequence 112, the system 108 is configured to identify a subset of the output tokens as response tokens. That is. the response tokens represent a portion of the candidate response to the query. For example, for a candidate output sequence 112 that represents a particular response (“We have 5 apples in total”), the system 108 can use the processing engine 302 to identify the subset of output tokens that represent the number “5” as response tokens.

[80] In some examples, the system can identify' the response tokens based on the subset of output tokens being the last occurring span of output tokens within the output sequence that includes only output tokens of a predetermined ty pe. For example, the predetermined type can be a number. That is, the system 108 can identify the answer tokens as the last numerical value in the candidate output sequence 112 for a mathematical reasoning task. In another example, the predetermined type can be output tokens that represent an option of multiple options. In this case, the query' input 110 can include input tokens that represent each option of the multiple options (e.g., a multiple-choice question).

[81] In some examples, the system can generate the final output sequence by prompting the language model 202 with the query' input 110 and the selected candidate output sequence 112. For example, a user can submit a prompt to the language model 202 asking the language model 202 to provide the response (e.g. the query, the candidate output sequence, followed by “so the answer is... ” and the language model can generate a final output sequence that represents “8 apples.”

[82] Based on the final output sequence, the system can identify the response tokens based on mapping the subset of output tokens to a subset of the final output sequence. For example, the system 100 can compare the candidate output sequence (“I have 3 apples, my dad has 2 more apples than me, so he has 5 apples. 3+5=8. We have 8 apples in total”) to the final output sequence (“8 apples”), and the system can identify⁷ the matching subset of response tokens (“8 apples”) from the candidate output sequence. In some examples, the system can identify the subset of response tokens based on their token identifier corresponding to the vocabulary of output tokens.

[83] The system 108 can then determine a confidence score for each candidate output sequence 112 based on the scores assigned by the language model 202 to the identified response tokens.

[84] In particular, the system can determine the confidence score based on the scores corresponding to the top-2 tokens generated at a decoding step t, as shown below by Equation 1: where n represents the number of identified answer tokens, x_t is the output tokens generated at the particular decoding step, x is the top-1 token at the particular decoding step Z, and x^ is the top-2 token at the particular decoding step t. That is, for each of the response tokens for candidate output sequence 112, the system 108 determines a score difference between the first score lp(x_t' |%_<t.) for the response token and a second score p^Xf \x_<t) for another output token given x_t being part of the answer tokens.

[85] The first score is the score in the score distribution generated by processing the current input sequence for the corresponding decoding step at which the system selected the response token (e.g., the score corresponding to the top-1 token). The second score can be the second highest score in the score distribution generated by processing the current input sequence for the corresponding decoding step at which the system selected the response token (e.g., the score corresponding to the top-2 token). The system 108 can then combine each score for each of the response tokens to determine the confidence score. That is, the confidence score can be the average of the score differences for all relevant x_t tokens. For example, in the case where the system 108 identifies the answer tokens as being the number “60,” the system can average the score differences for each of the output tokens of the answer tokens (e.g., “6” and “0”).

[86] In the case where the system 100 prompts the language model 202 to identify the answer tokens that are the same as the tokens of the final output sequence, the system can determine the confidence score of the answer from the original decoding path (e.g., the candidate output sequence 112).

[87] The system 108 can then select one of the candidate output sequences 112 based on the respective confidence scores for the response tokens. For example, selecting one of the candidate output sequences 112 can be based on a confidence score threshold. As shown in FIG. 3, the confidence score can be measured based on a certainty range, where the system assigns each of the identified answer tokens a color gradient corresponding to the certainty range. The system can then output the selected candidate output sequence 112 as the response 114 to the query input 110.

[88] In this example, the second candidate output sequence (“I have 3 apples, my dad has 2 more apples than me, so he has 5 apples. 3+5=8. We have 8 apples in total”) and the fourth candidate output sequence (“You have 3 apples, your dad has 2 more apples than you so he has 5 apples. 3+5=8. You have 8 apples in total”) have the highest confidence scores based on the certainty metric. As such, the system can output the second candidate output sequence 112 or the fourth candidate output sequence 112 as the response 114.

[89] Further examples of using the top-k decoding paths to generate a response to a query input are shown in Tables 1-7 below.

Table 5

Table 7

[90] In some examples, rather than selecting the maximum scoring candidate answer across the reasoning paths, when a candidate answer occurs in more than one reasoning path, the system can aggregate the confidence scores for the candidate answer across the multiple reasoning paths to generate an aggregated confidence score for the candidate answer. The system can then select the candidate answer having the highest aggregated confidence score as the final response to the query. For example, the system can determine the aggregated confidence score for a candidate answ er a that appears in k reasoning paths based on a sum of the candidate scores of the corresponding tokens of each of the candidate output sequences, as shown below by Equation 2:

(2) ^~a = Ik ,a where A~_a is the aggregated confidence score for the candidate answer a, and where A_{k a} is the confidence score for ft-th decoding path whose output tokens include the candidate answer a.

[91] An example of generating the aggregated confidence score by aggregating confidence scores across reasoning paths is shown in Table 8 below. In particular, for the correct answer 18, the system can identify that the answer is represented in four candidate output sequences (0, 6, 8, and 9), each with corresponding confidence scores (A = 0.994 (k = 0), A = 0.911 (k = 6), A = 0.584 (k = 8), and A = 0.999 (k = 9)). As such, the aggregated confidence score of the candidate answer 18 is 3.5. while the incorrect answer options 14, 16, 20, 10 have a much lower confidence score. Thus, in some cases, by generating an aggregated confidence based on multiple candidate output sequences according to the corresponding confidence scores, the system can generate the response with increased reliability and increased accuracy.

Table 8

[92] FIG. 4 is a flow diagram of an example process for generating a response to a query input using a language model. For convenience, the process 400 will be described as being performed by a system of one or more computers located in one or more locations. For example, a system, e g., the system 102 of FIG. 1, appropriately programmed, can perform the process 400.

[93] The system can receive a query input including a sequence of input tokens (402). The input tokens can represent input text, an image, an audio signal, or a combination thereof. For example, the system can receive the query input from a user during a communication session with a dialogue agent (e.g., a chatbot). The communication session includes communications between a user device of the user and the dialogue agent.

[94] The system can process the query⁷ input using a language model neural network to generate multiple candidate output sequences (404). The language model neural network can be an auto-regressive neural network that generates each candidate output sequence by generating a respective output token at each of a sequence of decoding steps.

[95] Each decoding step can include processing a cunent input sequence that includes the sequence of input tokens and the respective output tokens generated at any preceding decoding steps to generate a score distribution over the vocabulary of output tokens. The system can select an output token using the score distribution over the vocabulary of output tokens.

[96] In particular, at a first decoding step, the system selects the output token with the highest score in the score distribution generated by processing the current input sequence for the first decoding step. For each other candidate output sequence, the system selects a respective other output token that is different than the output token with the highest score in the score distribution. That is, the system selects an output token having a A-th highest score in the score distribution generated by processing the current input sequence for the first decoding step, where k has a different value for each candidate output sequence

[97] At each subsequent decoding step, the system can select the output token having the highest score in the score distribution for each output sequence. That is, each subsequent decoding step is performed in parallel for each candidate output sequence. In some examples, each subsequent decoding step is performed by a respective set of hardware accelerators.

[981 For each candidate output sequence, the system can identify, as response tokens, a subset of input tokens in the candidate output sequence (406).

[99] In particular, the system can prompt the language model to generate a final output sequence of output tokens conditioned on the query input and the candidate output sequence. The system can then identify the response tokens based on mapping the subset of the output tokens to a subset of the final output sequence. For example, the system can identify⁷ the response tokens based on the subset of output tokens being a last occurring span of output tokens within the output sequence that includes only output tokens of a predetermined type. In some examples, the predetermined type can be output tokens that represent a number. In some other examples, the predetermined type can be an option of multiple options. That is, the query input includes input tokens that represent an option of the multiple options.

[100] The system can determine, from scores assigned by the language model neural network while generating the response tokens, a confidence score for the candidate output sequence (408). [101] In particular, for each response token, the system can determine a score difference between a first score for the response token in the score distribution generated by processing the current input sequence for the decoding step at which the response token was selected and a second score for another output token of the vocabulary of output tokens. The second score is the second highest score in the score distribution generated by processing the current input sequence for the decoding step at which the response token was selected. The system can then combine each score difference for each of the response tokens to determine the confidence score.

[102] The system can select one of the candidate output sequences based on the confidence scores for the response tokens (410).

[103] The system can generate a response to the query input from the selected candidate output sequence (412). For example, the system can output the response to the query input during the communication session by the dialogue agent. The system can display the response to the query input on a user interface of the user device.

[104] FIG. 5 is a diagram of the results of implementing chain-of-thought prompting for different language models to generate a response to a query input.

[105] The graph of FIG. 5 illustrates the performance of generating a response to a query input computed using an accuracy metric that measures the quality of generated responses for different language models (e.g., PaLM instruction-tuned model “IT”, and PaLM-2 pretrained models “xs,” “s,” “m,” and “1.” In particular, the generated responses are based on the number of top-Hokens for decoding paths used by the particular language model, where top-A: refers to the A-th highest scoring token of the output tokens in the score distribution.

[106] In particular, the graph shows that the performance of each model improves in terms of accuracy based on selecting a relatively higher number of topty tokens for decoding paths. Therefore, with relatively larger numbers of k tokens, each model can generate a response with increased accuracy, which demonstrates that in some cases, the correct chain-of-thought paths may be generated by a language model, but the system may rank the correct chain-of-thought path relatively lower in comparison to other paths during the decoding stage. Table 9 below compares the results of implementing the chain-of- thought prompting (“COT-DECODING” and strictly performing greedy decoding (“GREEDY”) according to an accuracy metric.

Table 9

[107] Additionally, Tables 10-12 below compare the results of implementing the chain- of-thought prompting ( COT-DECODING " and strictly performing greedy decoding (“GREEDY”) according to an accuracy metric for the particular tasks of Tables 1-7 above.

Table 12

[108] Thus, by implementing the described response generation techniques of implementing a language model and chain-of-thought prompting, the system can generate a response to a query more accurately in comparison with other conventional techniques.

[109] FIG. 6 is a diagram of the results of implementing different chain-of-thought prompting methods to generate a response to a query input.

[110] The graph of FIG. 6 illustrates the performance of generating a response to a query input computed using aggregated path chain-of-thought decoding (as described in FIG. 3), maximum-path chain-of-thought decoding, zero-shot chain-of-thought prompting, fewshot chain of thought-prompting, and greedy decoding.

[111] In particular, the graph shows that the performance of each response generation technique improves in terms of accuracy based on performing aggregated chain-of-thought decoding. Therefore, with relatively larger numbers of k tokens, the system can generate a response with increased accuracy, which demonstrates that maximizing a confidence score of the decoding path based on selecting multiple candidate output sequence, not merely the highest-scoring candidate output sequences, improves accuracy in comparison to other techniques, such as zero-shot chain-of-thought prompting, few-shot chain of thoughtprompting, and greedy decoding.

[112] This specification uses the term “configured” in connection with systems and computer program components. For a system of one or more computers to be configured to perform particular operations or actions means that the system has installed on it software, firmware, hardware, or a combination of them that in operation cause the system to perform the operations or actions. For one or more computer programs to be configured to perform particular operations or actions means that the one or more programs include instructions that, when executed by data processing apparatus, cause the apparatus to perform the operations or actions.

[113] Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non transitory storage medium for execution by, or to control the operation of, data processing apparatus. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. Alternatively or in addition, the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus.

[114] The term “data processing apparatus” refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can also be, or further include, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). The apparatus can optionally include, in addition to hardware, code that creates an execution environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.

[115] A computer program, which may also be referred to or described as a program, software, a software application, an app, a module, a software module, a script, or code, can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages; and it can be deployed in any form, including as a stand alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a data communication network.

[116] In this specification, the term '‘database” is used broadly to refer to any collection of data: the data does not need to be structured in any particular way, or structured at all, and it can be stored on storage devices in one or more locations. Thus, for example, the index database can include multiple collections of data, each of which may be organized and accessed differently.

[117] Similarly, in this specification the term “engine” is used broadly to refer to a software-based system, subsystem, or process that is programmed to perform one or more specific functions. Generally, an engine will be implemented as one or more software modules or components, installed on one or more computers in one or more locations. In some cases, one or more computers will be dedicated to a particular engine; in other cases, multiple engines can be installed and running on the same computer or computers.

[118] The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by special purpose logic circuitry, e.g., an FPGA or an ASIC, or by a combination of special purpose logic circuitry and one or more programmed computers. [119] Computers suitable for the execution of a computer program can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read only memory or a random access memon or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data. The central processing unit and the memory can be supplemented by, or incorporated in, special purpose logic circuitry. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few.

[120] Computer readable media suitable for storing computer program instructions and data include all forms of non volatile memory, media and memory devices, including by way of example semiconductor memon devices, e.g., EPROM, EEPROM, and flash memory⁷ devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks.

11211 To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid cry stal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user’s device in response to requests received from the web browser. Also, a computer can interact with a user by sending text messages or other forms of message to a personal device, e.g., a smartphone that is running a messaging application, and receiving responsive messages from the user in return. [122] Data processing apparatus for implementing machine learning models can also include, for example, special-purpose hardware accelerator units for processing common and compute-intensive parts of machine learning training or production, i.e., inference, workloads.

[123] Machine learning models can be implemented and deployed using a machine learning framework, e.g., a TensorFlow framework.

[124] Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e. g. , an application server, or that includes a front end component, e.g., a client computer having a graphical user interface, a web browser, or an app through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (LAN) and a wide area network (WAN), e.g., the Internet.

[125] The computing system can include clients and servers. A client and server are generally remote from each other and ty pically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In some embodiments, a server transmits data, e.g., an EITML page, to a user device, e.g., for purposes of displaying data to and receiving user input from a user interacting with the device, which acts as a client. Data generated at the user device, e.g., a result of the user interaction, can be received at the server from the device.

[126] While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or on the scope of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially be claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

[127] Similarly, while operations are depicted in the drawings and recited in the claims in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

[128] Particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.

[129] What is claimed is:

Claims

1. A computer-implemented method, the method comprising: receiving a query input comprising a sequence of input tokens: processing the query input using a language model neural network to generate a plurality⁷ of candidate output sequences, each candidate output sequence comprising a sequence of output tokens from a vocabulary of output tokens; for each candidate output sequence: identifying, as response tokens, a subset of the output tokens in the candidate output sequence; and determining, from scores assigned by the language model neural network while generating the response tokens, a confidence score for the candidate output sequence: selecting one of the candidate output sequences based on the confidence scores for the response tokens; and generating a response to the query input from the selected candidate output sequence.

2. The method of claim 1, wherein the sequence of input tokens represent input text, an image, an audio signal, or a combination thereof.

3. The method of claim 1, wherein the query input is received from a user during a communication session with a dialogue agent, and wherein the method further comprises: outputting, by the dialogue agent, the response to the query input during the communication session.

4. The method of claim 3, wherein the communication session comprises communications between a user device of the user and the dialogue agent.

5. The method of claim 4. wherein outputting, by the dialogue agent, the response to the query input comprises: displaying the response to the query input on a user interface of the user device.

6. The method of any preceding claim, wherein the language model neural network is an auto-regressive neural network that generates each candidate output sequence by generating a respective output token at each of a sequence of decoding steps, comprising, at each decoding step: processing a current input sequence comprising the sequence of input tokens and the respective output tokens generated at any preceding decoding steps to generate a score distribution over the vocabulary of output tokens; and selecting an output token using the score distribution over the vocabulary of output tokens.

7. The method of claim 6, wherein processing the query input using a language model neural network to generate a plurality of candidate output sequences comprises, at a first decoding step of the sequence of decoding steps: for a first candidate output sequence of the plurality of candidate output sequences, selecting the output token having the highest score in the score distribution generated by processing the current input sequence for the first decoding step; and for each other candidate output sequence of the plurality of candidate output sequences, selecting a respective other output token that is different than the output token having the highest score in the score distribution generated by processing the current input sequence for the first decoding step.

8. The method of claim 7, wherein selecting the respective other output token that is different than the output token having the highest score further comprises: selecting an output token having a k-th highest score in the score distribution generated by processing the current input sequence for the first decoding step, wherein k has a different value for each candidate output sequence.

9. The method of claim 7 or claim 8, wherein processing the query input using a language model neural network to generate a plurality of candidate output sequences comprises, at each subsequent decoding step of the sequence of decoding steps: for each output sequence of the plurality of candidate output sequences, selecting the output token having the highest score in the score distribution generated by processing the cunent input sequence for the subsequent decoding step.

10. The method of any one of claims 6-9, wherein each subsequent decoding step is performed in parallel for each candidate output sequence of the plurality of output sequences.

11. The method of any one of claim 6-10, wherein each subsequent decoding step for each candidate output sequence of the plurality of output sequences is performed by a respective set of hardware accelerators.

12. The method of any one of claims 6-11, wherein determining, from scores assigned by the language model neural network while generating the response tokens, the confidence score for the candidate output sequence further comprises: for each response token, determining a score difference between:

(i) a first score for the response token in the score distribution generated by processing the current input sequence for the decoding step at which the response token was selected, and

(li) a second score for another output token of the vocabulary of output tokens, wherein the second score is the second highest score in the score distribution generated by processing the current input sequence for the decoding step at which the response token was selected; and combining each score difference for each of the response tokens to determine the confidence score.

13. The method of any preceding claim, wherein identifying, as response tokens, the subset of the output tokens in the candidate output sequence comprises: prompting the language model neural network to generate a final output sequence of output tokens conditioned on the query input and the candidate output sequence; and identifying the response tokens based on mapping the subset of the output tokens to a subset of the final output sequence.

14. The method of any one of claims 1-12, further comprising: identifying the response tokens based on the subset of output tokens being a last occurring span of output tokens within the output sequence that comprises only output tokens of a predetermined type.

15. The method of claim 14, wherein the predetermined type is output tokens that represent a number.

16. The method of claim 14, wherein the predetermined type is output tokens that represent an option of a plurality of options, wherein the query input comprises input tokens that represent each option of the plurality of options.

17. A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one more computers to perform operations comprising: receiving a query input comprising a sequence of input tokens; processing the query input using a language model neural network to generate a plurality of candidate output sequences, each candidate output sequence comprising a sequence of output tokens from a vocabulary' of output tokens; for each candidate output sequence: identifying, as response tokens, a subset of the output tokens in the candidate output sequence; and determining, from scores assigned by the language model neural network while generating the response tokens, a confidence score for the candidate output sequence; selecting one of the candidate output sequences based on the confidence scores for the response tokens; and generating a response to the query input from the selected candidate output sequence.

18. The system of claim 17, wherein the sequence of input tokens represent input text, an image, an audio signal, or a combination thereof.

19. One or more computer storage media storing instructions that when executed by one or more computers cause the one more computers to perform operations comprising: receiving a query input comprising a sequence of input tokens; processing the query input using a language model neural network to generate a plurality' of candidate output sequences, each candidate output sequence comprising a sequence of output tokens from a vocabulary' of output tokens; for each candidate output sequence: identifying, as response tokens, a subset of the output tokens in the candidate output sequence; and determining, from scores assigned by the language model neural network while generating the response tokens, a confidence score for the candidate output sequence; selecting one of the candidate output sequences based on the confidence scores for the response tokens; and generating a response to the query input from the selected candidate output sequence.

20. The one or more non-transi tory computer-readable storage media of claim 19, wherein the sequence of input tokens represent input text, an image, an audio signal, or a combination thereof.