WO2025048818A1

WO2025048818A1 - Machine-learned output using interleaved unimodal models

Info

Publication number: WO2025048818A1
Application number: PCT/US2023/031640
Authority: WO
Inventors: Victoria ZAYATS; Feifan CHEN; Dirk Ryan Padfield
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2023-08-31
Filing date: 2023-08-31
Publication date: 2025-03-06
Anticipated expiration: 2026-02-28

Abstract

Methods and systems for generating an output from a machine-learned model are disclosed herein. The method includes receiving a multimodal input and providing a first input portion from the multimodal input to a first unimodal model. The method also includes providing a second input portion from the multimodal input to a second unimodal model and providing a first representation from the first unimodal model to the second unimodal model. The method further includes providing a second representation from the second unimodal model to the first unimodal model, and processing the second input portion using the second unimodal model based in part on the first representation. The method also includes processing the first input portion using the first unimodal model based in part on the second representation and generating an output from the machine-learned model based in part on the processing by the first and second unimodal models.

Description

MACHINE-LEARNED OUTPUT USING INTERLEAVED UNIMODAL MODELS

FIELD

[1] The present disclosure relates generally to processing multimodal input. More particularly, the present disclosure relates to using interleaved unimodal models to process multimodal input and generate unimodal or multimodal output.

BACKGROUND

[2] Large language models (“LLMs”) have shown impressive capabilities in generating novel and/or arbitrary text and can be controlled or guided with textual prompts. However, LLMs are fundamentally text-based models, which limits their usage to applications involving text.

[3] There are a number of multi-modal (e.g., text accompanied by another input, such as an image or video) use cases that can benefit from the breadth and expressive power of LLMs, such as speech-to-text translation or speech-to-speech translation. One popular approach in attempting to solve this problem is to fine-tune a pre-tramed LLM with multimodal data. However, combining multiple modalities in a single model does not always succeed due to the interference of the various modalities with one another during training.

SUMMARY

[4] Aspects and advantages of embodiments of the present disclosure will be set forth in part in the following description, or can be learned from the description, or can be learned through practice of the embodiments.

[5] One example aspect of the present disclosure is directed to a method for generating an output from a machine-learned model. The method can include receiving, by a processor, a multimodal input from one or more input sources and providing, by the processor, a first input portion having a first modality from the multimodal input to a first unimodal model associated with the first modality. The method can also include providing, by the processor, a second input portion having a second modality from the multimodal input to a second unimodal model associated with the second modality and providing, by the processor, a first representation from the first unimodal model to the second unimodal model, the first representation representing an output of a layer of the first unimodal model. The method can further include providing, by the processor, a second representation from the second unimodal model to the first unimodal model, the second representation representing an output of a layer of the second unimodal model and processing, by the processor, the second input portion using the second unimodal model based at least in part on the first representation. The method can also include processing, by the processor, the first input portion using the first unimodal model based at least in part on the second representation and generating, by the processor, an output from the machine-learned model based at least in part on the processing by at least one of the first unimodal model and the second unimodal model.

[6] Another example aspect of the present disclosure is directed to a computing system. The computing system can include one or more processors and a non-transi lory, computer- readable medium. The computer-readable medium can include a machine-learned model comprising a first unimodal model associated with a first modality and a second unimodal model associated with a second modality and instructions that, when executed by the one or more processors, cause the one or more processors to perform operations. The operations can include receiving a multimodal input from one or more input sources, providing a first input portion having the first modality from the multimodal input to the first unimodal model, and providing a second input portion having the second modality from the multimodal input to the second unimodal model. The operations can also include providing a first representation from the first unimodal model to the second unimodal model, the first representation representing an output of a layer of the first unimodal model, providing a second representation from the second unimodal model to the first unimodal model, the second representation representing an output of a layer of the second unimodal mode, and processing the second input portion using the second unimodal model based at least in part on the first representation. The method can further include processing the first input portion using the first unimodal model based at least in part on the second representation and generating an output from the machine-learned model based at least in part on the processing by at least one of the first unimodal model and the second unimodal model.

[7] Another example aspect of the present disclosure is directed to a non-transitory, computer-readable medium that can include a machine-learned model comprising a first unimodal model associated with a first modality and a second unimodal model associated with a second modality and instructions that, when executed by the one or more processors, cause the one or more processors to perform operations. The operations can include receiving a multimodal input from one or more input sources, providing a first input portion having the first modality from the multimodal input to the first unimodal model, and providing a second input portion having the second modality from the multimodal input to the second unimodal model. The operations can also include providing a first representation from the first unimodal model to the second unimodal model, the first representation representing an output of a transformer layer of the first unimodal model, providing a second representation from the second unimodal model to the first unimodal model, the second representation representing an output of a layer of the second unimodal mode, and processing the second input portion using the second unimodal model based at least in part on the first representation. The method can further include processing the first input portion using the first unimodal model based at least in part on the second representation and generating an output from the machine-learned model based at least in part on the processing by at least one of the first unimodal model and the second unimodal model.

[8] Other aspects of the present disclosure are directed to various systems, apparatuses, non-transitory computer-readable media, user interfaces, and electronic devices.

[9] These and other features, aspects, and advantages of various embodiments of the present disclosure will become better understood with reference to the following description and appended claims. The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate example embodiments of the present disclosure and, together with the description, serve to explain the related principles.

BRIEF DESCRIPTION OF THE DRAWINGS

[10] Detailed discussion of embodiments directed to one of ordinary skill in the art is set forth in the specification, which makes reference to the appended figures, in which:

[1 1] Figure 1 depicts a block diagram of an example multimodal model according to example embodiments of the present disclosure.

[12] Figure 2 depicts a flow chart diagram of an example method to perform multimodal output according to example embodiments of the present disclosure.

[13] Figure 3 A depicts a block diagram of an example computing system that performs multimodal output according to example embodiments of the present disclosure.

[14] Figure 3B depicts a block diagram of an example computing device that performs multimodal output according to example embodiments of the present disclosure.

[15] Figure 3C depicts a block diagram of an example computing device that performs multimodal output according to example embodiments of the present disclosure.

[16] Reference numerals that are repeated across plural figures are intended to identify the same features in various implementations.

DETAILED DESCRIPTION Overview

[17] Generally, the present disclosure is directed to a multi-modal model that combines a set of unimodal models that are interleaved, or interconnected, for use in receiving multimodal input and outputting unimodal or multimodal outputs. For example, two or more unimodal models can be combined together using one or more modality adapter layers such as, for example, a cross-attention layer. By combining two unimodal modals in such fashion, the combined multi-modal model can both process multi-modal inputs and also produce multi-modal outputs.

[18] More particularly, designing a model that avoids modality interference within a single modality is a priority for improving multimodal representation learning. The proposed approach utilizes multiple pretrained unimodal networks, with each network being interconnected with the others (e.g., N networks being interconnected) and each network being associated with a single input modality (text, speech, images, video, time-series data, tabular data, and the like). Using multimodal adapters (which can include both downsampling and upsampling blocks and learned retokenization blocks), representations from a network for one modality can be provided to other networks for use in processing inputs while being able to keep the other networks frozen. Each pre-trained model retains its own capabilities and performance while enabling enhanced operation due to the combined model, such as combining two modalities through a bi-directional cross-attention layer, enabling the flow of information not just from one modality to the other but also in the reverse direction

[19] The bi-directional flow of information between the multiple unimodal models allows for the propagation of information from one modality to another and back while heavily relying on uni-modal pretrained representation. The resulting overall model can make predictions based on any singular modality or combination of modalities as input while being able to generate an output in any of the modalities with minimal information loss from other modalities.

[20] The use of a multimodal model enables computing systems to have an “all-in- one” solution that can take multimodal input and generate unimodal or multimodal output, without the need to access individual models separately to obtain only unimodal output (e.g., providing a text caption to a text processing model to receive a summary of the caption). This saves time and computing resources (e.g., bandwidth to access different models, time to access different storage locations of separate models, processing in parallel instead of in series, and the like). Additionally, the combination of unimodal models in the multimodal model can learn from each other unimodal model, resulting in a more robust unimodal or multimodal output that can be generated in parallel, instead of having to spend time and processing capability on using a series of unimodal inputs separately to obtain an output that does have not the cross-attention capabilities of the multimodal model.

[21] As another example benefit, the proposed technique for combining multiple unimodal models is flexible, enabling the extension of the technique to any number of unimodal models for any different combination of input modalities. This enables the same multiple pretrained models to be reused in different combinations to service different combinations of input modalities. By enabling the same pretrained models to be recombined and used in different combinations, the proposed technique prevents the need to train a new model for each possible combination. This reduces the amount of training cycles required, which represents a savings of computational resources such as reduced usage of processor cycles, memory usage, network bandwidth, etc.

[22] With reference now to the Figures, example embodiments of the present disclosure will be discussed in further detail.

Example Model Arrangements

[23] Figure 1 depicts a block diagram of an example multimodal model 100 according to example embodiments of the present disclosure. While the example multimodal model 100 shown in Figure 1 illustrates a two-modality model that includes an audio language model (“AudioLM”) and an text processing model (“LLM”), it is to be understood that more than two modalities can be used in the multimodal model 100 and that any other models associated with other modalities, such as image input, video input, statistical input, timeseries input, tabular input, and the like can be included in the multimodal model 100 in addition to/in replacement of the AudioLM and/or the LLM.

[24] As an example, the multimodal model 100 can include at least two unimodal models 105 and 110, represented in Figure 1 by the AudioLM and the LLM. Each of the at least two unimodal models 105 and 110 receive modality-specific inputs 115 and 120. In the shown example, AudioLM 105 receives semantic speech tokens 115, which can be generated by a tokenizer capable of transferring received speech data into recognizable tokens (e g., audio data, audio embeddings, or similar tokens) for the AudioLM 105, and the LLM 110 receives text tokens 120, such as words, phrases, sentences, text embeddings, or similar tokens. Other possible tokens for other kinds of modalities can include images, image embeddings, portions of images, videos, portions of videos, embeddings for frames of videos, statistical embeddings, statistics, and the like.

[25] The at least two unimodal models 105 and 110 can be pre-trained models that have been trained specifically for the respective modality associated with each of the at least two unimodal models 105 and 110. For example, an LLM can be trained on a large training set of unlabeled or labeled text data.

[26] The at least two unimodal models 105 and 110 can include one or more transformer layers 125 and 130. Transformer layers are given as an example. Other types of layers (e g., feed-forward layers) can be used in addition or alternatively to Transformer layers. Transformer layers 125 and 130 can iteratively perform various functions to the modalityspecific inputs 115 and 120, such as encoding input tokens to contextualize the tokens of the modality -specific inputs 115 and 120, which represent which parts of the input are relevant to one another. Output encodings can then be passed to other transformer layers 125 and 130 to further (iteratively) encode the outputs. The transformer layers 125 and 130 can also be decoding layers, which take encoding outputs from encoding layers and use the incorporated contextual information to generate an output sequence.

[27] Each transformer layer 125 and 130 can include self-attention mechanisms to enhance certain portions of input data while diminishing other portions of the input data.

Furthermore, the self-attention mechanisms can allow the individual transformer layer to access information from any input sequence element, including far-away tokens.

[28] In some implementations, each of the models 105 and 1 10 can include both an encoder portion and a decoder portion. In these implementations, the cross-model connections can be present in both the encoder portion and the decoder portion, only the encoder portion, or only the decoder portion. In other implementations, one or both of the models 105 and 110 can be decoder-only models.

[29] In addition to providing outputs to other transformer layers within each respective unimodal network, the transformer layers 125 and 130, for example especially encoding layers, can provide outputs to multimodal adapters 135 and 140. These multimodal adapters 135 and 140 can include downsampling blocks, upsampling blocks, and/or learned retokenization blocks. Multimodal adapters 135 and 140 receive the representations from other unimodal models, process the data to be able to be used by the unimodal modal the multimodal adapter is associated with, and then inputs the data (e.g., using cross-attention layers) into transformer layers 125 and 130 of the unimodal model the multimodal adapter is associated with. Cross-attention layers can enable transformer layers 125 and 130 to have outside data, such as encodings from other unimodal networks, be included in the encoding or decoding of representations of inputs.

[30] In some implementations, the multimodal adapters 135 and 140 can be bi-directional. In addition to receiving data from another unimodal network, the multimodal adapters 135 and 140 can send data from the current unimodal network to transformer layers 125 and 130 of other multimodal networks. This allows for propagation of data to and from each unimodal network while still relying primarily on the unimodal models 105 and 110. The use of the interleaved architecture (or “zipper” architecture) enables multiple levels of information or encodings to be shared between the models 105 and 110.

[31] Although Figure 1 illustrates the cross-model connections occurring between each layer of the models 105 and 110, the cross-model connections can be more sparse. For example, cross-model connections can occur between every third layer, or other periodicity, or at random intervals.

[32] Although Figure 1 illustrates the cross-model connections occurring between layers of the same rank or depth, the cross-model connections can occur between layers of different rank or depth. For example, a connection between a layer of depth two in model 105 can connect to a layer of depth three in model 110, or vice versa.

[33] Although Figure 1 illustrates only a single cross-model connection occurring at each layer, it is possible that multiple cross-models connections can occur at a single layer. For example, one layer in model 105 can connect (e.g., via adapter layers) to multiple layers in model 1 10, or vice versa.

[34] Although Figure 1 illustrates the cross-model connections as being symmetrical, it is possible that the cross-model connection can be asymmetrical. For example, there may be more cross-model connections extending from model 105 to 110 than there are cross-model connections extending from model 110 to 105, or vice versa.

[35] Although Figure 1 illustrates the models 105 and 110 as having the same number of layers, it is possible that the models 105 and 110 have different numbers of layers. For example, model 105 may be significantly deeper than model 110, or vice versa.

[36] In some implementations, when unimodal models 105 and 110 send and receive data via multimodal adapters 135 and 140, the unimodal models 105 and 110 can be frozen, or configured so that weights internal to each of the unimodal models 105 and 110 are unchanged during training by the inclusion of data from other unimodal models in calculations made by the unimodal model. Thus, in some implementations, during training, weights of the cross-attention layers in the multimodal adapters 135 and 140 may be changed, but the weights in the transformer layers 125 and 130 remain unchanged.

[37] The multimodal model 100 can therefore take in inputs 115 and 120, make predictions on any individual input or any combination of inputs, and the generate at least one output 145 or 150 or generate a plurality of outputs in any and all modalities of the unimodal networks included in the multimodal model 100, such as generating audio data for speech, a transcript of input audio data, a text output based on input text, an image, and the like.

[38] One example multi-modal output is audio-video outputs. One example audio-video output is a video completion output. Another example multi-modal output is interleaved multi-modal output such as interleaved text-audio outputs. Another example multiple-modal output is a text and image output. For example, the text and image output can include the image is a graphical depiction of a map. As another example, the text and image output can an image with a textual caption.

Example Methods

[39] Figure 2 depicts a flow chart diagram of an example method to perform multimodal output according to example embodiments of the present disclosure. Although Figure 2 depicts steps performed in a particular order for purposes of illustration and discussion, the methods of the present disclosure are not limited to the particularly illustrated order or arrangement. The various steps of the method 200 can be omitted, rearranged, combined, and/or adapted in various ways without deviating from the scope of the present disclosure.

[40] At 202, a computing system receives a multimodal input from one or more input sources. The multimodal input can include two or more input portions, each being associated with different modalities (e.g., a first modality and a second modality), such as input text, input speech data, input images, and the like. The input sources can be databases, user inputs, a tokenizer that parses input data into learnable tokens, and other data sources.

[41] At 204, the computing system provides a first input portion having a first modality from the multimodal input to a first unimodal model associated with the first modality and provides a second input portion having a second modality from the multimodal input to a second unimodal model associated with the second modality. The first input portion and the second input portion are two portions of the multimodal input. The multimodal input may include other portions in addition to the first input portion and the second input portion. The first input portion is provided to the first unimodal model that is associated with the modality of the first input portion. For example, the first input portion can include tokenized speech data and can therefore be provided to a unimodal model that is configured to process speech data and output, for example, a transcript of the data or novel speech data. Similarly, the second input portion is provided to the second unimodal model that is associated with the modality of the second input portion.

[42] At 206, the computing system provides a first representation from the first unimodal model to the second unimodal model and a second representation from the second unimodal model to the first unimodal model, the representations representing an output of a layer of their unimodal model, such as a transformer layer As described above with regards to Figure 1 , each unimodal model includes one or more transformer layers that perform processing on inputs to that transformer layer. The outputs of this processing are then provided to other transformer layers in the unimodal model. The output can also be provided to a different unimodal model with a different modality, such as the second unimodal model. In some embodiments, a module such as a multimodal adapter can be used to receive the output from the transformer layer. The multimodal adapter can include components such as downsampling blocks, upsampling blocks, learned retokenization blocks, and other components, that can perform further processing on the output from the transformer layer before providing the transformer layer output to the second multimodal model.

[43] In some embodiments, the multimodal adapter provides the first representation to the second unimodal model using a cross-attention layer of the second unimodal model and can provide the second representation to the first unimodal model using a cross-attention layer of the first unimodal model.

[44] In some embodiments, the multimodal adapter is “bi-directional,” or has the ability to both receive data from the transformer layer of the first unimodal model for use by the second unimodal model and also to provide information from a cross-attention layer or transformer layer of the second unimodal model to the first unimodal model for use in processing by the first unimodal model.

[45] At 208, the computing system processes the first and the second input portions using the second unimodal model and the first unimodal model, respectively, based at least in part on the representations. For example, based on input from a transformer layer from each of the first unimodal model and the second unimodal model (e g., the first representation and the second representation), the other respective unimodal model can process its own input with the representation provided to it used as a condition or other cross-attention factor in the process. In some embodiments, the second unimodal model is frozen during this processing (e.g., the weights of the second unimodal are not updated based on any feedback from the processing), but the cross-attention layer is not frozen, so that the second unimodal model can still have a layer that “learns” from other input modalities. The reverse can also be true in that the first unimodal model can be frozen during this processing.

[46] At 210, the computing system generates an output from the machine-learned model based at least in part on the processing by at least one of the first unimodal model and the second unimodal model. In some embodiments, the output includes an output in the modality of the model used for generating output, such as generating output in the second modality if only the second unimodal model is used for output. For example, based on text input and speech input, and with an input from a transformer layer from a LLM, a speech generation model can generate speech data based on a transcript generated in part based on language processing performed by the LLM.

[47] In other embodiments, the output can include a first output portion and a second output portion. Because of the sharing of representations in between models, multimodal outputs can be generated. This output can have the first output portion be in a first modality of one of the unimodal models (e.g., speech) and be output by the one of the unimodal modals. The second output portion be in a second modality of another of the unimodal models (e.g., text) and be output by the another of the unimodal modals.

Example Devices and Systems

[48] Figure 3 A depicts a block diagram of an example computing system 300 that performs multimodal output according to example embodiments of the present disclosure. The system 300 includes a user computing device 302, a server computing system 330, and a training computing system 350 that are communicatively coupled over a network 380.

[49] The user computing device 302 can be any type of computing device, such as, for example, a personal computing device (e.g., laptop or desktop), a mobile computing device (e.g., smartphone or tablet), a gaming console or controller, a wearable computing device, an embedded computing device, or any other type of computing device.

[50] The user computing device 302 includes one or more processors 312 and a memory 314. The one or more processors 312 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. The memory 314 can include one or more non-transitory computer-readable storage media, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. The memory 314 can store data 316 and instructions 318 which are executed by the processor 312 to cause the user computing device 302 to perform operations.

[51] In some implementations, the user computing device 302 can store or include one or more multimodal models 320. For example, the multimodal models 320 can be or can otherwise include various machine-learned models such as neural networks (e.g., deep neural networks) or other types of machine-learned models, including non-linear models and/or linear models. Neural networks can include feed-forward neural networks, recunent neural networks (e g., long short-term memory recurrent neural networks), convolutional neural networks or other forms of neural networks. Some example machine-learned models can leverage an attention mechanism such as self-attention. For example, some example machine- learned models can include multi-headed self-attention models (e.g., transformer models). Example multimodal models 320 are discussed with reference to Figure 1.

[52] In some implementations, the one or more multimodal models 320 can be received from the server computing system 330 over network 380, stored in the user computing device memory 314, and then used or otherwise implemented by the one or more processors 312. In some implementations, the user computing device 302 can implement multiple parallel instances of a single multimodal model 320 (e.g., to perform parallel multimodal output across multiple instances of multimodal processing).

[53] More particularly, the one or more multimodal models 320 include two or more unimodal models trained on different modalities that take in input tokens and outputs data in the modality associated with the particular unimodal model (e.g., outputting text, audio data, images, video data, classifications, and the like).

[54] Additionally or alternatively, one or more multimodal models 340 can be included in or otherwise stored and implemented by the server computing system 330 that communicates with the user computing device 302 according to a client-server relationship. For example, the multimodal models 340 can be implemented by the server computing system 340 as a portion of a web service (e.g., a multimodal service). Thus, one or more models 320 can be stored and implemented at the user computing device 302 and/or one or more models 340 can be stored and implemented at the server computing system 330.

[55] The user computing device 302 can also include one or more user input components 322 that receives user input. For example, the user input component 322 can be a touch- sensitive component (e.g., a touch-sensitive display screen or a touch pad) that is sensitive to the touch of a user input object (e.g., a finger or a stylus). The touch-sensitive component can serve to implement a virtual keyboard. Other example user input components include a microphone, a traditional keyboard, or other means by which a user can provide user input.

[56] The server computing system 330 includes one or more processors 332 and a memory 334. The one or more processors 332 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. The memory 334 can include one or more non-transitory computer-readable storage media, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. The memory 334 can store data 336 and instructions 338 which are executed by the processor 332 to cause the server computing system 330 to perform operations.

[57] In some implementations, the server computing system 330 includes or is otherwise implemented by one or more server computing devices. In instances in which the server computing system 330 includes plural server computing devices, such server computing devices can operate according to sequential computing architectures, parallel computing architectures, or some combination thereof.

[58] As described above, the server computing system 330 can store or otherwise include one or more multimodal models 340. For example, the models 340 can be or can otherwise include various machine-learned models. Example machine-learned models include neural networks or other multi-layer non-linear models. Example neural networks include feed forward neural networks, deep neural networks, recurrent neural networks, and convolutional neural networks. Some example machine-learned models can leverage an attention mechanism such as self-attention. For example, some example machine-learned models can include multi-headed self-attention models (e.g., transformer models). Example models 340 are discussed with reference to Figure 1.

[59] The user computing device 302 and/or the server computing system 330 can train the models 320 and/or 340 via interaction with the training computing system 350 that is communicatively coupled over the network 380. The training computing system 350 can be separate from the server computing system 330 or can be a portion of the server computing system 330.

[60] The training computing system 350 includes one or more processors 352 and a memory 354. The one or more processors 352 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. The memory 354 can include one or more non-transitory computer-readable storage media, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. The memory 354 can store data 356 and instructions 358 which are executed by the processor 352 to cause the training computing system 350 to perform operations. In some implementations, the training computing sy stem 350 includes or is otherwise implemented by one or more server computing devices.

[61] The training computing system 350 can include a model trainer 360 that trains the machine-learned models 320 and/or 340 stored at the user computing device 302 and/or the server computing system 330 using various training or learning techniques, such as, for example, backwards propagation of errors. For example, a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the loss function). Various loss functions can be used such as mean squared error, likelihood loss, cross entropy loss, hinge loss, and/or various other loss functions. Gradient descent techniques can be used to iteratively update the parameters over a number of training iterations.

[62] In some implementations, performing backwards propagation of errors can include performing truncated backpropagation through time. The model trainer 360 can perform a number of generalization techniques (e.g., weight decays, dropouts, etc.) to improve the generalization capability of the models being trained.

[63] In particular, the model trainer 360 can train the multimodal models 320 and/or 340 based on a set of training data 362. The training data 362 can include, for example, training data specific to each unimodal model, such as training a large language model using unlabeled text data, training an image generation model based on input tokens and resulting images, and the like.

[64] In some implementations, if the user has provided consent, the training examples can be provided by the user computing device 302. Thus, in such implementations, the model 320 provided to the user computing device 302 can be trained by the training computing system 350 on user-specific data received from the user computing device 302. In some instances, this process can be referred to as personalizing the model.

[65] The model trainer 360 includes computer logic utilized to provide desired functionality. The model trainer 360 can be implemented in hardware, firmware, and/or software controlling a general purpose processor. For example, in some implementations, the model trainer 360 includes program files stored on a storage device, loaded into a memory and executed by one or more processors. In other implementations, the model trainer 360 includes one or more sets of computer-executable instructions that are stored in a tangible computer-readable storage medium such as RAM, hard disk, or optical or magnetic media.

[66] The network 380 can be any type of communications network, such as a local area network (e.g., intranet), wide area network (e.g., Internet), or some combination thereof and can include any number of wired or wireless links. In general, communication over the network 380 can be carried via any type of wired and/or wireless connection, using a wide variety of communication protocols (e.g., TCP/IP, HTTP, SMTP, FTP), encodings or formats (e g., HTML, XML), and/or protection schemes (e g., VPN, secure HTTP, SSL).

[67] The machine-learned models described in this specification may be used in a variety of tasks, applications, and/or use cases.

[68] In some implementations, the input to the machine-learned model(s) of the present disclosure can be image data. The machine-learned model(s) can process the image data to generate an output. As an example, the machine-learned model(s) can process the image data to generate an image recognition output (e.g., a recognition of the image data, a latent embedding of the image data, an encoded representation of the image data, a hash of the image data, etc.). As another example, the machine-learned model(s) can process the image data to generate an image segmentation output. As another example, the machine-learned model(s) can process the image data to generate an image classification output. As another example, the machine-learned model(s) can process the image data to generate an image data modification output (e.g., an alteration of the image data, etc.). As another example, the machine-learned model(s) can process the image data to generate an encoded image data output (e.g., an encoded and/or compressed representation of the image data, etc.). As another example, the machine-learned model(s) can process the image data to generate an upscaled image data output. As another example, the machine-learned model(s) can process the image data to generate a prediction output.

[69] In some implementations, the input to the machine-learned model(s) of the present disclosure can be text or natural language data. The machine-learned model(s) can process the text or natural language data to generate an output. As an example, the machine-learned model(s) can process the natural language data to generate a language encoding output. As another example, the machine-learned model(s) can process the text or natural language data to generate a latent text embedding output. As another example, the machine-learned model(s) can process the text or natural language data to generate a translation output. As another example, the machine-learned model(s) can process the text or natural language data to generate a classification output. As another example, the machine-learned model(s) can process the text or natural language data to generate a textual segmentation output. As another example, the machine-learned model(s) can process the text or natural language data to generate a semantic intent output. As another example, the machine-learned model(s) can process the text or natural language data to generate an upscaled text or natural language output (e.g., text or natural language data that is higher quality than the input text or natural language, etc.). As another example, the machine-learned model(s) can process the text or natural language data to generate a prediction output.

[70] In some implementations, the input to the machine-learned model(s) of the present disclosure can be speech data. The machine-learned model(s) can process the speech data to generate an output. As an example, the machine-learned model(s) can process the speech data to generate a speech recognition output. As another example, the machine-learned model(s) can process the speech data to generate a speech translation output. As another example, the machine-learned model(s) can process the speech data to generate a latent embedding output. As another example, the machine-learned model(s) can process the speech data to generate an encoded speech output (e.g., an encoded and/or compressed representation of the speech data, etc.). As another example, the machine-learned model(s) can process the speech data to generate an upscaled speech output (e.g., speech data that is higher quality than the input speech data, etc.). As another example, the machine-learned model(s) can process the speech data to generate a textual representation output (e.g., a textual representation of the input speech data, etc.). As another example, the machine- learned model(s) can process the speech data to generate a prediction output.

[71] In some implementations, the input to the machine-learned model(s) of the present disclosure can be latent encoding data (e.g., a latent space representation of an input, etc.). The machine-learned model(s) can process the latent encoding data to generate an output. As an example, the machine-learned model(s) can process the latent encoding data to generate a recognition output. As another example, the machine-learned model(s) can process the latent encoding data to generate a reconstruction output. As another example, the machine-learned model(s) can process the latent encoding data to generate a search output. As another example, the machine-learned model(s) can process the latent encoding data to generate a reclustering output. As another example, the machine-learned model(s) can process the latent encoding data to generate a prediction output.

[72] In some implementations, the input to the machine-learned model(s) of the present disclosure can be statistical data. Statistical data can be, represent, or otherwise include data computed and/or calculated from some other data source. The machine-learned model(s) can process the statistical data to generate an output. As an example, the machine-learned model(s) can process the statistical data to generate a recognition output. As another example, the machine-learned model(s) can process the statistical data to generate a prediction output. As another example, the machine-learned model(s) can process the statistical data to generate a classification output. As another example, the machine-learned model(s) can process the statistical data to generate a segmentation output. As another example, the machine-learned model(s) can process the statistical data to generate a visualization output. As another example, the machine-learned model(s) can process the statistical data to generate a diagnostic output.

[73] In some implementations, the input to the machine-learned model(s) of the present disclosure can be sensor data. The machine-learned model(s) can process the sensor data to generate an output. As an example, the machine-learned model(s) can process the sensor data to generate a recognition output. As another example, the machine-learned model(s) can process the sensor data to generate a prediction output. As another example, the machine- learned model(s) can process the sensor data to generate a classification output. As another example, the machine-learned model(s) can process the sensor data to generate a segmentation output. As another example, the machine-learned model(s) can process the sensor data to generate a visualization output. As another example, the machine-learned model(s) can process the sensor data to generate a diagnostic output. As another example, the machine-learned model(s) can process the sensor data to generate a detection output.

[74] In some cases, the machine-learned model(s) can be configured to perform a task that includes encoding input data for reliable and/or efficient transmission or storage (and/or corresponding decoding). For example, the task may be an audio compression task. The input may include audio data and the output may comprise compressed audio data. In another example, the input includes visual data (e.g. one or more images or videos), the output comprises compressed visual data, and the task is a visual data compression task. In another example, the task may comprise generating an embedding for input data (e.g. input audio or visual data).

[75] In some cases, the input includes visual data and the task is a computer vision task. In some cases, the input includes pixel data for one or more images and the task is an image processing task. For example, the image processing task can be image classification, where the output is a set of scores, each score corresponding to a different object class and representing the likelihood that the one or more images depict an object belonging to the object class. The image processing task may be object detection, where the image processing output identifies one or more regions in the one or more images and, for each region, a likelihood that region depicts an object of interest. As another example, the image processing task can be image segmentation, where the image processing output defines, for each pixel in the one or more images, a respective likelihood for each category in a predetermined set of categories. For example, the set of categories can be foreground and background. As another example, the set of categories can be object classes. As another example, the image processing task can be depth estimation, where the image processing output defines, for each pixel in the one or more images, a respective depth value. As another example, the image processing task can be motion estimation, where the network input includes multiple images, and the image processing output defines, for each pixel of one of the input images, a motion of the scene depicted at the pixel between the images in the network input.

[76] In some cases, the input includes audio data representing a spoken utterance and the task is a speech recognition task. The output may comprise a text output which is mapped to the spoken utterance. In some cases, the task comprises encrypting or decrypting input data. In some cases, the task comprises a microprocessor performance task, such as branch prediction or memory address translation.

[77] Figure 3 A illustrates one example computing system that can be used to implement the present disclosure. Other computing systems can be used as well. For example, in some implementations, the user computing device 302 can include the model trainer 360 and the training dataset 362. In such implementations, the models 320 can be both trained and used locally at the user computing device 302. In some of such implementations, the user computing device 302 can implement the model trainer 360 to personalize the models 120 based on user-specific data.

[78] Figure 3B depicts a block diagram of an example computing device 10 that performs according to example embodiments of the present disclosure. The computing device 10 can be a user computing device or a server computing device.

[79] The computing device 10 includes a number of applications (e.g., applications 1 through N). Each application contains its own machine learning library and machine-learned model(s). For example, each application can include a machine-learned model. Example applications include a text messaging application, an email application, a dictation application, a virtual keyboard application, a browser application, etc.

[80] As illustrated in Figure 3B, each application can communicate with a number of other components of the computing device, such as, for example, one or more sensors, a context manager, a device state component, and/or additional components. In some implementations, each application can communicate with each device component using an API (e.g., a public API). In some implementations, the API used by each application is specific to that application.

[81] Figure 3C depicts a block diagram of an example computing device 50 that performs according to example embodiments of the present disclosure. The computing device 50 can be a user computing device or a server computing device.

[82] The computing device 50 includes a number of applications (e.g., applications 1 through N). Each application is in communication with a central intelligence layer. Example applications include a text messaging application, an email application, a dictation application, a virtual keyboard application, a browser application, etc. In some implementations, each application can communicate with the central intelligence layer (and model(s) stored therein) using an API (e.g., a common API across all applications).

[83] The central intelligence layer includes a number of machine-learned models. For example, as illustrated in Figure 3C, a respective machine-learned model can be provided for each application and managed by the central intelligence layer. In other implementations, two or more applications can share a single machine-learned model. For example, in some implementations, the central intelligence layer can provide a single model for all of the applications. In some implementations, the central intelligence layer is included within or otherwise implemented by an operating system of the computing device 50.

[84] The central intelligence layer can communicate with a central device data layer. The central device data layer can be a centralized repository of data for the computing device 50. As illustrated in Figure 3C, the central device data layer can communicate with a number of other components of the computing device, such as, for example, one or more sensors, a context manager, a device state component, and/or additional components. In some implementations, the central device data layer can communicate with each device component using an API (e.g., a private API).

Additional Disclosure

[85] The technology discussed herein makes reference to servers, databases, software applications, and other computer-based systems, as well as actions taken and information sent to and from such systems. The inherent flexibility of computer-based systems allows for a great variety of possible configurations, combinations, and divisions of tasks and functionality between and among components. For instance, processes discussed herein can be implemented using a single device or component or multiple devices or components working in combination. Databases and applications can be implemented on a single system or distributed across multiple systems. Distributed components can operate sequentially or in parallel.

[86] While the present subject matter has been described in detail with respect to various specific example embodiments thereof, each example is provided by way of explanation, not limitation of the disclosure. Those skilled in the art, upon attaining an understanding of the foregoing, can readily produce alterations to, variations of, and equivalents to such embodiments. Accordingly, the subject disclosure does not preclude inclusion of such modifications, variations and/or additions to the present subject matter as would be readily apparent to one of ordinary skill in the art. For instance, features illustrated or described as part of one embodiment can be used with another embodiment to yield a still further embodiment. Thus, it is intended that the present disclosure cover such alterations, variations, and equivalents.

Claims

WHAT IS CLAIMED IS:

1. A computer-implemented method for generating an output from a machine-learned model, the method comprising: receiving, by a processor, a multimodal input from one or more input sources; providing, by the processor, a first input portion having a first modality from the multimodal input to a first unimodal model associated with the first modality: providing, by the processor, a second input portion having a second modality from the multimodal input to a second unimodal model associated with the second modality; providing, by the processor, a first representation from the first unimodal model to the second unimodal model, the first representation representing an output of a layer of the first unimodal model; providing, by the processor, a second representation from the second unimodal model to the first unimodal model, the second representation representing an output of a layer of the second unimodal model; processing, by the processor, the second input portion using the second unimodal model based at least in part on the first representation; processing, by the processor, the first input portion using the first unimodal model based at least in part on the second representation; and generating, by the processor, an output from the machine-learned model based at least in part on the processing by at least one of the first unimodal model and the second unimodal model.

2. The method of claim 1 , wherein providing the first representation from the first unimodal model to the second unimodal model is performed using a multimodal adapter.

3. The method of claim 2, wherein the multimodal adapter includes at least one component selected from the group of components consisting of a downsampling block, an upsampling block, and a learned retokenization block.

4. The method of claim 3, wherein the multimodal adapter processes the first representation using the at least one component before providing the first representation to the second unimodal model.

5. The method of claim 2, wherein the multimodal adapter provides the first representation to the second unimodal model using a cross-attention layer.

6. The method of claim 5, wherein the cross-attention layer is updated based on the generated output.

7. The method of claim 2, wherein the multimodal adapter is bidirectional and provides the second representation to the first unimodal model from the second unimodal model.

8. The method of claim 1, wherein the second unimodal model is frozen when processing the second input portion.

9. The method of claim 1, wherein the output is an output in the second modality.

10. The method of claim 1, wherein the output includes a first output portion and a second output portion, the first output portion being an output in the first modality and the second output portion being an output in the second modality.

11. The method of claim 10, wherein the first unimodal model outputs the first output portion and the second unimodal model outputs the second output portion.

12. The method of claim 1, wherein the layer of the first unimodal model is a transformer layer.

13. The method of claim 1, the method further comprising: providing, by the processor, the first representation to a third unimodal model; processing, by the processor, the first representation with the third unimodal model; and generating, by the processor, the output based in part on the processing by the third unimodal model.

14. A computing system, the computing system comprising: one or more processors; and a non-transitory, computer-readable medium comprising: a machine-learned model comprising a first unimodal model associated with a first modality and a second unimodal model associated with a second modality; and instructions that, when executed by the one or more processors, cause the one or more processors to perform operations, the operations comprising: receiving a multimodal input from one or more input sources; providing a first input portion having the first modality from the multimodal input to the first unimodal model; providing a second input portion having the second modality from the multimodal input to the second unimodal model; providing a first representation from the first unimodal model to the second unimodal model, the first representation representing an output of a layer of the first unimodal model; providing a second representation from the second unimodal model to the first unimodal model, the second representation representing an output of a layer of the second unimodal mode; processing the second input portion using the second unimodal model based at least in part on the first representation; processing the first input portion using the first unimodal model based at least in part on the second representation; and generating an output from the machine-learned model based at least in part on the processing by at least one of the first unimodal model and the second unimodal model.

15. The computing system of claim 14, wherein providing the first representation from the first unimodal model to the second unimodal model is performed using a multimodal adapter.

16. The computing system of claim 15, wherein the multimodal adapter provides the first representation to the second unimodal model using a cross-attention layer.

17. The computing system of claim 16, wherein the cross-attention layer is updated based on the generated output.

18. The computing system of claim 15, wherein the multimodal adapter is bidirectional and provides the second representation to the first unimodal model from the second unimodal model.

19. A non-transitory, computer-readable medium comprising: a machine-learned model comprising a first unimodal model associated with a first modality and a second unimodal model associated with a second modality; and instructions that, when executed by the one or more processors, cause the one or more processors to perform operations, the operations comprising: receiving a multimodal input from one or more input sources; providing a first input portion having the first modality from the multimodal input to the first unimodal model; providing a second input portion having the second modality from the multimodal input to the second unimodal model; providing a first representation from the first unimodal model to the second unimodal model, the first representation representing an output of a transformer layer of the first unimodal model; providing a second representation from the second unimodal model to the first unimodal model, the second representation representing an output of a layer of the second unimodal mode; processing the second input portion using the second unimodal model based at least in part on the first representation; processing the first input portion using the first unimodal model based at least in part on the second representation; and generating an output from the machine-learned model based at least in part on the processing by at least one of the first unimodal model and the second unimodal model.

20. The non-transitory, computer-readable medium of claim 18, wherein providing the first representation from the first unimodal model to the second unimodal model is performed using a multimodal adapter, wherein the multimodal adapter provides the first representation to the second unimodal model using a cross-atention layer and provides the second representation to the first unimodal model using a cross-atention layer.