[go: up one dir, main page]

WO2025048837A1 - Multi-scale attribute-disentangled audio tokenization for controllable generation - Google Patents

Multi-scale attribute-disentangled audio tokenization for controllable generation Download PDF

Info

Publication number
WO2025048837A1
WO2025048837A1 PCT/US2023/031886 US2023031886W WO2025048837A1 WO 2025048837 A1 WO2025048837 A1 WO 2025048837A1 US 2023031886 W US2023031886 W US 2023031886W WO 2025048837 A1 WO2025048837 A1 WO 2025048837A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
model
machine
learned
intermediate representation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/US2023/031886
Other languages
French (fr)
Inventor
Aren Jansen
Ron Weiss
Jacob David WALKER
Hakan Erdogan
John Randall Hershey
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Google LLC
Original Assignee
Google LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Google LLC filed Critical Google LLC
Priority to PCT/US2023/031886 priority Critical patent/WO2025048837A1/en
Publication of WO2025048837A1 publication Critical patent/WO2025048837A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/18Vocoders using multiple modes
    • G10L19/20Vocoders using multiple modes using sound class specific coding, hybrid encoders or object based coding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Definitions

  • MULTI-SCALE ATTRIBUTE-DISENTANGLED AUDIO TOKENIZATION FOR CONTROLLABLE GENERATION FIELD [0001]
  • the present disclosure relates generally to audio tokenization. More particularly, the present disclosure relates to multi-scale, attribute-based disentanglement for audio tokenization.
  • Foundational models such as Large Language Models (LLMs)
  • LLMs Large Language Models
  • Foundational models are currently revolutionizing what can be achieved with assistive AI technology in many fields, ranging from regular conversation assistants to multi-modal editing of content such as audio, images or video.
  • a foundational model can perform a wide variety of tasks within the context of the training data used to train the model.
  • a LLM can perform a wide variety of language tasks ranging from answering queries to generating new content based on prompts.
  • One example aspect of the present disclosure is directed to a computer- implemented method. The method includes obtaining, by a computing system comprising one or more processor devices, an intermediate representation of audio data comprising audio of a particular type, wherein audio of the particular type comprises a plurality of audio content attributes.
  • the method includes processing, by the computing system, the intermediate representation with a machine-learned disentanglement model to obtain two or more token streams for two or more respective audio content attributes of the plurality of audio content attributes, wherein the two or more token streams comprise a first token stream for a first audio content attribute of the two or more audio content attributes, wherein the first token stream is tokenized at a first time scale, and a second token stream for a second audio content attribute of the two or more audio content attributes, wherein the second token stream is tokenized at a second time scale different than the first time scale.
  • Another example aspect of the present disclosure is directed to a computing system.
  • the computing system includes one or more processors and one or more tangible, non-transitory computer readable media storing computer-readable instructions that when executed by the one or more processors cause the computing system to perform operations.
  • the operations include obtaining an intermediate representation of audio data comprising audio of a particular type, wherein audio of the particular type comprises a plurality of audio content attributes.
  • the operations include processing the intermediate representation with a machine-learned disentanglement model to obtain two or more token streams for two or more respective audio content attributes of the plurality of audio content attributes, wherein the two or more token streams comprise a first token stream for a first audio content attribute of the two or more audio content attributes, wherein the first token stream is tokenized at a first time scale, and a second token stream for a second audio content attribute of the two or more audio content attributes, wherein the second token stream is tokenized at a second time scale different than the first time scale.
  • Another example aspect of the present disclosure is directed to one or more tangible, non-transitory computer readable media storing computer-readable instructions that when executed by one or more processors cause the one or more processors to perform operations.
  • the operations include obtaining an intermediate representation of audio data comprising audio of a particular type, wherein audio of the particular type comprises a plurality of audio content attributes.
  • the operations include processing the intermediate representation with a machine-learned disentanglement model to obtain two or more token streams for two or more respective audio content attributes of the plurality of audio content attributes, wherein the two or more token streams comprise a first token stream for a first audio content attribute of the two or more audio content attributes, wherein the first token stream is tokenized at a first time scale, and a second token stream for a second audio content attribute of the two or more audio content attributes, wherein the second token stream is tokenized at a second time scale different than the first time scale.
  • Figure 1A depicts a block diagram of an example computing system that performs per-attribute token disentanglement according to example embodiments of the present disclosure.
  • Figure 1B depicts a block diagram of an example computing device that performs training of a machine-learned disentanglement model according to example embodiments of the present disclosure.
  • Figure 1C depicts a block diagram of an example computing device that performs machine-learned disentanglement of an intermediate representation according to example embodiments of the present disclosure.
  • Figure 2A depicts a data flow diagram for unsupervised training of a machine- learned tokenization model according to some implementations of the present disclosure.
  • Figure 2B is a data flow diagram for performing quantization with the quantization portion of the machine-learned disentanglement model of Figure 2A according to some implementations of the present disclosure.
  • Figure 3 is a block diagram for a computing system that disentangles and tokenizes representations for generative tasks and/or compression tasks according to some implementations of the present disclosure.
  • Figure 4 is a data flow diagram for disentanglement of an intermediate representation of speech audio at prescribed time scales according to some implementations of the present disclosure.
  • Figure 5 depicts a flow chart diagram of an example method to perform disentanglement of tokens for generative tasks according to some implementations of the present disclosure.
  • Reference numerals that are repeated across plural figures are intended to identify the same features in various implementations.
  • DETAILED DESCRIPTION Overview [0019]
  • the present disclosure is directed to multi-scale, attribute-based disentanglement for audio tokenization. More specifically, it can be advantageous to disentangle audio data (or from an intermediate representation of audio data) on a per- property basis to generate separate token streams for each audio content property of the audio data.
  • audio data of a song may be disentangled into multiple token streams (e.g., a speaker audio property, a music audio property, an emotion audio property, etc.).
  • token streams e.g., a speaker audio property, a music audio property, an emotion audio property, etc.
  • conventional tokenization processes are rendered inefficient due to the temporally variant nature of audio content attributes (e.g., speech, music, one-off background noise, consistent background noise, etc.).
  • different audio content attributes are most optimally tokenized at different time scales.
  • a speech audio content attribute for speech audio caused by singing may be optimally tokenized on a per-utterance basis (e.g., generating tokens when an utterance occurs) while an instrumental audio content attribute for instrumental background audio may be most optimally tokenized on a per-frame basis or on a time basis (e.g., every 0.5 seconds, etc.).
  • a per-utterance basis e.g., generating tokens when an utterance occurs
  • an instrumental audio content attribute for instrumental background audio may be most optimally tokenized on a per-frame basis or on a time basis (e.g., every 0.5 seconds, etc.).
  • an intermediate representation can be obtained for audio data that includes audio of a particular type (e.g., a song, a speech, a recording of a room, etc.).
  • the audio of the particular type can include multiple audio content attributes.
  • audio of a song may include a speech audio content attribute, a background vocals audio content attribute, a background instrumental audio content attribute, etc.
  • the intermediate representation of the audio data can be processed with a machine-learned tokenization model to obtain token streams for particular audio content attributes.
  • the machine-learned tokenization model may provide a token stream representing audio of the speech audio content attribute, a second token stream representing audio of the background instrumental audio content attribute, and a third “catch-all” token stream to represent any perceptible information not included in the other two token streams.
  • Each of the token streams can be tokenized at a particular time scale.
  • the first token stream representing certain audio characteristics of the speech audio content attribute may be tokenized at a per-utterance scale, such as speaker identity, emotion, etc. (e.g., every time the main singer produces a spoken utterance)
  • the second token stream representing audio of the background instrumental audio content attribute may be tokenized at a time-basis rate (e.g., every 0.5 seconds).
  • an intermediate representation of audio data can be more optimally disentangled and tokenized on a per-attribute basis.
  • Implementations of the present disclosure provide a number of technical effects and benefits. As one example technical effect and benefit, some conventional intermediate representations of audio data have recently been found to “over-represent” corresponding audio data. In other words, many conventional techniques for generating intermediate representations of audio data produce representations that require more computing resources than necessary (e.g., memory, storage, power, processing cycles, etc.). However, by disentangling and tokenizing an intermediate representation on a per-attribute basis, implementations of the present disclosure can substantially reduce the quantity of non- perceptible information included in an intermediate representation.
  • implementations of the present disclosure can create intermediate representations that imperceptibly compress audio data to a degree that is substantially better than conventional compression processes.
  • tokenization of the instrumental audio may require a particular time scale (e.g., every 0.5 seconds, etc.), tokenizing the speech audio of the speech audio attribute at the same time scale will result in the collection of a large quantity of tokens that represent non-perceptible information (e.g., periods during which the singer is not singing).
  • implementations of the present disclosure can tokenize the speech audio at a per-utterance time scale that generates substantially fewer tokens than the time scale for the instrumental audio, therefore substantially reducing the size of the intermediate representation without loss of perceptible information.
  • implementations of the present disclosure provide for fine-tuned control of audio content attributes. For example, assume that audio data including audio of vocal music with content attributes such as speech, music, background vocals, etc. is tokenized. The tokens can be disentangled on a per-attribute basis corresponding to audio content attributes associated with the audio.
  • FIG. 1A depicts a block diagram of an example computing system 100 that performs per-attribute token disentanglement according to example embodiments of the present disclosure.
  • the system 100 includes a user computing device 102, a server computing system 130, and a training computing system 150 that are communicatively coupled over a network 180.
  • the user computing device 102 can be any type of computing device, such as, for example, a personal computing device (e.g., laptop or desktop), a mobile computing device (e.g., smartphone or tablet), a gaming console or controller, a wearable computing device, an embedded computing device, or any other type of computing device.
  • the user computing device 102 includes one or more processors 112 and a memory 114.
  • the one or more processors 112 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected.
  • the memory 114 can include one or more non-transitory computer-readable storage media, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof.
  • the memory 114 can store data 116 and instructions 118 which are executed by the processor 112 to cause the user computing device 102 to perform operations.
  • the user computing device 102 can store or include one or more models 120.
  • the models 120 can be or can otherwise include various machine-learned models such as neural networks (e.g., deep neural networks) or other types of machine-learned models, including non-linear models and/or linear models.
  • Neural networks can include feed-forward neural networks, recurrent neural networks (e.g., long short-term memory recurrent neural networks), convolutional neural networks or other forms of neural networks.
  • Some example machine-learned models can leverage an attention mechanism such as self-attention.
  • some example machine-learned models can include multi-headed self-attention models (e.g., transformer models).
  • Example models 120 are discussed with reference to Figures 2 and 3.
  • the one or more models 120 can be received from the server computing system 130 over network 180, stored in the user computing device memory 114, and then used or otherwise implemented by the one or more processors 112.
  • the user computing device 102 can implement multiple parallel instances of a single model 120 (e.g., to perform parallel per-attribute token disentanglement across multiple instances of the models 120).
  • the machine-learned model(s) 120 can include a machine- learned disentanglement model.
  • the machine-learned disentanglement model can be a model that processes an intermediate representation, or tokens generated from an intermediate representation, and produces a disentanglement output that disentangles the intermediate representation into a collection of subspaces that each hold tokens corresponding to a particular attribute (i.e., per-attribute disentangled tokens).
  • the machine-learned disentanglement model can be a transformer model, transformer layer(s), attention-based layer(s), etc. trained to disentangle intermediate representations of an input.
  • the machine-learned disentanglement model may be a multi-headed model that includes a supervision head for each attribute that is trained on a corresponding contrastive loss term.
  • the machine-learned model(s) 120 can include a machine-learned encoder model.
  • the machine-learned encoder model can be a model that processes input data to generate an intermediate representation of the input data.
  • the user computing device 102 can obtain audio data including audio of a spoken utterance from a user via an audio input device associated with the user computing device 102.
  • the machine-learned encoder model can process the audio data to generate an intermediate representation of the audio data, such as an encoding (e.g., a wav2vec encoding, etc.).
  • the user computing device 102 can obtain textual data including textual content from a user via a text input device associated with the user computing device 102.
  • the machine-learned encoder model can process the textual data to generate an intermediate representation of the textual data, such as an encoding (e.g., a text encoding such as a Bidirectional Encoder Representation from Transformer (BERT) representation, etc.).
  • the machine-learned encoder model can be a portion of a larger model, such as a foundational model (e.g., a large language model (LLM), a foundational audio model, foundational computer vision model, etc.).
  • LLM large language model
  • the machine-learned encoder model can be the encoder portion of an LLM.
  • the machine-learned model(s) 120 can include a machine-learned generative model, such as machine-learned decoder model.
  • the machine- learned generative model can be a model that processes an intermediate representation to generate a model output.
  • the user computing device 102 can obtain an intermediate representation of textual content describing a particular type of audio and corresponding audio content attributes for the audio.
  • the machine-learned generative model is, for example, a foundational audio decoder, the machine-learned generative model can process the intermediate representation to generate audio data that includes the audio described by the textual content.
  • the machine-learned generative model can process the intermediate representation to generate image data that depicts some visual representation of the audio described by the textual content (e.g., a spectrogram, album art for audio, a visual depiction of events described by the audio, etc.).
  • the machine-learned model(s) 120 can include any type or manner of machine-learned model or portion / layer of machine-learned model, including multi-modal models that are trained to process and/or generate multimodal information.
  • the machine-learned model(s) may include a model that processes an intermediate representation of two types of information (e.g., audio, image, text, etc.) to generate an output.
  • the machine-learned model(s) may include a model that processes an intermediate representation of one type of information to generate a multimodal output.
  • the models, and/or portions of models, described herein may be multimodal or otherwise mode-agnostic.
  • one or more models 140 can be included in or otherwise stored and implemented by the server computing system 130 that communicates with the user computing device 102 according to a client-server relationship.
  • the models 140 can be implemented by the server computing system 130 as a portion of a web service (e.g., a generative service).
  • a web service e.g., a generative service
  • one or more models 120 can be stored and implemented at the user computing device 102 and/or one or more models 140 can be stored and implemented at the server computing system 130.
  • the user computing device 102 can also include one or more user input components 122 that receives user input.
  • the user input component 122 can be a touch-sensitive component (e.g., a touch-sensitive display screen or a touch pad) that is sensitive to the touch of a user input object (e.g., a finger or a stylus).
  • the touch-sensitive component can serve to implement a virtual keyboard.
  • the server computing system 130 includes one or more processors 132 and a memory 134.
  • the one or more processors 132 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected.
  • the memory 134 can include one or more non-transitory computer-readable storage media, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof.
  • the memory 134 can store data 136 and instructions 138 which are executed by the processor 132 to cause the server computing system 130 to perform operations.
  • the server computing system 130 includes or is otherwise implemented by one or more server computing devices. In instances in which the server computing system 130 includes plural server computing devices, such server computing devices can operate according to sequential computing architectures, parallel computing architectures, or some combination thereof.
  • the server computing system 130 can store or otherwise include one or more models 140.
  • the models 140 can be or can otherwise include various machine-learned models.
  • Example machine-learned models include neural networks or other multi-layer non-linear models.
  • Example neural networks include feed forward neural networks, deep neural networks, recurrent neural networks, and convolutional neural networks.
  • Some example machine-learned models can leverage an attention mechanism such as self-attention.
  • some example machine-learned models can include multi-headed self-attention models (e.g., transformer models).
  • Example models 140 are discussed with reference to Figures 2-3.
  • the user computing device 102 and/or the server computing system 130 can train the models 120 and/or 140 via interaction with the training computing system 150 that is communicatively coupled over the network 180.
  • the training computing system 150 can be separate from the server computing system 130 or can be a portion of the server computing system 130.
  • the training computing system 150 includes one or more processors 152 and a memory 154.
  • the one or more processors 152 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected.
  • the memory 154 can include one or more non-transitory computer-readable storage media, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof.
  • the memory 154 can store data 156 and instructions 158 which are executed by the processor 152 to cause the training computing system 150 to perform operations.
  • the training computing system 150 includes or is otherwise implemented by one or more server computing devices.
  • the training computing system 150 can include a model trainer 160 that trains the machine-learned models 120 and/or 140 stored at the user computing device 102 and/or the server computing system 130 using various training or learning techniques, such as, for example, backwards propagation of errors.
  • a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the loss function).
  • Various loss functions can be used such as mean squared error, likelihood loss, cross entropy loss, hinge loss, and/or various other loss functions.
  • Gradient descent techniques can be used to iteratively update the parameters over a number of training iterations.
  • performing backwards propagation of errors can include performing truncated backpropagation through time.
  • the model trainer 160 can perform a number of generalization techniques (e.g., weight decays, dropouts, etc.) to improve the generalization capability of the models being trained.
  • the model trainer 160 can train the models 120 and/or 140 based on a set of training data 162.
  • the training data 162 can include any type or manner of ground truth information, such as a ground truth intermediate representation, ground truth disentangled token streams, reconstructed intermediate representations, etc.
  • the model trainer 160 can train the models 120 and/or 140 in a supervised manner.
  • the models 120 and/or 140 can include explicit supervision heads (e.g., heads, such as classification or contrastive heads, that are explicitly supervised).
  • the explicit supervision heads can be combined with adversarial and/or mutual information losses.
  • pre-trained attribute encoder models can be utilized as “teacher” models.
  • a pre-trained attribute encoder can refer to a machine-learned encoder model that has been trained to generate an attribute-specific embedding based on an input.
  • an encoder model may be specifically trained to generate a tokenized representation of a prosody audio content attribute based on audio data or an intermediate representation of audio data.
  • a machine-learned disentanglement model, or a prosody audio content attribute head of a machine-learned disentanglement model can generate a disentangled token stream for the same prosody audio content attribute.
  • the machine-learned disentanglement model, or the prosody audio content attribute head of a machine-learned disentanglement model can be trained based on a loss function that evaluates a difference between the disentangled token stream and the token stream generated by the pre-trained “teacher” model. This process can be repeated on a per-attribute basis.
  • the model trainer 160 can train the models 120 and/or 140 in an unsupervised manner.
  • the model trainer 160 can evaluate each attribute subspace determined through disentanglement with a proximity-based contrastive loss function that includes a loss term for each attribute. Unsupervised training will be discussed in greater detail with regard to Figure 2. [0049] Additionally, or alternatively, in some implementations, the model trainer 160 can train the models 120 and/or 140 in a self-supervised manner. For example, the model trainer 160 can fine-tune a generative model to produce audio conditioned on some side input label using a small labeled dataset, in a manner analogous to few-shot prompting with LLMs. Once fine-tuned, synthetic audio can be generated that is conditioned on all available labels, or combinations of labels if relevant.
  • the generation model can be re-trained using both the initial training data utilized to fine-tune the model and the synthetic data generated using the fine-tuned model. This can be performed across a number of iterations.
  • the training examples can be provided by the user computing device 102.
  • the model 120 provided to the user computing device 102 can be trained by the training computing system 150 on user-specific data received from the user computing device 102. In some instances, this process can be referred to as personalizing the model.
  • the model trainer 160 includes computer logic utilized to provide desired functionality.
  • the model trainer 160 can be implemented in hardware, firmware, and/or software controlling a general purpose processor.
  • the model trainer 160 includes program files stored on a storage device, loaded into a memory and executed by one or more processors.
  • the model trainer 160 includes one or more sets of computer-executable instructions that are stored in a tangible computer-readable storage medium such as RAM, hard disk, or optical or magnetic media.
  • the network 180 can be any type of communications network, such as a local area network (e.g., intranet), wide area network (e.g., Internet), or some combination thereof and can include any number of wired or wireless links.
  • communication over the network 180 can be carried via any type of wired and/or wireless connection, using a wide variety of communication protocols (e.g., TCP/IP, HTTP, SMTP, FTP), encodings or formats (e.g., HTML, XML), and/or protection schemes (e.g., VPN, secure HTTP, SSL).
  • TCP/IP Transmission Control Protocol/IP
  • HTTP HyperText Transfer Protocol
  • SMTP Simple Stream Transfer Protocol
  • FTP e.g., HTTP
  • FTP encodings or formats
  • protection schemes e.g., VPN, secure HTTP, SSL
  • the machine-learned models described in this specification may be used in a variety of tasks, applications, and/or use cases.
  • the input to the machine-learned model(s) of the present disclosure can be image data.
  • the machine-learned model(s) can process the image data to generate an output.
  • the machine-learned model(s) can process the image data to generate an image recognition output (e.g., a recognition of the image data, a latent embedding of the image data, an encoded representation of the image data, a hash of the image data, etc.).
  • the machine-learned model(s) can process the image data to generate an image segmentation output.
  • the machine- learned model(s) can process the image data to generate an image classification output.
  • the machine-learned model(s) can process the image data to generate an image data modification output (e.g., an alteration of the image data, etc.).
  • the machine-learned model(s) can process the image data to generate an encoded image data output (e.g., an encoded and/or compressed representation of the image data, etc.).
  • the machine-learned model(s) can process the image data to generate an upscaled image data output.
  • the machine-learned model(s) can process the image data to generate a prediction output.
  • the input to the machine-learned model(s) of the present disclosure can be text or natural language data.
  • the machine-learned model(s) can process the text or natural language data to generate an output.
  • the machine- learned model(s) can process the natural language data to generate a language encoding output.
  • the machine-learned model(s) can process the text or natural language data to generate a latent text embedding output.
  • the machine- learned model(s) can process the text or natural language data to generate a translation output.
  • the machine-learned model(s) can process the text or natural language data to generate a classification output.
  • the machine-learned model(s) can process the text or natural language data to generate a textual segmentation output.
  • the machine-learned model(s) can process the text or natural language data to generate a semantic intent output.
  • the machine-learned model(s) can process the text or natural language data to generate an upscaled text or natural language output (e.g., text or natural language data that is higher quality than the input text or natural language, etc.). As another example, the machine-learned model(s) can process the text or natural language data to generate a prediction output.
  • the input to the machine-learned model(s) of the present disclosure can be speech data.
  • the machine-learned model(s) can process the speech data to generate an output.
  • the machine-learned model(s) can process the speech data to generate a speech recognition output.
  • the machine- learned model(s) can process the speech data to generate a speech translation output.
  • the machine-learned model(s) can process the speech data to generate a latent embedding output.
  • the machine-learned model(s) can process the speech data to generate an encoded speech output (e.g., an encoded and/or compressed representation of the speech data, etc.).
  • the machine-learned model(s) can process the speech data to generate an upscaled speech output (e.g., speech data that is higher quality than the input speech data, etc.).
  • the machine-learned model(s) can process the speech data to generate a textual representation output (e.g., a textual representation of the input speech data, etc.).
  • the machine- learned model(s) can process the speech data to generate a prediction output.
  • the input to the machine-learned model(s) of the present disclosure can be latent encoding data (e.g., a latent space representation of an input, etc.).
  • the machine-learned model(s) can process the latent encoding data to generate an output.
  • the machine-learned model(s) can process the latent encoding data to generate a recognition output.
  • the machine-learned model(s) can process the latent encoding data to generate a reconstruction output.
  • the machine-learned model(s) can process the latent encoding data to generate a search output.
  • the machine-learned model(s) can process the latent encoding data to generate a reclustering output.
  • the machine-learned model(s) can process the latent encoding data to generate a prediction output.
  • the input to the machine-learned model(s) of the present disclosure can be statistical data.
  • Statistical data can be, represent, or otherwise include data computed and/or calculated from some other data source.
  • the machine-learned model(s) can process the statistical data to generate an output.
  • the machine- learned model(s) can process the statistical data to generate a recognition output.
  • the machine-learned model(s) can process the statistical data to generate a prediction output.
  • the machine-learned model(s) can process the statistical data to generate a classification output.
  • the machine-learned model(s) can process the statistical data to generate a segmentation output.
  • the machine-learned model(s) can process the statistical data to generate a visualization output.
  • the machine-learned model(s) can process the statistical data to generate a diagnostic output.
  • the input to the machine-learned model(s) of the present disclosure can be sensor data.
  • the machine-learned model(s) can process the sensor data to generate an output.
  • the machine-learned model(s) can process the sensor data to generate a recognition output.
  • the machine-learned model(s) can process the sensor data to generate a prediction output.
  • the machine-learned model(s) can process the sensor data to generate a classification output.
  • the machine-learned model(s) can process the sensor data to generate a segmentation output.
  • the machine-learned model(s) can process the sensor data to generate a visualization output.
  • the machine-learned model(s) can process the sensor data to generate a diagnostic output.
  • the machine-learned model(s) can process the sensor data to generate a detection output.
  • the machine-learned model(s) can be configured to perform a task that includes encoding input data for reliable and/or efficient transmission or storage (and/or corresponding decoding).
  • the task may be an audio compression task.
  • the input may include audio data and the output may comprise compressed audio data.
  • the input includes visual data (e.g., one or more images or videos), the output comprises compressed visual data, and the task is a visual data compression task.
  • the task may comprise generating an embedding for input data (e.g., input audio or visual data).
  • the input includes visual data and the task is a computer vision task.
  • the input includes pixel data for one or more images and the task is an image processing task.
  • the image processing task can be image classification, where the output is a set of scores, each score corresponding to a different object class and representing the likelihood that the one or more images depict an object belonging to the object class.
  • the image processing task may be object detection, where the image processing output identifies one or more regions in the one or more images and, for each region, a likelihood that region depicts an object of interest.
  • the image processing task can be image segmentation, where the image processing output defines, for each pixel in the one or more images, a respective likelihood for each category in a predetermined set of categories.
  • the set of categories can be foreground and background.
  • the set of categories can be object classes.
  • the image processing task can be depth estimation, where the image processing output defines, for each pixel in the one or more images, a respective depth value.
  • the image processing task can be motion estimation, where the network input includes multiple images, and the image processing output defines, for each pixel of one of the input images, a motion of the scene depicted at the pixel between the images in the network input.
  • the input includes audio data representing a spoken utterance and the task is a speech recognition task.
  • the output may comprise a text output which is mapped to the spoken utterance.
  • the task comprises encrypting or decrypting input data.
  • the task comprises a microprocessor performance task, such as branch prediction or memory address translation.
  • Figure 1A illustrates one example computing system that can be used to implement the present disclosure. Other computing systems can be used as well.
  • the user computing device 102 can include the model trainer 160 and the training dataset 162.
  • the models 120 can be both trained and used locally at the user computing device 102.
  • the user computing device 102 can implement the model trainer 160 to personalize the models 120 based on user-specific data.
  • Figure 1B depicts a block diagram of an example computing device 10 that performs training of a machine-learned disentanglement model according to example embodiments of the present disclosure.
  • the computing device 10 can be a user computing device or a server computing device.
  • the computing device 10 includes a number of applications (e.g., applications 1 through N).
  • Each application contains its own machine learning library and machine-learned model(s).
  • each application can include a machine-learned model.
  • Example applications include a text messaging application, an email application, a dictation application, a virtual keyboard application, a browser application, etc.
  • each application can communicate with a number of other components of the computing device, such as, for example, one or more sensors, a context manager, a device state component, and/or additional components.
  • each application can communicate with each device component using an API (e.g., a public API).
  • an API e.g., a public API
  • FIG. 1C depicts a block diagram of an example computing device 50 that performs machine-learned disentanglement of an intermediate representation according to example embodiments of the present disclosure.
  • the computing device 50 can be a user computing device or a server computing device.
  • the computing device 50 includes a number of applications (e.g., applications 1 through N). Each application is in communication with a central intelligence layer.
  • Example applications include a text messaging application, an email application, a dictation application, a virtual keyboard application, a browser application, etc.
  • each application can communicate with the central intelligence layer (and model(s) stored therein) using an API (e.g., a common API across all applications).
  • the central intelligence layer includes a number of machine-learned models. For example, as illustrated in Figure 1C, a respective machine-learned model can be provided for each application and managed by the central intelligence layer. In other implementations, two or more applications can share a single machine-learned model. For example, in some implementations, the central intelligence layer can provide a single model for all of the applications. In some implementations, the central intelligence layer is included within or otherwise implemented by an operating system of the computing device 50. [0070] The central intelligence layer can communicate with a central device data layer. The central device data layer can be a centralized repository of data for the computing device 50.
  • the central device data layer can communicate with a number of other components of the computing device, such as, for example, one or more sensors, a context manager, a device state component, and/or additional components.
  • the central device data layer can communicate with each device component using an API (e.g., a private API).
  • Figure 2 depicts a data flow diagram for unsupervised training of a machine- learned tokenization model according to some implementations of the present disclosure. More specifically, audio data 202 can be obtained that includes audio 204 of a particular type. An audio “type” can refer to a classification, or general descriptor, for the content of audio data.
  • a type of audio may refer to music, or may refer to a more granular classification of music (e.g., a specific genre of music, music from a particular artist, etc.).
  • a type of audio may also refer to a podcast, speech, recording, spoken utterance, audio livestream, and/or any other manner of audio (at any level of granularity).
  • data 206 indicative of the audio 204 can be obtained.
  • the data 206 can be any type or manner of data that indicates the audio 204, such as textual data, image data, video data, etc.
  • the data 206 can be textual data that describes the audio of the particular type.
  • the data 206 can be a review written of the song, the song title, a description of the song, a transcript for an interview in which the creator of the song discusses it, etc.
  • the audio 204 is an audio track from audiovisual data, such as a movie
  • the data 206 indicative of the audio 204 can be the corresponding video data.
  • the audio data 202 that includes the audio 204, and/or the data 206 indicative of the audio 204 can be processed with the representation generator 208 to obtain an intermediate representation 210.
  • the representation generator 208 can process the audio data 202 to generate an intermediate representation 210 of the audio data 202.
  • the representation generator 208 can process the data 206 indicative of the audio 204 to generate an intermediate representation 210 of the data 206. In some implementations, the representation generator 208 can process both the audio data 202 and the data 206 to generate two intermediate representations that respectively represent the audio data 202 and the data 206 (not illustrated). In some implementations, the representation generator 208 can process both the audio data 202 and the data 206 indicative of the audio 204 to generate an intermediate representation 210 that represents both the data 206 and the audio data 202. [0074] In some implementations, the representation generator 208 can be one or more machine-learned model(s), encoder model(s), transformer model(s), etc.).
  • the representation generator 208 may be an encoder model trained to generate intermediate representations of audio data.
  • the representation generator 208 may be an encoder model trained to generate intermediate representations of the data 206.
  • the representation generator 208 may be, or otherwise include, multiple encoder models trained to generate intermediate representations for multiple modalities of data (e.g., image data, audio data, textual data, etc.).
  • the representation generator 208 can include one or more portion(s), or layer(s) of machine-learned model(s).
  • the multiple encoder models can be encoder portions from pre-trained foundational models.
  • the representation generator 208 may include a pre-trained language encoder from an LLM, a pre-trained audio encoder from a foundational audio model, a pre- trained vision encoder from a foundational vision model, etc.
  • the intermediate representation 210 can be generated via machine learning techniques. In these instances, the intermediate representation 210 can be an embedding, vector representation, matrix, etc. Alternatively, in some implementations, the representation generator 208 can generate some other manner of intermediate representation 210 via techniques unrelated to machine learning. As one example, the representation generator 208 may be a spectrogram creator, and the intermediate representation 210 can be a spectrogram representation of the audio data 202.
  • intermediate representation 210 can include entangled token representations of input(s) to the representation generator 208.
  • the intermediate representation 210 can include a token stream that includes tokens for multiple types of audio content attributes that are mixed together.
  • tokens can be generated by the representation generator 208 in a mixed fashion, and then disentangled with a machine-learned disentanglement model 212.
  • the intermediate representation 210 can be some other manner of intermediate representation (e.g., a vector representation, etc.) and the machine-learned disentanglement model 212 can process the intermediate representation 210 to generate disentangled token streams.
  • the intermediate representation 210 can be processed by the machine-learned disentanglement model 212 to generate disentangled token streams 214A, 214B, and 214C (generally, disentangled token streams 214).
  • the disentangled token streams 214 can be disentangled streams of representational tokens that have been disentangled into subspaces that correspond to particular audio content attributes. For example, if the audio 204 is audio from a film, token stream 214A may correspond to a dialogue audio content attribute, token stream 214B may correspond to a special effects audio content attribute, and token stream 214C may correspond to a soundtrack audio content attribute.
  • the token stream 214A may correspond to an utterance-scale audio content attribute
  • the token stream 214B may correspond to a word-level audio content attribute
  • the token stream 214C may correspond to a syllable-scale or phoneme-scale audio content attribute.
  • the intermediate representation 210 can be tokenized by the machine-learned tokenization model 211. For example, if the intermediate representation 210 does not include tokens (e.g., a spectrogram representation, etc.), the machine-learned tokenization model 211 can process the intermediate representation 210 to generate an entangled token stream 213.
  • the entangled token stream 213 can refer to a series of tokens that generally serve as a tokenized representation of the inputs to the representation generator 208. However, the tokens of the entangled token stream 213 can be entangled such that tokens associated with an audio content attribute may be mixed with tokens associated with some other audio content attribute.
  • the entangled token stream 213 can be processed with the machine-learned disentanglement model 212 to generate the disentangled token streams 214.
  • the machine-learned tokenization model 211 can be a portion of the machine-learned disentanglement model 212.
  • the machine-learned tokenization model 211, the machine-learned disentanglement model 212, and/or the representation generator 208 may be portions of the same machine-learned model.
  • the machine-learned tokenization model 211 can be a separate model from either the machine- learned disentanglement model 212 or the representation generator 208 (e.g., a pre-trained tokenizer, etc.).
  • the audio 204 is speech audio.
  • the disentangled token stream 214A can be a token stream for a utterance-level subspace.
  • the contrastive loss term 216A can evaluate target frame pairs (e.g., pairs of positive and negative examples) at an utterance scale.
  • the disentangled token stream 214B can be a token stream for a word-level subspace.
  • the contrastive loss term 216B can evaluate target frame pairs at a word scale (e.g., between 1 second of each other).
  • the disentangled token stream 214C can be a token stream for a syllable-level subspace.
  • the contrastive loss term 216C can evaluate target frame pairs at a syllable scale (e.g., between about 250ms of each other).
  • the disentangled token stream 214D can be a token stream for a phoneme-level subspace.
  • the contrastive loss term 216D can evaluate target frame pairs at a phoneme scale (e.g., between about 70ms of each other).
  • the machine-learned disentanglement model 212 can include a quantization portion 215.
  • the quantization portion 215 can include one or more model layer(s) that can discretize outputs of the machine-learned disentanglement model 212.
  • the quantization portion 215 can be, or otherwise include, neural network layer(s) (or similar) that apply within-network clustering of tokens output by the machine-learned disentanglement model 212 across attribute subspaces corresponding to the disentangled token streams 214.
  • the quantization portion 215 may cluster two or more tokens from the machine-learned disentanglement model 212 to form the disentangled token stream 214A based on a similarity between the tokens.
  • the quantization portion 215 can be trained based on entropy loss functions that evaluate different objectives.
  • the quantization portion 215 can be trained based on a first clustering loss function and a second clustering loss function.
  • the first clustering loss function can evaluate satisfaction of a first objective to encourage confident clustering by evaluating a first average, across the plurality of tokens, of a respective first entropy of a probability distribution associated with each token (e.g., mean per-example entropy).
  • the probability distribution can describe respective probabilities (e.g., confidences) of the token belonging to a particular subspace.
  • the second clustering loss function can evaluate satisfaction of a second objective to encourage diversity of cluster assignments by evaluating a second entropy of a second average of the probability distributions for all tokens (e.g., entropy of a batch average distribution).
  • the quantization portion 215 can perform scalar binary quantization. More specifically, the quantization portion 215 can quantize an input vector to generate a set of bits. The set of bits can be processed by the quantization portion 215 to generate a prediction output that predicts a probability of each bit being “on” or “off” given an input X. Tokens can be clustered, or assigned to a particular subspace / token stream, based on the prediction outputs.
  • Figure 2B is a data flow diagram for performing quantization with the quantization portion 215 of the machine-learned disentanglement model 212 of Figure 2A according to some implementations of the present disclosure.
  • the machine-learned disentanglement model 212 can include a quantization portion 215 that can directly output disentangled token streams without the need for additional processing, such as K-means clustering.
  • the quantization portion 215 can receive an input vector 222.
  • the input vector 222 can be, include, or otherwise be derived from the intermediate representation 210.
  • the input vector 222 can be a vector representation of the audio 202.
  • the input vector 222 can be processed with an encoder submodel 224 of the quantization portion 215 to obtain a quantized bit representation 226.
  • the quantized bit representation 226 can predict a probability of each bit being “on” given the input vector 222. Specifically, each bit can be represented as ⁇ ⁇ ⁇ ⁇ 0,1 ⁇ , ⁇ ⁇ 1... ⁇ .
  • the encoder submodel 224 can predict a quantized bit representation 226 of the input vector 222.
  • the quantized bit representation 226 can be represented as ⁇ ⁇
  • the decoder submodel 228 can process the quantized bit representation 226 to generate output 230.
  • a straight-through estimator can be utilized to estimate bits from probabilities as ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ , ⁇ ⁇ 1,... ⁇ , where ⁇ is the stop gradient operator.
  • the decoder submodel 228 can “see” hard outputs ⁇ ⁇ (e.g., the quantized bit representation 226) from the encoder submodel 224 as is performed via inference, while retaining the capability to pass gradients from the decoder to the probability outputs of the encoder.
  • the quantization portion 215 can be trained using a quantization loss function 232 that evaluates the output 230 and the input vector 222 (or the input from which the input vector 222 is derived). For example, assume that the input vector 222 is represented as variable x.
  • the input x is quantized by the encoder submodel 224 into the quantized bit representation 226 which can include N bits per frame, and each bit can be represented by a random variable (e.g., a Bernoulli random variable) ⁇ ⁇ , ⁇ ⁇ 1,... , ⁇ .
  • the prior probability of ⁇ ⁇ being “on” can be represented as ⁇ ⁇ ⁇ ⁇ ⁇ 1 ⁇ , and ⁇ ⁇
  • the quantization loss function 232 can include a certainty loss term 234.
  • the certainty loss term 234 can be, evaluate, or otherwise represent, a mean-squared error between probability and its corresponding bit, averaged over examples and bits.
  • the certainty loss term 234 can be represented as ⁇ ⁇ , ⁇ ⁇ ⁇
  • the certainty loss term the conditional probabilities of the quantized bit representation 226 to the edges and thus decrease / minimize the entropy of the conditional binary variable.
  • the quantization loss function 232 can include a utilization loss term 236.
  • the utilization loss term 236 can be, evaluate, or otherwise represent, a squared error between prior probability for each bit and 0.5, where prior probability is calculated with Exponential Moving Average (EMA) over batches using a decay, with the losses being averaged over the bits.
  • EMA Exponential Moving Average
  • the utilization loss term 236 can be represented as ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ 0.5 ⁇ , where ⁇ ⁇ ⁇ ⁇ ⁇ 1 ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇
  • the utilization loss term 236 can “pull” the mean probability to 0.5 and make the entropy of the unconditional/prior distribution maximized.
  • the quantization loss function 232 can include a pairwise independence loss term 238.
  • the pairwise independence loss term 238 can be, evaluate, or otherwise represent a squared error between conditional probabilities and 0.5 for bit i given bit j is zero or one, weighted by the probability of bit j being zero or one (priors).
  • the pairwise independence loss term 238 can be represented as ⁇ ⁇ ⁇ ⁇ , ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇
  • To binary mask can be utilized over examples when bit ⁇ is 0 and 1.
  • the quantization loss function 232 can include a pairwise uncorrelatedness loss term 240.
  • the pairwise uncorrelatedness loss term 240 can be, evaluate, or otherwise represent, a squared error between joint probabilities and multiplication of mean probabilities for pairs of bits ⁇ and ⁇ for ⁇ ⁇ ⁇ .
  • the pairwise uncorrelatedness loss term 240 can calculate probabilities with EMA over batches.
  • the joint probability evaluated by the pairwise uncorrelatedness loss term 240 can utilize a decay for EMA, and the mean probability evaluated by the pairwise uncorrelatedness loss term 240 can utilize another decay to update expectations (e.g., ⁇ ⁇ terms).
  • the pairwise uncorrelatedness loss term 240 can be represented as ⁇ ⁇ ! ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ .
  • the pairwise uncorrelatedness loss term 240 can be represented as ⁇ ⁇ ! ⁇ ⁇ ⁇ ⁇ ⁇
  • the targets ⁇ ⁇ and ⁇ ⁇ ⁇ ⁇ ⁇ can be replaced for the pairwise independence loss term 238 and the pairwise uncorrelatedness loss term 240 with expected values of 0.5 and 0.25, respectively. This substitution can be made to provide more optimal targets for the pairwise independence loss term 238 and the pairwise uncorrelatedness loss term 240.
  • the quantization loss function 232 may only include some, or may include none, of the loss terms 234-240.
  • the quantization loss function 232 may include loss terms 234, 236, and 238.
  • the quantization loss function 232 may include loss terms 234, 236, and 240.
  • neighboring frames of the input vector 222, or the input from which the input vector 222 is derived can be quantized together. For example, assume that the input vector 222 is an input tensor.
  • the input tensor can be reshaped before processing with the encoder submodel 224 to combine multiple neighboring frames together.
  • the pre-output of the encoder submodel 224 has a shape of 250 x 1024 where 250 is the number of frames and 1024 is the output/encoding dimension.
  • the encoding dimensions can be mapped to 12 using a projection layer with 1024 x 12 size.
  • this is a relatively “small”, and thus inefficient, matrix size.
  • the frames before the current projection can be combined and quantized together. For example, if 10 frames are combined, the input tensor can be reshaped to 25 x 10240 dimensions, which can reduce the frames and increase the encoding dimension.
  • 10240 dimensional data can be projected to 120 bits using a matrix with 10240 x 120 dimensions.
  • 120 can be the number of bits used to represent 10 frames, so the overall bits-per-second can be the same as using 12 bits per frame.
  • a reduced number of frames e.g., 25 frames
  • a 25x10240 tensor representing the mapped space which can be reshaped to 250 x 1024 dimensions, which is the tensor on which the decoder submodel 228 operates.
  • this provides the benefits of quantizing frames together to receive superior performance while utilizing smaller memory weights than conventional approaches.
  • the disentangled token streams 214 can be evaluated using corresponding contrastive loss terms 216A, 216B, and 216C (generally, contrastive loss terms 216C).
  • the contrastive loss terms 216 may refer to constituent loss “terms” of a singular loss function, or may refer to multiple loss functions. More generally, the contrastive loss terms 216 refer to some function or group of functions that can evaluate pairs of examples (e.g., a pair including a positive example and a negative example) based on some semantically-correlated notion of similarity to learn a representation that groups positive pairs.
  • the contrastive loss terms 216 can be collectively evaluated and utilized to train the machine-learned disentanglement model to disentangle the disentangled token streams 214 into discrete subspaces.
  • mutual information loss terms (not illustrated) between the disentangled token streams 214 can also be evaluated to train the machine-learned disentanglement model 212 to more completely disentangle the disentangled token streams 214.
  • a mutual information loss term can refer to a loss term that, when evaluated and utilized to train a model or portion of a model, trains the model by maximizing mutual information shared between disentangled token representations. By maximizing such mutual information, the machine-learned disentanglement model can be encouraged to more completely disentangle the disentangled token streams 214.
  • the machine-learned decoder model 218 can process the disentangled token streams 214 to generate a reconstructed intermediate representation 220. More specifically, the machine-learned decoder model 218 can “invert” the disentangled token streams 214 to generate the reconstructed intermediate representation 220. For example, if the machine- learned disentanglement model 212 is a transformer stack, the machine-learned decoder model 218 may be an inverter transformer stack that inverts the disentangled token streams 214. In some implementations, the disentangled token streams 214 can be stacked and then inverted via the machine-learned decoder model 218. [00100] The reconstructed intermediate representation 220 can represent the same information represented by the intermediate representation 210.
  • the reconstructed intermediate representation 220 can be perceptually equivalent to the intermediate representation 210, although the reconstructed intermediate representation 220 may not be identical to the intermediate representation 210.
  • the intermediate representation 210 and the reconstructed intermediate representation 220 are both processed with a generative audio decoder to generate audio data. If the audio data from both representations is played with an audio playback device, the audio from both representations will be perceptibly the same. In this manner, the disentangled token streams 214 serve as a compressed representation of the intermediate representation 210.
  • the machine-learned disentanglement model 212 can be utilized as an efficiently compress intermediate representations.
  • the reconstructed intermediate representation can be processed by various generative models to produce a generative output. Additionally, or alternatively, in some implementations, the reconstructed intermediate representation, or the Various use- cases for the reconstructed intermediate representation 220 will be discussed in greater detail with regards to Figure 3.
  • FIG 3 is a block diagram for a computing system 300 that disentangles and tokenizes representations for generative tasks and/or compression tasks according to some implementations of the present disclosure.
  • a computing system 300 can be a computing system that provides compression services and/or generative services (e.g., a cloud computing system, a network compute node, a server computing system, etc.).
  • the computing system 300 can receive information from a requesting entity 302.
  • the requesting entity 302 can provide an input 304 to the computing system.
  • the input 304 can be any type or manner of input data.
  • the input 304 can be audio data, or data indicative of audio data (e.g., image data, textual data, etc.).
  • the computing system 300 can process the input 304 with an encoder model 306 to generate an intermediate representation 308.
  • the encoder model 306 can process the input 304 to generate an intermediate representation 308 of the audio data.
  • the requesting entity 302 can directly provide the intermediate representation 308 of the input 304 to the computing system 300.
  • the input 304 can include additional information that indicates particular sub-space(s) over which to disentangle the intermediate representation 308.
  • the input 304 may include additional information indicating a prosody subspace, an inflection subspace, a tone subspace, an utterance-scale subspace, a phoneme-scale subspace, etc.
  • the requesting entity 302 can provide a compression request 310 to the computing system 300.
  • the compression request 310 can request that the computing system generate a compressed representation of the input 304.
  • the compression request 310 can also indicate to the computing system 300 to store the compressed representation, and/or return the compressed representation.
  • the compression request 310 can indicate to the computing system 300 to transmit the compressed representation to a compression data store 312 (e.g., a data store associated with the requesting entity 302, a data store associated with the computing system 300, etc.).
  • a compression data store 312 e.g., a data store associated with the requesting entity 302, a data store associated with the computing system 300, etc.
  • the compression request 310 can request that the computing system 300 return the compressed representation to the requesting entity 302.
  • the requesting entity can provide a generative request 314 to the computing system 300.
  • the generative request 314 can specify particular generative task(s) to perform with the information provided by the requesting entity 302 to the computing system 300.
  • the generative request 314 can indicate a generative audio task to generate the audio based on the input 304.
  • the generative request 314 may indicate a text generation task to generate task descriptive of the audio.
  • the generative request 314 can include adjustment parameters 316.
  • the adjustment parameters 316 can be task-specific parameters to modify certain characteristics of particular disentangled token stream(s). For example, if tokens are disentangled to a prosody subspace, the adjustment parameter(s) 316 can indicate that the degree of prosody is to be adjusted, or the words to which a prosody effect is applied.
  • the computing system 300 can include a machine-learned disentanglement model 318.
  • the machine-learned disentanglement model 318 can process the intermediate representation 308, and/or the input 304, to generate disentangled tokens 320 as described with regards to Figure 2.
  • the disentangled tokens 320 can include disentangled token streams 320A, 320B, and 320C.
  • the disentangled tokens 320 can serve as a compressed representation of the intermediate representation 308 by excluding (or more efficiently representing) some or all of the non-perceptible information included in the intermediate representation 308.
  • the computing system 300 receives the compression request 310 from the requesting entity 302, the computing system can return the disentangled tokens 320 to the requesting entity 302 and/or can store the disentangled tokens 320 in the compression data store 312.
  • the computing system 300 receives the generative request 314 from the requesting entity 302, the computing system 300 can perform a generative task using the disentangled tokens 320. To do so, the computing system 300 can include a generative task controller 322.
  • the computing system 300 can include machine-learned generative model(s) 326.
  • the machine-learned generative model(s) 326 can be model(s) trained to perform various generative task(s).
  • the machine-learned generative model(s) 326 can include a single foundational model trained to process token streams to generate multiple types of outputs (e.g., audio data, image data, textual data, etc.).
  • the foundational model can be a model that can perform audio data tasks, image data, tasks, and/or textual data tasks.
  • the machine- learned generative model(s) 326 can include multiple foundational models each trained to process disentangled tokens to perform multiple tasks for a particular type of data.
  • the machine-learned generative model(s) 326 can include a LLM that can process token streams to perform multiple types of language tasks that each generate a language output.
  • the machine-learned generative model(s) 326 can include a foundational vision model that can process token streams to perform multiple types of vision tasks that each generate a vision output.
  • the machine-learned generative model(s) 326 can include a foundational audio model that can process token streams to perform multiple types of audio tasks that each generate an audio output. [00111] If the generative request 314 includes the adjustment parameters 316, the machine-learned generative model(s) 326 can process the adjusted disentangled tokens 324 to generate a generative output 328. Alternatively, the generative request 314 does not include the adjustment parameters 316, the machine-learned generative model(s) 326 can process the disentangled tokens 320 to generate the generative output 328.
  • the generative output 328 can correspond to the type of task performed by the machine-learned generative model(s) 326.
  • the particular task performed by the machine-learned generative model(s) 326, and/or the specific model used to perform the particular task can be specified by task instructions 330. More specifically, the task instructions 330 can specify a particular task to perform and one or more model(s) of the machine-learned generative model(s) 326 to be used to perform the task. In some implementations, the task instructions 330 can be generated by the generative task controller 322 based on a variety of factors (e.g., previous generative requests 314, the type of input 304, the adjustment parameters 316, etc.).
  • the task instructions 330 can be included in the generative request 314 and can be provided to the machine-learned generative model(s) 326 for processing.
  • the machine-learned generative model(s) 326 can process additional input tokens 332.
  • the additional input tokens 332, also referred to as “side tokens,” can be tokens that are not derived from the input 304 or the intermediate representation 308. For example, assume that the input 304 is audio data that includes audio of a popular song. If the generative request 314 indicates a generative text task, the additional input tokens 332 may be tokens derived from the text of reviews for the popular song, lyrics from the song, social media discussions of the song or the artist associated with the song, etc.
  • FIG. 4 is a data flow diagram 400 for disentanglement of an intermediate representation of speech audio at prescribed time scales according to some implementations of the present disclosure.
  • an intermediate representation 402 for speech audio can be obtained.
  • the speech audio can be audio of spoken utterances produced by a human, or audio of simulated spoken utterances from a simulated human.
  • speech audio can include a variety of audio content attributes, such as speakers, channels, emotions, phonetics, and prosodics. However, such attributes are most optimally encoded on different time scales.
  • a machine-learned disentanglement model 404 can generate a disentangled token stream 406.
  • the disentangled token stream 406 can include multiple subspaces 408A-408E (generally, subspaces 408). Additionally, the subspaces can be structured in accordance with varying time scales. To follow the depicted example, tokens for the speaker subspace 408A, channel subspace 408B, and emotion subspace 408C can be disentangled on an utterance-level time scale. Tokens for the phonetic subspace 408D and the prosodic subspace 408E can be disentangled on a frame-level time scale. In this manner, sufficient information can be captured for phonetic and prosodic audio content attributes while substantially reducing the capture of non-perceptible, extraneous information for speaker, channel, and emotion attributes.
  • each contrastive loss can be viewed as applying to the entire encoding space, but with a binary-valued mask applied that only admits the restricted subspace to contribute to the distance calculation of each loss function.
  • arbitrary real-valued masking functions can be utilized to enable each dimension to contribute to one or more loss functions. This can alternatively be viewed as a structured dropout mechanism to encourage registering individual encoding dimensions with target properties of interest.
  • a more continuous association of each dimension can be adopted that leaves hard assignments to inference time.
  • a map M can be predefined such that M: [0, 1] ⁇ [0, 1] D from lag to the collection of mask functions over the D disentangled encoding dimensions.
  • M(t) [ ⁇ ⁇ ,..., ⁇ ⁇ ] can be defined as the quantized fixed-width Gaussian kernel, where each mask weight is given by ⁇ ⁇ ⁇ ⁇
  • the resulting encoding dimensions can encode properties at a given time scale and contiguous subsets of dimensions will have similar behavior to the hard boundary case described previously.
  • speech is a well-studied and highly constrained audio domain, application of previously described techniques to the general audio domain can present a more difficult task. In particular, determining the list of attributes over which to entangle can be non-trivial.
  • implementations of the present disclosure propose two broad categories with different modeling imperatives: (1) sound textures of indefinite length that present little opportunity for constrained duration modeling (e.g., air conditioning, engine idling, etc.), and (2) isolated events generated via a process with modellable duration constraints (e.g., a doorbell ring, a dog barks, etc.).
  • constrained duration modeling e.g., air conditioning, engine idling, etc.
  • isolated events generated via a process with modellable duration constraints e.g., a doorbell ring, a dog barks, etc.
  • a time period around 1 second can serve as an approximate duration threshold that discriminates between the two.
  • disentangling by time scale can also be applied to differentiating broad classes of sound events. For example, modifications in the clip-level subspace may tend to affect background sound textures, while modifications in the local-time subspace may tend to affect isolated events.
  • FIG. 5 depicts a flow chart diagram of an example method to perform disentanglement of tokens for generative tasks according to some implementations of the present disclosure.
  • a computing system can obtain an intermediate representation.
  • the intermediate representation can represent audio data that audio of a particular type.
  • the audio of the particular type can include a plurality of audio content attributes.
  • the intermediate representation can represent data indicative of audio of the particular type.
  • the intermediate representation can represent a transcript of audio data that includes audio of a conversation.
  • the intermediate representation can represent an image that depicts the transcript described previously.
  • the computing system can process the intermediate representation with a machine-learned disentanglement model to obtain two or more token streams for two or more respective audio content attributes of the plurality of audio content attributes.
  • the two or more token streams can include a first token stream for a first audio content attribute of the two or more audio content attributes.
  • the first token stream can be tokenized at a first time scale.
  • the two or more token streams can additionally include a second token stream for a second audio content attribute of the two or more audio content attributes.
  • the second token stream can be tokenized at a second time scale different than the first time scale.
  • the computing system can process the two or more token streams with a machine-learned decoder model to obtain a reconstructed intermediate representation of the audio data.
  • the reconstructed intermediate representation of the audio data can be perceptually equivalent to the intermediate representation of the audio data.
  • the intermediate representation of the audio data can include a first quantity of information, and the two or more token streams can include a second quantity of information less than the first quantity of information (e.g., 500 bits vs 400 bits, etc.).
  • the machine-learned disentanglement model can include a transformer stack that includes one or more transformer layers.
  • the machine- learned decoder model can include an inverter transformer stack that includes one or more inversion transformer layers corresponding to the one or more transformer layers.
  • the computing system can adjust one or more parameters of the machine-learned disentanglement model based on a reconstructive loss function that evaluates a difference between the reconstructed intermediate representation and the intermediate representation.
  • the reconstructive loss function includes a mutual information loss term that maximizes a quantity of the audio data represented by both the first token stream and the second token stream.
  • the reconstructive loss function respectively includes two or more content- specific contrastive loss terms for the two or more audio content attributes. Each of the two or more content-specific loss terms minimizes a difference between a particular token stream and a positive example and maximizes a difference between the particular token stream and a negative example.
  • the computing system can process the reconstructed intermediate representation of the audio data with a machine-learned generative audio model to obtain reconstructed audio data.
  • the computing system can modify the first token stream based on a request to modify a characteristic of the first audio content attribute.
  • the first content attribute can include a speech attribute for speech audio caused by spoken utterances.
  • the characteristic of the first audio content attribute can include a vocal tone characteristic, an inflection characteristic, an emotion characteristic, a cadence characteristic, a pitch characteristic, etc.
  • the first content attribute can be an instrumental attribute for instrumental audio caused by instruments.
  • the characteristic of the first audio content attribute can include one or more instrument-specific characteristics, a mixing characteristic, a volume characteristic, an instrument inclusion characteristic, an instrument exclusion characteristic, etc.
  • the computing system can adjust one or more parameters of one or more models based on a generative loss function that evaluates a difference between the audio data and the reconstructed audio data.
  • the one or more models can be, or otherwise include, at least one of the machine-learned generative audio model, the machine-learned tokenization model, the machine-learned decoder model, etc.
  • the computing system can adjust at least one of the first tokenization rate or the second tokenization rate based at least in part on the generative loss function.
  • the computing system can process textual content descriptive of the audio data with an encoder portion of a LLM to obtain an intermediate representation of the textual content.
  • the computing system can adjust one or more parameters of the LLM based on a language loss function that evaluates a difference between the intermediate representation of the textual content and the intermediate representation of the audio data.
  • the computing system can obtain second textual content descriptive of second audio of the particular type.
  • the computing system can process the second textual content descriptive of the second audio of the particular type with the LLM to obtain an intermediate representation of the second audio of the particular type.
  • the computing system can process the intermediate representation of the second audio of the particular type with the machine-learned disentanglement model to obtain two or more second token streams for the two or more respective audio content attributes of the plurality of audio content attributes.
  • the computing system can process the two or more second token streams with the machine-learned decoder model to obtain a reconstructed intermediate representation of the second audio of the particular type.
  • the computing system can process the reconstructed intermediate representation of the second audio of the particular type with the machine-learned generative audio model to obtain the second audio of the particular type described by the textual content.
  • the first audio content attribute includes temporally- diffuse audio
  • the second audio attribute includes localized audio.
  • the computing system can store the token stream(s) to a data store operable to store compressed audio data. Additional Disclosure [00133]
  • the technology discussed herein makes reference to servers, databases, software applications, and other computer-based systems, as well as actions taken and information sent to and from such systems.
  • the inherent flexibility of computer-based systems allows for a great variety of possible configurations, combinations, and divisions of tasks and functionality between and among components. For instance, processes discussed herein can be implemented using a single device or component or multiple devices or components working in combination. Databases and applications can be implemented on a single system or distributed across multiple systems. Distributed components can operate sequentially or in parallel.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An intermediate representation can be obtained of audio data comprising audio of a particular type, wherein audio of the particular type comprises a plurality of audio content attributes. The intermediate representation can be processed with a machine-learned disentanglement model to obtain two or more token streams for two or more respective audio content attributes of the plurality of audio content attributes. The two or more token streams include a first token stream for a first audio content attribute of the two or more audio content attributes, wherein the first token stream is tokenized at a first time scale, and a second token stream for a second audio content attribute of the two or more audio content attributes, wherein the second token stream is tokenized at a second time scale different than the first time scale.

Description

MULTI-SCALE ATTRIBUTE-DISENTANGLED AUDIO TOKENIZATION FOR CONTROLLABLE GENERATION FIELD [0001] The present disclosure relates generally to audio tokenization. More particularly, the present disclosure relates to multi-scale, attribute-based disentanglement for audio tokenization. BACKGROUND [0002] Foundational models, such as Large Language Models (LLMs), are models with large numbers of parameters that are trained using large datasets to perform multiple tasks. Foundational models are currently revolutionizing what can be achieved with assistive AI technology in many fields, ranging from regular conversation assistants to multi-modal editing of content such as audio, images or video. Once trained, a foundational model can perform a wide variety of tasks within the context of the training data used to train the model. For example, a LLM can perform a wide variety of language tasks ranging from answering queries to generating new content based on prompts. SUMMARY [0003] Aspects and advantages of embodiments of the present disclosure will be set forth in part in the following description, or can be learned from the description, or can be learned through practice of the embodiments. [0004] One example aspect of the present disclosure is directed to a computer- implemented method. The method includes obtaining, by a computing system comprising one or more processor devices, an intermediate representation of audio data comprising audio of a particular type, wherein audio of the particular type comprises a plurality of audio content attributes. The method includes processing, by the computing system, the intermediate representation with a machine-learned disentanglement model to obtain two or more token streams for two or more respective audio content attributes of the plurality of audio content attributes, wherein the two or more token streams comprise a first token stream for a first audio content attribute of the two or more audio content attributes, wherein the first token stream is tokenized at a first time scale, and a second token stream for a second audio content attribute of the two or more audio content attributes, wherein the second token stream is tokenized at a second time scale different than the first time scale. [0005] Another example aspect of the present disclosure is directed to a computing system. The computing system includes one or more processors and one or more tangible, non-transitory computer readable media storing computer-readable instructions that when executed by the one or more processors cause the computing system to perform operations. The operations include obtaining an intermediate representation of audio data comprising audio of a particular type, wherein audio of the particular type comprises a plurality of audio content attributes. The operations include processing the intermediate representation with a machine-learned disentanglement model to obtain two or more token streams for two or more respective audio content attributes of the plurality of audio content attributes, wherein the two or more token streams comprise a first token stream for a first audio content attribute of the two or more audio content attributes, wherein the first token stream is tokenized at a first time scale, and a second token stream for a second audio content attribute of the two or more audio content attributes, wherein the second token stream is tokenized at a second time scale different than the first time scale. [0006] Another example aspect of the present disclosure is directed to one or more tangible, non-transitory computer readable media storing computer-readable instructions that when executed by one or more processors cause the one or more processors to perform operations. The operations include obtaining an intermediate representation of audio data comprising audio of a particular type, wherein audio of the particular type comprises a plurality of audio content attributes. The operations include processing the intermediate representation with a machine-learned disentanglement model to obtain two or more token streams for two or more respective audio content attributes of the plurality of audio content attributes, wherein the two or more token streams comprise a first token stream for a first audio content attribute of the two or more audio content attributes, wherein the first token stream is tokenized at a first time scale, and a second token stream for a second audio content attribute of the two or more audio content attributes, wherein the second token stream is tokenized at a second time scale different than the first time scale. [0007] Other aspects of the present disclosure are directed to various systems, apparatuses, non-transitory computer-readable media, user interfaces, and electronic devices. [0008] These and other features, aspects, and advantages of various embodiments of the present disclosure will become better understood with reference to the following description and appended claims. The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate example embodiments of the present disclosure and, together with the description, serve to explain the related principles. BRIEF DESCRIPTION OF THE DRAWINGS [0009] Detailed discussion of embodiments directed to one of ordinary skill in the art is set forth in the specification, which makes reference to the appended figures, in which: [0010] Figure 1A depicts a block diagram of an example computing system that performs per-attribute token disentanglement according to example embodiments of the present disclosure. [0011] Figure 1B depicts a block diagram of an example computing device that performs training of a machine-learned disentanglement model according to example embodiments of the present disclosure. [0012] Figure 1C depicts a block diagram of an example computing device that performs machine-learned disentanglement of an intermediate representation according to example embodiments of the present disclosure. [0013] Figure 2A depicts a data flow diagram for unsupervised training of a machine- learned tokenization model according to some implementations of the present disclosure. [0014] Figure 2B is a data flow diagram for performing quantization with the quantization portion of the machine-learned disentanglement model of Figure 2A according to some implementations of the present disclosure. [0015] Figure 3 is a block diagram for a computing system that disentangles and tokenizes representations for generative tasks and/or compression tasks according to some implementations of the present disclosure. [0016] Figure 4 is a data flow diagram for disentanglement of an intermediate representation of speech audio at prescribed time scales according to some implementations of the present disclosure. [0017] Figure 5 depicts a flow chart diagram of an example method to perform disentanglement of tokens for generative tasks according to some implementations of the present disclosure. [0018] Reference numerals that are repeated across plural figures are intended to identify the same features in various implementations. DETAILED DESCRIPTION Overview [0019] Generally, the present disclosure is directed to multi-scale, attribute-based disentanglement for audio tokenization. More specifically, it can be advantageous to disentangle audio data (or from an intermediate representation of audio data) on a per- property basis to generate separate token streams for each audio content property of the audio data. For example, audio data of a song may be disentangled into multiple token streams (e.g., a speaker audio property, a music audio property, an emotion audio property, etc.). [0020] However, conventional tokenization processes are rendered inefficient due to the temporally variant nature of audio content attributes (e.g., speech, music, one-off background noise, consistent background noise, etc.). In other words, different audio content attributes are most optimally tokenized at different time scales. As an example, for audio data that includes audio from a popular song, a speech audio content attribute for speech audio caused by singing may be optimally tokenized on a per-utterance basis (e.g., generating tokens when an utterance occurs) while an instrumental audio content attribute for instrumental background audio may be most optimally tokenized on a per-frame basis or on a time basis (e.g., every 0.5 seconds, etc.). As such, many conventional audio tokenization processes tokenize audio data sub-optimally, leading to perceptual information loss and inefficient compression. [0021] Accordingly, implementations of the present disclosure propose multi-scale attribute-disentangled audio tokenization for controllable generation. More specifically, an intermediate representation can be obtained for audio data that includes audio of a particular type (e.g., a song, a speech, a recording of a room, etc.). The audio of the particular type can include multiple audio content attributes. For example, audio of a song may include a speech audio content attribute, a background vocals audio content attribute, a background instrumental audio content attribute, etc. [0022] The intermediate representation of the audio data can be processed with a machine-learned tokenization model to obtain token streams for particular audio content attributes. To follow the previous example, for a song, the machine-learned tokenization model may provide a token stream representing audio of the speech audio content attribute, a second token stream representing audio of the background instrumental audio content attribute, and a third “catch-all” token stream to represent any perceptible information not included in the other two token streams. Each of the token streams can be tokenized at a particular time scale. For example, the first token stream representing certain audio characteristics of the speech audio content attribute may be tokenized at a per-utterance scale, such as speaker identity, emotion, etc. (e.g., every time the main singer produces a spoken utterance), while the second token stream representing audio of the background instrumental audio content attribute may be tokenized at a time-basis rate (e.g., every 0.5 seconds). In this manner, an intermediate representation of audio data can be more optimally disentangled and tokenized on a per-attribute basis. [0023] Implementations of the present disclosure provide a number of technical effects and benefits. As one example technical effect and benefit, some conventional intermediate representations of audio data have recently been found to “over-represent” corresponding audio data. In other words, many conventional techniques for generating intermediate representations of audio data produce representations that require more computing resources than necessary (e.g., memory, storage, power, processing cycles, etc.). However, by disentangling and tokenizing an intermediate representation on a per-attribute basis, implementations of the present disclosure can substantially reduce the quantity of non- perceptible information included in an intermediate representation. In turn, implementations of the present disclosure can create intermediate representations that imperceptibly compress audio data to a degree that is substantially better than conventional compression processes. [0024] To follow the previous example, assume that a song only includes a few seconds of singing and mostly consists of instrumental audio. Although tokenization of the instrumental audio may require a particular time scale (e.g., every 0.5 seconds, etc.), tokenizing the speech audio of the speech audio attribute at the same time scale will result in the collection of a large quantity of tokens that represent non-perceptible information (e.g., periods during which the singer is not singing). However, by tokenizing the song on a per- attribute basis, implementations of the present disclosure can tokenize the speech audio at a per-utterance time scale that generates substantially fewer tokens than the time scale for the instrumental audio, therefore substantially reducing the size of the intermediate representation without loss of perceptible information. [0025] As another example technical effect and benefit, implementations of the present disclosure provide for fine-tuned control of audio content attributes. For example, assume that audio data including audio of vocal music with content attributes such as speech, music, background vocals, etc. is tokenized. The tokens can be disentangled on a per-attribute basis corresponding to audio content attributes associated with the audio. If a user wishes to modify properties of the prosody of speech audio associated with a speech content attribute of the vocal music, the tokens corresponding to the speech audio attribute can be modified to increase or decrease the characteristics of the prosody (e.g., loudness, pitch, speaking rate on a word by word basis, etc.). In such fashion, implementations of the present disclosure enable fine-tuned adjustment and optimization of generative tasks. [0026] With reference now to the Figures, example embodiments of the present disclosure will be discussed in further detail. Example Devices and Systems [0027] Figure 1A depicts a block diagram of an example computing system 100 that performs per-attribute token disentanglement according to example embodiments of the present disclosure. The system 100 includes a user computing device 102, a server computing system 130, and a training computing system 150 that are communicatively coupled over a network 180. [0028] The user computing device 102 can be any type of computing device, such as, for example, a personal computing device (e.g., laptop or desktop), a mobile computing device (e.g., smartphone or tablet), a gaming console or controller, a wearable computing device, an embedded computing device, or any other type of computing device. [0029] The user computing device 102 includes one or more processors 112 and a memory 114. The one or more processors 112 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. The memory 114 can include one or more non-transitory computer-readable storage media, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. The memory 114 can store data 116 and instructions 118 which are executed by the processor 112 to cause the user computing device 102 to perform operations. [0030] In some implementations, the user computing device 102 can store or include one or more models 120. For example, the models 120 can be or can otherwise include various machine-learned models such as neural networks (e.g., deep neural networks) or other types of machine-learned models, including non-linear models and/or linear models. Neural networks can include feed-forward neural networks, recurrent neural networks (e.g., long short-term memory recurrent neural networks), convolutional neural networks or other forms of neural networks. Some example machine-learned models can leverage an attention mechanism such as self-attention. For example, some example machine-learned models can include multi-headed self-attention models (e.g., transformer models). Example models 120 are discussed with reference to Figures 2 and 3. [0031] In some implementations, the one or more models 120 can be received from the server computing system 130 over network 180, stored in the user computing device memory 114, and then used or otherwise implemented by the one or more processors 112. In some implementations, the user computing device 102 can implement multiple parallel instances of a single model 120 (e.g., to perform parallel per-attribute token disentanglement across multiple instances of the models 120). [0032] More particularly, the machine-learned model(s) 120 can include a machine- learned disentanglement model. The machine-learned disentanglement model can be a model that processes an intermediate representation, or tokens generated from an intermediate representation, and produces a disentanglement output that disentangles the intermediate representation into a collection of subspaces that each hold tokens corresponding to a particular attribute (i.e., per-attribute disentangled tokens). For example, the machine-learned disentanglement model can be a transformer model, transformer layer(s), attention-based layer(s), etc. trained to disentangle intermediate representations of an input. For a more specific example, the machine-learned disentanglement model may be a multi-headed model that includes a supervision head for each attribute that is trained on a corresponding contrastive loss term. [0033] In some implementations, the machine-learned model(s) 120 can include a machine-learned encoder model. The machine-learned encoder model can be a model that processes input data to generate an intermediate representation of the input data. For example, the user computing device 102 can obtain audio data including audio of a spoken utterance from a user via an audio input device associated with the user computing device 102. The machine-learned encoder model can process the audio data to generate an intermediate representation of the audio data, such as an encoding (e.g., a wav2vec encoding, etc.). For another example, the user computing device 102 can obtain textual data including textual content from a user via a text input device associated with the user computing device 102. The machine-learned encoder model can process the textual data to generate an intermediate representation of the textual data, such as an encoding (e.g., a text encoding such as a Bidirectional Encoder Representation from Transformer (BERT) representation, etc.). In some implementations, the machine-learned encoder model can be a portion of a larger model, such as a foundational model (e.g., a large language model (LLM), a foundational audio model, foundational computer vision model, etc.). For example, the machine-learned encoder model can be the encoder portion of an LLM. [0034] In some implementations, the machine-learned model(s) 120 can include a machine-learned generative model, such as machine-learned decoder model. The machine- learned generative model can be a model that processes an intermediate representation to generate a model output. For example, the user computing device 102 can obtain an intermediate representation of textual content describing a particular type of audio and corresponding audio content attributes for the audio. If the machine-learned generative model is, for example, a foundational audio decoder, the machine-learned generative model can process the intermediate representation to generate audio data that includes the audio described by the textual content. Alternatively, if the machine-learned generative model is, for example, a foundational vision decoder, the machine-learned generative model can process the intermediate representation to generate image data that depicts some visual representation of the audio described by the textual content (e.g., a spectrogram, album art for audio, a visual depiction of events described by the audio, etc.). [0035] More generally, it should be noted that the machine-learned model(s) 120 can include any type or manner of machine-learned model or portion / layer of machine-learned model, including multi-modal models that are trained to process and/or generate multimodal information. For example, the machine-learned model(s) may include a model that processes an intermediate representation of two types of information (e.g., audio, image, text, etc.) to generate an output. For another example, the machine-learned model(s) may include a model that processes an intermediate representation of one type of information to generate a multimodal output. As such, the models, and/or portions of models, described herein may be multimodal or otherwise mode-agnostic. [0036] Additionally or alternatively, one or more models 140 can be included in or otherwise stored and implemented by the server computing system 130 that communicates with the user computing device 102 according to a client-server relationship. For example, the models 140 can be implemented by the server computing system 130 as a portion of a web service (e.g., a generative service). Thus, one or more models 120 can be stored and implemented at the user computing device 102 and/or one or more models 140 can be stored and implemented at the server computing system 130. [0037] The user computing device 102 can also include one or more user input components 122 that receives user input. For example, the user input component 122 can be a touch-sensitive component (e.g., a touch-sensitive display screen or a touch pad) that is sensitive to the touch of a user input object (e.g., a finger or a stylus). The touch-sensitive component can serve to implement a virtual keyboard. Other example user input components include a microphone, a traditional keyboard, or other means by which a user can provide user input. [0038] The server computing system 130 includes one or more processors 132 and a memory 134. The one or more processors 132 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. The memory 134 can include one or more non-transitory computer-readable storage media, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. The memory 134 can store data 136 and instructions 138 which are executed by the processor 132 to cause the server computing system 130 to perform operations. [0039] In some implementations, the server computing system 130 includes or is otherwise implemented by one or more server computing devices. In instances in which the server computing system 130 includes plural server computing devices, such server computing devices can operate according to sequential computing architectures, parallel computing architectures, or some combination thereof. [0040] As described above, the server computing system 130 can store or otherwise include one or more models 140. For example, the models 140 can be or can otherwise include various machine-learned models. Example machine-learned models include neural networks or other multi-layer non-linear models. Example neural networks include feed forward neural networks, deep neural networks, recurrent neural networks, and convolutional neural networks. Some example machine-learned models can leverage an attention mechanism such as self-attention. For example, some example machine-learned models can include multi-headed self-attention models (e.g., transformer models). Example models 140 are discussed with reference to Figures 2-3. [0041] The user computing device 102 and/or the server computing system 130 can train the models 120 and/or 140 via interaction with the training computing system 150 that is communicatively coupled over the network 180. The training computing system 150 can be separate from the server computing system 130 or can be a portion of the server computing system 130. [0042] The training computing system 150 includes one or more processors 152 and a memory 154. The one or more processors 152 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. The memory 154 can include one or more non-transitory computer-readable storage media, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. The memory 154 can store data 156 and instructions 158 which are executed by the processor 152 to cause the training computing system 150 to perform operations. In some implementations, the training computing system 150 includes or is otherwise implemented by one or more server computing devices. [0043] The training computing system 150 can include a model trainer 160 that trains the machine-learned models 120 and/or 140 stored at the user computing device 102 and/or the server computing system 130 using various training or learning techniques, such as, for example, backwards propagation of errors. For example, a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the loss function). Various loss functions can be used such as mean squared error, likelihood loss, cross entropy loss, hinge loss, and/or various other loss functions. Gradient descent techniques can be used to iteratively update the parameters over a number of training iterations. [0044] In some implementations, performing backwards propagation of errors can include performing truncated backpropagation through time. The model trainer 160 can perform a number of generalization techniques (e.g., weight decays, dropouts, etc.) to improve the generalization capability of the models being trained. [0045] In particular, the model trainer 160 can train the models 120 and/or 140 based on a set of training data 162. The training data 162 can include any type or manner of ground truth information, such as a ground truth intermediate representation, ground truth disentangled token streams, reconstructed intermediate representations, etc. [0046] In some implementations, the model trainer 160 can train the models 120 and/or 140 in a supervised manner. For example, the models 120 and/or 140 can include explicit supervision heads (e.g., heads, such as classification or contrastive heads, that are explicitly supervised). The explicit supervision heads can be combined with adversarial and/or mutual information losses. [0047] For example, pre-trained attribute encoder models can be utilized as “teacher” models. A pre-trained attribute encoder can refer to a machine-learned encoder model that has been trained to generate an attribute-specific embedding based on an input. For example, an encoder model may be specifically trained to generate a tokenized representation of a prosody audio content attribute based on audio data or an intermediate representation of audio data. A machine-learned disentanglement model, or a prosody audio content attribute head of a machine-learned disentanglement model, can generate a disentangled token stream for the same prosody audio content attribute. The machine-learned disentanglement model, or the prosody audio content attribute head of a machine-learned disentanglement model, can be trained based on a loss function that evaluates a difference between the disentangled token stream and the token stream generated by the pre-trained “teacher” model. This process can be repeated on a per-attribute basis. [0048] Additionally, or alternatively, in some implementations, the model trainer 160 can train the models 120 and/or 140 in an unsupervised manner. More specifically, the model trainer 160 can evaluate each attribute subspace determined through disentanglement with a proximity-based contrastive loss function that includes a loss term for each attribute. Unsupervised training will be discussed in greater detail with regard to Figure 2. [0049] Additionally, or alternatively, in some implementations, the model trainer 160 can train the models 120 and/or 140 in a self-supervised manner. For example, the model trainer 160 can fine-tune a generative model to produce audio conditioned on some side input label using a small labeled dataset, in a manner analogous to few-shot prompting with LLMs. Once fine-tuned, synthetic audio can be generated that is conditioned on all available labels, or combinations of labels if relevant. The generation model can be re-trained using both the initial training data utilized to fine-tune the model and the synthetic data generated using the fine-tuned model. This can be performed across a number of iterations. [0050] In some implementations, if the user has provided consent, the training examples can be provided by the user computing device 102. Thus, in such implementations, the model 120 provided to the user computing device 102 can be trained by the training computing system 150 on user-specific data received from the user computing device 102. In some instances, this process can be referred to as personalizing the model. [0051] The model trainer 160 includes computer logic utilized to provide desired functionality. The model trainer 160 can be implemented in hardware, firmware, and/or software controlling a general purpose processor. For example, in some implementations, the model trainer 160 includes program files stored on a storage device, loaded into a memory and executed by one or more processors. In other implementations, the model trainer 160 includes one or more sets of computer-executable instructions that are stored in a tangible computer-readable storage medium such as RAM, hard disk, or optical or magnetic media. [0052] The network 180 can be any type of communications network, such as a local area network (e.g., intranet), wide area network (e.g., Internet), or some combination thereof and can include any number of wired or wireless links. In general, communication over the network 180 can be carried via any type of wired and/or wireless connection, using a wide variety of communication protocols (e.g., TCP/IP, HTTP, SMTP, FTP), encodings or formats (e.g., HTML, XML), and/or protection schemes (e.g., VPN, secure HTTP, SSL). [0053] The machine-learned models described in this specification may be used in a variety of tasks, applications, and/or use cases. [0054] In some implementations, the input to the machine-learned model(s) of the present disclosure can be image data. The machine-learned model(s) can process the image data to generate an output. As an example, the machine-learned model(s) can process the image data to generate an image recognition output (e.g., a recognition of the image data, a latent embedding of the image data, an encoded representation of the image data, a hash of the image data, etc.). As another example, the machine-learned model(s) can process the image data to generate an image segmentation output. As another example, the machine- learned model(s) can process the image data to generate an image classification output. As another example, the machine-learned model(s) can process the image data to generate an image data modification output (e.g., an alteration of the image data, etc.). As another example, the machine-learned model(s) can process the image data to generate an encoded image data output (e.g., an encoded and/or compressed representation of the image data, etc.). As another example, the machine-learned model(s) can process the image data to generate an upscaled image data output. As another example, the machine-learned model(s) can process the image data to generate a prediction output. [0055] In some implementations, the input to the machine-learned model(s) of the present disclosure can be text or natural language data. The machine-learned model(s) can process the text or natural language data to generate an output. As an example, the machine- learned model(s) can process the natural language data to generate a language encoding output. As another example, the machine-learned model(s) can process the text or natural language data to generate a latent text embedding output. As another example, the machine- learned model(s) can process the text or natural language data to generate a translation output. As another example, the machine-learned model(s) can process the text or natural language data to generate a classification output. As another example, the machine-learned model(s) can process the text or natural language data to generate a textual segmentation output. As another example, the machine-learned model(s) can process the text or natural language data to generate a semantic intent output. As another example, the machine-learned model(s) can process the text or natural language data to generate an upscaled text or natural language output (e.g., text or natural language data that is higher quality than the input text or natural language, etc.). As another example, the machine-learned model(s) can process the text or natural language data to generate a prediction output. [0056] In some implementations, the input to the machine-learned model(s) of the present disclosure can be speech data. The machine-learned model(s) can process the speech data to generate an output. As an example, the machine-learned model(s) can process the speech data to generate a speech recognition output. As another example, the machine- learned model(s) can process the speech data to generate a speech translation output. As another example, the machine-learned model(s) can process the speech data to generate a latent embedding output. As another example, the machine-learned model(s) can process the speech data to generate an encoded speech output (e.g., an encoded and/or compressed representation of the speech data, etc.). As another example, the machine-learned model(s) can process the speech data to generate an upscaled speech output (e.g., speech data that is higher quality than the input speech data, etc.). As another example, the machine-learned model(s) can process the speech data to generate a textual representation output (e.g., a textual representation of the input speech data, etc.). As another example, the machine- learned model(s) can process the speech data to generate a prediction output. [0057] In some implementations, the input to the machine-learned model(s) of the present disclosure can be latent encoding data (e.g., a latent space representation of an input, etc.). The machine-learned model(s) can process the latent encoding data to generate an output. As an example, the machine-learned model(s) can process the latent encoding data to generate a recognition output. As another example, the machine-learned model(s) can process the latent encoding data to generate a reconstruction output. As another example, the machine-learned model(s) can process the latent encoding data to generate a search output. As another example, the machine-learned model(s) can process the latent encoding data to generate a reclustering output. As another example, the machine-learned model(s) can process the latent encoding data to generate a prediction output. [0058] In some implementations, the input to the machine-learned model(s) of the present disclosure can be statistical data. Statistical data can be, represent, or otherwise include data computed and/or calculated from some other data source. The machine-learned model(s) can process the statistical data to generate an output. As an example, the machine- learned model(s) can process the statistical data to generate a recognition output. As another example, the machine-learned model(s) can process the statistical data to generate a prediction output. As another example, the machine-learned model(s) can process the statistical data to generate a classification output. As another example, the machine-learned model(s) can process the statistical data to generate a segmentation output. As another example, the machine-learned model(s) can process the statistical data to generate a visualization output. As another example, the machine-learned model(s) can process the statistical data to generate a diagnostic output. [0059] In some implementations, the input to the machine-learned model(s) of the present disclosure can be sensor data. The machine-learned model(s) can process the sensor data to generate an output. As an example, the machine-learned model(s) can process the sensor data to generate a recognition output. As another example, the machine-learned model(s) can process the sensor data to generate a prediction output. As another example, the machine-learned model(s) can process the sensor data to generate a classification output. As another example, the machine-learned model(s) can process the sensor data to generate a segmentation output. As another example, the machine-learned model(s) can process the sensor data to generate a visualization output. As another example, the machine-learned model(s) can process the sensor data to generate a diagnostic output. As another example, the machine-learned model(s) can process the sensor data to generate a detection output. [0060] In some cases, the machine-learned model(s) can be configured to perform a task that includes encoding input data for reliable and/or efficient transmission or storage (and/or corresponding decoding). For example, the task may be an audio compression task. The input may include audio data and the output may comprise compressed audio data. In another example, the input includes visual data (e.g., one or more images or videos), the output comprises compressed visual data, and the task is a visual data compression task. In another example, the task may comprise generating an embedding for input data (e.g., input audio or visual data). [0061] In some cases, the input includes visual data and the task is a computer vision task. In some cases, the input includes pixel data for one or more images and the task is an image processing task. For example, the image processing task can be image classification, where the output is a set of scores, each score corresponding to a different object class and representing the likelihood that the one or more images depict an object belonging to the object class. The image processing task may be object detection, where the image processing output identifies one or more regions in the one or more images and, for each region, a likelihood that region depicts an object of interest. As another example, the image processing task can be image segmentation, where the image processing output defines, for each pixel in the one or more images, a respective likelihood for each category in a predetermined set of categories. For example, the set of categories can be foreground and background. As another example, the set of categories can be object classes. As another example, the image processing task can be depth estimation, where the image processing output defines, for each pixel in the one or more images, a respective depth value. As another example, the image processing task can be motion estimation, where the network input includes multiple images, and the image processing output defines, for each pixel of one of the input images, a motion of the scene depicted at the pixel between the images in the network input. [0062] In some cases, the input includes audio data representing a spoken utterance and the task is a speech recognition task. The output may comprise a text output which is mapped to the spoken utterance. In some cases, the task comprises encrypting or decrypting input data. In some cases, the task comprises a microprocessor performance task, such as branch prediction or memory address translation. [0063] Figure 1A illustrates one example computing system that can be used to implement the present disclosure. Other computing systems can be used as well. For example, in some implementations, the user computing device 102 can include the model trainer 160 and the training dataset 162. In such implementations, the models 120 can be both trained and used locally at the user computing device 102. In some of such implementations, the user computing device 102 can implement the model trainer 160 to personalize the models 120 based on user-specific data. [0064] Figure 1B depicts a block diagram of an example computing device 10 that performs training of a machine-learned disentanglement model according to example embodiments of the present disclosure. The computing device 10 can be a user computing device or a server computing device. [0065] The computing device 10 includes a number of applications (e.g., applications 1 through N). Each application contains its own machine learning library and machine-learned model(s). For example, each application can include a machine-learned model. Example applications include a text messaging application, an email application, a dictation application, a virtual keyboard application, a browser application, etc. [0066] As illustrated in Figure 1B, each application can communicate with a number of other components of the computing device, such as, for example, one or more sensors, a context manager, a device state component, and/or additional components. In some implementations, each application can communicate with each device component using an API (e.g., a public API). In some implementations, the API used by each application is specific to that application. [0067] Figure 1C depicts a block diagram of an example computing device 50 that performs machine-learned disentanglement of an intermediate representation according to example embodiments of the present disclosure. The computing device 50 can be a user computing device or a server computing device. [0068] The computing device 50 includes a number of applications (e.g., applications 1 through N). Each application is in communication with a central intelligence layer. Example applications include a text messaging application, an email application, a dictation application, a virtual keyboard application, a browser application, etc. In some implementations, each application can communicate with the central intelligence layer (and model(s) stored therein) using an API (e.g., a common API across all applications). [0069] The central intelligence layer includes a number of machine-learned models. For example, as illustrated in Figure 1C, a respective machine-learned model can be provided for each application and managed by the central intelligence layer. In other implementations, two or more applications can share a single machine-learned model. For example, in some implementations, the central intelligence layer can provide a single model for all of the applications. In some implementations, the central intelligence layer is included within or otherwise implemented by an operating system of the computing device 50. [0070] The central intelligence layer can communicate with a central device data layer. The central device data layer can be a centralized repository of data for the computing device 50. As illustrated in Figure 1C, the central device data layer can communicate with a number of other components of the computing device, such as, for example, one or more sensors, a context manager, a device state component, and/or additional components. In some implementations, the central device data layer can communicate with each device component using an API (e.g., a private API). [0071] Figure 2 depicts a data flow diagram for unsupervised training of a machine- learned tokenization model according to some implementations of the present disclosure. More specifically, audio data 202 can be obtained that includes audio 204 of a particular type. An audio “type” can refer to a classification, or general descriptor, for the content of audio data. For example, a type of audio may refer to music, or may refer to a more granular classification of music (e.g., a specific genre of music, music from a particular artist, etc.). A type of audio may also refer to a podcast, speech, recording, spoken utterance, audio livestream, and/or any other manner of audio (at any level of granularity). [0072] Additionally, or alternatively, in some implementations, data 206 indicative of the audio 204 can be obtained. The data 206 can be any type or manner of data that indicates the audio 204, such as textual data, image data, video data, etc. In some implementations, the data 206 can be textual data that describes the audio of the particular type. For example, if the audio 204 is a song, the data 206 can be a review written of the song, the song title, a description of the song, a transcript for an interview in which the creator of the song discusses it, etc. For another example, if the audio 204 is an audio track from audiovisual data, such as a movie, the data 206 indicative of the audio 204 can be the corresponding video data. [0073] The audio data 202 that includes the audio 204, and/or the data 206 indicative of the audio 204, can be processed with the representation generator 208 to obtain an intermediate representation 210. In some implementations, the representation generator 208 can process the audio data 202 to generate an intermediate representation 210 of the audio data 202. Alternatively, in some implementations, the representation generator 208 can process the data 206 indicative of the audio 204 to generate an intermediate representation 210 of the data 206. In some implementations, the representation generator 208 can process both the audio data 202 and the data 206 to generate two intermediate representations that respectively represent the audio data 202 and the data 206 (not illustrated). In some implementations, the representation generator 208 can process both the audio data 202 and the data 206 indicative of the audio 204 to generate an intermediate representation 210 that represents both the data 206 and the audio data 202. [0074] In some implementations, the representation generator 208 can be one or more machine-learned model(s), encoder model(s), transformer model(s), etc.). For example, the representation generator 208 may be an encoder model trained to generate intermediate representations of audio data. For another example, the representation generator 208 may be an encoder model trained to generate intermediate representations of the data 206. For yet another example, the representation generator 208 may be, or otherwise include, multiple encoder models trained to generate intermediate representations for multiple modalities of data (e.g., image data, audio data, textual data, etc.). [0075] In some implementations, the representation generator 208 can include one or more portion(s), or layer(s) of machine-learned model(s). To follow the previous example, the multiple encoder models can be encoder portions from pre-trained foundational models. More specifically, the representation generator 208 may include a pre-trained language encoder from an LLM, a pre-trained audio encoder from a foundational audio model, a pre- trained vision encoder from a foundational vision model, etc. [0076] As described previously, the intermediate representation 210 can be generated via machine learning techniques. In these instances, the intermediate representation 210 can be an embedding, vector representation, matrix, etc. Alternatively, in some implementations, the representation generator 208 can generate some other manner of intermediate representation 210 via techniques unrelated to machine learning. As one example, the representation generator 208 may be a spectrogram creator, and the intermediate representation 210 can be a spectrogram representation of the audio data 202. [0077] Additionally, or alternatively, in some implementations, intermediate representation 210 can include entangled token representations of input(s) to the representation generator 208. For example, the intermediate representation 210 can include a token stream that includes tokens for multiple types of audio content attributes that are mixed together. As such, in some implementations, tokens can be generated by the representation generator 208 in a mixed fashion, and then disentangled with a machine-learned disentanglement model 212. Alternatively, in some implementations, the intermediate representation 210 can be some other manner of intermediate representation (e.g., a vector representation, etc.) and the machine-learned disentanglement model 212 can process the intermediate representation 210 to generate disentangled token streams. [0078] The intermediate representation 210 can be processed by the machine-learned disentanglement model 212 to generate disentangled token streams 214A, 214B, and 214C (generally, disentangled token streams 214). The disentangled token streams 214 can be disentangled streams of representational tokens that have been disentangled into subspaces that correspond to particular audio content attributes. For example, if the audio 204 is audio from a film, token stream 214A may correspond to a dialogue audio content attribute, token stream 214B may correspond to a special effects audio content attribute, and token stream 214C may correspond to a soundtrack audio content attribute. For another example, if the audio 204 is audio produced by someone singing or speaking, the token stream 214A may correspond to an utterance-scale audio content attribute, the token stream 214B may correspond to a word-level audio content attribute, and the token stream 214C may correspond to a syllable-scale or phoneme-scale audio content attribute. [0079] In some implementations, the intermediate representation 210 can be tokenized by the machine-learned tokenization model 211. For example, if the intermediate representation 210 does not include tokens (e.g., a spectrogram representation, etc.), the machine-learned tokenization model 211 can process the intermediate representation 210 to generate an entangled token stream 213. The entangled token stream 213 can refer to a series of tokens that generally serve as a tokenized representation of the inputs to the representation generator 208. However, the tokens of the entangled token stream 213 can be entangled such that tokens associated with an audio content attribute may be mixed with tokens associated with some other audio content attribute. The entangled token stream 213 can be processed with the machine-learned disentanglement model 212 to generate the disentangled token streams 214. [0080] As described previously, in some implementations, the machine-learned tokenization model 211 can be a portion of the machine-learned disentanglement model 212. Alternatively, in some implementations, the machine-learned tokenization model 211, the machine-learned disentanglement model 212, and/or the representation generator 208 may be portions of the same machine-learned model. Alternatively, in some implementations, the machine-learned tokenization model 211 can be a separate model from either the machine- learned disentanglement model 212 or the representation generator 208 (e.g., a pre-trained tokenizer, etc.). [0081] As a particular example, assume that the audio 204 is speech audio. The disentangled token stream 214A can be a token stream for a utterance-level subspace. The contrastive loss term 216A can evaluate target frame pairs (e.g., pairs of positive and negative examples) at an utterance scale. The disentangled token stream 214B can be a token stream for a word-level subspace. The contrastive loss term 216B can evaluate target frame pairs at a word scale (e.g., between 1 second of each other). The disentangled token stream 214C can be a token stream for a syllable-level subspace. The contrastive loss term 216C can evaluate target frame pairs at a syllable scale (e.g., between about 250ms of each other). The disentangled token stream 214D can be a token stream for a phoneme-level subspace. The contrastive loss term 216D can evaluate target frame pairs at a phoneme scale (e.g., between about 70ms of each other). [0082] In some implementations, the machine-learned disentanglement model 212 can include a quantization portion 215. The quantization portion 215 can include one or more model layer(s) that can discretize outputs of the machine-learned disentanglement model 212. In some implementations, the quantization portion 215 can be, or otherwise include, neural network layer(s) (or similar) that apply within-network clustering of tokens output by the machine-learned disentanglement model 212 across attribute subspaces corresponding to the disentangled token streams 214. For example, the quantization portion 215 may cluster two or more tokens from the machine-learned disentanglement model 212 to form the disentangled token stream 214A based on a similarity between the tokens. [0083] In some implementations, the quantization portion 215 can be trained based on entropy loss functions that evaluate different objectives. For example, the quantization portion 215 can be trained based on a first clustering loss function and a second clustering loss function. The first clustering loss function can evaluate satisfaction of a first objective to encourage confident clustering by evaluating a first average, across the plurality of tokens, of a respective first entropy of a probability distribution associated with each token (e.g., mean per-example entropy). The probability distribution can describe respective probabilities (e.g., confidences) of the token belonging to a particular subspace. The second clustering loss function can evaluate satisfaction of a second objective to encourage diversity of cluster assignments by evaluating a second entropy of a second average of the probability distributions for all tokens (e.g., entropy of a batch average distribution). [0084] Alternatively, in some implementations, the quantization portion 215 can perform scalar binary quantization. More specifically, the quantization portion 215 can quantize an input vector to generate a set of bits. The set of bits can be processed by the quantization portion 215 to generate a prediction output that predicts a probability of each bit being “on” or “off” given an input X. Tokens can be clustered, or assigned to a particular subspace / token stream, based on the prediction outputs. [0085] For example, turning to Figure 2B, Figure 2B is a data flow diagram for performing quantization with the quantization portion 215 of the machine-learned disentanglement model 212 of Figure 2A according to some implementations of the present disclosure. As described previously, in some implementations, the machine-learned disentanglement model 212 can include a quantization portion 215 that can directly output disentangled token streams without the need for additional processing, such as K-means clustering. To do so, the quantization portion 215 can receive an input vector 222. In some implementations, the input vector 222 can be, include, or otherwise be derived from the intermediate representation 210. For example, the input vector 222 can be a vector representation of the audio 202. [0086] The input vector 222 can be processed with an encoder submodel 224 of the quantization portion 215 to obtain a quantized bit representation 226. The quantized bit representation 226 can predict a probability of each bit being “on” given the input vector 222. Specifically, each bit can be represented as ^^ ∈ ^0,1^, ^ ൌ 1…^. The encoder submodel 224 can predict a quantized bit representation 226 of the input vector 222. The quantized bit representation 226 can be represented as ^^|௫ ൌ ^^^^ ൌ 1|^^ for each bit. It should be noted that, unlike conventional techniques, the quantization bottleneck size of the encoder submodel 224 can be equal to the number of bits N. [0087] The decoder submodel 228 can process the quantized bit representation 226 to generate output 230. To backpropagate from the decoder submodel 228 to the encoder submodel 224, a straight-through estimator can be utilized to estimate bits from probabilities as ^^ ൌ ^^^ ^ ^^^^^ െ^^^ ^, ^ ൌ 1,…^, where ^^^^ is the stop gradient operator. In this manner, the decoder submodel 228 can “see” hard outputs ^^ (e.g., the quantized bit representation 226) from the encoder submodel 224 as is performed via inference, while retaining the capability to pass gradients from the decoder to the probability outputs of the encoder. [0088] The quantization portion 215 can be trained using a quantization loss function 232 that evaluates the output 230 and the input vector 222 (or the input from which the input vector 222 is derived). For example, assume that the input vector 222 is represented as variable x. The input x is quantized by the encoder submodel 224 into the quantized bit representation 226 which can include N bits per frame, and each bit can be represented by a random variable (e.g., a Bernoulli random variable) ^^ , ^ ൌ 1,… ,^. The prior probability of ^^ being “on” (e.g., having a value of 1) can be represented as ^^ ൌ ^^^^ ൌ 1^, and ^^|௫ can represent the conditional probability or posterior (in VAE terms) ^^^^ ൌ 1|^^ (e.g., the prediction output from the encoder submodel 224). [0089] The quantization loss function 232 can include a certainty loss term 234. The certainty loss term 234 can be, evaluate, or otherwise represent, a mean-squared error between probability and its corresponding bit, averaged over examples and bits. The certainty loss term 234 can be represented as ^௫,^^^^|௫ െ ^^^^^^൫^^|௫൯^ where ^^^^^^^^ ൌ ^^^ ^ 0.5^. The certainty loss term
Figure imgf000023_0001
the conditional probabilities of the quantized bit representation 226 to the edges and thus decrease / minimize the entropy of the conditional binary variable. [0090] The quantization loss function 232 can include a utilization loss term 236. The utilization loss term 236 can be, evaluate, or otherwise represent, a squared error between prior probability for each bit and 0.5, where prior probability is calculated with Exponential Moving Average (EMA) over batches using a decay, with the losses being averaged over the bits. The utilization loss term 236 can be represented as ^ ଶ ^ ^^^ െ 0.5^ , where ^^ ൌ ^^^^ ൌ 1^ ൌ ^^^^^ ൌ ^^^^|^^ ൌ ^ ^^|௫ , where |X| is the number of examples
Figure imgf000023_0002
performed over batches with EMA. The utilization loss term 236 can “pull” the mean probability to 0.5 and make the entropy of the unconditional/prior distribution maximized. [0091] The quantization loss function 232 can include a pairwise independence loss term 238. The pairwise independence loss term 238 can be, evaluate, or otherwise represent a squared error between conditional probabilities and 0.5 for bit i given bit j is zero or one, weighted by the probability of bit j being zero or one (priors). The pairwise independence loss term 238 can be represented as ∑^^,^ ^൫^^ ൌ ^൯^^ஷ^^^^|^ୀ^ᇲ െ ^^^. Here, the conditional probabilities and over batches using a decay for all pairs ^^, ^^ where ^ ് ^. To
Figure imgf000024_0001
binary mask can be utilized over examples when bit ^ is 0 and 1. Using this mask, the probabilities of the ^௧^ bit can be averaged in the masked batch items to ensure that the average probability of the masked examples is relatively close to ^^ ^ 0.5 for both cases (e.g., for both ^ ൌ 0 and ^ ൌ 1). [0092] The quantization loss function 232 can include a pairwise uncorrelatedness loss term 240. The pairwise uncorrelatedness loss term 240 can be, evaluate, or otherwise represent, a squared error between joint probabilities and multiplication of mean probabilities for pairs of bits ^ and ^ for ^ ് ^. The pairwise uncorrelatedness loss term 240 can calculate probabilities with EMA over batches. The joint probability evaluated by the pairwise uncorrelatedness loss term 240 can utilize a decay for EMA, and the mean probability evaluated by the pairwise uncorrelatedness loss term 240 can utilize another decay to update expectations (e.g., ^ terms). The pairwise uncorrelatedness loss term 240 can be represented as ^^!ୀ^^^^^^^^൧ െ ^^^^^^^^^൧^. Alternatively, the pairwise uncorrelatedness loss term 240 can be represented as ^^!ୀ^^^൫^^|௫^^|௫൯ െ ^^^^^, where ^^ and ^^ represent the same values as described previously. Here, it can be assumed that ^^,^|௫ ൌ ^^^|௫^^|௫, as each ^^|௫ and ^^|௫ can be considered an instantiation of the joint variable even though the quantized bit representation 226 does not directly predict ^^,^|௫, it can be approximated with the product of estimated conditional marginals. [0093] In some implementations, the targets ^^ and ^^^^ can be replaced for the pairwise independence loss term 238 and the pairwise uncorrelatedness loss term 240 with expected values of 0.5 and 0.25, respectively. This substitution can be made to provide more optimal targets for the pairwise independence loss term 238 and the pairwise uncorrelatedness loss term 240. In particular, the value of ^^ may already be “pushed” towards 0.5 by the diversity loss term 236. [0094] It should be noted that the quantization loss function 232 may only include some, or may include none, of the loss terms 234-240. For example, in some implementations, the quantization loss function 232 may include loss terms 234, 236, and 238. Alternatively, in some implementations, the quantization loss function 232 may include loss terms 234, 236, and 240. [0095] In some implementations, neighboring frames of the input vector 222, or the input from which the input vector 222 is derived, can be quantized together. For example, assume that the input vector 222 is an input tensor. The input tensor can be reshaped before processing with the encoder submodel 224 to combine multiple neighboring frames together. [0096] As a more specific example, assume the pre-output of the encoder submodel 224 has a shape of 250 x 1024 where 250 is the number of frames and 1024 is the output/encoding dimension. When quantizing frames independently, the encoding dimensions can be mapped to 12 using a projection layer with 1024 x 12 size. However, this is a relatively “small”, and thus inefficient, matrix size. As such, the frames before the current projection can be combined and quantized together. For example, if 10 frames are combined, the input tensor can be reshaped to 25 x 10240 dimensions, which can reduce the frames and increase the encoding dimension. Next, 10240 dimensional data can be projected to 120 bits using a matrix with 10240 x 120 dimensions. Here, 120 can be the number of bits used to represent 10 frames, so the overall bits-per-second can be the same as using 12 bits per frame. Once quantized, a reduced number of frames (e.g., 25 frames) can be mapped to 10240 dimensions to obtain a 25x10240 tensor representing the mapped space, which can be reshaped to 250 x 1024 dimensions, which is the tensor on which the decoder submodel 228 operates. In turn, this provides the benefits of quantizing frames together to receive superior performance while utilizing smaller memory weights than conventional approaches. [0097] Returning to Figure 2A, the disentangled token streams 214 can be evaluated using corresponding contrastive loss terms 216A, 216B, and 216C (generally, contrastive loss terms 216C). It should be noted that, as described herein, the contrastive loss terms 216 may refer to constituent loss “terms” of a singular loss function, or may refer to multiple loss functions. More generally, the contrastive loss terms 216 refer to some function or group of functions that can evaluate pairs of examples (e.g., a pair including a positive example and a negative example) based on some semantically-correlated notion of similarity to learn a representation that groups positive pairs. [0098] The contrastive loss terms 216 can be collectively evaluated and utilized to train the machine-learned disentanglement model to disentangle the disentangled token streams 214 into discrete subspaces. In some implementations, mutual information loss terms (not illustrated) between the disentangled token streams 214 can also be evaluated to train the machine-learned disentanglement model 212 to more completely disentangle the disentangled token streams 214. A mutual information loss term can refer to a loss term that, when evaluated and utilized to train a model or portion of a model, trains the model by maximizing mutual information shared between disentangled token representations. By maximizing such mutual information, the machine-learned disentanglement model can be encouraged to more completely disentangle the disentangled token streams 214. [0099] The machine-learned decoder model 218 can process the disentangled token streams 214 to generate a reconstructed intermediate representation 220. More specifically, the machine-learned decoder model 218 can “invert” the disentangled token streams 214 to generate the reconstructed intermediate representation 220. For example, if the machine- learned disentanglement model 212 is a transformer stack, the machine-learned decoder model 218 may be an inverter transformer stack that inverts the disentangled token streams 214. In some implementations, the disentangled token streams 214 can be stacked and then inverted via the machine-learned decoder model 218. [00100] The reconstructed intermediate representation 220 can represent the same information represented by the intermediate representation 210. In particular, the reconstructed intermediate representation 220 can be perceptually equivalent to the intermediate representation 210, although the reconstructed intermediate representation 220 may not be identical to the intermediate representation 210. For example, assume that the intermediate representation 210 and the reconstructed intermediate representation 220 are both processed with a generative audio decoder to generate audio data. If the audio data from both representations is played with an audio playback device, the audio from both representations will be perceptibly the same. In this manner, the disentangled token streams 214 serve as a compressed representation of the intermediate representation 210. In other words, as the disentangled token streams 214 require fewer storage resources than the intermediate representation 210, and can be processed to generate a reconstructed intermediate representation 220 that is perceptually equivalent to the intermediate representation 210, the machine-learned disentanglement model 212 can be utilized as an efficiently compress intermediate representations. [00101] Once generated, the reconstructed intermediate representation can be processed by various generative models to produce a generative output. Additionally, or alternatively, in some implementations, the reconstructed intermediate representation, or the Various use- cases for the reconstructed intermediate representation 220 will be discussed in greater detail with regards to Figure 3. [00102] Figure 3 is a block diagram for a computing system 300 that disentangles and tokenizes representations for generative tasks and/or compression tasks according to some implementations of the present disclosure. More specifically, a computing system 300 can be a computing system that provides compression services and/or generative services (e.g., a cloud computing system, a network compute node, a server computing system, etc.). The computing system 300 can receive information from a requesting entity 302. Specifically, in some implementations, the requesting entity 302 can provide an input 304 to the computing system. The input 304 can be any type or manner of input data. For example, the input 304 can be audio data, or data indicative of audio data (e.g., image data, textual data, etc.). [00103] In some implementations, if the requesting entity 302 provides the input 304, the computing system 300 can process the input 304 with an encoder model 306 to generate an intermediate representation 308. For example, if the input 304 is audio data, the encoder model 306 can process the input 304 to generate an intermediate representation 308 of the audio data. Alternatively, in some implementations, the requesting entity 302 can directly provide the intermediate representation 308 of the input 304 to the computing system 300. [00104] In some implementations, the input 304 can include additional information that indicates particular sub-space(s) over which to disentangle the intermediate representation 308. For example, if the input 304 is audio data including audio from a person speaking, the input 304 may include additional information indicating a prosody subspace, an inflection subspace, a tone subspace, an utterance-scale subspace, a phoneme-scale subspace, etc. [00105] In some implementations, the requesting entity 302 can provide a compression request 310 to the computing system 300. The compression request 310 can request that the computing system generate a compressed representation of the input 304. The compression request 310 can also indicate to the computing system 300 to store the compressed representation, and/or return the compressed representation. For example, the compression request 310 can indicate to the computing system 300 to transmit the compressed representation to a compression data store 312 (e.g., a data store associated with the requesting entity 302, a data store associated with the computing system 300, etc.). For another example, the compression request 310 can request that the computing system 300 return the compressed representation to the requesting entity 302. [00106] Additionally, or alternatively, in some implementations, the requesting entity can provide a generative request 314 to the computing system 300. The generative request 314 can specify particular generative task(s) to perform with the information provided by the requesting entity 302 to the computing system 300. For example, if the input 304 is textual content descriptive of audio, the generative request 314 can indicate a generative audio task to generate the audio based on the input 304. For another example, if the input 304 is audio data that includes audio, the generative request 314 may indicate a text generation task to generate task descriptive of the audio. [00107] In some implementations, the generative request 314 can include adjustment parameters 316. The adjustment parameters 316 can be task-specific parameters to modify certain characteristics of particular disentangled token stream(s). For example, if tokens are disentangled to a prosody subspace, the adjustment parameter(s) 316 can indicate that the degree of prosody is to be adjusted, or the words to which a prosody effect is applied. For another example, if the tokens are disentangled to a tone subspace, the adjustment parameter(s) 316 can indicate that the type or manner of tone is to be adjusted, and/or the degree of tone is to be adjusted. [00108] The computing system 300 can include a machine-learned disentanglement model 318. The machine-learned disentanglement model 318 can process the intermediate representation 308, and/or the input 304, to generate disentangled tokens 320 as described with regards to Figure 2. The disentangled tokens 320 can include disentangled token streams 320A, 320B, and 320C. As described previously with regards to Figure 2, the disentangled tokens 320 can serve as a compressed representation of the intermediate representation 308 by excluding (or more efficiently representing) some or all of the non-perceptible information included in the intermediate representation 308. As such, if the computing system 300 receives the compression request 310 from the requesting entity 302, the computing system can return the disentangled tokens 320 to the requesting entity 302 and/or can store the disentangled tokens 320 in the compression data store 312. [00109] If the computing system 300 receives the generative request 314 from the requesting entity 302, the computing system 300 can perform a generative task using the disentangled tokens 320. To do so, the computing system 300 can include a generative task controller 322. If the generative request 314 includes the adjustment parameters 316, the generative task controller 322 can process the disentangled tokens 320 and the adjustment parameters 316 to generate adjusted disentangled tokens 324. The adjusted disentangled tokens 324 can include adjustments to the token streams 320A-320C as indicated by the adjustment parameters 316. [00110] The computing system 300 can include machine-learned generative model(s) 326. The machine-learned generative model(s) 326 can be model(s) trained to perform various generative task(s). In some implementations, the machine-learned generative model(s) 326 can include a single foundational model trained to process token streams to generate multiple types of outputs (e.g., audio data, image data, textual data, etc.). For example, the foundational model can be a model that can perform audio data tasks, image data, tasks, and/or textual data tasks. Alternatively, in some implementations, the machine- learned generative model(s) 326 can include multiple foundational models each trained to process disentangled tokens to perform multiple tasks for a particular type of data. For example, the machine-learned generative model(s) 326 can include a LLM that can process token streams to perform multiple types of language tasks that each generate a language output. For another example, the machine-learned generative model(s) 326 can include a foundational vision model that can process token streams to perform multiple types of vision tasks that each generate a vision output. For yet another example, the machine-learned generative model(s) 326 can include a foundational audio model that can process token streams to perform multiple types of audio tasks that each generate an audio output. [00111] If the generative request 314 includes the adjustment parameters 316, the machine-learned generative model(s) 326 can process the adjusted disentangled tokens 324 to generate a generative output 328. Alternatively, the generative request 314 does not include the adjustment parameters 316, the machine-learned generative model(s) 326 can process the disentangled tokens 320 to generate the generative output 328. The generative output 328 can correspond to the type of task performed by the machine-learned generative model(s) 326. [00112] In some implementations, the particular task performed by the machine-learned generative model(s) 326, and/or the specific model used to perform the particular task, can be specified by task instructions 330. More specifically, the task instructions 330 can specify a particular task to perform and one or more model(s) of the machine-learned generative model(s) 326 to be used to perform the task. In some implementations, the task instructions 330 can be generated by the generative task controller 322 based on a variety of factors (e.g., previous generative requests 314, the type of input 304, the adjustment parameters 316, etc.). Alternatively, in some implementations, the task instructions 330 can be included in the generative request 314 and can be provided to the machine-learned generative model(s) 326 for processing. [00113] In some implementations, the machine-learned generative model(s) 326 can process additional input tokens 332. The additional input tokens 332, also referred to as “side tokens,” can be tokens that are not derived from the input 304 or the intermediate representation 308. For example, assume that the input 304 is audio data that includes audio of a popular song. If the generative request 314 indicates a generative text task, the additional input tokens 332 may be tokens derived from the text of reviews for the popular song, lyrics from the song, social media discussions of the song or the artist associated with the song, etc. More generally, the additional input tokens 332 can be provided to the model(s) 326 to generate a higher quality generative output 328. [00114] Figure 4 is a data flow diagram 400 for disentanglement of an intermediate representation of speech audio at prescribed time scales according to some implementations of the present disclosure. In particular, an intermediate representation 402 for speech audio can be obtained. The speech audio can be audio of spoken utterances produced by a human, or audio of simulated spoken utterances from a simulated human. Generally, speech audio can include a variety of audio content attributes, such as speakers, channels, emotions, phonetics, and prosodics. However, such attributes are most optimally encoded on different time scales. More specifically, if frame-level attributes such as phonetics and prosodics were captured on an utterance-level time scale (e.g., captured as utterances occur), substantial perceptual information loss may occur due to the lack of information captured between the occurrence of utterances. Conversely, if utterance-level attributes such as speaker, channel, and emotion were captured on a frame-level time scale (e.g., captured each frame), substantial non-perceptual information capture may occur due to the quantity of information captured each frame when an utterance is not occurring. As such, it can be most optimal to capture different audio content attributes at differing, prescribed time scales. [00115] Accordingly, a machine-learned disentanglement model 404 can generate a disentangled token stream 406. The disentangled token stream 406 can include multiple subspaces 408A-408E (generally, subspaces 408). Additionally, the subspaces can be structured in accordance with varying time scales. To follow the depicted example, tokens for the speaker subspace 408A, channel subspace 408B, and emotion subspace 408C can be disentangled on an utterance-level time scale. Tokens for the phonetic subspace 408D and the prosodic subspace 408E can be disentangled on a frame-level time scale. In this manner, sufficient information can be captured for phonetic and prosodic audio content attributes while substantially reducing the capture of non-perceptible, extraneous information for speaker, channel, and emotion attributes. [00116] Although speech audio facilitates explicit disentanglement into discrete subspaces, other types of audio may benefit from a less rigid subspace assignment function. In particular, consider the subspace-restricted contrastive loss described previously with regards to Figure 2. There, each contrastive loss can be viewed as applying to the entire encoding space, but with a binary-valued mask applied that only admits the restricted subspace to contribute to the distance calculation of each loss function. Formulated this way, arbitrary real-valued masking functions can be utilized to enable each dimension to contribute to one or more loss functions. This can alternatively be viewed as a structured dropout mechanism to encourage registering individual encoding dimensions with target properties of interest. [00117] For example, rather than associating each of the subspaces 408 with smooth feature properties of a given scale, a more continuous association of each dimension can be adopted that leaves hard assignments to inference time. Specifically, for each contrastive loss target pair sampled with time lag t ∊ [0, T], a map M can be predefined such that M: [0, 1] → [0, 1]D from lag to the collection of mask functions over the D disentangled encoding dimensions. [00118] To follow the previous example, M(t) = [^^,…,^^^] can be defined as the quantized fixed-width Gaussian kernel, where each mask weight is given by ^^ ൌ ^^^^|^μ^t^,σ^^ and μ^t^ is a monotonically increasing function of time lag. In some implementations, the resulting encoding dimensions can encode properties at a given time scale and contiguous subsets of dimensions will have similar behavior to the hard boundary case described previously. [00119] Although speech is a well-studied and highly constrained audio domain, application of previously described techniques to the general audio domain can present a more difficult task. In particular, determining the list of attributes over which to entangle can be non-trivial. Accordingly, implementations of the present disclosure propose two broad categories with different modeling imperatives: (1) sound textures of indefinite length that present little opportunity for constrained duration modeling (e.g., air conditioning, engine idling, etc.), and (2) isolated events generated via a process with modellable duration constraints (e.g., a doorbell ring, a dog barks, etc.). To differentiate the two types of audio, a time period around 1 second can serve as an approximate duration threshold that discriminates between the two. Thus, disentangling by time scale can also be applied to differentiating broad classes of sound events. For example, modifications in the clip-level subspace may tend to affect background sound textures, while modifications in the local-time subspace may tend to affect isolated events. [00120] To more completely disentangle general audio, additional signal properties can be evaluated, such as modulation frequency, pitch, spectral entropy, etc. Additionally, in some implementations, rather than applying full-band time patches as a tokenization strategy (as can be performed for speech), a 2D spectrotemporal patch tokenization strategy can be applied in general audio contexts. This has the added benefit of replacing multi-scale temporal proximity-based contrastive losses with multi-scale spectrotemporal proximity- based contrastive losses. Example Methods [00121] Figure 5 depicts a flow chart diagram of an example method to perform disentanglement of tokens for generative tasks according to some implementations of the present disclosure. Although Figure 5 depicts steps performed in a particular order for purposes of illustration and discussion, the methods of the present disclosure are not limited to the particularly illustrated order or arrangement. The various steps of the method 500 can be omitted, rearranged, combined, and/or adapted in various ways without deviating from the scope of the present disclosure. [00122] At 502, a computing system can obtain an intermediate representation. The intermediate representation can represent audio data that audio of a particular type. The audio of the particular type can include a plurality of audio content attributes. Alternatively, in some implementations, the intermediate representation can represent data indicative of audio of the particular type. For example, the intermediate representation can represent a transcript of audio data that includes audio of a conversation. For another example, the intermediate representation can represent an image that depicts the transcript described previously. [00123] At 504, the computing system can process the intermediate representation with a machine-learned disentanglement model to obtain two or more token streams for two or more respective audio content attributes of the plurality of audio content attributes. The two or more token streams can include a first token stream for a first audio content attribute of the two or more audio content attributes. The first token stream can be tokenized at a first time scale. The two or more token streams can additionally include a second token stream for a second audio content attribute of the two or more audio content attributes. The second token stream can be tokenized at a second time scale different than the first time scale. [00124] In some implementations, the computing system can process the two or more token streams with a machine-learned decoder model to obtain a reconstructed intermediate representation of the audio data. The reconstructed intermediate representation of the audio data can be perceptually equivalent to the intermediate representation of the audio data. The intermediate representation of the audio data can include a first quantity of information, and the two or more token streams can include a second quantity of information less than the first quantity of information (e.g., 500 bits vs 400 bits, etc.). [00125] In some implementations, the machine-learned disentanglement model can include a transformer stack that includes one or more transformer layers. The machine- learned decoder model can include an inverter transformer stack that includes one or more inversion transformer layers corresponding to the one or more transformer layers. [00126] In some implementations, the computing system can adjust one or more parameters of the machine-learned disentanglement model based on a reconstructive loss function that evaluates a difference between the reconstructed intermediate representation and the intermediate representation. In some implementations, the reconstructive loss function includes a mutual information loss term that maximizes a quantity of the audio data represented by both the first token stream and the second token stream. In some implementations, the reconstructive loss function respectively includes two or more content- specific contrastive loss terms for the two or more audio content attributes. Each of the two or more content-specific loss terms minimizes a difference between a particular token stream and a positive example and maximizes a difference between the particular token stream and a negative example. [00127] In some implementations, the computing system can process the reconstructed intermediate representation of the audio data with a machine-learned generative audio model to obtain reconstructed audio data. [00128] In some implementations, prior to processing the reconstructed intermediate representation of the audio data, the computing system can modify the first token stream based on a request to modify a characteristic of the first audio content attribute. In some implementations, the first content attribute can include a speech attribute for speech audio caused by spoken utterances. The characteristic of the first audio content attribute can include a vocal tone characteristic, an inflection characteristic, an emotion characteristic, a cadence characteristic, a pitch characteristic, etc. In some implementations, the first content attribute can be an instrumental attribute for instrumental audio caused by instruments. The characteristic of the first audio content attribute can include one or more instrument-specific characteristics, a mixing characteristic, a volume characteristic, an instrument inclusion characteristic, an instrument exclusion characteristic, etc. [00129] In some implementations, the computing system can adjust one or more parameters of one or more models based on a generative loss function that evaluates a difference between the audio data and the reconstructed audio data. The one or more models can be, or otherwise include, at least one of the machine-learned generative audio model, the machine-learned tokenization model, the machine-learned decoder model, etc. In some implementations, the computing system can adjust at least one of the first tokenization rate or the second tokenization rate based at least in part on the generative loss function. [00130] In some implementations, the computing system can process textual content descriptive of the audio data with an encoder portion of a LLM to obtain an intermediate representation of the textual content. The computing system can adjust one or more parameters of the LLM based on a language loss function that evaluates a difference between the intermediate representation of the textual content and the intermediate representation of the audio data. [00131] In some implementations, the computing system can obtain second textual content descriptive of second audio of the particular type. The computing system can process the second textual content descriptive of the second audio of the particular type with the LLM to obtain an intermediate representation of the second audio of the particular type. The computing system can process the intermediate representation of the second audio of the particular type with the machine-learned disentanglement model to obtain two or more second token streams for the two or more respective audio content attributes of the plurality of audio content attributes. The computing system can process the two or more second token streams with the machine-learned decoder model to obtain a reconstructed intermediate representation of the second audio of the particular type. The computing system can process the reconstructed intermediate representation of the second audio of the particular type with the machine-learned generative audio model to obtain the second audio of the particular type described by the textual content. [00132] In some implementations, the first audio content attribute includes temporally- diffuse audio, and the second audio attribute includes localized audio. In some implementations, the computing system can store the token stream(s) to a data store operable to store compressed audio data. Additional Disclosure [00133] The technology discussed herein makes reference to servers, databases, software applications, and other computer-based systems, as well as actions taken and information sent to and from such systems. The inherent flexibility of computer-based systems allows for a great variety of possible configurations, combinations, and divisions of tasks and functionality between and among components. For instance, processes discussed herein can be implemented using a single device or component or multiple devices or components working in combination. Databases and applications can be implemented on a single system or distributed across multiple systems. Distributed components can operate sequentially or in parallel. [00134] While the present subject matter has been described in detail with respect to various specific example embodiments thereof, each example is provided by way of explanation, not limitation of the disclosure. Those skilled in the art, upon attaining an understanding of the foregoing, can readily produce alterations to, variations of, and equivalents to such embodiments. Accordingly, the subject disclosure does not preclude inclusion of such modifications, variations and/or additions to the present subject matter as would be readily apparent to one of ordinary skill in the art. For instance, features illustrated or described as part of one embodiment can be used with another embodiment to yield a still further embodiment. Thus, it is intended that the present disclosure cover such alterations, variations, and equivalents.

Claims

WHAT IS CLAIMED IS: 1. A computer-implemented method, comprising: obtaining, by a computing system comprising one or more processor devices, an intermediate representation of audio data comprising audio of a particular type, wherein audio of the particular type comprises a plurality of audio content attributes; processing, by the computing system, the intermediate representation with a machine- learned disentanglement model to obtain two or more token streams for two or more respective audio content attributes of the plurality of audio content attributes, wherein the two or more token streams comprise: a first token stream for a first audio content attribute of the two or more audio content attributes, wherein the first token stream is tokenized at a first time scale; and a second token stream for a second audio content attribute of the two or more audio content attributes, wherein the second token stream is tokenized at a second time scale different than the first time scale.
2. The computer-implemented method of claim 1, wherein the method further comprises: processing, by the computing system, the two or more token streams with a machine- learned decoder model to obtain a reconstructed intermediate representation of the audio data, wherein the reconstructed intermediate representation of the audio data is perceptually equivalent to the intermediate representation of the audio data, wherein the intermediate representation of the audio data comprises a first quantity of information, and wherein the two or more token streams comprise a second quantity of information less than the first quantity of information.
3. The computer-implemented method of claim 2, wherein the two or more token streams further comprise a third token stream for perceptible audio data other than that captured by the first token stream and the second token stream.
4. The computer-implemented method of any of claims 2-3, wherein the machine-learned disentanglement model comprises a transformer stack comprising one or more transformer layers, and wherein the machine-learned decoder model comprises an inverter transformer stack comprising one or more inversion transformer layers corresponding to the one or more transformer layers.
5. The computer-implemented method of any of claims 2-4, wherein the method further comprises: adjusting, by the computing system, one or more parameters of the machine-learned disentanglement model based on a reconstructive loss function that evaluates a difference between the reconstructed intermediate representation and the intermediate representation.
6. The computer-implemented method of claim 5, wherein the reconstructive loss function comprises a mutual information loss term that maximizes a quantity of the audio data represented by both the first token stream and the second token stream.
7. The computer-implemented method of any of claims 5-6, wherein the reconstructive loss function respectively comprises two or more content-specific contrastive loss terms for the two or more audio content attributes, wherein each of the two or more content-specific contrastive loss terms minimizes a difference between a particular token stream and a positive example and maximizes a difference between the particular token stream and a negative example.
8. The computer-implemented method of any of claims 2-7, wherein the method further comprises: processing, by the computing system, the reconstructed intermediate representation of the audio data with a machine-learned generative audio model to obtain reconstructed audio data.
9. The computer-implemented method of claim 8, wherein, prior to processing the reconstructed intermediate representation of the audio data, the method comprises: modifying, by the computing system, the first token stream based on a request to modify a characteristic of the first audio content attribute.
10. The computer-implemented method of claim 9, wherein the first audio content attribute comprises a speech attribute for speech audio caused by spoken utterances, and wherein the characteristic of the first audio content attribute comprises: a vocal tone characteristic; an inflection characteristic; an emotion characteristic; a cadence characteristic; or a pitch characteristic.
11. The computer-implemented method of claim 9, wherein the first audio content attribute comprises an instrumental attribute for instrumental audio caused by instruments, and wherein the characteristic of the first audio content attribute comprises: one or more instrument-specific characteristics; a mixing characteristic; a volume characteristic; an instrument inclusion characteristic; or an instrument exclusion characteristic.
12. The computer-implemented method of claim 8, wherein the method further comprises: adjusting, by the computing system, one or more parameters of one or more models based on a generative loss function that evaluates a difference between the audio data and the reconstructed audio data, wherein the one or more models comprises at least one of: the machine-learned generative audio model; the machine-learned disentanglement model; or the machine-learned decoder model.
13. The computer-implemented method of claim 12, wherein the method further comprises: adjusting, by the computing system, at least one of the first time scale or the second time scale based at least in part on the generative loss function.
14. The computer-implemented method of claim 8, wherein the method further comprises: obtaining, by the computing system, textual content descriptive of second audio of the particular type; processing, by the computing system, the textual content descriptive of the second audio of the particular type with a Large Language Model (LLM) to obtain an intermediate representation of the second audio of the particular type; processing, by the computing system, the intermediate representation of the second audio of the particular type with the machine-learned disentanglement model to obtain two or more second token streams for two or more respective audio content attributes of the plurality of audio content attributes of the second audio of the particular type; processing, by the computing system, the two or more second token streams with the machine-learned decoder model to obtain a reconstructed intermediate representation of the second audio of the particular type; and processing, by the computing system, the reconstructed intermediate representation of the second audio of the particular type with the machine-learned generative audio model to obtain the second audio of the particular type described by the textual content.
15. The computer-implemented method of claim 14, wherein the method further comprises storing, by the computing system, the reconstructed intermediate representation of the audio data to a data store operable to store compressed audio data.
16. The computer-implemented method of claim 1, wherein the first audio content attribute comprises temporally-diffuse audio, and wherein the second audio content attribute comprises localized audio.
17. The computer-implemented method of claim 1, wherein processing the intermediate representation with the machine-learned disentanglement model to obtain the two or more token streams comprises: processing, by the computing system, the intermediate representation with a quantization portion of the machine-learned disentanglement model to obtain the two or more token streams for the two or more respective audio content attributes of the plurality of audio content attributes, wherein the quantization portion is trained to perform Scalar Binary Quantization (SBQ) 18. The computer-implemented method of claim 1, wherein the audio data comprising audio of the particular type comprises data indicative of the audio data of the audio of the particular type.
18. A computing system, comprising: one or more processors; one or more tangible, non-transitory computer readable media storing computer- readable instructions that when executed by the one or more processors cause the computing system to perform operations, the operations comprising: obtaining an intermediate representation of audio data comprising audio of a particular type, wherein audio of the particular type comprises a plurality of audio content attributes; processing the intermediate representation with a machine-learned disentanglement model to obtain two or more token streams for two or more respective audio content attributes of the plurality of audio content attributes, wherein the two or more token streams comprise: a first token stream for a first audio content attribute of the two or more audio content attributes, wherein the first token stream is tokenized at a first time scale; and a second token stream for a second audio content attribute of the two or more audio content attributes, wherein the second token stream is tokenized at a second time scale different than the first time scale.
19. The computing system of claim 18, wherein the operations further comprise: processing the two or more token streams with a machine-learned decoder model to obtain a reconstructed intermediate representation of the audio data, wherein the reconstructed intermediate representation of the audio data is perceptually equivalent to the intermediate representation of the audio data, wherein the intermediate representation of the audio data comprises a first quantity of information, and wherein the two or more token streams comprise a second quantity of information less than the first quantity of information.
20. One or more tangible, non-transitory computer readable media storing computer- readable instructions that when executed by one or more processors cause the one or more processors to perform operations, the operations comprising: obtaining an intermediate representation of audio data comprising audio of a particular type, wherein audio of the particular type comprises a plurality of audio content attributes; processing the intermediate representation with a machine-learned disentanglement model to obtain two or more token streams for two or more respective audio content attributes of the plurality of audio content attributes, wherein the two or more token streams comprise: a first token stream for a first audio content attribute of the two or more audio content attributes, wherein the first token stream is tokenized at a first time scale; and a second token stream for a second audio content attribute of the two or more audio content attributes, wherein the second token stream is tokenized at a second time scale different than the first time scale.
PCT/US2023/031886 2023-09-01 2023-09-01 Multi-scale attribute-disentangled audio tokenization for controllable generation Pending WO2025048837A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2023/031886 WO2025048837A1 (en) 2023-09-01 2023-09-01 Multi-scale attribute-disentangled audio tokenization for controllable generation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2023/031886 WO2025048837A1 (en) 2023-09-01 2023-09-01 Multi-scale attribute-disentangled audio tokenization for controllable generation

Publications (1)

Publication Number Publication Date
WO2025048837A1 true WO2025048837A1 (en) 2025-03-06

Family

ID=88207318

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/031886 Pending WO2025048837A1 (en) 2023-09-01 2023-09-01 Multi-scale attribute-disentangled audio tokenization for controllable generation

Country Status (1)

Country Link
WO (1) WO2025048837A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120409745A (en) * 2025-07-04 2025-08-01 北京泰尔英福科技有限公司 A model-heterogeneous federated learning method, system, device, medium, and product

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022167242A1 (en) * 2021-02-05 2022-08-11 Novoic Ltd. Method for obtaining de-identified data representations of speech for speech analysis

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022167242A1 (en) * 2021-02-05 2022-08-11 Novoic Ltd. Method for obtaining de-identified data representations of speech for speech analysis

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XIN ZHANG ET AL: "SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 31 August 2023 (2023-08-31), XP091602269 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120409745A (en) * 2025-07-04 2025-08-01 北京泰尔英福科技有限公司 A model-heterogeneous federated learning method, system, device, medium, and product

Similar Documents

Publication Publication Date Title
CN117349675B (en) Multi-mode large model construction system for multiple information sources
EP4235485A1 (en) Method for converting text data into acoustic feature, electronic device, and storage medium
CN113454717B (en) Speech recognition device and method
Pokorny et al. Detection of negative emotions in speech signals using bags-of-audio-words
Luo et al. Emotional voice conversion using dual supervised adversarial networks with continuous wavelet transform f0 features
WO2023226239A1 (en) Object emotion analysis method and apparatus and electronic device
CN117121099B (en) Adaptive Visual Speech Recognition
CN114220438A (en) A lightweight speaker recognition method and system based on bottleneck and channel segmentation
Yang et al. Integrated visual transformer and flash attention for lip-to-speech generation GAN
CN119785777A (en) Voice interaction analysis method and system of intelligent voice robot
WO2025048837A1 (en) Multi-scale attribute-disentangled audio tokenization for controllable generation
WO2024220078A1 (en) Machine-learned selection of textual inputs for generative audio models
US12361964B2 (en) Conditioned separation of arbitrary sounds based on machine learning models
KR20240137625A (en) Self-supervised learning for audio processing
Narayanan et al. Hierarchical sequence to sequence voice conversion with limited data
Baas et al. Voice conversion for stuttered speech, instruments, unseen languages and textually described voices
US20240395233A1 (en) Machine-Learned Models for Generation of Musical Accompaniments Based on Input Vocals
US20240386280A1 (en) Knowledge Distillation Training via Encoded Information Exchange to Generate Models Structured for More Efficient Compute
Tits et al. The theory behind controllable expressive speech synthesis: A cross-disciplinary approach
Hong et al. When hearing the voice, who will come to your mind
US12347416B2 (en) Systems and methods to automate trust delivery
Wei et al. Mapping ultrasound-based articulatory images and vowel sounds with a deep neural network framework
CN117316141A (en) Training method of rhythm annotation model, audio generation method, device and equipment
Camastra et al. Machine Learning for Audio, Image and Video Analysis.
Leonov et al. Russian language speech generation from facial video recordings using variational autoencoder

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23782335

Country of ref document: EP

Kind code of ref document: A1