WO2024086012A1 - End-to-end general audio synthesis with generative networks - Google Patents
End-to-end general audio synthesis with generative networks Download PDFInfo
- Publication number
- WO2024086012A1 WO2024086012A1 PCT/US2023/034098 US2023034098W WO2024086012A1 WO 2024086012 A1 WO2024086012 A1 WO 2024086012A1 US 2023034098 W US2023034098 W US 2023034098W WO 2024086012 A1 WO2024086012 A1 WO 2024086012A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- audio
- conditioning
- generator
- audio signal
- score
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0475—Generative networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Definitions
- a neural network- based system for audio synthesis (in a continuous feature domain, e.g. in a time domain such as waveform domain) including: a generator configured to generate synthesized audio, the generator comprising: an encoder configured to: transform an input audio signal (e.g.
- the system further comprises a decoder configured to: convert the hidden features associated with the receptive field back to the first (sample) rate by upsampling to form a processed signal (e.g. in time domain, such as waveform domain) and output a synthesized audio signal based on the processed signal as the generated synthesized audio.
- a decoder configured to: convert the hidden features associated with the receptive field back to the first (sample) rate by upsampling to form a processed signal (e.g. in time domain, such as waveform domain) and output a synthesized audio signal based on the processed signal as the generated synthesized audio.
- the relationship between the first (sample) rate of the input audio signal and the second (sample) rate output by the encoder is tunable and may be set as desired as long as the second rate is lower than the first rate.
- the first sample rate is at least 2 times higher, at least 4 times higher, at least 8 times higher, at least 10 times higher, at least 100 times higher or at least 300 times higher than the second rate.
- the transformation from the first rate to the second may be performed in separate subsequent modules, with each module performing at least one of processing and downsampling.
- the neural network based system has been conditionally trained for two or more types (classes) of audio content and receives conditioning information indicating the type of audio content to be generated.
- the two or more types (classes) of audio content may comprise two or more of: speech, environmental sounds, animal sounds, instrument sounds.
- the neural network system can then be said to be trained for general audio synthesis since it leverages the common structures in audio content across multiple types (classes) of audio content to synthesize the type (class) of audio content indicated by the conditioning information.
- the system may be configured for general audio synthesis wherein the generator has been trained with at least two types of audio content.
- General audio synthesis differs from [009]
- a method for performing general audio synthesis e.g. in a time domain such as waveform domain).
- the method comprising transforming, with a encoder, an input audio signal with a first rate into a sequence of hidden features with a second rate, lower than the first rate processing, the hidden features to aggregate temporal information to attain a receptive field; converting, with a decoder, the hidden features back to the first rate by upsampling to form a processed signal; and outputting a synthesized audio signal based on the processed signal as the generated synthesized audio.
- the systems and methods of aspects of the present invention may be used to build audio synthesizers by means of neural networks that improve upon existing related synthesizers by using recent deep generative techniques to model the full bandwidth of the signal in a continuous feature domain (e.g., in time domain, such as in the waveform domain).
- the systems and methods of aspects of the invention enable general generation (i.e., generation of sounds of any type, as long as they are fed into the model during training), which can be used for conditional or unconditional audio synthesis, and also for audio style transfer (e.g., feed an acapella voice and generate unconditional piano imitating it, or a conditional dog-barking imitating it).
- Unconditional audio synthesis involves synthesizing audio content without any conditioning information indicating or guiding the generator towards the type of audio content to be synthesized.
- the resulting synthesized audio content from unconditional synthesis can therefore vary greatly and, depending on the properties of the generator be perceived as music, speech, mechanical sounds, noise or mixtures thereof.
- a neural network generator which has been trained to synthesize audio content of a single type or class (e.g. guitar sounds, acapella sounds or dog barking sounds) may unconditionally synthesize audio content only of this specific type. That is, when a source- specific neural network generator trained specifically for a single type of audio content performs unconditional synthesis the resulting synthesized audio will be reminiscent of the audio type used for training (e.g.
- Conditional audio synthesis involves synthesizing audio content at least partially based on conditioning information c indicating the type of audio content to be synthesized.
- the very same neural network generator may also synthesize audio content with dog barking sounds if it instead is provided with (i.e. conditioned with) conditioning information c indicating that dog barking sounds are to be synthesized. Accordingly, by providing a general audio generator that can be conditioned a variety of different audio content types can be synthesized with the same generator.
- the generator is used in a diffusion process as a diffusion audio generator (DAG) acting as a full-bandwidth end-to-end source-agnostic waveform synthesizer.
- DAG diffusion audio generator
- the diffusion process involves iteratively extracting samples from a denoising generator, for example the samples are extracted using Langevin dynamics (see equation 1 below).
- the initial input to the generator may be a random noise signal for synthesis or any audio signal (with noise added) for style transfer and at each subsequent iteration in the diffusion process the input data is based on the output data from the previous iteration. Additionally, some noise may be added to the input of each iteration, with less and less noise being added for subsequent iterations to allow the output to converge to a synthesized (or style transfer) audio signal with reduced, or no, noise. [016]
- the sampling process may also be performed with or without conditioning depending on the desired result. In some implementations, less and less noise is added.
- the generator has preferably been trained to remove noise at multiple noise scales.
- the DAG is built upon a lossless auto-encoder that can directly generate waveforms at 48 kHz. While 48 kHz is a commonly used sample rate for many audio applications it is understood that the proposed DAG is capable of directly generating waveforms at an arbitrarily chosen sample rate. For example, the DAG is capable of directly generating waveforms at other commonly used sample rates such as 44.1 kHz, 96 kHz or 192 kHz. [018] Its end-to-end design makes it simpler to train and use, avoiding intermediate bottlenecks of information or possible cumulative errors from one module to the next one.
- DAG is built upon a score-based diffusion generative paradigm, which has shown great performance in related fields like speech synthesis, universal speech enhancement, or source-specific audio synthesis.
- a method for generating synthesized audio comprising: receiving, at a generator, conditioning information, c, and a random noise sample, ⁇ ⁇ .
- the conditioning information, c comprises a standard deviation, ⁇ ⁇ , of the random noise sample ⁇ ⁇ .
- the generator is trained to minimize an error function between a training random noise sample, ⁇ ⁇ and the predicted score, S for a training audio signal comprising the training random noise sample ⁇ ⁇ , and least one type of audio content.
- the generator may be built upon generative adversarial learning (e.g., such as generative adversarial networks, GANs).
- GANs generative adversarial networks
- FIG. 1 is a block diagram illustrating the generator according to some implementations.
- Figure 2 is a block diagram illustrating the structure of an up or down GBlock according to some implementations.
- Figure 3 is a flowchart describing a method for generating synthesized audio content.
- Figure 4 is flowchart describing a method for generating synthesized audio content with the generator used in a diffusion process.
- DETAILED DESCRIPTION OF CURRENTLY PREFERRED EMBODIMENTS [028]
- Systems and methods disclosed in the present application may be implemented as software, firmware, hardware or a combination thereof.
- the division of tasks does not necessarily correspond to the division into physical units; to the contrary, one physical component may have multiple functionalities, and one task may be carried out by several physical components in cooperation.
- the computer hardware may for example be a server computer, a client computer, a personal computer (PC), a tablet PC, a set-top box (STB), a personal digital assistant (PDA), a cellular telephone, a smartphone, an AR/VR wearable, automotive infotainment system, a web appliance, a network router, switch or bridge, or any machine capable of executing instructions (sequential or otherwise) that specify actions to be taken by that computer hardware.
- PC personal computer
- PDA personal digital assistant
- a cellular telephone a smartphone
- AR/VR wearable automotive infotainment system
- web appliance a web appliance
- network router switch or bridge
- processors that accept computer-readable (also called machine-readable) code containing a set of instructions that when executed by one or more of the processors carry out at least one of the methods described herein.
- Any processor capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken are included.
- a typical processing system e.g., computer hardware
- Each processor may include one or more of a CPU, a graphics processing unit, and a programmable DSP unit.
- the processing system further may include a memory subsystem including a hard drive, SSD, RAM and/or ROM.
- a bus subsystem may be included for communicating between the components.
- the software may reside in the memory subsystem and/or within the processor during execution thereof by the computer system.
- the one or more processors may operate as a standalone device or may be connected, e.g., networked to other processor(s). Such a network may be built on various different network protocols, and may be the Internet, a Wide Area Network (WAN), a Local Area Network (LAN), or any combination thereof.
- the software may be distributed on computer readable media, which may comprise computer storage media (or non-transitory media) and communication media (or transitory media).
- Computer storage media includes both volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
- Computer storage media includes, but is not limited to, physical (non-transitory) storage media in various forms, such as EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer.
- Fig.1 describes an embodiment of a neural audio generator 10 (sometimes referred to as the generator 10) that allows to synthesize variations of learned sounds, these being sounds of any type: environmental sounds, harmonic sounds, etc.
- the generator 10 can learn multiple types of sound with a single neural network.
- the generator 10 can be conditioned with conditioning information, c, comprising e.g., auxiliary features that determine the source to synthesize when sharing the same structure for multiple styles (e.g., a discrete set of labels, l, can control the synthesis output of the generator 10 on what to generate).
- conditioning information c
- the generator 10 can also be used for audio style transfer by combining the appropriate conditioning information c with the deep generative framework latent domain arithmetic.
- the construction of the generator 10 has been accomplished by configuring neural encoder 1 and decoder 2 modules based on convolutional and recurrent neural blocks or self-attention blocks and configuring a deep generative task upon them to approximate a real signal distribution.
- the generator 10 as a whole may be capable of modelling the distribution of properties of real audio signals.
- the generator learns to model these distributions when it is trained to accomplish the generative task.
- the generative task is performing noise cancellation for varying noise levels.
- the proposed generator 10 structure and deep generative task leads to a powerful audio synthesizer that can work directly with sampled audio signals at any sample rate (e.g.48 kHz), optionally expressed in the waveform domain.
- the encoder 1 is configured to receive conditioning information c and the decoder 2 is configured to receive the conditioning information c.
- the conditioning information c indicates a type (or class) of audio to be generated.
- the conditioning information c indicates a specific musical instrument (e.g. piccolo, guitar, or drums), a musical genre (e.g. rock, pop, or rap), an animal sound (e.g. cat, dog, or elephant) or types of human oral utterances (e.g. song, speech, scream, male, female).
- the type (or class) of audio content may be indicated with a label, l, comprised in the conditioning information c.
- the format of the condition information can also be one of many formats.
- the conditioning information c is in the form of a class label, text conditioning, visual conditioning, audio conditioning, class-to-audio information, text-to-audio information, image-to-audio information, audio-to-audio information and/or combinations of previous inputs to audio.
- the embodiment of the generator 10 depicted in fig.1 follows the structure of UNet auto-encoding frameworks, the generator 10 features an encoder 1 and a decoder 2.
- the encoder 1 is built with downsampling convolutional feature encoders referred to as Down GBlocks (DGB1, DGB2, DGBk or DGBlocks DGB1, DGB2, DGBk) that transform the input signal z (which may consist of noise or be an original audio signal with added noise for style transfer) with a first sample rate (resolution) into a sequence of hidden features f1 (embedding vectors) with a second, lower, rate (resolution) compared to the first rate through subsequent strided convolutions with stride factors ⁇ ⁇ , where the subscript k indicates the layer index.
- the second DGBlock DGB2 obtains the hidden features f1 as input and performs further and/or processing to output hidden features f2. The same process is then repeated for all DGBlocks DBG1:k in the encoder 1 until finally the hidden features fk are output by the final DGBlock DGBk.
- An exemplary embodiment with five DGBlocks DGB1 – DGB5 contains the following strides (downsampling factors), from lower to higher layers in depth: [2, 2, 4, 4, 5].
- the hidden features f k output by the final DGBlock DGBk are provided to and processed by a neural network module 3 which is configured to utilize an increased temporal context window compared to the final DGBlock DGBk and which outputs hidden features f R .
- the neural network module 3 with an increased temporal context window may comprise one or more recurrent neural networks, RNNs, that aggregate temporal information to attain a broader temporal receptive field (i.e. to contextualize features about the whole input sequence regardless of the position in time of a certain embedding vector).
- the hidden features fR having been processed to incorporate information from a broader temporal receptive field, are optionally used as input to the decoder 2.
- the hidden features fR based on the broader temporal field are combined with hidden features f k from the last DGBlock DGBk to form combined features f SUM that are used as input to the decoder 2.
- the output of the neural network module 3 e.g. RNN
- the input to the decoder 2 is based on at least one of the output f k from last DGBlock DGBk and the output f R of a neural network module 3 (e.g. RNN) with increased temporal context window compared to the last DGBlock DGBk.
- the neural network module 3 e.g. RNN
- RNN neural network module 3
- An RNN can be unidirectional or bidirectional, and can take the form of gated cells like gated recurrent units (GRUs) or long short-term memories (LSTMs). An embodiment of this is a bidirectional GRU of two layers. A residual connection surrounding the RNN alleviates potential gradient flow issues of RNN saturating activations.
- the decoder 2 converts the hidden features obtained at its input (e.g.
- the first UGBlock UGB1 outputs hidden features f’1 and the second UGBlock UGB2 outputs hidden features f’2 and so on until the last UGBlock UGBk outputs the score function S.
- not all DGBlocks DGB1:k and UGBlocks UGB1:k need to perform downsampling or upsampling.
- each down GBlock DGB1:k includes at least one down sampler stage.
- the at least one down GBlock DGB1:k is configured to transform the input audio signal with the first rate (resolution) into the second rate sequence of hidden features.
- the at least one down GBlock DGB1:k is further configured to transform the input audio signal with the first rate into the second rate sequence of hidden features through subsequent strided convolutions.
- the generator 10 follows an architecture that works by inferring all input time- steps in parallel (despite the recurrence imposed by the recurrent layer), hence all output time-steps after decoding are predicted at once, from the first input sample ⁇ ⁇ to the last one ⁇ ⁇ of a signal segment with T number of samples.
- the generator 10 is configured to operate on a full- bandwidth of the input audio signal.
- the generator 10 is configured to operate on a 48 kHz sampled audio signal in the waveform domain or an audio signal in the waveform domain with a different sample rate (e.g.44.1 kHz or 192 kHz).
- any specific examples of the number of DGBlocks, and their respective downsampling factors in the encoder 1 and the number of UGBlocks, and their respective upsampling factors in the decoder 2 are merely exemplary. There may be one, two or three or more DGBblocks and UGBlocks in the encoder 1 and decoder 2 respectively.
- the details of the Down GBlocks DGB1:k and Up GBlocks UGB1:k is shown with further reference to fig.2.
- a difference between the up GBlocks from the decoder 2 and the down GBblocks from the encoder 1 are the change of downsampling and upsampling stages (see linear resample block 35).
- the linear resampler 35 acts as a linear downsampler in skip connection and is combined with a strided convolution 23a for downsampling blocks and the linear resampler 35 acts as linear upsampler in skip connection and is combined with transposed convolution 23b for upsampling blocks. That is, down GBlocks DGB1:k used in downsampling encoder blocks utilize the StridedConv branch, and up GBlocks UGB1:k used in upsampling decoder blocks utilize TransposedConv branch.
- the input z is general to the network.
- the signals x and c are the input to each GBlock and the conditioning signal (e.g. label projection) respectively.
- each GBlock features four convolutional blocks 23a, 23b, 26, 29, 32, four non-linear activations 22, 25, 28, 31 (e.g. LeakyReLUs) and four FiLM (Feature-wise Linear Modulation) conditioning layers 21, 24, 27, 30, with each layer’s parameters (e.g. convolutions) being independent from each other throughout the whole network.
- the conditioning FiLM layers 21, 24, 27, 30 can be controlled with conditioning information c comprising global or local parameters, like a class label (e.g. ‘siren’, ‘train’, ‘fan’, ...) or a time-varying feature (e.g. loudness curve).
- each GBlock may include a dilation pattern for the convolutions 23a, 23b, 26, 29, 32 to increase the receptive field.
- the encoder includes: at least one down GBlock DGB1:k, wherein each down GBlock DGB1:k includes: at least one activation layer 22, 25, 28, 31; at least one conditioning layer 21, 24, 27, 30; and at least one convolutional layer 23a, 23b, 26, 29, 32, 36.
- An input audio signal is obtained and at step S1 the input audio signal is transformed, with a (trained) encoder 1, from a first rate (resolution) representation in a continuous feature domain (e.g. a time domain such as the waveform domain) to a second (lower) rate sequence of hidden features.
- a first rate (resolution) representation in a continuous feature domain e.g. a time domain such as the waveform domain
- second (lower) rate sequence of hidden features e.g. a time domain such as the waveform domain
- the hidden features are processed (e.g. with an RNN 3) to aggregate temporal information to attain a receptive field.
- a (trained) decoder 2 converts the hidden features back to the first rate by upsampling to form a processed signal and at step S4 a synthesized audio signal (in e.g. waveform domain) is obtained from the generator based on the processed signal.
- the generator 10 has been trained to generate the synthesized audio signal directly whereby the processed signal is the synthesized audio signal.
- step S3 and S4 may be combined.
- the processed signal is a score signal indicating for each sample a score S wherein the score is indicative of how to modify the input audio signal to obtain the synthesized audio signal.
- the score signal has a (sample) rate equal to that of the input audio signal meaning that the output of the generator in score based diffusion is one score for each individual sample in the input audio signal.
- the score indicates the direction in which the input audio signal should be altered to increase the likelihood of each sample whereby modifying the input audio signal in accordance with the score results in a synthesized audio signal.
- step S4 may comprise forming the synthesized audio signal based on the input audio signal and the score signal.
- the generator 10 is formulated as a score predictor that predicts a score ⁇ ( ⁇ ⁇ + ⁇ ⁇ ⁇ ⁇ , ⁇ , ⁇ ⁇ ), where ⁇ ⁇ is the original audio signal, ⁇ ⁇ is the standard deviation of the Gaussian noise sample ⁇ ⁇ used in the diffusion process, and ⁇ is the label that drives what type of audio is to be generated (e.g., piano, music, dog barking, gunshots, fire crackling, environmental sounds in general, and mixtures of these types).
- the generator 10 is sometimes referred to as a diffusion audio generator, DAG.
- the generator 10 is provided with training audio signals comprising audio content with additive sampled random noise zt and the generator 10 learns by minimizing the mean squared error between the sampled noise ⁇ ⁇ and the predicted score ⁇ . More specifically, the generator 10 learns by minimizing the mean squared error between the sampled noise ⁇ ⁇ scaled with the standard deviation ⁇ ⁇ and the predicted score ⁇ .
- the generator 10 predicts the score ⁇ gradient of the logarithm of a density of the training data. The score ⁇ therefore represents, for each input data point, an indication of the direction in which the likelihood of data increases most rapidly. That is, the generator 10 learns to predict the trajectory to follow, for each data point in order to remove the noise that would recover the training data.
- the score signal therefore comprises, for each sample of the input audio signal, an indication of the direction in which the likelihood of data increases most rapidly.
- the score signal has the sample rate as the input audio signal and the score signal may be referred to as a score audio signal or a processed audio signal since it is the result of processing the input audio signal with the generator 10.
- both conditioning signals in the diffusion process, ⁇ and ⁇ ⁇ are projected through some hidden layers and concatenated to form the features ⁇ from fig.1. More concretely, the logarithm of ⁇ ⁇ is processed with random Fourier feature embeddings and a multilayer perceptron (MLP) as in previous works.
- MLP multilayer perceptron
- the label ⁇ is linearly projected through an embedding layer, and the resulting embedding is concatenated to the sigma’s MLP features.
- the label ⁇ may include information such as a class label, text conditioning, visual conditioning, audio conditioning, class-to-audio information, text-to- audio information, image-to-audio information, audio-to-audio information and/or combinations of previous inputs to audio.
- the label l will be a class label for a guitar class whereby the generator 10 will be conditioned to synthesize guitar audio content.
- the label l may also be in the form of a text string such as “GUITAR” or “FRETTED MUSICAL INSTRUMENT WITH SIX STRINGS” or an image showing a guitar whereby the generator 10 has been trained with text strings and/or images to generate guitar sounds conditioned on the label l indicating a guitar. If the label l is changed to a class label for a saxophone class, the string “SAXOPHONE” or an image of a saxophone the generator 10 will instead synthesize saxophone audio content. In general the label l can be set to indicate any type of audio content which has been used during training. [066] In some implementations, generator 10 has only been trained with training data of a single sound type.
- the generator 10 has only been trained with piccolo audio content whereby the generator 10 may only be capable of synthesizing piccolo audio content.
- the label l is not needed as the generator 10 will unconditionally synthesize audio content of the single audio type (e.g. only piccolo sounds) for which it has been trained.
- the resulting conditioning signal ⁇ is what drives all the FiLM conditioning layers throughout the generator 10.
- the generator 10 allows to do conditional and unconditional audio generation/synthesis, since the model can be trained on multi-class audio datasets to learn the signal distribution and we can sample new audio sequences from it, with or without labels. For example, the generator 10 will perform unconditional synthesis if it is not provided with any conditioning information c indicating a type of audio content to be generated and performs conditional synthesis if it is provided with conditioning information c indicating a type of audio content to be generated.
- conditional and unconditional audio generation may be initiated by conditioning information c (e.g. a label l) explicitly indicating unconditional synthesis.
- conditioning information c e.g. a label l
- a method for using the generator 10 for score- based diffusion synthesis will now be described.
- the input audio signal ⁇ ⁇ is a silent signal and random noise sample(s) ⁇ ⁇ is added to it to form a noisy input audio signal.
- the generator 10 then generates a score ⁇ ⁇ at step S6 based on the input audio signal ⁇ ⁇ , the standard deviation ⁇ ⁇ associated with the noisy input audio signal or added noise sample(s) ⁇ ⁇ and optionally based on any class label l of the conditioning information c.
- the score ⁇ ⁇ the next iteration ⁇ ⁇ input audio signal is determined at step S7 using equation 1 above. Synthesized audio signal ⁇ ⁇ resulting from one iteration will likely still be perceived as a noisy signal.
- the signal ⁇ ⁇ is then used as input to the generator 10 which generates an output audio signal ⁇ ⁇ + in the next iteration and then the process is repeated to generate ⁇ ⁇ , , ⁇ ⁇ - and so on until ⁇ ⁇ . is obtained.
- ⁇ ⁇ is adjusted at step S8.
- further noise samples ⁇ ⁇ are also added at step S8 for each iteration.
- ⁇ ⁇ decreases from one iteration to the next meaning that the energy of the synthesized audio signal will gradually prevail over noise energy that decreases for each iteration.
- the label l indicates “dog barking” and the source audio signal / ⁇ is a melody played on a piano and the resulting style transferred audio signal ⁇ ⁇ . is dog barking sounds resembling the piano melody of the source audio signal / ⁇ .
- the source audio signal / ⁇ is normalized with an amplitude factor prior to adding the noise at initialization so that the information of the source audio signal / ⁇ is not completely lost.
- conditional synthesis i.e. synthesis condition with a label l
- classifier-free guidance to the conditioned synthesis.
- each evaluation of ⁇ 0 ⁇ ⁇ , ⁇ , ⁇ ⁇ ! is between the conditioned score ⁇ ⁇ ⁇ , ⁇ , ⁇ ⁇ ! and unconditioned score , the terms being weighted with ⁇ .
- the conditional prediction controlled by the conditioning information c gets more and more exaggerated. This has been found to increase the quality and accuracy of the synthesized audio content while the variability of the synthesized audio content decreases. As an example, if the conditioning information c indicates “dog barking” and ⁇ is increased, the score prediction will exaggerate the generation of dog barking.
- a neural network-based system for general audio synthesis comprising: a generator configured to generate synthesized audio, the generator comprising: an encoder configured to: transform an input audio signal with a first resolution into a low-rate sequence of hidden features; process the hidden features to aggregate temporal information to attain a receptive field; a decoder: convert the hidden features back to the first resolution by upsampling; and output an upsampled audio signal as the generated synthesized audio.
- EEE 2 The system of EEE 1, wherein the encoder is further configured to receive conditioning information and wherein the decoder is further configured to receive the conditioning information.
- EEE 3 The system of EEE 2, wherein the conditioning information indicates a type of audio to be generated. [085] EEE 4.
- each down GBlock comprises: at least one activation layer; at least one conditioning layer; and at least one convolutional layer.
- EEE 5 The system of EEE 1, wherein each down GBlock further comprises at least one down sampler stage.
- EEE 6 The system of EEE 4 or 5, wherein the at least one down GBlock is configured to transform the input audio signal with the first resolution into the low-rate sequence of hidden features.
- EEE 7. The system of EEE 6, wherein the at least one down GBlock is further configured to transform the input audio signal with the first resolution into the low-rate sequence of hidden features through subsequent strided convolutions.
- EEE 8 The system of any one of EEEs 4-7, wherein the at least one conditioning layer is a FiLM conditioning layer.
- EEE 9. The system of any claim 8, wherein the FiLM conditioning layer is configured to be controlled via global and/or local parameters.
- EEE 10. The system of any one of EEEs 1-9, wherein the encoder further comprises a recurrent neural network, wherein the recurrent network is configured to process the hidden features to aggregate the temporal information to attain the receptive field.
- EEE 11 The system of any one of EEEs 1-10, wherein the receptive field is configured to contextualize features corresponding to the input audio signal regardless of a position in time of an embedding vector.
- each up GBlock comprises: at least one activation layer; at least one conditioning layer; and at least one convolutional layer.
- EEE 13 The system of EEE 12, wherein each up GBlock further comprises at least one up sampler stage.
- EEE 14 The system of EEE 12 or 13, wherein the at least one up GBlock is configured to convert the hidden features back to the first resolution by upsampling with strided reversed factors with respect to the encoder.
- EEE 15 The system of any of EEEs 1-14, wherein the system is configured to operate on a full-bandwidth of the input audio signal. [097] EEE 16.
- EEE 17 A method for generating synthesized audio, the method comprising: receiving, at a generator, an original audio signal, x ⁇ , conditioning information, c, and a random noise sample, z 5 ; wherein the conditioning information, c, comprises: information, l, corresponding to a type of audio to be generated; and a standard deviation, ⁇ 5 , of the random noise sample z; determining, a synthesized audio based on a predicted score, S, wherein S(x ⁇ + ⁇ 5 z 5 , l, ⁇ 5 ); and wherein the generator is trained to minimize a mean square error between the random noise sample, z, and the predicted score, S.
- EEE 18 The method of EEE 17, wherein the synthesized audio is further determined by sampling based on noise-consistent Langevin dynamics.
- EEE 19 The method of EEE 17 or 18, wherein the information, l, comprises at least one of a class label, text conditioning, visual conditioning, audio conditioning, class-to- audio information, text-to-audio information, image-to-audio information, audio-to-audio information and/or combinations of previous inputs to audio.
- EEE 20 An apparatus configured to perform the method of any one of EEEs 17- 19.
- EEE 21 A non-transitory computer readable storage medium comprising a sequence of instructions which, when executed, cause one or more devices to perform the method of any one of EEEs 17-19.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
Description
Claims
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202380072951.7A CN120092288A (en) | 2022-10-17 | 2023-09-29 | End-to-end general audio synthesis using generative networks |
| EP23793580.4A EP4605934A1 (en) | 2022-10-17 | 2023-09-29 | End-to-end general audio synthesis with generative networks |
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| ESP202230889 | 2022-10-17 | ||
| ES202230889 | 2022-10-17 | ||
| US202263433650P | 2022-12-19 | 2022-12-19 | |
| US63/433,650 | 2022-12-19 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024086012A1 true WO2024086012A1 (en) | 2024-04-25 |
Family
ID=88506925
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2023/034098 Ceased WO2024086012A1 (en) | 2022-10-17 | 2023-09-29 | End-to-end general audio synthesis with generative networks |
Country Status (3)
| Country | Link |
|---|---|
| EP (1) | EP4605934A1 (en) |
| CN (1) | CN120092288A (en) |
| WO (1) | WO2024086012A1 (en) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119274533A (en) * | 2024-07-30 | 2025-01-07 | 清华大学深圳国际研究生院 | A highly expressive audio generation method based on natural language description text |
| US20250140242A1 (en) * | 2023-10-31 | 2025-05-01 | Lemon Inc. | Generating audio representations using machine learning model |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2022079164A2 (en) * | 2020-10-15 | 2022-04-21 | Dolby International Ab | Real-time packet loss concealment using deep generative networks |
-
2023
- 2023-09-29 WO PCT/US2023/034098 patent/WO2024086012A1/en not_active Ceased
- 2023-09-29 CN CN202380072951.7A patent/CN120092288A/en active Pending
- 2023-09-29 EP EP23793580.4A patent/EP4605934A1/en active Pending
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2022079164A2 (en) * | 2020-10-15 | 2022-04-21 | Dolby International Ab | Real-time packet loss concealment using deep generative networks |
Non-Patent Citations (2)
| Title |
|---|
| JOAN SERR\`A ET AL: "Universal Speech Enhancement with Score-based Diffusion", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 7 June 2022 (2022-06-07), XP091241200 * |
| NEIL ZEGHIDOUR ET AL: "SoundStream: An End-to-End Neural Audio Codec", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 7 July 2021 (2021-07-07), XP091009160 * |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20250140242A1 (en) * | 2023-10-31 | 2025-05-01 | Lemon Inc. | Generating audio representations using machine learning model |
| CN119274533A (en) * | 2024-07-30 | 2025-01-07 | 清华大学深圳国际研究生院 | A highly expressive audio generation method based on natural language description text |
| CN119274533B (en) * | 2024-07-30 | 2025-11-28 | 清华大学深圳国际研究生院 | High-expressive force audio generation method based on natural language description text |
Also Published As
| Publication number | Publication date |
|---|---|
| CN120092288A (en) | 2025-06-03 |
| EP4605934A1 (en) | 2025-08-27 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP7204989B2 (en) | Expressivity Control in End-to-End Speech Synthesis Systems | |
| JP7709545B2 (en) | Unsupervised Parallel Tacotron: Non-autoregressive and Controllable Text-to-Speech | |
| US20230260504A1 (en) | Variational Embedding Capacity in Expressive End-to-End Speech Synthesis | |
| CN111771213B (en) | Speech style migration | |
| Blaauw et al. | A neural parametric singing synthesizer modeling timbre and expression from natural songs | |
| US11538455B2 (en) | Speech style transfer | |
| CN108510975B (en) | System and method for real-time neural text-to-speech | |
| JP7257593B2 (en) | Training Speech Synthesis to Generate Distinguishable Speech Sounds | |
| US20240355017A1 (en) | Text-Based Real Image Editing with Diffusion Models | |
| Wu et al. | Quasi-periodic WaveNet: An autoregressive raw waveform generative model with pitch-dependent dilated convolution neural network | |
| CN114450694B (en) | Training a neural network to generate structured embeddings | |
| EP4605934A1 (en) | End-to-end general audio synthesis with generative networks | |
| CN114267366A (en) | Speech noise reduction through discrete representation learning | |
| JP2020194558A (en) | Information processing method | |
| CN120826736A (en) | Diffusion model for audio data generation based on descriptive textual cues | |
| Bitton et al. | Neural granular sound synthesis | |
| EP4328900A1 (en) | Generative music from human audio | |
| JP7488422B2 (en) | A generative neural network model for processing audio samples in the filter bank domain | |
| Caillon | Hierarchical temporal learning for multi-instrument and orchestral audio synthesis | |
| US20250372067A1 (en) | Music generation with time varying controls | |
| US20240339104A1 (en) | Systems and methods for text-to-speech synthesis | |
| Lee | Deep Generative Model for Waveform Synthesis | |
| Siva Kumar Reddy et al. | Artificial intelligence driven gender based text-to-speech systems (TTS) using deep learning algorithms | |
| CN113177635B (en) | Information processing method, device, electronic device and storage medium | |
| 이상길 | Deep Generative Model for Waveform Synthesis |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23793580 Country of ref document: EP Kind code of ref document: A1 |
|
| DPE1 | Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101) | ||
| WWE | Wipo information: entry into national phase |
Ref document number: 202380072951.7 Country of ref document: CN |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023793580 Country of ref document: EP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2023793580 Country of ref document: EP Effective date: 20250519 |
|
| WWP | Wipo information: published in national office |
Ref document number: 202380072951.7 Country of ref document: CN |
|
| WWP | Wipo information: published in national office |
Ref document number: 2023793580 Country of ref document: EP |