WO2024086012A1

WO2024086012A1 - End-to-end general audio synthesis with generative networks

Info

Publication number: WO2024086012A1
Application number: PCT/US2023/034098
Authority: WO
Inventors: Santiago PASCUAL; Joan Serra; Jordi PONS PUIG; Chunghsin YEH; Gautam Bhattacharya
Original assignee: Dolby International AB; Dolby Laboratories Licensing Corp
Current assignee: Dolby International AB; Dolby Laboratories Licensing Corp
Priority date: 2022-10-17
Filing date: 2023-09-29
Publication date: 2024-04-25
Anticipated expiration: 2025-04-17
Also published as: CN120092288A; EP4605934A1

Abstract

An aspect of the present disclosure relates to a neural network-based system for general audio synthesis comprising a generator configured to generate synthesized audio. The generator comprising an encoder configured to transform an input audio signal with a first rate into a sequence of hidden features with a second rate, lower than the first rate and process the hidden features to aggregate temporal information. The generator comprises a decoder configured to convert the hidden features back to the first rate by upsampling to form a processed signal and output a synthesized audio signal based on the processed signal as the generated synthesized audio.

Description

END-TO-END GENERAL AUDIO SYNTHESIS WITH GENERATIVE NETWORKS CROSS-REFERENCE TO RELATED APPLICATIONS [001] This application claims the benefit of priority from Spanish Patent Application No. P202230889 filed October 17, 2022 and US Provisional Patent Application No. 63/433,650 filed December 19, 2022, each of which is incorporated by reference herein in its entirety. TECHNICAL FIELD OF THE INVENTION [002] The present invention relates to a method and system for synthesis of audio content. BACKGROUND OF THE INVENTION [003] Existing systems and methods for audio content synthesis are often band-limited operating with sampling rates below 16 kHz or 22050 kHz due to modelling constraints. Additionally, existing systems and methods for audio content synthesis utilize lossy transformations implemented by one or more separate processing modules that introduce intermediate bottlenecks of information and cumulative errors from one module to the next one. GENERAL DISCLOSURE OF THE INVENTION [004] In view of the drawbacks of existing systems and methods there is a need for an improved system and method for audio synthesis. [005] According to a first aspect of the invention there is provided a neural network- based system for audio synthesis (in a continuous feature domain, e.g. in a time domain such as waveform domain) including: a generator configured to generate synthesized audio, the generator comprising: an encoder configured to: transform an input audio signal (e.g. in time domain, such as waveform domain) with a first (sample) rate into a sequence of hidden features with a second (sample) rate, lower than the first (sample) rate, and to process the hidden features to aggregate temporal information to attain hidden features associated with a receptive field. The system further comprises a decoder configured to: convert the hidden features associated with the receptive field back to the first (sample) rate by upsampling to form a processed signal (e.g. in time domain, such as waveform domain) and output a synthesized audio signal based on the processed signal as the generated synthesized audio. [006] The relationship between the first (sample) rate of the input audio signal and the second (sample) rate output by the encoder is tunable and may be set as desired as long as the second rate is lower than the first rate. For example, the first sample rate is at least 2 times higher, at least 4 times higher, at least 8 times higher, at least 10 times higher, at least 100 times higher or at least 300 times higher than the second rate. As will be described below the transformation from the first rate to the second may be performed in separate subsequent modules, with each module performing at least one of processing and downsampling. [007] In some implementations the neural network based system has been conditionally trained for two or more types (classes) of audio content and receives conditioning information indicating the type of audio content to be generated. The two or more types (classes) of audio content may comprise two or more of: speech, environmental sounds, animal sounds, instrument sounds. The neural network system can then be said to be trained for general audio synthesis since it leverages the common structures in audio content across multiple types (classes) of audio content to synthesize the type (class) of audio content indicated by the conditioning information. [008] The system may be configured for general audio synthesis wherein the generator has been trained with at least two types of audio content. General audio synthesis differs from [009] According to a second aspect of the invention there is provided a method for performing general audio synthesis (e.g. in a time domain such as waveform domain). The method comprising transforming, with a encoder, an input audio signal with a first rate into a sequence of hidden features with a second rate, lower than the first rate processing, the hidden features to aggregate temporal information to attain a receptive field; converting, with a decoder, the hidden features back to the first rate by upsampling to form a processed signal; and outputting a synthesized audio signal based on the processed signal as the generated synthesized audio. [010] The systems and methods of aspects of the present invention may be used to build audio synthesizers by means of neural networks that improve upon existing related synthesizers by using recent deep generative techniques to model the full bandwidth of the signal in a continuous feature domain (e.g., in time domain, such as in the waveform domain). The systems and methods of aspects of the invention enable general generation (i.e., generation of sounds of any type, as long as they are fed into the model during training), which can be used for conditional or unconditional audio synthesis, and also for audio style transfer (e.g., feed an acapella voice and generate unconditional piano imitating it, or a conditional dog-barking imitating it). [011] Unconditional audio synthesis involves synthesizing audio content without any conditioning information indicating or guiding the generator towards the type of audio content to be synthesized. The resulting synthesized audio content from unconditional synthesis can therefore vary greatly and, depending on the properties of the generator be perceived as music, speech, mechanical sounds, noise or mixtures thereof. [012] A neural network generator which has been trained to synthesize audio content of a single type or class (e.g. guitar sounds, acapella sounds or dog barking sounds) may unconditionally synthesize audio content only of this specific type. That is, when a source- specific neural network generator trained specifically for a single type of audio content performs unconditional synthesis the resulting synthesized audio will be reminiscent of the audio type used for training (e.g. resemble acapella sounds if acapella audio was the only audio type used during training). [013] Conditional audio synthesis involves synthesizing audio content at least partially based on conditioning information c indicating the type of audio content to be synthesized. A neural network generator trained for at least two types of audio content, each type associated with conditioning information (e.g. a label, l) also provided to the neural network during training, will adjust the generation based on conditioning information c indicating which type of audio to synthesize. For example, a neural network generator trained with both dog barking training data and guitar training data may synthesize guitar audio content if the generator is provided with (i.e. conditioned with) conditioning information, c, indicating that guitar sounds are to be synthesized. The very same neural network generator may also synthesize audio content with dog barking sounds if it instead is provided with (i.e. conditioned with) conditioning information c indicating that dog barking sounds are to be synthesized. Accordingly, by providing a general audio generator that can be conditioned a variety of different audio content types can be synthesized with the same generator. [014] In some embodiments, the generator is used in a diffusion process as a diffusion audio generator (DAG) acting as a full-bandwidth end-to-end source-agnostic waveform synthesizer. [015] The diffusion process involves iteratively extracting samples from a denoising generator, for example the samples are extracted using Langevin dynamics (see equation 1 below). The initial input to the generator may be a random noise signal for synthesis or any audio signal (with noise added) for style transfer and at each subsequent iteration in the diffusion process the input data is based on the output data from the previous iteration. Additionally, some noise may be added to the input of each iteration, with less and less noise being added for subsequent iterations to allow the output to converge to a synthesized (or style transfer) audio signal with reduced, or no, noise. [016] The sampling process may also be performed with or without conditioning depending on the desired result. In some implementations, less and less noise is added. The generator has preferably been trained to remove noise at multiple noise scales. [017] While previous works are band-limited to 16 kHz or 22050 Hz sampling rates due to modeling constraints, the DAG is built upon a lossless auto-encoder that can directly generate waveforms at 48 kHz. While 48 kHz is a commonly used sample rate for many audio applications it is understood that the proposed DAG is capable of directly generating waveforms at an arbitrarily chosen sample rate. For example, the DAG is capable of directly generating waveforms at other commonly used sample rates such as 44.1 kHz, 96 kHz or 192 kHz. [018] Its end-to-end design makes it simpler to train and use, avoiding intermediate bottlenecks of information or possible cumulative errors from one module to the next one. Additionally, in some embodiments, DAG is built upon a score-based diffusion generative paradigm, which has shown great performance in related fields like speech synthesis, universal speech enhancement, or source-specific audio synthesis. According to a third aspect of the invention there is provided a method for generating synthesized audio, the method comprising: receiving, at a generator, conditioning information, c, and a random noise sample, ^_^. Wherein the conditioning information, c, comprises a standard deviation, ^_^, of the random noise sample ^_^. Predicting, by the generator, a score S (or score signal), and determining, a synthesized audio signal in a continuous feature domain based on the predicted score, S. Wherein the generator is trained to minimize an error function between a training random noise sample, ^_^^^^^^^^ and the predicted score, S for a training audio signal comprising the training random noise sample ^_^^^^^^^^, and least one type of audio content. [019] Alternatively, in some embodiments, the generator may be built upon generative adversarial learning (e.g., such as generative adversarial networks, GANs). [020] According to a fourth aspect of the invention there is provided an apparatus configured to perform the method of the third aspect. [021] According to a fifth aspect of the invention there is provided non-transitory computer readable storage medium comprising a sequence of instructions which, when executed, cause one or more devices to perform the method of any one of the second and third aspect. [022] The invention according to the second, fourth and fifth aspect features the same or equivalent benefits as the invention according to the first and third aspect. Any functions described in relation to a method, may have corresponding features in a system, device or a computer program product. BRIEF DESCRIPTION OF THE DRAWINGS [023] The present invention will be described in more detail with reference to the appended drawings, showing currently preferred embodiments of the invention. [024] Figure 1 is a block diagram illustrating the generator according to some implementations. [025] Figure 2 is a block diagram illustrating the structure of an up or down GBlock according to some implementations. [026] Figure 3 is a flowchart describing a method for generating synthesized audio content. [027] Figure 4 is flowchart describing a method for generating synthesized audio content with the generator used in a diffusion process. DETAILED DESCRIPTION OF CURRENTLY PREFERRED EMBODIMENTS [028] Systems and methods disclosed in the present application may be implemented as software, firmware, hardware or a combination thereof. In a hardware implementation, the division of tasks does not necessarily correspond to the division into physical units; to the contrary, one physical component may have multiple functionalities, and one task may be carried out by several physical components in cooperation. [029] The computer hardware may for example be a server computer, a client computer, a personal computer (PC), a tablet PC, a set-top box (STB), a personal digital assistant (PDA), a cellular telephone, a smartphone, an AR/VR wearable, automotive infotainment system, a web appliance, a network router, switch or bridge, or any machine capable of executing instructions (sequential or otherwise) that specify actions to be taken by that computer hardware. Further, the present disclosure shall relate to any collection of computer hardware that individually or jointly execute instructions to perform any one or more of the concepts discussed herein. [030] Certain or all components may be implemented by one or more processors that accept computer-readable (also called machine-readable) code containing a set of instructions that when executed by one or more of the processors carry out at least one of the methods described herein. Any processor capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken are included. Thus, one example is a typical processing system (e.g., computer hardware) that includes one or more processors. Each processor may include one or more of a CPU, a graphics processing unit, and a programmable DSP unit. The processing system further may include a memory subsystem including a hard drive, SSD, RAM and/or ROM. A bus subsystem may be included for communicating between the components. The software may reside in the memory subsystem and/or within the processor during execution thereof by the computer system. [031] The one or more processors may operate as a standalone device or may be connected, e.g., networked to other processor(s). Such a network may be built on various different network protocols, and may be the Internet, a Wide Area Network (WAN), a Local Area Network (LAN), or any combination thereof. [032] The software may be distributed on computer readable media, which may comprise computer storage media (or non-transitory media) and communication media (or transitory media). As is well known to a person skilled in the art, the term computer storage media includes both volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, physical (non-transitory) storage media in various forms, such as EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Further, it is well known to the skilled person that communication media (transitory) typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. [033] Fig.1 describes an embodiment of a neural audio generator 10 (sometimes referred to as the generator 10) that allows to synthesize variations of learned sounds, these being sounds of any type: environmental sounds, harmonic sounds, etc. The generator 10 can learn multiple types of sound with a single neural network. The generator 10 can be conditioned with conditioning information, c, comprising e.g., auxiliary features that determine the source to synthesize when sharing the same structure for multiple styles (e.g., a discrete set of labels, l, can control the synthesis output of the generator 10 on what to generate). The generator 10 can also be used for audio style transfer by combining the appropriate conditioning information c with the deep generative framework latent domain arithmetic. [034] The construction of the generator 10 has been accomplished by configuring neural encoder 1 and decoder 2 modules based on convolutional and recurrent neural blocks or self-attention blocks and configuring a deep generative task upon them to approximate a real signal distribution. That is, while each individual neural network block on its own may be limited in its capabilities for modelling the distribution of real audio signals, the generator 10 as a whole may be capable of modelling the distribution of properties of real audio signals. The generator learns to model these distributions when it is trained to accomplish the generative task. In one example, the generative task is performing noise cancellation for varying noise levels. [035] The proposed generator 10 structure and deep generative task leads to a powerful audio synthesizer that can work directly with sampled audio signals at any sample rate (e.g.48 kHz), optionally expressed in the waveform domain. [036] In some implementations, the encoder 1 is configured to receive conditioning information c and the decoder 2 is configured to receive the conditioning information c. [037] In some implementations, the conditioning information c indicates a type (or class) of audio to be generated. For example, the conditioning information c indicates a specific musical instrument (e.g. piccolo, guitar, or drums), a musical genre (e.g. rock, pop, or rap), an animal sound (e.g. cat, dog, or elephant) or types of human oral utterances (e.g. song, speech, scream, male, female). The type (or class) of audio content may be indicated with a label, l, comprised in the conditioning information c. The format of the condition information can also be one of many formats. As a few examples, the conditioning information c is in the form of a class label, text conditioning, visual conditioning, audio conditioning, class-to-audio information, text-to-audio information, image-to-audio information, audio-to-audio information and/or combinations of previous inputs to audio. [038] The embodiment of the generator 10 depicted in fig.1 follows the structure of UNet auto-encoding frameworks, the generator 10 features an encoder 1 and a decoder 2. [039] First, the encoder 1 is built with downsampling convolutional feature encoders referred to as Down GBlocks (DGB1, DGB2, DGBk or DGBlocks DGB1, DGB2, DGBk) that transform the input signal z (which may consist of noise or be an original audio signal with added noise for style transfer) with a first sample rate (resolution) into a sequence of hidden features f1 (embedding vectors) with a second, lower, rate (resolution) compared to the first rate through subsequent strided convolutions with stride factors ∆_^ , where the subscript k indicates the layer index. The total downsampling factor after all k DGBlocks becomes ^ = ∏_^ ∆_^ . [040] The second DGBlock DGB2 obtains the hidden features f1 as input and performs further and/or processing to output hidden features f2. The same process is then repeated for all DGBlocks DBG1:k in the encoder 1 until finally the hidden features fk are output by the final DGBlock DGBk. [041] An exemplary embodiment with five DGBlocks DGB1 – DGB5 contains the following strides (downsampling factors), from lower to higher layers in depth: [2, 2, 4, 4, 5]. [042] The hidden features f_k output by the final DGBlock DGBk are provided to and processed by a neural network module 3 which is configured to utilize an increased temporal context window compared to the final DGBlock DGBk and which outputs hidden features f_R. The neural network module 3 with an increased temporal context window may comprise one or more recurrent neural networks, RNNs, that aggregate temporal information to attain a broader temporal receptive field (i.e. to contextualize features about the whole input sequence regardless of the position in time of a certain embedding vector). The hidden features fR, having been processed to incorporate information from a broader temporal receptive field, are optionally used as input to the decoder 2. [043] As seen in fig.1 there may also be skip connection across the neural network module 3 whereby the hidden features fR based on the broader temporal field are combined with hidden features f_k from the last DGBlock DGBk to form combined features f_SUM that are used as input to the decoder 2. The output of the neural network module 3 (e.g. RNN) is linearly projected before the residual summation at summation point 4 to adjust input and output dimensionalities. Accordingly, the input to the decoder 2 is based on at least one of the output f_k from last DGBlock DGBk and the output f_R of a neural network module 3 (e.g. RNN) with increased temporal context window compared to the last DGBlock DGBk. For example, it is envisaged that the neural network module 3 (e.g. RNN) with increased temporal context window can be omitted in some implementations whereby the input to the decoder is hidden features fk. [044] An RNN can be unidirectional or bidirectional, and can take the form of gated cells like gated recurrent units (GRUs) or long short-term memories (LSTMs). An embodiment of this is a bidirectional GRU of two layers. A residual connection surrounding the RNN alleviates potential gradient flow issues of RNN saturating activations. [045] Secondly, the decoder 2 converts the hidden features obtained at its input (e.g. hidden features f_R, f_k, f_SUM) back to an output (score function with scores S) having first rate of the input signal z by upsampling with strided factors reversed with respect to factors in the encoder 1. In this case, another stack of GBlocks, referred to as Up GBlocks or UGBlocks UGB1, UGB2, UGBk, is inserted. The total upsampling factor of the decoder 2 is the same as in the encoder 1, and for the above mentioned exemplary embodiment the stride configuration in the decoder 2 would be the same but reversed: [5, 4, 4, 2, 2]. Nonetheless, this choice (of number of stride factors and the stride factors respective value) is up to tuning based on validation data. [046] In fig.1 the first UGBlock UGB1 outputs hidden features f’1 and the second UGBlock UGB2 outputs hidden features f’2 and so on until the last UGBlock UGBk outputs the score function S. [047] Additionally, it is noted that not all DGBlocks DGB1:k and UGBlocks UGB1:k need to perform downsampling or upsampling. For example, there may be at least one DGBlock DGB1:k and UGBlocks UGB1:k that does not perform upsampling or downsampling. Typically, non-downsampling/upsampling GBlocks are located in the deepest parts of the encoder 1 and earliest part of the decoder 2, respectively, close to the neural network module 3 (e.g. RNN) aggregating the temporal field. [048] In some embodiments, each down GBlock DGB1:k includes at least one down sampler stage. The at least one down GBlock DGB1:k is configured to transform the input audio signal with the first rate (resolution) into the second rate sequence of hidden features. In some implementations, the at least one down GBlock DGB1:k is further configured to transform the input audio signal with the first rate into the second rate sequence of hidden features through subsequent strided convolutions. [049] The generator 10 follows an architecture that works by inferring all input time- steps in parallel (despite the recurrence imposed by the recurrent layer), hence all output time-steps after decoding are predicted at once, from the first input sample ^_^ to the last one ^_^^^ of a signal segment with T number of samples. [050] In some implementations, the generator 10 is configured to operate on a full- bandwidth of the input audio signal. For example, the generator 10 is configured to operate on a 48 kHz sampled audio signal in the waveform domain or an audio signal in the waveform domain with a different sample rate (e.g.44.1 kHz or 192 kHz). [051] It is understood that any specific examples of the number of DGBlocks, and their respective downsampling factors in the encoder 1 and the number of UGBlocks, and their respective upsampling factors in the decoder 2 are merely exemplary. There may be one, two or three or more DGBblocks and UGBlocks in the encoder 1 and decoder 2 respectively. [052] The details of the Down GBlocks DGB1:k and Up GBlocks UGB1:k is shown with further reference to fig.2. A difference between the up GBlocks from the decoder 2 and the down GBblocks from the encoder 1 are the change of downsampling and upsampling stages (see linear resample block 35). The linear resampler 35 acts as a linear downsampler in skip connection and is combined with a strided convolution 23a for downsampling blocks and the linear resampler 35 acts as linear upsampler in skip connection and is combined with transposed convolution 23b for upsampling blocks. That is, down GBlocks DGB1:k used in downsampling encoder blocks utilize the StridedConv branch, and up GBlocks UGB1:k used in upsampling decoder blocks utilize TransposedConv branch. The input z is general to the network. The signals x and c are the input to each GBlock and the conditioning signal (e.g. label projection) respectively. [053] According to some implementations, each GBlock features four convolutional blocks 23a, 23b, 26, 29, 32, four non-linear activations 22, 25, 28, 31 (e.g. LeakyReLUs) and four FiLM (Feature-wise Linear Modulation) conditioning layers 21, 24, 27, 30, with each layer’s parameters (e.g. convolutions) being independent from each other throughout the whole network. The conditioning FiLM layers 21, 24, 27, 30 can be controlled with conditioning information c comprising global or local parameters, like a class label (e.g. ‘siren’, ‘train’, ‘fan’, …) or a time-varying feature (e.g. loudness curve). The subsequent type of conditioning may require some additional downsampling and upsampling layers to adjust the (sampling) rate to the rate of the hidden feature they are conditioning. [054] In some embodiments, each GBlock may include a dilation pattern for the convolutions 23a, 23b, 26, 29, 32 to increase the receptive field. [055] In some implementations the encoder includes: at least one down GBlock DGB1:k, wherein each down GBlock DGB1:k includes: at least one activation layer 22, 25, 28, 31; at least one conditioning layer 21, 24, 27, 30; and at least one convolutional layer 23a, 23b, 26, 29, 32, 36. [056] With further reference to fig.3, a method for synthesizing audio using the generator of fig.1 will now be described in detail. [057] An input audio signal is obtained and at step S1 the input audio signal is transformed, with a (trained) encoder 1, from a first rate (resolution) representation in a continuous feature domain (e.g. a time domain such as the waveform domain) to a second (lower) rate sequence of hidden features. At step S2 the hidden features are processed (e.g. with an RNN 3) to aggregate temporal information to attain a receptive field. At step S3 a (trained) decoder 2 converts the hidden features back to the first rate by upsampling to form a processed signal and at step S4 a synthesized audio signal (in e.g. waveform domain) is obtained from the generator based on the processed signal. [058] For Generative-adversarial-network GAN implementations the generator 10 has been trained to generate the synthesized audio signal directly whereby the processed signal is the synthesized audio signal. To this end, for GAN implementations step S3 and S4 may be combined. [059] For score-based diffusion implementations, the processed signal is a score signal indicating for each sample a score S wherein the score is indicative of how to modify the input audio signal to obtain the synthesized audio signal. The score signal has a (sample) rate equal to that of the input audio signal meaning that the output of the generator in score based diffusion is one score for each individual sample in the input audio signal. For example, the score indicates the direction in which the input audio signal should be altered to increase the likelihood of each sample whereby modifying the input audio signal in accordance with the score results in a synthesized audio signal. Accordingly, for diffused bases implementations step S4 may comprise forming the synthesized audio signal based on the input audio signal and the score signal. [060] Although different generative modeling techniques, like variational-auto- encoders or generative adversarial networks, can be applied to the generator 10 which work under the same inference scheme in each modeling technique, the best performing embodiment of the disclosure takes the form of a score-based diffusion generative model. [061] In score-based diffusion, the generator 10 is formulated as a score predictor that predicts a score ^(^_^ + ^_^^_^, ^, ^_^), where ^_^ is the original audio signal, ^_^ is the standard deviation of the Gaussian noise sample ^_^ used in the diffusion process, and ^ is the label that drives what type of audio is to be generated (e.g., piano, music, dog barking, gunshots, fire crackling, environmental sounds in general, and mixtures of these types). In score-based diffusion, the generator 10 is sometimes referred to as a diffusion audio generator, DAG. During training, the generator 10 is provided with training audio signals comprising audio content with additive sampled random noise zt and the generator 10 learns by minimizing the mean squared error between the sampled noise ^_^ and the predicted score ^. More specifically, the generator 10 learns by minimizing the mean squared error between the sampled noise ^_^ scaled with the standard deviation ^_^ and the predicted score ^. [062] Formally, the generator 10 predicts the score ^ gradient of the logarithm of a density of the training data. The score ^ therefore represents, for each input data point, an indication of the direction in which the likelihood of data increases most rapidly. That is, the generator 10 learns to predict the trajectory to follow, for each data point in order to remove the noise that would recover the training data. The score signal therefore comprises, for each sample of the input audio signal, an indication of the direction in which the likelihood of data increases most rapidly. The score signal has the sample rate as the input audio signal and the score signal may be referred to as a score audio signal or a processed audio signal since it is the result of processing the input audio signal with the generator 10. [063] In embodiments, both conditioning signals in the diffusion process, ^ and ^_^ are projected through some hidden layers and concatenated to form the features ^ from fig.1. More concretely, the logarithm of ^_^ is processed with random Fourier feature embeddings and a multilayer perceptron (MLP) as in previous works. On the other hand, the label ^ is linearly projected through an embedding layer, and the resulting embedding is concatenated to the sigma’s MLP features. [064] In some embodiments, the label ^ may include information such as a class label, text conditioning, visual conditioning, audio conditioning, class-to-audio information, text-to- audio information, image-to-audio information, audio-to-audio information and/or combinations of previous inputs to audio. [065] As an example, the label l will be a class label for a guitar class whereby the generator 10 will be conditioned to synthesize guitar audio content. However, the label l may also be in the form of a text string such as “GUITAR” or “FRETTED MUSICAL INSTRUMENT WITH SIX STRINGS” or an image showing a guitar whereby the generator 10 has been trained with text strings and/or images to generate guitar sounds conditioned on the label l indicating a guitar. If the label l is changed to a class label for a saxophone class, the string “SAXOPHONE” or an image of a saxophone the generator 10 will instead synthesize saxophone audio content. In general the label l can be set to indicate any type of audio content which has been used during training. [066] In some implementations, generator 10 has only been trained with training data of a single sound type. For example, the generator 10 has only been trained with piccolo audio content whereby the generator 10 may only be capable of synthesizing piccolo audio content. In such implementations, the label l is not needed as the generator 10 will unconditionally synthesize audio content of the single audio type (e.g. only piccolo sounds) for which it has been trained. [067] The resulting conditioning signal ^, with ^_^ and optionally the label l, is what drives all the FiLM conditioning layers throughout the generator 10. [068] To sample from the generator 10 (DAG) noise-consistent Langevin dynamics can be used, which results in the recursion ^_^^^^ = ^_^^ + ^ ^_^^ ^{^}^ ^_^^ , ^, ^_^^! + "^_^^^^ ^_^^^^ (eq.1) [069] ^ " set with

the help a ∈ on and quantitative results we obtain of audio synthesis quality. Note that the N number of steps can be arbitrarily large, and N is another hyper-parameter that can be set experimentally. [070] The generator 10 allows to do conditional and unconditional audio generation/synthesis, since the model can be trained on multi-class audio datasets to learn the signal distribution and we can sample new audio sequences from it, with or without labels. For example, the generator 10 will perform unconditional synthesis if it is not provided with any conditioning information c indicating a type of audio content to be generated and performs conditional synthesis if it is provided with conditioning information c indicating a type of audio content to be generated. It is also envisaged that unconditional synthesis may be initiated by conditioning information c (e.g. a label l) explicitly indicating unconditional synthesis. [071] With further reference to fig.4, a method for using the generator 10 for score- based diffusion synthesis will now be described. For both conditional and unconditional audio generation a random noise input audio signal ^_^^ is obtained at step S5 and input to the generator 10. Alternatively, the input audio signal ^_^^ is a silent signal and random noise sample(s) ^_^^ is added to it to form a noisy input audio signal. The generator 10 then generates a score ^_^^ at step S6 based on the input audio signal ^_^^, the standard deviation ^_^^associated with the noisy input audio signal or added noise sample(s) ^_^^ and optionally based on any class label l of the conditioning information c. Using the score ^_^^ the next iteration ^_^^^^ input audio signal is determined at step S7 using equation 1 above. Synthesized audio signal ^_^^^^ resulting from one iteration will likely still be perceived as a noisy signal. The signal ^_^^^^ is then used as input to the generator 10 which generates an output audio signal ^_^^^+ in the next iteration and then the process is repeated to generate ^_^^^,, ^_^^^- and so on until ^_^. is obtained. For each iteration ^_^^ is adjusted at step S8. Optionally, further noise samples ^_^^ are also added at step S8 for each iteration. In some implementations ^_^^ decreases from one iteration to the next meaning that the energy of the synthesized audio signal will gradually prevail over noise energy that decreases for each iteration. [072] Nevertheless, it is also possible to inject a source audio signal /_^ to be converted to a target class by means of injecting /_^ in the first recursive step of the Langevin sampling of equation 1, similar to what is done with the training signal ^_^ during training. This way the model operates as a style transfer system by transforming the signal /_^ and reverting the diffusion process with the guidance of the conditioning features, which determine what target type of sound imitates the structure of /_^ with new acoustic properties (e.g., singing voice is converted to sound like piano). [073] To convert a source audio signal /_^ to a target class conditioning information c with a label, l indicating a class different from that of the source audio signal /_^ is provided to the generator 10. The source audio signal /_^ is also provided with additional noise by adding the random noise sample ^_^^ on top of the source audio signal /_^ whereby the method of fig.4 is repeated until ^_^. is obtained. Since the starting signal which initialized the diffusion process was a source audio signal /_^ the final output signal ^_^. will carry perceptual properties of /_^ (e.g. the general melody) while being of a different audio class. As an illustrative example the label l indicates “dog barking” and the source audio signal /_^ is a melody played on a piano and the resulting style transferred audio signal ^_^. is dog barking sounds resembling the piano melody of the source audio signal /_^. [074] Preferably, the source audio signal /_^ is normalized with an amplitude factor prior to adding the noise at initialization so that the information of the source audio signal /_^ is not completely lost. [075] In some implementations, it is also envisaged that for conditional synthesis (i.e. synthesis condition with a label l) it is also possible to apply classifier-free guidance to the conditioned synthesis. Classifier-free guidance may be performed by replacing the score function ^ ^_^^ , ^, ^_^^! in equation 1 above with the classifier-free guided score ^⁰ ^_^^ , ^, ^_^^! defined as

^⁰ ^_^^ , ^, ^_^^! = (1 + 1)^ ^_^^ , ^, ^_^^! − 1^ ^_^^ , ^_^^! (eq.2)

where γ is a hyper parameter that fulfills γ ≥ 0. If γ = 0 the score function reduces to ^ ^_^^ , ^, ^_^^!, i.e. regular conditioned diffusion. If γ > 0 each evaluation of ^⁰ ^_^^ , ^, ^_^^! is between the conditioned score ^ ^_^^ , ^, ^_^^! and unconditioned score

, the terms being weighted with γ.

By increasing γ, the unconditional score prioritized and the

score becomes more prioritized. That is, as γ increases the conditional prediction controlled by the conditioning information c, gets more and more exaggerated. This has been found to increase the quality and accuracy of the synthesized audio content while the variability of the synthesized audio content decreases. As an example, if the conditioning information c indicates “dog barking” and γ is increased, the score prediction will exaggerate the generation of dog barking. While this reduces the variability of the dog barking sounds the generator 10 would generate, the output would be very accurate examples of dog barking sounds. [077] As indicated in equation 2, with classifier-free guidance, the same generator 10 is sampled two times in each iteration, one time with conditioning information c and one time without conditioning information c. Alternatively, it is envisaged that two generators can be used, wherein one generator has been trained with conditioning information c and the other generator has been trained for unconditional synthesis. For each iteration with classifier-free guidance, both of these generators are sampled, and their outputs are combined using γ with equation 2. [078] Unless specifically stated otherwise, as apparent from the following discussions, it is appreciated that throughout the disclosure discussions utilizing terms such as “processing”, “computing”, “calculating”, “determining”, “analyzing” or the like, refer to the action and/or processes of a computer hardware or computing system, or similar electronic computing devices, that manipulate and/or transform data represented as physical, such as electronic, quantities into other data similarly represented as physical quantities. [079] It should be appreciated that in the above description of exemplary embodiments of the invention, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this invention. Furthermore, while some embodiments described herein include some but not other features included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention, and form different embodiments, as would be understood by those skilled in the art. For example, in the following claims, any of the claimed embodiments can be used in any combination. [080] Furthermore, some of the embodiments are described herein as a method or combination of elements of a method that can be implemented by a processor of a computer system or by other means of carrying out the function. Thus, a processor with instructions for carrying out such a method or element of a method forms a means for carrying out the method or element of a method. Note that when the method includes several elements, e.g., several steps, no ordering of such elements is implied, unless specifically stated. Furthermore, an element described herein of an apparatus embodiment is an example of a means for carrying out the function performed by the element for the purpose of carrying out the embodiments of the invention. In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In other instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description. [081] Thus, while there has been described specific embodiments of the invention, those skilled in the art will recognize that other and further modifications may be made thereto without departing from the spirit of the invention, and it is intended to claim all such changes and modifications as falling within the scope of the invention. Various aspects of the present invention may be appreciated from the following Enumerated Example Embodiments (EEEs): [082] EEE 1. A neural network-based system for general audio synthesis comprising: a generator configured to generate synthesized audio, the generator comprising: an encoder configured to: transform an input audio signal with a first resolution into a low-rate sequence of hidden features; process the hidden features to aggregate temporal information to attain a receptive field; a decoder: convert the hidden features back to the first resolution by upsampling; and output an upsampled audio signal as the generated synthesized audio. [083] EEE 2. The system of EEE 1, wherein the encoder is further configured to receive conditioning information and wherein the decoder is further configured to receive the conditioning information. [084] EEE 3. The system of EEE 2, wherein the conditioning information indicates a type of audio to be generated. [085] EEE 4. The system of any one of EEEs 1-3, wherein the encoder comprises: at least one down GBlock, wherein each down GBlock comprises: at least one activation layer; at least one conditioning layer; and at least one convolutional layer. [086] EEE 5. The system of EEE 1, wherein each down GBlock further comprises at least one down sampler stage. [087] EEE 6. The system of EEE 4 or 5, wherein the at least one down GBlock is configured to transform the input audio signal with the first resolution into the low-rate sequence of hidden features. [088] EEE 7. The system of EEE 6, wherein the at least one down GBlock is further configured to transform the input audio signal with the first resolution into the low-rate sequence of hidden features through subsequent strided convolutions. [089] EEE 8. The system of any one of EEEs 4-7, wherein the at least one conditioning layer is a FiLM conditioning layer. [090] EEE 9. The system of any claim 8, wherein the FiLM conditioning layer is configured to be controlled via global and/or local parameters. [091] EEE 10. The system of any one of EEEs 1-9, wherein the encoder further comprises a recurrent neural network, wherein the recurrent network is configured to process the hidden features to aggregate the temporal information to attain the receptive field. [092] EEE 11. The system of any one of EEEs 1-10, wherein the receptive field is configured to contextualize features corresponding to the input audio signal regardless of a position in time of an embedding vector. [093] EEE 12. The system of any one of EEEs 1-11, wherein the decoder comprises: at least one up GBlock, wherein each up GBlock comprises: at least one activation layer; at least one conditioning layer; and at least one convolutional layer. [094] EEE 13. The system of EEE 12, wherein each up GBlock further comprises at least one up sampler stage. [095] EEE 14. The system of EEE 12 or 13, wherein the at least one up GBlock is configured to convert the hidden features back to the first resolution by upsampling with strided reversed factors with respect to the encoder. [096] EEE 15. The system of any of EEEs 1-14, wherein the system is configured to operate on a full-bandwidth of the input audio signal. [097] EEE 16. The system of any of EEEs 1-15, wherein the system is implemented via generative adversarial learning. [098] EEE 17. A method for generating synthesized audio, the method comprising: receiving, at a generator, an original audio signal, x_^, conditioning information, c, and a random noise sample, z₅; wherein the conditioning information, c, comprises: information, l, corresponding to a type of audio to be generated; and a standard deviation, σ₅, of the random noise sample z; determining, a synthesized audio based on a predicted score, S, wherein S(x_^ + σ₅z₅, l, σ₅); and wherein the generator is trained to minimize a mean square error between the random noise sample, z, and the predicted score, S. [099] EEE 18. The method of EEE 17, wherein the synthesized audio is further determined by sampling based on noise-consistent Langevin dynamics. [100] EEE 19. The method of EEE 17 or 18, wherein the information, l, comprises at least one of a class label, text conditioning, visual conditioning, audio conditioning, class-to- audio information, text-to-audio information, image-to-audio information, audio-to-audio information and/or combinations of previous inputs to audio. [101] EEE 20. An apparatus configured to perform the method of any one of EEEs 17- 19. [102] EEE 21. A non-transitory computer readable storage medium comprising a sequence of instructions which, when executed, cause one or more devices to perform the method of any one of EEEs 17-19.

Claims

CLAIMS 1. A neural network-based system for audio synthesis comprising: a generator configured to generate synthesized audio, the generator comprising: an encoder configured to: transform an input audio signal with a first rate into a sequence of hidden features with a second rate lower than the first rate; process the hidden features to aggregate temporal information to obtain hidden features associated with a receptive field; a decoder configured to: convert the hidden features associated with the receptive field back to the first rate by upsampling to form a processed signal; and output a synthesized audio signal based on the processed signal as the generated synthesized audio.

2. The system of claim 1, wherein the encoder is further configured to receive conditioning information and wherein the decoder is further configured to receive the conditioning information.

3. The system of claim 2, wherein the conditioning information indicates a type of audio to be generated.

4. The system of claim 3, wherein the generator has been trained with training data comprising audio content of at least two types and the conditioning information indicated at least one of the at least two types of audio content used during training.

5. The system of any of the preceding claims, wherein the encoder comprises at least one down GBlock, and wherein the at least one down GBlock comprises a down sampler stage.

6. The system of claim 5, wherein the down sampler stage comprises a strided convolution.

7. The system of claim 5 or claim 6, wherein the down sampler stage precedes at least one convolutional layer.

8. The system of any of claims 5-7, wherein the at least one down GBlock with at least one down sampler stage is configured to transform the input audio signal with the first rate into the sequence of hidden features of the second rate.

9. The system of any of claims 5-8 wherein each GBblock comprises a down sampler stage.

10. The system of any one of the preceding claims, wherein the encoder comprises: at least one down GBlock, wherein each down GBlock comprises: at least one activation layer; at least one conditioning layer; and at least one convolutional layer.

11. The system of claim 10, wherein the at least one conditioning layer is a FiLM conditioning layer.

12. The system of claim 11, wherein the FiLM conditioning layer is configured to be controlled via global and/or local conditioning parameters.

13. The system of any of the preceding claims, wherein the encoder further comprises a recurrent neural network, wherein the recurrent network is configured to process the hidden features to aggregate the temporal information to attain the receptive field.

14. The system of any of the preceding claims, wherein the receptive field is configured to contextualize features corresponding to the input audio signal regardless of a position in time of an embedding vector.

15. The system of any one of the preceding claims, wherein the decoder comprises: at least one up GBlock, wherein each up GBlock comprises: at least one activation layer; at least one conditioning layer; and at least one convolutional layer.

16. The system of any of the preceding claims, wherein the decoder comprises at least one up GBlock and wherein at least one of the up GBlock comprises an up sampler stage.

17. The system of claim 16, wherein the up sampler stage is configured to convert the hidden features back to the first rate by upsampling with strided convolutions.

18. The system of claim 16 or 17, wherein each up GBlock comprises an up sampler stage.

19. The system of any of the preceding claims, wherein the input audio signal and/or the processed signal are represented in continuous feature domain.

20. The system of claim 19, wherein the continuous feature domain representation is a full-band representation the input audio signal.

21. The system of claim 20, wherein the continuous feature domain is time domain, such as waveform domain.

22. The system of any of claims 19-21 wherein the continuous feature domain is 48 kHz sampled audio.

23. The system of any of claims 1-22, wherein the system has been trained with generative adversarial learning.

24. The system of any of claims 1-22, wherein the output audio signal is based on the processed signal and the input audio signal.

25. The system of claim 24, wherein the system has been trained as a score-based diffusion model.

26. A method for performing general audio synthesis, the method comprising: transforming, with an encoder, an input audio signal with a first rate into a sequence of hidden features with a second rate, lower than the first rate; processing, the hidden features to aggregate temporal information to obtain hidden features associated with a receptive field; converting, with a decoder, the hidden features associated with the receptive fieldback to the first rate by upsampling to form a processed signal; and outputting a synthesized audio signal based on the processed signal as the generated synthesized audio.

27. A method for generating synthesized audio, the method comprising: receiving, at a generator, conditioning information, c, and a random noise sample, z_t; wherein the conditioning information, c, comprises: a standard deviation, σt, of the random noise sample z; predicting, by the generator, a score S, determining, a synthesized audio signal based on the predicted score, S; and wherein the generator is trained to minimize an error function between a training random noise sample, ztraining, and the predicted score, S, for a training audio signal comprising the training random noise sample, z_training, and least one type of audio content.

28. The method according to claim 27, wherein the generator is trained to minimize the error function between a training random noise sample ztraining, and the predicted score, S for at least two training audio signals, each training audio signal comprising the training random noise sample and a respective type of audio content, and wherein the conditioning information, c, indicates a type of audio content to be generated.

29. The method of claim 28, wherein the conditioning information, c, further comprises at least one of a class label, text conditioning, visual conditioning, audio conditioning, class-to-audio information, text-to-audio information, image-to-audio information, audio-to-audio information and/or combinations of previous inputs to audio.

30. The method of claim 28 or claim 29, wherein the score S is a guided score, the method further comprising: predicting an unconditioned score ^ ^_^^ , ^_^^! based on the standard deviation, σ_t, of the random noise sample z and an input audio signal ^_^^ comprising random noise sample z; predicating a conditioned score ^ ^_^^ , ^, ^_^^! based on based on the standard deviation, σt, of the random noise z the audio signal ^_^^and conditioning information indicating a type of audio

generated; and calculating the guided score based on the unconditioned score ^ ^_^^ , ^_^^! and the conditioned score ^ ^_^^ , ^, ^_^^!.

31. The method of any of claims 27 - 30, further comprising: receiving, at the generator an original audio signal x0, wherein the original audio signal and the random noise sample are used as input to the generator.

32. The method of any of claims 27 - 31, wherein the synthesized audio is further determined by sampling based on noise-consistent Langevin dynamics.

33. An apparatus configured to perform the method of any one of claims 27 - 32.

34. A non-transitory computer readable storage medium comprising a sequence of instructions which, when executed, cause one or more devices to perform the method of any one of claims 26-31.