WO2024184745A1 - Unsupervised voice restoration with unconditional diffusion model - Google Patents
Unsupervised voice restoration with unconditional diffusion model Download PDFInfo
- Publication number
- WO2024184745A1 WO2024184745A1 PCT/IB2024/051943 IB2024051943W WO2024184745A1 WO 2024184745 A1 WO2024184745 A1 WO 2024184745A1 IB 2024051943 W IB2024051943 W IB 2024051943W WO 2024184745 A1 WO2024184745 A1 WO 2024184745A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speech
- diffusion
- audio signal
- waveform
- voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/038—Speech enhancement, e.g. noise reduction or echo cancellation using band spreading techniques
Definitions
- the invention relates to the field of computer technologies, in particular to methods for processing and analyzing audio recordings. More particularly, the invention may be employed for improving the quality and intelligibility of speech recordings, and/or be used as a part of a text-to-speech system as a neural vocoder.
- Background Art [0002] As the field of artificial intelligence continues to evolve, generative models have emerged as a powerful tool for a variety of tasks including speech processing. In recent years, diffusion models (see [1, 2, 3]) have gained attention due to their ability to efficiently model complex high-dimensional distributions.
- Diffusion models are designed to learn the underlying data distribution’s implicit prior by matching the gradient of the log density.
- One of the presently evolving approaches to the task of speech processing, in particular voice restoration, speech differentiation, vocoding etc. is generally referred to as unconditional speech generation. The latter, however, is generally a challenging task due to the high diversity of possible linguistic content.
- Prior works on diffusion models tend to consider conditional speech generation [10, 11] or limit the scope to simple datasets with predefined phrases (e.g., spoken digits) [12, 10].
- Reference [1] describes score-based diffusion models, which are class neural generative models that can be informally described as gradual transforming analytically known and unknown (only samples are available) data distributions to each other.
- Reference [7] describes using a single diffusion model to solve several problems of audio recording restoration, namely frequency bandwidth extension and declipping.
- the method of [7] was experimentally tested on music, in particular to restore audio recordings of a piano, while the present invention is intended mainly for restoring speech recordings.
- the prior art solution of [7] has not been tested for such tasks as neural vocoding and source separation from mixtures of voices.
- the prior art solution of [7] uses the UNet architecture model, and not the FFC-AE architecture model, which is known to have advantages over UNet. Summary of Invention [0006] This section which discloses various aspects of the claimed invention is intended for providing a brief overview of the claimed subject matters and their embodiments.
- the above-mentioned object is achieved by a method for voice restoration in speech recordings, the method comprising: receiving audio data of a speech recording containing a voice audio signal; applying a diffusion probabilistic model trained by denoising score matching objective for unconditional speech generation, wherein the diffusion probabilistic model is applied to the audio data of the speech recording in the form of a waveform comprising random Gaussian noise, iteratively sampling the waveform with a conditional score function, which is a sum of unconditional score function estimated by the diffusion probabilistic model and log- likelihood, so as to produce a sample with a reduced amount of noise for the next iteration, until a speech waveform without noise is obtained; and outputting the processed voice audio signal comprising the speech waveform without noise.
- a conditional score function which is a sum of unconditional score function estimated by the diffusion probabilistic model and log- likelihood
- the diffusion probabilistic model is selected from a group comprising a fast Fourier convolutional autoencoder (FFC-AE) model, a Diffwave model, and a CQT-UNet model.
- the method may further comprise performing frequency bandwidth extension, declipping on the voice audio signal, performing neural vocoding on the voice audio signal to convert spectral representations of the voice audio signal to audio waveforms, performing source separation on the voice audio signal.
- the diffusion probabilistic model may be adapted for degradation inversion.
- the diffusion probabilistic model may be adapted to solve bandwidth extension, declipping, neural vocoding, and/or source separation tasks by modifying a voice audio signal sampling procedure, said modification making the sampling to be conditional on observations, which are a waveform with reduced bandwidth in the case of bandwidth extension task, a clipped waveform in the case of declipping task, a mel-spectrogram in the case of neural vocoding and/or a waveform with mixed voices in the case of source separation task.
- a system for voice restoration in speech recordings comprising: a memory; a speech recording receiving module configured to receive a speech recording comprising at least a voice audio signal; a voice restoration processing module; wherein the voice restoration processing module comprises: a neural network module configured to apply a diffusion probabilistic model trained by denoising score matching objective for unconditional speech generation, wherein the diffusion probabilistic model is applied to the audio data of the speech recording in the form of a waveform comprising random Gaussian noise, the neural network module being configured to iteratively sample the waveform with a conditional score function, which is a sum of unconditional score function estimated by the diffusion probabilistic model and log-likelihood, so as to produce a sample with a reduced amount of noise for the next iteration, until a speech waveform without noise is obtained; and the voice restoration processing module being configured to output the processed voice audio signal comprising the speech waveform without noise.
- a conditional score function which is a sum of unconditional score function estimated by the diffusion probabilistic model and log-likelihood
- the diffusion probabilistic model is selected from a group comprising a fast Fourier convolutional autoencoder (FFC-AE) model, a Diffwave model, and a CQT-UNet model.
- the system may further comprise a frequency bandwidth extension module configured to perform frequency bandwidth extension on the voice audio signal, a declipping module configured to perform declipping on the voice audio signal, a neural vocoder module configured to convert spectral representations of the voice audio signal to audio waveforms, a source separation module configured to perform source separation on the voice audio signal.
- the inventive technique is based on a diffusion probabilistic model capable of solving various speech inverse tasks.
- diffusion models are designed to learn the implicit prior of an underlying data distribution by matching the gradient of log density. This learned prior can be useful for solving inverse problems, where the objective is to recover the input signal x from the measurements y, which are typically linked through some differentiable operator A, s.t. , where n is some noise.
- the method can be employed for improving the quality and intelligibility of speech recordings, by performing frequency bandwidth extension, declipping and source separation.
- the method could also be used as part of a text-to-speech system as a neural vocoder.
- a novel diffusion probabilistic model capable of solving various speech inverse tasks is introduced by the present invention. Being once trained for speech waveform generation in an unconditional manner, the diffusion probabilistic model may be adapted to different tasks including degradation inversion, neural vocoding, and source separation.
- UnDiff Unconditional Diffusion Model
- unsupervised voice restoration means such restoration which is based upon an unsupervised machine learning algorithm which is known not to rely on labeled data, unlike supervised learning algorithms that require upfront human intervention to label the training data appropriately.
- unsupervised voice restoration means such restoration which is based upon an unsupervised machine learning algorithm which is known not to rely on labeled data, unlike supervised learning algorithms that require upfront human intervention to label the training data appropriately.
- the following three approaches to building unconditional diffusion models are considered, all of which operate in the time domain but have different preconditioning transformations: [0021] 1. Diffwave neural network operating directly in the time domain; [0022] 2. FFC-AE neural network operating on short-time Fourier transform spectrograms; [0023] 3.
- the fast Fourier convolutional autoencoder (FFC-AE) or Diffwave diffusion probabilistic model is trained by denoising score matching objective for the task of unconditional speech generation.
- FFC-AE fast Fourier convolutional autoencoder
- Diffwave diffusion probabilistic model is trained by denoising score matching objective for the task of unconditional speech generation.
- iterative denoising of the waveform is performed by numerically solving the reverse stochastic equation (by way of an example, see expression (2) hereinbelow) with a conditional score function (by way of an example, see expression (5) hereinbelow) which is a sum of unconditional score function and log-likelihood (by way of an example, see expression (6) hereinbelow).
- conditional score function by way of an example, see expression (5) hereinbelow
- log-likelihood by way of an example, see expression (6) hereinbelow.
- the speech waveform corresponding to a given condition is produced.
- output from the previous iteration is given to the model to estimate the score function and produce a sample with a reduced amount of noise for the next iteration.
- the model produces a speech waveform without noise.
- the trained model can be used to solve various speech inverse tasks, among which are bandwidth extension, declipping, neural vocoding and source separation tasks by modification of sampling procedure; these modifications make sampling from the model to be conditional on observations, which are waveform with reduced bandwidth in the case of bandwidth extension task, clipped waveform in the case of declipping task, mel- spectrogram in the case of neural vocoding and waveform with mixed voices in the case of source separation task.
- the present technique relies upon a diffusion probabilistic model (also referred to herein as UnDiff) that is specifically designed to tackle these and other speech inverse tasks for speech processing including degradation inversion, neural vocoding and source separation.
- Unconditional Diffusion Model (UnDiff)
- a diffusion probabilistic model (UnDiff)
- FFC-AE architecture in one or more non-limiting examples having an FFC-AE architecture and being generally a class of latent variable models, is first trained for speech waveform generation in an unconditional manner.
- unconditional means not being conditional on any prior data labels (descriptions).
- various training datasets may be used for training the diffusion probabilistic model.
- training datasets suitable for training the diffusion probabilistic model (UnDiff) according to the present invention include e.g. a publicly available VCTK dataset (see [20]) which includes 44200 speech recordings belonging to 110 speakers, a Librispeech (see [20]) clean subset, which is a large-scale corpus of read English speech consisting of approximately 1000 hours of audio data including recordings from 2456 speakers reading public domain audiobooks.
- VCTK dataset see [20]
- Librispeech see [20]
- training datasets suitable for training the diffusion probabilistic model (UnDiff) according to the present invention include e.g. a publicly available VCTK dataset (see [20]) which includes 44200 speech recordings belonging to 110 speakers, a Librispeech (see [20]) clean subset, which is a large-scale corpus of read English speech consisting of approximately 1000 hours of audio data including recordings from 2456 speakers reading public domain audiobooks.
- the diffusion probabilistic model employs Variance Preserving Stochastic Differential Equation (VP-SDE), which is equivalent to Denoising Diffusion Probabilistic Models (DDPM) (see e.g. [2]).
- VP-SDE Variance Preserving Stochastic Differential Equation
- DDPM Denoising Diffusion Probabilistic Models
- score-based diffusion models are a class of neural generative models that can be informally described as gradual transforming analytically known and unknown (where only samples are available) data distributions p known and p data to each other.
- VP-SDE Ito stochastic equations
- the score function is known, it is possible to solve reverse SDE (2) numerically and thus generate samples from p data .
- the score function could be approximated by a neural network trained with denoising score matching objective eventually leading to L2 loss function: an explicit function of .
- a scaled version of score is optimized, , is set. Linear schedule is used set.
- the above mathematical expressions describe one possible example of the SDE function which may be used in the context of the present invention.
- the diffusion models are generally known to employ the following principle. First, they build a “generative Markov chain” which transforms a known distribution (such as pknown as mentioned above) into a “target” distribution (such as pdata as mentioned above) using a diffusion process.
- a “reverse” transformation e.g. of the distribution of training data into another distribution may be performed as mentioned above. Both transformations (e.g. data to noise and noise to data) may be performed using the same functional form.
- transformations e.g. data to noise and noise to data
- distributions of data from the training datasets as mentioned above are used in the above-mentioned model training process.
- the training process is performed iteratively, and may in practice include several hundreds of iterations over the training dataset (millions of gradient descent steps).
- the inventive technique notably differs from the prior art in that the machine learning algorithm is unsupervised, and it does not rely on labeled data, unlike supervised learning algorithms that require upfront human intervention at each training step to label the training data appropriately.
- the diffusion model is trained until a predetermined training completion condition is achieved.
- the diffusion model may be trained until there is no more significant increase in model quality.
- Inverse speech processing tasks [0046]
- Prior art approaches utilize conditional diffusion models for waveform restoration and generation (see e.g. [4, 5, 6]).
- the solution of [7] like the present invention, also utilizes an unconditional diffusion model, but for a specific task of piano music restoration, and the prior art approach includes solving declipping, bandwidth extension, and inpainting problems.
- the present invention is aimed at tackling a more challenging problem of speech restoration and additionally considers neural vocoding and speech source separation problems which are referred to in the context of the present invention as inverse problems.
- the inventive technique has been experimentally proven to be effective in solving a variety of speech processing tasks such as bandwidth extension, declipping, neural vocoding, and speech source separation.
- bandwidth extension is generally known (and also known as audio super-resolution) as a task of realistic increase of audio signal sampling frequency.
- Audio declipping is a process of reconstructing an audio signal in order to reverse clipping, i.e. cutting off a signal level that rises above a certain maximum level.
- Neural vocoding is a process of converting spectral representations of an audio signal to audio waveforms.
- Speech source separation is generally understood as extracting one or more source signals of interest from an audio recording which involves several sound sources.
- inventive technique is not limited to these tasks, which are only mentioned by way of illustration.
- the inventive technique may also be used as part of a text-to-speech system as a neural vocoder.
- the inventive technique uses methods of diffusion post-training conditioning (see e.g. [8, 1, 9]) to adapt unconditional diffusion to each of the above- mentioned tasks.
- the present invention trains an unconditional diffusion model and does not constrain the linguistic content of the datasets.
- the present invention aims at generating only syntactically consistent speech recordings (i.e. speechlike sounds), which the inventors believe to be sufficient for the purpose of unsupervised voice restoration.
- Three approaches to building unconditional diffusion models are considered herein, all of which operate in the time domain but have different preconditioning transformations: [0058] 1. Diffwave neural network (for example, of a type disclosed in [10]) that operates directly in the time domain; [0059] 2.
- FFC-AE neural network for example, of a type disclosed in [13]
- CQT-UNet neural network for example, of a type disclosed in [7]
- Constant-Q transform spectrograms for example, of a type disclosed in [7]
- the present inventors note that, in various non-limiting embodiments of the invention, at least one of these three approaches may be used, and the best one is selected for use in each case. In other embodiments, however, a combination of two or three approaches may be used.
- Frequency bandwidth extension [0063]
- the inventive method may include a step of frequency bandwidth extension (see e.g.
- the step is also referred to as audio super-resolution), which can be viewed as a realistic increase in signal sampling frequency.
- the observation operator is a low-pass filter imputation guidance in this case corresponds to the substitution of the generated estimate of low frequencies with observed low frequencies y at each step. More formally, this corresponds to modifying score function during sampling as score function.
- Declipping [0067] Then, declipping is performed on the data from the previous method step. Similarly to what is done in respective operations of prior art technique described in [7], clipping is considered in the context of the present invention as an inverse problem with observation function defined a reconstruction guidance strategy is applied.
- Neural vocoding Next method step may be characterized as “neural vocoding”. In most state-of-the-art speech synthesis systems this task is decomposed into two stages. At the first stage, low- resolution intermediate representations (e.g., linguistic features, mel-spectrograms) are predicted from text data (see e.g. [16, 17]). At the second stage, these intermediate representations are transformed to raw waveform (see e.g. [18, 19]). Neural vocoders relate to techniques used in the second stage of the speech synthesis process. [0070] Neural vocoding may be formulated as the inverse problem with the observation operator defined as mel-spectrogram computation .
- Sine mel- spectrogram computation is a differentiable operation, so reconstruction guidance may be easily applied in this case.
- the next method step may be referred to as source separation.
- the goal of single-channel speech separation is to extract individual speech signals from a mixed audio signal, in which multiple speakers are talking simultaneously.
- the potential applications of speech source separation include teleconferencing, speech recognition, and hearing aid technology.
- x1 and x2 be the two voice recordings.
- the invention is directed to a system for voice restoration in speech recordings, which is substantially intended for implementing the method as described above for the first aspect of the present invention.
- the system may be characterized as comprising at least a memory and one or more processors configured to execute instructions stored in the memory.
- FIG.1 a schematic diagram of the system 100 for voice restoration is shown.
- the system generally comprises a memory 110, a speech recording receiving module 120, a voice restoration processing module 130, and an input/output (I/O) interface module 140.
- the voice restoration processing module may comprise a neural network module 150, a frequency bandwidth extension module 160, a declipping module 170, and a neural vocoder module 180.
- the input/output (I/O) interface module 140 may comprise an analog-digital converter (ADC) module 190.
- the input/output interface module 140 may comprise a digital-analog converter (DAC) module 200.
- DAC digital-analog converter
- a source separation module 210 may be further provided in the system 100 as a part of the voice restoration processing module 130 or otherwise.
- the “core” part of the system 100 for voice restoration is constituted by the voice restoration processing module 130, which, depending on a particular non-limiting implementation of the claimed invention, may be implemented in the form of one or more processors, such as general purpose CPUs and/or digital signal processors (DSPs), microprocessors, and/or integrated circuits, ASICs, microchips, FPGAs etc.
- processors such as general purpose CPUs and/or digital signal processors (DSPs), microprocessors, and/or integrated circuits, ASICs, microchips, FPGAs etc.
- DSPs digital signal processors
- voice restoration processing module 130 performs the core part of the processing in respect of the speech recording, which is received via the speech recording receiving module 120, e.g.
- the neural network module 150 performs the processing of the data signal that corresponds to the received speech recording by applying the applying the fast Fourier convolutional autoencoder (FFC-AE) diffusion probabilistic model trained by denoising score matching objective for unconditional speech generation.
- FFC-AE fast Fourier convolutional autoencoder
- Said model may be trained in advance as described above to be better suited to process the voice restoration processing in a given non-limiting implementation of the system.
- the frequency bandwidth extension module 160 within the voice restoration processing module 130 performs the frequency bandwidth extension operation on the speech recording that has been processed by the neural network module.
- the process of frequency bandwidth extension in this case is as described above.
- the declipping module 170 operates on the digital signal that represents the speech recording to be processed, again, as described above in connection with the declipping process.
- the neural vocoder module 180 processes the speech recording to restore the speech from a mel-spectrogram in the form of a speech waveform.
- the source separation module 210 may perform source separation so as to separate a specific speech signal waveform from the one or more mixed speech sources as described above.
- input/output (I/O) interface module 140 may output the resulting speech waveform with restored voice source signal, optionally employing the digital-analog converter (DAC) module 200 and/or other output means, such as one or more speakers, network connections etc., depending on whether the restored speech waveform is to be output in the form of a sound playback via one or more speakers, or transmitted to a memory for storage, or to another networked entity for playback etc.
- DAC digital-analog converter
- the modules of the system 100 for voice restoration may be implemented in a plurality of different ways depending on a given implementation scenario of the present invention.
- these modules which are responsible for various neural network operators, elements etc. as described above, may be implemented using one or more processors, such as general purpose computer processors (CPUs), digital signal processors (DSPs), microprocessors etc. operating under control of respective software elements, integrated circuits, field programmable gate array(s) (FPGAs) or any other similar means as well known by persons skilled in the art.
- processors such as general purpose computer processors (CPUs), digital signal processors (DSPs), microprocessors etc. operating under control of respective software elements, integrated circuits, field programmable gate array(s) (FPGAs) or any other similar means as well known by persons skilled in the art.
- processors such as general purpose computer processors (CPUs), digital signal processors (DSPs), microprocessors etc. operating under control of respective software elements, integrated circuits, field programmable gate array(s) (FPGAs) or any other similar means as well known by persons skilled in the art.
- FPGAs field programmable gate array
- modules of the system 100 for voice restoration may be implemented in the form of software provided in one or more programming languages or in the form of executable code as is well known by persons skilled in the art.
- Such software may be embodied as computer program or programs, a computer program product, in particular one implemented on a tangible computer readable medium of any suitable kind, computer program element(s), units or modules. It may be stored locally or distributed over one or more wired or wireless networks, using one or more remote servers etc. These details do not restrict the scope of the present invention.
- such software may be stored in the memory 110.
- the latter may also store, temporarily or otherwise, at least some portion(s) or component(s) of the speech recording to be processed, optionally after being converted into a digital form by the analog-digital converter (ADC) module 190.
- the memory 110 may be embodied in various forms, such as a RAM, a ROM, a flash memory, EPROM, EEPROM etc., and/or a removable storage medium, to permanently or temporarily store the respective software instructions, as well as signal(s) and/or data involved in the voice restoration from a speech recording in accordance with the present invention. Details concerning the particular embodiment(s) of the memory 110 are specific to various embodiments of the present invention and do not restrict the scope of the present invention.
- FIG. 2 provides a flowchart illustrating the basic steps of the method for voice restoration in accordance with the first aspect of the present invention as described above.
- audio data of a speech recording containing a voice audio signal are received. These data may be received via any suitable I/O means such as e.g. a microphone or a network connection etc.
- neural network module 150 performs the processing of the data signal that corresponds to the received speech recording containing the voice audio signal by applying the fast diffusion probabilistic model trained by denoising score matching objective for unconditional speech generation.
- Step S2 comprises S3, iteratively sampling the waveform with a conditional score function, and S4, producing a sample with a reduced amount of noise for the next iteration, until a speech waveform without noise is obtained.
- the processed voice audio signal comprising the speech waveform without noise is output, and
- the restored voice audio signal is output via appropriate I/O means, and the method is terminated.
- the method steps as described above are not necessarily always performed in the specific order, in which they are mentioned above.
- the present invention provides a computer readable medium having stored thereon computer executable instructions which, when executed by one or more processors, cause the one or more processors to perform the method of the first aspect. It should be noted that said computer readable medium may be transitory or non- transitory depending on its specific implementation.
- the computer readable medium may be embodied in practice as one or more of a RAM, a ROM, a flash memory, EPROM, EEPROM etc., and/or a removable storage medium, to permanently or temporarily store the respective computer executable instructions which cause a computer, or its one or more processors, to perform the respective functions of the modules as recited above and/or respective method steps as described above.
- Experimental test examples [0107] The following examples confirm the possibility of implementing the intended use of the claimed invention and achieving the technical result as aforementioned.
- the invention was experimentally tested using two datasets for experiments.
- the first dataset was publicly available VCTK dataset (see e.g. [20]) which includes 44200 speech recordings by 110 speakers.
- LJ-Speech dataset [21] which is standard in the speech synthesis field. LJ-Speech is a single speaker dataset that consists of 13100 audio clips with a total length of approximately 24 hours. A train-validation split from [18] was used with sizes of 12950 train clips and 150 validation clips. Audio samples had a sampling rate of 22.05 kHz.
- Diffwave architecture provides the best performance among tested time-domain architectures (UNIVERSE [4], UNet [27] were also tested).
- the capacity of the original Diffwave architecture was enhanced by increasing the size to 22 blocks with 512 channels.
- squeeze-excitation see e.g. [28]
- weighting on skip connections condition generative model via random Fourier features
- condition generative model via random Fourier features see e.g. [29]
- STFT short-time Fourier transform
- FFC-AE was found to provide superior quality compared to convolutional UNet-type architectures.
- CQT Constant-Q Transform
- the quality of 8000 unconditionally generated samples was compared based on WV- MOS and FDSD metrics.
- the inventive diffusion probabilistic model provides a diffusion probabilistic model capable of solving various speech inverse tasks. Efficiency of the model in bandwidth extension, declipping, neural vocoding, and source separation tasks was demonstrated and is confirmed by experimental data provided hereinabove.
- the inventive diffusion probabilistic model also referred to herein as UnDiff
- UnDiff provides a new tool for solving complex inverse problems in speech restoration, highlighting the potential of diffusion models to be a general framework for voice restoration.
- the invention solves the challenging problem of unconditional waveform generation by comparing different neural architectures and preconditioning domains.
- the trained unconditional diffusion can adapted to different tasks of speech processing by the means of various techniques of post-training conditioning of diffusion models as shown above.
- the inventive method for voice restoration was described above. Persons skilled in the art shall understand that the invention may be implemented by various combinations of hardware and software means, and any such particular combinations do not restrict the scope of the present invention.
- modules described above which constitute the inventive device, may be implemented in the form of separate hardware means, or two or more modules may be implemented by one hardware means, or the inventive system may be implemented by one or more computer(s), processor(s) (CPUs) such as general purpose processors or specialized processors such as digital signal processors (DSPs), or by one or more ASICs, FPGAs, logic elements etc.
- processors CPUs
- DSPs digital signal processors
- one or more modules may be implemented as software means such as e.g. a program or programs, computer program element(s) or module(s) which control one or more computer(s), CPUs etc. to implement the method steps and/or operations as described in detail above.
- These software means may be embodied in one or more computer-readable media which are well known to ones skilled in the art, may be stored in one or more memories such as a ROM, a RAM, flash memory, EEPROM, etc., or provided e.g. from remote servers via one or more wired and/or wireless network connections, the Internet, Ethernet connection, LAN(s), or other local or global computer networks, if necessary.
- Industrial applicability [0137] The invention can be used in various devices transmitting, receiving, and recording speech for the improvement of user experience of listening to speech recording and also for transforming text to speech.
- the inventive method may also be used as part of a text- to-speech system as a neural vocoder.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Complex Calculations (AREA)
Abstract
The invention relates to the field of computer technologies, in particular to methods for processing and analyzing audio recordings, and may be employed for improving the quality and intelligibility of speech recordings. A method for voice restoration in speech recordings comprises the steps of receiving audio data of a speech recording containing a voice audio signal; applying a diffusion probabilistic model trained by denoising score matching objective for unconditional speech generation, wherein the diffusion probabilistic model is applied to the audio data of the speech recording in the form of a waveform comprising random Gaussian noise, iteratively sampling the waveform with a conditional score function, which is a sum of unconditional score function estimated by the diffusion probabilistic model and log-likelihood, so as to produce a sample with a reduced amount of noise for the next iteration, until a speech waveform without noise is obtained; and outputting the processed voice audio signal comprising the speech waveform without noise. A system and computer readable medium which implement the method are also provided. The technical result consists in improving the quality and intelligibility of speech recordings.
Description
Description Title of Invention : UNSUPERVISED VOICE RESTORATION WITH UNCONDITIONAL DIFFUSION MODEL Technical Field [0001] The invention relates to the field of computer technologies, in particular to methods for processing and analyzing audio recordings. More particularly, the invention may be employed for improving the quality and intelligibility of speech recordings, and/or be used as a part of a text-to-speech system as a neural vocoder. Background Art [0002] As the field of artificial intelligence continues to evolve, generative models have emerged as a powerful tool for a variety of tasks including speech processing. In recent years, diffusion models (see [1, 2, 3]) have gained attention due to their ability to efficiently model complex high-dimensional distributions. Diffusion models are designed to learn the underlying data distribution’s implicit prior by matching the gradient of the log density. [0003] One of the presently evolving approaches to the task of speech processing, in particular voice restoration, speech differentiation, vocoding etc. is generally referred to as unconditional speech generation. The latter, however, is generally a challenging task due to the high diversity of possible linguistic content. [0004] Prior works on diffusion models tend to consider conditional speech generation [10, 11] or limit the scope to simple datasets with predefined phrases (e.g., spoken digits) [12, 10]. Reference [1] describes score-based diffusion models, which are class neural generative models that can be informally described as gradual transforming analytically known and unknown (only samples are available) data distributions to each other.
[0005] Reference [7] describes using a single diffusion model to solve several problems of audio recording restoration, namely frequency bandwidth extension and declipping. However, the method of [7] was experimentally tested on music, in particular to restore audio recordings of a piano, while the present invention is intended mainly for restoring speech recordings. Besides, the prior art solution of [7] has not been tested for such tasks as neural vocoding and source separation from mixtures of voices. The prior art solution of [7] uses the UNet architecture model, and not the FFC-AE architecture model, which is known to have advantages over UNet. Summary of Invention [0006] This section which discloses various aspects of the claimed invention is intended for providing a brief overview of the claimed subject matters and their embodiments. Detailed characteristics of technical means and methods that implement the combinations of features of the claimed inventions are provided hereinbelow. Neither this summary of invention nor the detailed description provided below together with accompanying drawings should be regarded as defining the scope of the claimed invention. The scope of legal protection of the claimed invention is only defined by the appended set of claims. [0007] Technical problem to be solved by the present invention consists in enabling the application of the inventive technique for different speech processing tasks without any additional training. [0008] Object of the present invention is to provide an improved method and system for voice restoration using an unconditional diffusion model. [0009] Technical result achieved by using the claimed invention consists in improving the quality and intelligibility of speech recordings. [0010] In the first aspect, the above-mentioned object is achieved by a method for voice restoration in speech recordings, the method comprising: receiving audio data of a speech
recording containing a voice audio signal; applying a diffusion probabilistic model trained by denoising score matching objective for unconditional speech generation, wherein the diffusion probabilistic model is applied to the audio data of the speech recording in the form of a waveform comprising random Gaussian noise, iteratively sampling the waveform with a conditional score function, which is a sum of unconditional score function estimated by the diffusion probabilistic model and log- likelihood, so as to produce a sample with a reduced amount of noise for the next iteration, until a speech waveform without noise is obtained; and outputting the processed voice audio signal comprising the speech waveform without noise. [0011] The diffusion probabilistic model is selected from a group comprising a fast Fourier convolutional autoencoder (FFC-AE) model, a Diffwave model, and a CQT-UNet model. The method may further comprise performing frequency bandwidth extension, declipping on the voice audio signal, performing neural vocoding on the voice audio signal to convert spectral representations of the voice audio signal to audio waveforms, performing source separation on the voice audio signal. The diffusion probabilistic model may be adapted for degradation inversion. The diffusion probabilistic model may be adapted to solve bandwidth extension, declipping, neural vocoding, and/or source separation tasks by modifying a voice audio signal sampling procedure, said modification making the sampling to be conditional on observations, which are a waveform with reduced bandwidth in the case of bandwidth extension task, a clipped waveform in the case of declipping task, a mel-spectrogram in the case of neural vocoding and/or a waveform with mixed voices in the case of source separation task. [0012] In the second aspect, the above-mentioned object is achieved by a system for voice restoration in speech recordings, the system comprising: a memory; a speech recording receiving module configured to receive a speech recording comprising at least a voice
audio signal; a voice restoration processing module; wherein the voice restoration processing module comprises: a neural network module configured to apply a diffusion probabilistic model trained by denoising score matching objective for unconditional speech generation, wherein the diffusion probabilistic model is applied to the audio data of the speech recording in the form of a waveform comprising random Gaussian noise, the neural network module being configured to iteratively sample the waveform with a conditional score function, which is a sum of unconditional score function estimated by the diffusion probabilistic model and log-likelihood, so as to produce a sample with a reduced amount of noise for the next iteration, until a speech waveform without noise is obtained; and the voice restoration processing module being configured to output the processed voice audio signal comprising the speech waveform without noise. [0013] The diffusion probabilistic model is selected from a group comprising a fast Fourier convolutional autoencoder (FFC-AE) model, a Diffwave model, and a CQT-UNet model. The system may further comprise a frequency bandwidth extension module configured to perform frequency bandwidth extension on the voice audio signal, a declipping module configured to perform declipping on the voice audio signal, a neural vocoder module configured to convert spectral representations of the voice audio signal to audio waveforms, a source separation module configured to perform source separation on the voice audio signal. [0014] The system may further comprise an input/output (I/O) interface module configured to input the speech recording comprising at least a voice audio signal and/or output the processed voice audio signal, wherein the input/output interface module may comprise an analog-digital converter (ADC) module configured to perform analog-digital conversion on the input speech recording. The input/output interface module may comprise a digital-
analog converter (DAC) module for performing digital-analog conversion on the output processed voice audio signal. [0015] In the third aspect, the above-mentioned object is achieved by a computer readable medium having stored thereon computer executable instructions which, when executed by one or more processors, cause the one or more processors to perform the method of the first aspect. Brief Description of Drawings Drawings are provided in this document to facilitate the understanding of the essence of the present invention. Drawings are schematic and not drawn to scale. The drawings are for illustrative purposes only and are not intended to define the scope of the present invention. Figure 1 illustrates a schematic diagram of the system for voice restoration in accordance with one or more embodiments of the present invention; Figure 2 illustrates a flowchart of the method for voice restoration in accordance with one or more embodiments of the present invention. Description of Embodiments [0016] Exemplary embodiments of the present disclosure are described in detail below. The exemplary embodiments are illustrated in the appended drawings, where same or similar reference numerals may designate the same or similar elements, or elements which have the same or similar functions. Exemplary embodiments described with reference to the appended drawings are illustrative and are only used for explaining the present disclosure and should not be regarded as any restrictions thereto. [0017] The inventive technique is based on a diffusion probabilistic model capable of solving various speech inverse tasks. In general, diffusion models are designed to learn the implicit prior of an underlying data distribution by matching the gradient of log density. This learned prior can be useful for solving inverse problems, where the objective is to
recover the input signal x from the measurements y, which are typically linked through some differentiable operator A, s.t.
, where n is some noise. [0018] The method can be employed for improving the quality and intelligibility of speech recordings, by performing frequency bandwidth extension, declipping and source separation. The method could also be used as part of a text-to-speech system as a neural vocoder. A novel diffusion probabilistic model capable of solving various speech inverse tasks is introduced by the present invention. Being once trained for speech waveform generation in an unconditional manner, the diffusion probabilistic model may be adapted to different tasks including degradation inversion, neural vocoding, and source separation. Hence, in the context of the present invention, the model employed to achieve the object of the invention as described above will be generally referred to herein as Unconditional Diffusion Model (UnDiff). [0019] Unlike the above-mentioned prior art, the present invention concentrates on training the unconditional diffusion model, while not constraining the linguistic content of training datasets. It should be noted, however, that the inventive concept only generates syntactically consistent speech recordings (such as speechlike sounds) which the present inventors believe to be enough for the purpose of unsupervised voice restoration according to the invention. Here, unsupervised voice restoration means such restoration which is based upon an unsupervised machine learning algorithm which is known not to rely on labeled data, unlike supervised learning algorithms that require upfront human intervention to label the training data appropriately. [0020] In the context of the invention, the following three approaches to building unconditional diffusion models are considered, all of which operate in the time domain but have different preconditioning transformations: [0021] 1. Diffwave neural network operating directly in the time domain;
[0022] 2. FFC-AE neural network operating on short-time Fourier transform spectrograms; [0023] 3. CQT-UNet neural network operating on Constant-Q transform spectrograms. [0024] It should be noted that these approaches may be used in the context of the present invention in any suitable combination and/or individually, and the embodiments of the present invention are generally not restricted to the use of any one of these approaches and/or combinations thereof, which may be particularly advantageous in certain cases as will be shown hereinbelow. [0025] Specifically, in one or more non-limiting embodiments, the fast Fourier convolutional autoencoder (FFC-AE) or Diffwave diffusion probabilistic model is trained by denoising score matching objective for the task of unconditional speech generation. [0026] Taken in general, the model training is performed in accordance with the following principle. Starting from random Gaussian noise as waveform, iterative denoising of the waveform is performed by numerically solving the reverse stochastic equation (by way of an example, see expression (2) hereinbelow) with a conditional score function (by way of an example, see expression (5) hereinbelow) which is a sum of unconditional score function and log-likelihood (by way of an example, see expression (6) hereinbelow). In this way, the model is conditioned on the corrupted speech signal by log-likelihood added to the unconditional score function which is estimated by the FFC-AE model (or e.g. Diffwave). [0027] During the iterative denoising (sampling), the speech waveform corresponding to a given condition (corrupted signal or mel-spectrogram) is produced. At each iteration of the sampling process, output from the previous iteration is given to the model to estimate the score function and produce a sample with a reduced amount of noise for the next iteration. Finally, the model produces a speech waveform without noise.
[0028] The trained model can be used to solve various speech inverse tasks, among which are bandwidth extension, declipping, neural vocoding and source separation tasks by modification of sampling procedure; these modifications make sampling from the model to be conditional on observations, which are waveform with reduced bandwidth in the case of bandwidth extension task, clipped waveform in the case of declipping task, mel- spectrogram in the case of neural vocoding and waveform with mixed voices in the case of source separation task. The present technique relies upon a diffusion probabilistic model (also referred to herein as UnDiff) that is specifically designed to tackle these and other speech inverse tasks for speech processing including degradation inversion, neural vocoding and source separation. [0029] The key advantage of the inventive technique consists in its ability to be trained in an unconditional manner for speech waveform generation and be then adapted for the inverse problem without any additional supervised training for specific problems. [0030] Unconditional Diffusion Model (UnDiff) [0031] According to the invention, a diffusion probabilistic model (UnDiff), in one or more non-limiting examples having an FFC-AE architecture and being generally a class of latent variable models, is first trained for speech waveform generation in an unconditional manner. In this context, unconditional means not being conditional on any prior data labels (descriptions). [0032] For training the diffusion probabilistic model, various training datasets may be used. Just a few non-limiting examples of training datasets suitable for training the diffusion probabilistic model (UnDiff) according to the present invention include e.g. a publicly available VCTK dataset (see [20]) which includes 44200 speech recordings belonging to 110 speakers, a Librispeech (see [20]) clean subset, which is a large-scale corpus of read English speech consisting of approximately 1000 hours of audio data including
recordings from 2456 speakers reading public domain audiobooks. It will be apparent for a person skilled in the art that, instead of the above-mentioned datasets, many other kinds of training datasets may be used in the context of the present invention, so the scope of protection of the invention is in no case restricted to the above-mentioned details. [0033] The diffusion probabilistic model employs Variance Preserving Stochastic Differential Equation (VP-SDE), which is equivalent to Denoising Diffusion Probabilistic Models (DDPM) (see e.g. [2]). [0034] Generally speaking, score-based diffusion models (like the one described in reference [1]) are a class of neural generative models that can be informally described as gradual transforming analytically known and unknown (where only samples are available) data distributions pknown and pdata to each other. More formally, as one example, one can consider a forward (1) and reverse (2) Ito stochastic equations (VP-SDE) for a data noising process in the following form:
, (2) [0037] where t ∈ [0,T] is time variable, β(t) is a noise schedule of the process, selected such that
[0038] While other forms of stochastic differential equations exist in the prior art, in one or more non-limiting embodiments the present invention specifically employs VP-SDE for the diffusion probabilistic model to be further used in the inventive method for voice restoration in speech recordings. VP-SDE is a form of a stochastic differential equation which is used in the invention to define the diffusion process.
[0039] Once the score function is known, it is possible to solve reverse SDE (2) numerically and thus generate samples from pdata. The score function could be approximated by a neural network
trained with denoising score matching objective eventually leading to L2 loss function:
an explicit function of
. A scaled version of score is optimized,
, is set. Linear schedule is used
set. The above mathematical expressions describe one possible example of the SDE function which may be used in the context of the present invention. [0042] The diffusion models are generally known to employ the following principle. First, they build a “generative Markov chain” which transforms a known distribution (such as pknown as mentioned above) into a “target” distribution (such as pdata as mentioned above) using a diffusion process. Then, a “reverse” transformation e.g. of the distribution of training data into another distribution may be performed as mentioned above. Both transformations (e.g. data to noise and noise to data) may be performed using the same functional form. [0043] In practice, distributions of data from the training datasets as mentioned above are used in the above-mentioned model training process. The training process is performed iteratively, and may in practice include several hundreds of iterations over the training dataset (millions of gradient descent steps). The inventive technique notably differs from the prior art in that the machine learning algorithm is unsupervised, and it does not rely
on labeled data, unlike supervised learning algorithms that require upfront human intervention at each training step to label the training data appropriately. [0044] In the inventive technique, the diffusion model is trained until a predetermined training completion condition is achieved. By way of a non-limiting example, the diffusion model may be trained until there is no more significant increase in model quality. [0045] Inverse speech processing tasks [0046] Prior art approaches utilize conditional diffusion models for waveform restoration and generation (see e.g. [4, 5, 6]). The solution of [7], like the present invention, also utilizes an unconditional diffusion model, but for a specific task of piano music restoration, and the prior art approach includes solving declipping, bandwidth extension, and inpainting problems. Unlike the prior art technique of [7], the present invention is aimed at tackling a more challenging problem of speech restoration and additionally considers neural vocoding and speech source separation problems which are referred to in the context of the present invention as inverse problems. [0047] The inventive technique has been experimentally proven to be effective in solving a variety of speech processing tasks such as bandwidth extension, declipping, neural vocoding, and speech source separation. Among those, bandwidth extension is generally known (and also known as audio super-resolution) as a task of realistic increase of audio signal sampling frequency. Audio declipping is a process of reconstructing an audio signal in order to reverse clipping, i.e. cutting off a signal level that rises above a certain maximum level. Neural vocoding is a process of converting spectral representations of an audio signal to audio waveforms. Speech source separation is generally understood as extracting one or more source signals of interest from an audio recording which involves several sound sources. It should be noted that the inventive technique is not limited to
these tasks, which are only mentioned by way of illustration. The inventive technique may also be used as part of a text-to-speech system as a neural vocoder. [0048] Inter alia, the inventive technique uses methods of diffusion post-training conditioning (see e.g. [8, 1, 9]) to adapt unconditional diffusion to each of the above- mentioned tasks. [0049] The inverse problems are generally known to approach the task of retrieving object x given its partial observation y and the forward model
To utilize reverse SDE (2) for sampling from conditional distribution , one needs to find the score function of conditional distribution
. One way to estimate said score function of conditional distribution is to apply imputation guidance (data consistency) (see e.g. [1, 7, 9]). The idea underlying this method is to explicitly modify the score so that some parts of a denoised estimate
are imputed with observations y. Possible ways of using imputation for different speech inverse tasks will be discussed in more detail below. Another way to formalize the search for x is the usage of Bayes’ rule:
[0053] where
is generally intractable. [0054] However, it is known from reference [8] that one can make approximation
may be
computed using a forward model. Given observation operator A and assuming Gaussian likelihood, the final approximation becomes:
[0056] where is a given weighting coefficient, which is set to be inversely proportional to the gradient norm. Like [7], this method may be referred to in the context of the present invention as “reconstruction guidance”. In the context of the present invention, at least some of the above-mentioned mathematical equations may be applied to resolve the inverse problems as outlined above. Taken in general, the method of “reconstruction guidance” may be characterized as follows: as in the process of sampling with the unconditional model, the reverse stochastic equation (2) is numerically solved using a finite difference method (sampling), but, in the process of sampling, the score function is substituted with equation (5), where the first term is calculated using equation (6), and the second term is taken from the unconditional model. [0057] Having discussed above the basics of the inventive technique in computational terms, we now turn to the detailed description of unconditional diffusion model in the context of the present invention. In general, unconditional speech generation is a challenging task due to the high diversity of possible linguistic content. Here, prior art related to diffusion models tends to consider conditional speech generation (see e.g. [10, 11]) or restrict the scope to simple datasets with predefined phrases (e.g., spoken digits) (see e.g. [12, 10]). Conversely, the present invention trains an unconditional diffusion model and does not constrain the linguistic content of the datasets. However, the present invention aims at generating only syntactically consistent speech recordings (i.e. speechlike sounds), which the inventors believe to be sufficient for the purpose of unsupervised voice restoration. Three approaches to building unconditional diffusion models are considered herein, all of which operate in the time domain but have different preconditioning transformations:
[0058] 1. Diffwave neural network (for example, of a type disclosed in [10]) that operates directly in the time domain; [0059] 2. FFC-AE neural network (for example, of a type disclosed in [13]) that operates on short-time Fourier transform spectrograms; [0060] 3. CQT-UNet neural network (for example, of a type disclosed in [7]) operating on Constant-Q transform spectrograms. [0061] The present inventors note that, in various non-limiting embodiments of the invention, at least one of these three approaches may be used, and the best one is selected for use in each case. In other embodiments, however, a combination of two or three approaches may be used. [0062] Frequency bandwidth extension [0063] The inventive method may include a step of frequency bandwidth extension (see e.g. [14, 15], the step is also referred to as audio super-resolution), which can be viewed as a realistic increase in signal sampling frequency. The observation operator is a low-pass filter
imputation guidance in this case corresponds to the substitution of the generated estimate of low frequencies with observed low frequencies y at each step. More formally, this corresponds to modifying score function during sampling as
score function. [0066] Declipping
[0067] Then, declipping is performed on the data from the previous method step. Similarly to what is done in respective operations of prior art technique described in [7], clipping is considered in the context of the present invention as an inverse problem with observation function defined
a reconstruction guidance strategy is applied. [0068] Neural vocoding [0069] Next method step may be characterized as “neural vocoding”. In most state-of-the-art speech synthesis systems this task is decomposed into two stages. At the first stage, low- resolution intermediate representations (e.g., linguistic features, mel-spectrograms) are predicted from text data (see e.g. [16, 17]). At the second stage, these intermediate representations are transformed to raw waveform (see e.g. [18, 19]). Neural vocoders relate to techniques used in the second stage of the speech synthesis process. [0070] Neural vocoding may be formulated as the inverse problem with the observation operator defined as mel-spectrogram computation
. Sine mel- spectrogram computation is a differentiable operation, so reconstruction guidance may be easily applied in this case. [0071] The next method step may be referred to as source separation. In general, the goal of single-channel speech separation is to extract individual speech signals from a mixed audio signal, in which multiple speakers are talking simultaneously. The potential applications of speech source separation include teleconferencing, speech recognition, and hearing aid technology. Let x1 and x2 be the two voice recordings. Consider the observation model which mixes these two recordings, i.e.
. It is noted that, since x1 and x2 are independent, an unconditional density function on their joint distribution can be factorized as
[0072] Thus, for the unconditional score function of joint distribution:
[0074] According to (5), to sample from a joint conditional density, one also needs to estimate the gradient of log-likelihood
One can apply reconstruction guidance (6), however, the present inventors found a more natural way to estimate the log-likelihood gradient in this case. Specifically, since y depends only the sum of x1 and x2, it can be shown that the same holds for and , i.e.
. This likelihood can be computed analytically, indeed, since
[0077] Therefore, gradient of the log-likelihood may be calculated analytically: ,
[0079] the same relation holds for . [0080] Two generalizations of the source separation case discussed above may be considered. First, one may introduce additional parameter a ∈ [0,1], such that
. This generalization corresponds to the different loudness of the mixed voices. The parameter a could be retrieved during the sampling process by maximizing likelihood
[0082] The second possible generalization of the proposed scheme consists in consideration of more than two speakers. It is apparent that equations (8) and (10) could be generalized to the case where
taking into consideration additional terms corresponding to a third voice. [0083] System embodiments [0084] In the second aspect, the invention is directed to a system for voice restoration in speech recordings, which is substantially intended for implementing the method as described above for the first aspect of the present invention. [0085] Taken in general, the system may be characterized as comprising at least a memory and one or more processors configured to execute instructions stored in the memory. [0086] Referring to Fig.1, a schematic diagram of the system 100 for voice restoration is shown. The system generally comprises a memory 110, a speech recording receiving module 120, a voice restoration processing module 130, and an input/output (I/O) interface module 140. The voice restoration processing module may comprise a neural
network module 150, a frequency bandwidth extension module 160, a declipping module 170, and a neural vocoder module 180. The input/output (I/O) interface module 140 may comprise an analog-digital converter (ADC) module 190. The input/output interface module 140 may comprise a digital-analog converter (DAC) module 200. In one or more embodiments, a source separation module 210 may be further provided in the system 100 as a part of the voice restoration processing module 130 or otherwise. [0087] The “core” part of the system 100 for voice restoration is constituted by the voice restoration processing module 130, which, depending on a particular non-limiting implementation of the claimed invention, may be implemented in the form of one or more processors, such as general purpose CPUs and/or digital signal processors (DSPs), microprocessors, and/or integrated circuits, ASICs, microchips, FPGAs etc. Taken in general, voice restoration processing module 130 performs the core part of the processing in respect of the speech recording, which is received via the speech recording receiving module 120, e.g. via one or more input means such as microphone(s) or wireless/wired network connection(s), or memory, such as memory 110 of the system 100 or any external memory source, and, if necessary, undergoes analog-digital conversion (such as in case it is received in the form of an analog signal e.g. from microphone(s)) by means of the ADC module 190. [0088] Then, the neural network module 150 performs the processing of the data signal that corresponds to the received speech recording by applying the applying the fast Fourier convolutional autoencoder (FFC-AE) diffusion probabilistic model trained by denoising score matching objective for unconditional speech generation. Said model may be trained in advance as described above to be better suited to process the voice restoration processing in a given non-limiting implementation of the system.
[0089] Further, the frequency bandwidth extension module 160 within the voice restoration processing module 130 performs the frequency bandwidth extension operation on the speech recording that has been processed by the neural network module. The process of frequency bandwidth extension in this case is as described above. [0090] Then, the declipping module 170 operates on the digital signal that represents the speech recording to be processed, again, as described above in connection with the declipping process. [0091] The neural vocoder module 180 processes the speech recording to restore the speech from a mel-spectrogram in the form of a speech waveform. [0092] The source separation module 210 may perform source separation so as to separate a specific speech signal waveform from the one or more mixed speech sources as described above. [0093] After the processing by means of the above-mentioned components of the voice restoration processing module 130, input/output (I/O) interface module 140 may output the resulting speech waveform with restored voice source signal, optionally employing the digital-analog converter (DAC) module 200 and/or other output means, such as one or more speakers, network connections etc., depending on whether the restored speech waveform is to be output in the form of a sound playback via one or more speakers, or transmitted to a memory for storage, or to another networked entity for playback etc. [0094] It should be noted that the modules of the system 100 for voice restoration may be implemented in a plurality of different ways depending on a given implementation scenario of the present invention. In particular, these modules, which are responsible for various neural network operators, elements etc. as described above, may be implemented using one or more processors, such as general purpose computer processors (CPUs), digital signal processors (DSPs), microprocessors etc. operating under control of
respective software elements, integrated circuits, field programmable gate array(s) (FPGAs) or any other similar means as well known by persons skilled in the art. [0095] It should also be noted that the above-mentioned inverse tasks such as frequency bandwidth extension, declipping, neural vocoding and/or source separation may be performed by the above-mentioned system modules in an order which is different from the one that is specifically described above. Some of these operations may be performed simultaneously, or in any feasible order, or may be omitted, duplicated, repeated one or more times, if required by a given implementation of the invention. The scope of the invention is in no case restricted by any specific order of performance of the above- mentioned operations by respective modules of the inventive system, or by any specific manner of implementation of said modules. [0096] One should understand that the invention is not limited to any details concerning the presence and/or specific nature of input/output means or specific hardware processing means used to implement the modules of the system 100 for voice restoration as aforementioned. [0097] It should also be clearly understood that at least some of the modules of the system 100 for voice restoration may be implemented in the form of software provided in one or more programming languages or in the form of executable code as is well known by persons skilled in the art. Such software may be embodied as computer program or programs, a computer program product, in particular one implemented on a tangible computer readable medium of any suitable kind, computer program element(s), units or modules. It may be stored locally or distributed over one or more wired or wireless networks, using one or more remote servers etc. These details do not restrict the scope of the present invention. In one or more non-limited embodiments, such software may be stored in the memory 110. The latter may also store, temporarily or otherwise, at least
some portion(s) or component(s) of the speech recording to be processed, optionally after being converted into a digital form by the analog-digital converter (ADC) module 190. In practice, the memory 110 may be embodied in various forms, such as a RAM, a ROM, a flash memory, EPROM, EEPROM etc., and/or a removable storage medium, to permanently or temporarily store the respective software instructions, as well as signal(s) and/or data involved in the voice restoration from a speech recording in accordance with the present invention. Details concerning the particular embodiment(s) of the memory 110 are specific to various embodiments of the present invention and do not restrict the scope of the present invention. [0098] Figure 2 provides a flowchart illustrating the basic steps of the method for voice restoration in accordance with the first aspect of the present invention as described above. [0099] At S1, audio data of a speech recording containing a voice audio signal are received. These data may be received via any suitable I/O means such as e.g. a microphone or a network connection etc. [0100] At S2, neural network module 150 performs the processing of the data signal that corresponds to the received speech recording containing the voice audio signal by applying the fast diffusion probabilistic model trained by denoising score matching objective for unconditional speech generation. [0101] Step S2 comprises S3, iteratively sampling the waveform with a conditional score function, and S4, producing a sample with a reduced amount of noise for the next iteration, until a speech waveform without noise is obtained. [0102] At S5, the processed voice audio signal comprising the speech waveform without noise is output, and [0103] At S6, the restored voice audio signal is output via appropriate I/O means, and the method is terminated.
[0104] It should be noted that the method steps as described above are not necessarily always performed in the specific order, in which they are mentioned above. In particular, one or more of the method steps may be performed simultaneously/in parallel, in a different order, and/or omitted, duplicated etc., unless otherwise clearly mentioned herein and/or required by the essence of the inventive technique. [0105] In yet another aspect, the present invention provides a computer readable medium having stored thereon computer executable instructions which, when executed by one or more processors, cause the one or more processors to perform the method of the first aspect. It should be noted that said computer readable medium may be transitory or non- transitory depending on its specific implementation. The computer readable medium may be embodied in practice as one or more of a RAM, a ROM, a flash memory, EPROM, EEPROM etc., and/or a removable storage medium, to permanently or temporarily store the respective computer executable instructions which cause a computer, or its one or more processors, to perform the respective functions of the modules as recited above and/or respective method steps as described above. [0106] Experimental test examples [0107] The following examples confirm the possibility of implementing the intended use of the claimed invention and achieving the technical result as aforementioned. [0108] The invention was experimentally tested using two datasets for experiments. The first dataset was publicly available VCTK dataset (see e.g. [20]) which includes 44200 speech recordings by 110 speakers. 6 speakers and 8 recordings from the utterances corresponding to each speaker were excluded from the training dataset to avoid text-level and speaker-level data leakage to the training dataset. For evaluation, 48 utterances corresponding to 6 speakers excluded from the training data were used. Importantly, the
text corresponding to evaluation utterances was not read in any recordings constituting training data. [0109] The second dataset was LJ-Speech dataset [21] which is standard in the speech synthesis field. LJ-Speech is a single speaker dataset that consists of 13100 audio clips with a total length of approximately 24 hours. A train-validation split from [18] was used with sizes of 12950 train clips and 150 validation clips. Audio samples had a sampling rate of 22.05 kHz. [0110] For evaluation of samples generated by the unconditional model, absolute objective speech quality measure based on direct MOS score prediction by a fine-tuned wav2vec2.0 (see e.g. [22]) model (WV-MOS, see e.g. [23]) and unconditional Frechet DeepSpeech Distance (FDSD) introduced in [24] were used. WV-MOS is known to measure the quality of each generated sample individually, while FDSD measures the distance between distributions of generated and real samples. [0111] For quality evaluation in speech inverse tasks, conventional metrics extended STOI (see e.g. [25]), scale-invariant signal-to-noise ratio (SI-SNR) (see e.g. [26]), log-spectral distance (LSD) and WV-MOS were used. 5-scale MOS tests for subjective quality evaluation following the procedure described in [15] were also used. [0112] All models were trained for 230 epochs, with batch size 8 on audio segments of 2 seconds at the sampling rate of 16 kHz by exponentially averaging model weights with rate of 0.9999. Adam optimizer with learning rate of 0.0002 and betas 0.9 and 0.999 was used. For diffusion, denoising was performed over 200 steps during training, and conditioned on depending on the model. [0113] In this context, denoising is a process of sampling by the diffusion model, i.e. the process of numerically solving the reverse stochastic equation, and, in its capacity as such, it is a standard process for diffusion models (see e.g. Algorithm 2 in [2]).
[0114] 3 approaches to unconditional diffusion-based speech generation and 3 additional baseline cases were compared. All considered approaches operated in the time domain but use different invertible preprocessing transformations and their corresponding inverse post-processing transformations for preconditioning of neural networks. Note that hyperparameters of all models were adjusted such that they had equal capacities as measured by GPU memory allocated for training of each model with equal batch size. [0115] The first approach consisted in training of the neural architecture directly in the time domain, i.e. without any preconditioning transformation. The present inventors found that Diffwave architecture provides the best performance among tested time-domain architectures (UNIVERSE [4], UNet [27] were also tested). The capacity of the original Diffwave architecture was enhanced by increasing the size to 22 blocks with 512 channels. Additionally, squeeze-excitation (see e.g. [28]) weighting on skip connections, condition generative model
via random Fourier features (see e.g. [29]) were introduced. [0116] Another approach was based on time-frequency domain architecture FFC-AE (see e.g. [13]) which uses short-time Fourier transform (STFT) as preconditioning. This architecture is based on a fast Fourier convolution neural operator and operates on complex-valued STFT spectrograms. [0117] FFC-AE was found to provide superior quality compared to convolutional UNet-type architectures. [0118] Finally, the approach to unconditional audio generation as proposed e.g. in [7] was tested. In this approach, Constant-Q Transform (CQT) was used as a preconditioning transformation and convolutional UNet neural architecture with dilated residual blocks as a neural architecture as recommended by [7]. UNet of depth 5 with the following
channels = [64, 64, 128, 128, 256], with downsampling by a factor of 2 at each block was used. [0119] The quality of 8000 unconditionally generated samples was compared based on WV- MOS and FDSD metrics. Metrics for 4 baseline cases: ground-truth speech, gaussian noise, samples from unconditional Diffwave with original architecture [10], and text-to- audio AudioLDM [11] model generated with the prompt ”A person speaking English” were also provided. The experimental data resulting from unconditional speech generation tests are presented in Table 1. [0120] Table 1. Results of unconditional speech generation Model WV-MOS (↑) FDSD (↓) # Params (M) Ground Truth 4.57 0.9 - FFC-AE 4.06 15.3 55.3 Diffwave (invention) 3.84 7.0 32.3 CQT-UNet 2.29 12.37 27.8 AudioLDM 1.81 22.5 185.0 Diffwave (original) 3.12 7.0 24.0 Gaussian noise 1.27 153.5 - [0121] Overall, all tested models demonstrated an ability to generate speech-like sounds, however they did not produce any semantically consistent speech. This behavior was rather expected since the linguistic content of the training dataset was not constrained, and no language understanding guidance was provided (unlike, e.g., AudioLM [30]). [0122] It should be noted that present inventors believe that language understanding is not necessary for speech restoration in the context of the present invention since voice could be potentially retrieved based on acoustic (syntactic) information.
[0123] From experiments, the present inventors observed that the FFC-AE model used in the present invention provides better WVMOS quality, while Diffwave delivers the lowest FDSD score. For a more comprehensive estimation of the results of implementation of the inventive method, subsequent experimental tests were performed with both Diffwave and FFC-AE models. [0124] Experimental results for bandwidth extension, declipping, neural vocoding and source separation are provided below in Tables 2, 3, 4, 5, where best results are highlighted in bold type. All metrics were computed on randomly cropped 1-second segments. [0125] In the bandwidth extension experimental tests, recordings with a sampling rate of 16 kHz were used as targets, and two frequency bandwidths were considered for input data: 2 kHz and 4 kHz. Original signal was artificially degraded to the desired frequency bandwidth (2 kHz or 4 kHz) using polyphase filtering. The results and comparison with other techniques are outlined in Table 2. Here, “UnDiff” stands for the models used in the present invention (both Diffwave and FFC-AE variants), while all other models represent the prior art (respective prior art references describing them are indicated in brackets). [0126] Table 2. Results of bandwidth extension (BWE) on VCTK Model Supervised WV-MOS LSD MOS Ground Truth - 4.17 0 4.09 ± 0.09 BWE 2kHz → 8kHz HiFi++ [15] ✓ 4.05 1.09 3.93 ± 0.10 Voicefixer [31] ✓ 3.67 1.08 3.64 ± 0.10 TFiLM [32] ✓ 2.83 1.01 2.71 ± 0.10 UnDiff (Diffwave) × 3.48 0.96 3.59 ± 0.11 UnDiff (FFC-AE) × 3.59 1.13 3.50 ± 0.11
Input - 2.52 1.06 2.42 ± 0.09 BWE 4kHz → 8kHz HiFi++ [15] ✓ 4.22 1.07 4.04 ± 0.10 Voicefixer [31] ✓ 3.95 0.98 3.92 ± 0.10 TFiLM [32] ✓ 3.46 0.83 3.43 ± 0.10 UnDiff (Diffwave) × 4.00 0.76 3.74 ± 0.11 UnDiff (FFC-AE) × 3.88 0.96 3.72 ± 0.10 Input - 3.34 0.85 3.39 ± 0.10 [0127] Next, the inventive FFC-AE and Diffwave models were experimentally tested for the declipping task against popular audio declipping methods known as analysis sparse audio declipper (A-SPADE, see [33]) and synthesis sparse audio declipper (S-SPADE, see [34]), as well as the general speech restoration framework Voicefixer [31] on clipped audio recordings with input SDR being equal to 3 db (see Table 3). [0128] Table 3. Results of declipping (input SNR = 3 db) on VCTK Model Supervised WV-MOS SI-SNR MOS Ground Truth - 3.91 - 3.84 ± 0.11 A-SPADE [33] × 2.63 8.48 2.67 ± 0.11 S-SPADE [34] × 2.69 8.50 2.55 ± 0.11 Voicefixer [31] ✓ 2.79 -22.58 2.98 ± 0.12 Undiff (Diffwave) × 3.62 10.57 3.59 ± 0.12 Undiff (FFC-AE) × 3.01 7.35 3.06 ± 0.12 Input - 2.30 3.82 2.19 ± 0.09 [0129] To demonstrate the effectiveness of the inventive Undiff model on the task of neural vocoding, FFC-AE and Diffwave models were trained on the unconditional generation of
the LJ speech dataset. The inventive model was compared to 2 supervised baselines from the prior art and to the unsupervised Griffin-Lim vocoder. Experimental testing results are provided in Table 4. [0130] Table 4. Results of neural vocoding (LJ speech dataset) Model Supervised WV-MOS MOS Ground Truth - 4.32 4.26 ± 0.07 HiFi-GAN (V1) [18] ✓ 4.36 4.23 ± 0.07 Diffwave [10] ✓ 4.19 4.15 ± 0.07 Griffin-Lim [35] × 3.30 3.46 ± 0.08 Undiff (Diffwave) × 3.99 3.79 ± 0.08 Undiff (FFC-AE) × 4.08 4.12 ± 0.07 [0131] Finally, to assess the inventive Undiff’s performance on source separation task, recordings belonging to different speakers from VCTK validation data were randomly mixed. The recordings were normalized and mixed without a weighting coefficient. [0132] Table 5. Results of source separation (VCTK dataset) Model Supervised SI-SNR STOI Mixture (input) - -0.04 0.67 Undiff (Diffwave) × 5.73 0.75 Undiff (FFC-AE) × 3.39 0.76 Conv-TasNet [36] ✓ 15.94 0.95 [0133] The results showed that despite the inventive Undiff models were never explicitly trained to solve any of the considered tasks, they perform at least comparably to supervised baselines for bandwidth extension, declipping and vocoding. The potential to solve the source separation task was also demonstrated by the inventive models. In one or more embodiments of the invention, using different mixing weights and enabling the
inventive models to produce globally coherent voices during source separation may be further considered. Overall, the experimental test results highlighted the capability of the unconditional diffusion models (UnDiff) according to the present invention to resolve voice (speech) restoration tasks by means of the method and device embodiments as described above. [0134] The present invention provides a diffusion probabilistic model capable of solving various speech inverse tasks. Efficiency of the model in bandwidth extension, declipping, neural vocoding, and source separation tasks was demonstrated and is confirmed by experimental data provided hereinabove. The inventive diffusion probabilistic model (also referred to herein as UnDiff) provides a new tool for solving complex inverse problems in speech restoration, highlighting the potential of diffusion models to be a general framework for voice restoration. Being once trained for speech waveform generation in an unconditional manner, it can be adapted to different tasks including degradation inversion, neural vocoding, and source separation as mentioned above. The invention solves the challenging problem of unconditional waveform generation by comparing different neural architectures and preconditioning domains. The trained unconditional diffusion can adapted to different tasks of speech processing by the means of various techniques of post-training conditioning of diffusion models as shown above. [0135] The inventive method for voice restoration was described above. Persons skilled in the art shall understand that the invention may be implemented by various combinations of hardware and software means, and any such particular combinations do not restrict the scope of the present invention. The modules described above, which constitute the inventive device, may be implemented in the form of separate hardware means, or two or more modules may be implemented by one hardware means, or the inventive system may be implemented by one or more computer(s), processor(s) (CPUs) such as general
purpose processors or specialized processors such as digital signal processors (DSPs), or by one or more ASICs, FPGAs, logic elements etc. Alternatively, one or more modules may be implemented as software means such as e.g. a program or programs, computer program element(s) or module(s) which control one or more computer(s), CPUs etc. to implement the method steps and/or operations as described in detail above. These software means may be embodied in one or more computer-readable media which are well known to ones skilled in the art, may be stored in one or more memories such as a ROM, a RAM, flash memory, EEPROM, etc., or provided e.g. from remote servers via one or more wired and/or wireless network connections, the Internet, Ethernet connection, LAN(s), or other local or global computer networks, if necessary. [0136] Industrial applicability [0137] The invention can be used in various devices transmitting, receiving, and recording speech for the improvement of user experience of listening to speech recording and also for transforming text to speech. The inventive method may also be used as part of a text- to-speech system as a neural vocoder. [0138] Persons skilled in the art shall understand that only some of the possible examples of techniques and material and technical means by which embodiments of the present invention may be implemented are described above and shown in the figures. Detailed description of embodiments of the invention as provided above is not intended for limiting or defining the scope of legal protection of the present invention. [0139] Other embodiments which may be encompassed by the scope of the present invention may be conceived by persons skilled in the art after careful reading of the above specification with reference to the accompanying drawings, and all such apparent modifications, changes and/or equivalent substitutions are considered to be included in
the scope of the present invention. All prior art references cited and discussed herein are hereby incorporated in this disclosure by reference where applicable. [0140] While the present invention has been described and illustrated with reference to its different embodiments, persons skilled in the art shall understand that various modifications in its form and specific details may be made without departing from the scope of the present invention which is only defined by the claims provided hereinbelow and their equivalents. [0141] Citation list [0142] [1] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations”, in International Conference on Learning Representations. [0143] [2] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models”, Advances in Neural Information Processing Systems, vol.33, pp.6840–6851, 2020. [0144] [3] T. Karras, M. Aittala, T. Aila, and S. Laine, “Elucidating the design space of diffusion-based generative models”, in Advances in Neural Information Processing Systems. [0145] [4] J. Serr`a, S. Pascual, J. Pons, R. O. Araz, and D. Scaini, “Universal speech enhancement with score-based diffusion”, arXiv preprint arXiv:2206.03065, 2022. [0146] [5] J. Richter, S. Welker, J.-M. Lemercier, B. Lay, and T. Gerkmann, “Speech enhancement and dereverberation with diffusion-based generative models”, arXiv preprint arXiv:2208.05830, 2022. [0147] [6] R. Scheibler, Y. Ji, S.-W. Chung, J. Byun, S. Choe, and M.-S. Choi, “Diffusion- based generative speech source separation”, arXiv preprint arXiv:2210.17327, 2022. [0148] [7] E. Moliner, J. Lehtinen, and V. V¨alim¨aki, “Solving audio inverse problems with a diffusion model”, arXiv preprint arXiv:2210.15228, 2022.
[0149] [8] H. Chung, J. Kim, M. T. Mccann, M. L. Klasky, and J. C. Ye, “Diffusion posterior sampling for general noisy inverse problems”, arXiv preprint arXiv:2209.14687, 2022. [0150] [9] J. Choi, S. Kim, Y. Jeong, Y. Gwon, and S. Yoon, “Ilvr: Conditioning method for denoising diffusion probabilistic models”, in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp.14367–14376. [0151] [10] Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “Diffwave: A versatile diffusion model for audio synthesis”, in International Conference on Learning Representations. [0152] [11] H. Liu, Z. Chen, Y. Yuan, X. Mei, X. Liu, D. Mandic, W. Wang, and M. D. Plumbley, “Audioldm: Text-to-audio generation with latent diffusion models,” arXiv preprint arXiv:2301.12503, 2023. [0153] [12] K. Goel, A. Gu, C. Donahue, and C. R´e, “It’s raw! audio generation with state- space models”, in International Conference on Machine Learning. PMLR, 2022, pp. 7616–7633. [0154] [13] I. Shchekotov, P. K. Andreev, O. Ivanov, A. Alanov, and D. Vetrov, “FFC-SE: Fast Fourier Convolution for Speech Enhancement”, in Proc. Interspeech 2022, 2022, pp. 1188–1192. [0155] [14] V. Kuleshov, S. Z. Enam, and S. Ermon, “Audio super resolution using neural networks”, arXiv preprint arXiv:1708.00853, 2017. [0156] [15] P. Andreev, A. Alanov, O. Ivanov, and D. Vetrov, “Hifi++: a unified framework for bandwidth extension and speech enhancement”, arXiv preprint arXiv:2203.13086, 2022. [0157] [16] N. Li, Y. Liu, Y. Wu, S. Liu, S. Zhao, and M. Liu, “Robutrans: A robust transformer-based text-to-speech model”, in Proceedings of the AAAI Conference on Artificial Intelligence, vol.34, no.05, 2020, pp.8228–8235.
[0158] [17] J. Shen, Y. Jia, M. Chrzanowski, Y. Zhang, I. Elias, H. Zen, and Y. Wu, “Non- attentive tacotron: Robust and controllable neural tts synthesis including unsupervised duration modeling”, arXiv preprint arXiv:2010.04301, 2020. [0159] [18] J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis”, Advances in Neural Information Processing Systems, vol.33, pp.17022–17033, 2020. [0160] [19] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based generative network for speech synthesis”, in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp.3617–3621. [0161] [20] J. Yamagishi, C. Veaux, K. MacDonald et al., “Cstr vctk corpus: English multi- speaker corpus for cstr voice cloning toolkit (version 0.92)”, 2019. [0162] [21] K. Ito and L. Johnson, “The lj speech dataset,” https://keithito.com/LJ-Speech- Dataset/, 2017. [0163] [22] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations”, in Advances in Neural Information Processing Systems, vol.33, 2020, pp.12449–12460. [0164] [23] “Wv-mos: Mos score prediction by fine-tuned wav2vec2.0 model”, https://github.com/AndreevP/wvmos, accessed: 2022-01-20. [0165] [24] M. Bi´nkowski, J. Donahue, S. Dieleman, A. Clark, E. Elsen, N. Casagrande, L. C. Cobo, and K. Simonyan, “High fidelity speech synthesis with adversarial networks”, in International Conference on Learning Representations. [0166] [25] J. Jensen and C. H. Taal, “An algorithm for predicting the intelligibility of speech masked by modulated noise maskers”, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.24, no.11, pp.2009–2022, 2016.
[0167] [26] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “Sdr–half-baked or well done?” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp.626–630. [0168] [27] A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilistic models”, in International Conference on Machine Learning. PMLR, 2021, pp.8162–8171. [0169] [28] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks”, 2018. [0170] [29] S. Rouard and G. Hadjeres, “Crash: Raw audio score-based generative modeling for controllable high-resolution drum sound synthesis”, in Music Information Retrieval Conf. (ISMIR), 2021, pp.579–585. [0171] [30] Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, O. Teboul, D. Grangier, M. Tagliasacchi, and N. Zeghidour, “Audiolm: a language modeling approach to audio generation”, arXiv preprint arXiv:2209.03143, 2022. [0172] [31] H. Liu, Q. Kong, Q. Tian, Y. Zhao, D. Wang, C. Huang, and Y. Wang, “Voicefixer: Toward general speech restoration with [0173] neural vocoder”, arXiv preprint arXiv:2109.13731, 2021. [0174] [32] S. Birnbaum, V. Kuleshov, Z. Enam, P. W. W. Koh, and S. Ermon, “Temporal film: Capturing long-range sequence dependencies with feature-wise modulations”. Advances in Neural Information Processing Systems, vol.32, 2019. [0175] [33] P. Zaviska and P. Rajmic, “Analysis social sparsity audio declipper”, arXiv preprint arXiv:2205.10215, 2022. [0176] [34] P. Zaviska, P. Rajmic, O. Mokry, and Z. Pruvsa, “A proper version of synthesis- based sparse audio declipper,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp.591–595.
[0177] [35] D. Griffin and J. Lim, “Signal estimation from modified short-time fourier transform”, IEEE Transactions on acoustics, speech, and signal processing, vol.32, no.2, pp.236–243, 1984. [0178] [36] Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol.27, no.8, pp.1256–1266, 2019.
Claims
Claims 1. A method for voice restoration in speech recordings, the method comprising: receiving audio data of a speech recording containing a voice audio signal; applying a diffusion probabilistic model trained by denoising score matching objective for unconditional speech generation, wherein the diffusion probabilistic model is applied to the audio data of the speech recording in the form of a waveform comprising random Gaussian noise, iteratively sampling the waveform with a conditional score function, which is a sum of unconditional score function estimated by the diffusion probabilistic model and log- likelihood, so as to produce a sample with a reduced amount of noise for the next iteration, until a speech waveform without noise is obtained; and outputting the processed voice audio signal comprising the speech waveform without noise.
2. The method of claim 1, wherein the diffusion probabilistic model is selected from a group comprising a fast Fourier convolutional autoencoder (FFC-AE) model, a Diffwave model, and a CQT-UNet model.
3. The method of claim 1, further comprising performing frequency bandwidth extension on the voice audio signal.
4. The method of claim 1, further comprising performing declipping on the voice audio signal.
5. The method of claim 1, further comprising performing neural vocoding on the voice audio signal to convert spectral representations of the voice audio signal to audio waveforms.
6. The method of claim 1, further comprising performing source separation on the voice audio signal.
7. The method of claim 1, wherein the diffusion probabilistic model is adapted for degradation inversion.
8. The method of any one of claims 3, wherein the diffusion probabilistic model is adapted to solve bandwidth extension, declipping, neural vocoding, and/or source separation tasks by modifying a voice audio signal sampling procedure, said modification making the sampling to be conditional on observations, which are a waveform with reduced bandwidth in the case of bandwidth extension task, a clipped waveform in the case of declipping task, a mel-spectrogram in the case of neural vocoding and/or a waveform with mixed voices in the case of source separation task.
9. A system for voice restoration in speech recordings, the system comprising: a memory; a speech recording receiving module configured to receive a speech recording comprising at least a voice audio signal; a voice restoration processing module; wherein the voice restoration processing module comprises: a neural network module configured to apply a diffusion probabilistic model trained by denoising score matching objective for unconditional speech generation,
wherein the diffusion probabilistic model is applied to the audio data of the speech recording in the form of a waveform comprising random Gaussian noise, the neural network module being configured to iteratively sample the waveform with a conditional score function, which is a sum of unconditional score function estimated by the diffusion probabilistic model and log-likelihood, so as to produce a sample with a reduced amount of noise for the next iteration, until a speech waveform without noise is obtained; and the voice restoration processing module being configured to output the processed voice audio signal comprising the speech waveform without noise.
10. The system of claim 9, wherein the diffusion probabilistic model is selected from a group comprising a fast Fourier convolutional autoencoder (FFC-AE) model, a Diffwave model, and a UNet model.
11. The system of claim 9, further comprising a frequency bandwidth extension module configured to perform frequency bandwidth extension on the voice audio signal.
12. The system of claim 9, further comprising a declipping module configured to perform declipping on the voice audio signal.
13. The system of claim 9, further comprising a neural vocoder module configured to convert spectral representations of the voice audio signal to audio waveforms.
14. The system of claim 9, further comprising a source separation module configured to perform source separation on the voice audio signal.
15. The system of claim 9, further comprising an input/output (I/O) interface module configured to input the speech recording comprising at least a voice audio signal and/or output the processed voice audio signal.
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| RU2023104979 | 2023-03-03 | ||
| RU2023104979 | 2023-03-03 | ||
| RU2023117574 | 2023-07-04 | ||
| RU2023117574A RU2823017C1 (en) | 2023-07-04 | Uncontrolled voice restoration using unconditioned diffusion model without teacher |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024184745A1 true WO2024184745A1 (en) | 2024-09-12 |
Family
ID=92674202
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/IB2024/051943 Pending WO2024184745A1 (en) | 2023-03-03 | 2024-02-29 | Unsupervised voice restoration with unconditional diffusion model |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2024184745A1 (en) |
Cited By (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119380734A (en) * | 2024-11-14 | 2025-01-28 | 腾讯音乐娱乐科技(深圳)有限公司 | Singing voice conversion method, device, electronic device, program product and storage medium |
| US20250054473A1 (en) * | 2023-08-09 | 2025-02-13 | Futureverse Ip Limited | Artificial intelligence music generation model and method for configuring the same |
| CN119864006A (en) * | 2024-12-23 | 2025-04-22 | 科大讯飞股份有限公司 | Speech synthesis generation method, electronic device, and storage medium |
| CN120340507A (en) * | 2025-06-19 | 2025-07-18 | 北京生数科技有限公司 | Audio generation method, device, storage medium, electronic device and program product |
| CN120496565A (en) * | 2025-06-23 | 2025-08-15 | 成都埃文数智信息技术有限公司 | Voice enhancement method based on distribution enhancement diffusion model |
| US12456250B1 (en) | 2024-11-14 | 2025-10-28 | Futureverse Ip Limited | System and method for reconstructing 3D scene data from 2D image data |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20220392471A1 (en) * | 2021-06-02 | 2022-12-08 | Arizona Board Of Regents On Behalf Of Arizona State University | Systems, methods, and apparatuses for restoring degraded speech via a modified diffusion model |
-
2024
- 2024-02-29 WO PCT/IB2024/051943 patent/WO2024184745A1/en active Pending
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20220392471A1 (en) * | 2021-06-02 | 2022-12-08 | Arizona Board Of Regents On Behalf Of Arizona State University | Systems, methods, and apparatuses for restoring degraded speech via a modified diffusion model |
Non-Patent Citations (4)
| Title |
|---|
| CHEN NANXIN, ZHANG YU, ZEN HEIGA, WEISS RON J, NOROUZI MOHAMMAD, CHAN WILLIAM: "WAVEGRAD: ESTIMATING GRADIENTS FOR WAVEFORM GENERATION", ARXIV:2009.00713V2, 1 October 2020 (2020-10-01), XP093208289, Retrieved from the Internet <URL:https://arxiv.org/pdf/2009.00713v2> * |
| KIM HEESEUNG, KIM SUNGWON, YOON SUNGROH: "Guided-TTS: A Diffusion Model for Text-to-Speech via Classifier Guidance", ARXIV:2111.11755V4, 1 June 2022 (2022-06-01), XP093208297 * |
| KONG ZHIFENG, PING WEI, HUANG JIAJI, ZHAO KEXIN, CATANZARO BRYAN, , : "DIFFWAVE: A VERSATILE DIFFUSION MODEL FOR AUDIO SYNTHESIS", ARXIV:2009.09761V3, 30 March 2021 (2021-03-30), XP093208299, Retrieved from the Internet <URL:https://arxiv.org/pdf/2009.09761> * |
| SERRÀ JOAN, PASCUAL SANTIAGO, PONS JORDI, ARAZ R. OGUZ, SCAINI DAVIDE: "Universal Speech Enhancement with Score-based Diffusion", 16 September 2022 (2022-09-16), XP093008353, Retrieved from the Internet <URL:https://arxiv.org/pdf/2206.03065.pdf> [retrieved on 20221214], DOI: 10.48550/arxiv.2206.03065 * |
Cited By (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20250054473A1 (en) * | 2023-08-09 | 2025-02-13 | Futureverse Ip Limited | Artificial intelligence music generation model and method for configuring the same |
| US12354576B2 (en) * | 2023-08-09 | 2025-07-08 | Futureverse Ip Limited | Artificial intelligence music generation model and method for configuring the same |
| CN119380734A (en) * | 2024-11-14 | 2025-01-28 | 腾讯音乐娱乐科技(深圳)有限公司 | Singing voice conversion method, device, electronic device, program product and storage medium |
| US12456250B1 (en) | 2024-11-14 | 2025-10-28 | Futureverse Ip Limited | System and method for reconstructing 3D scene data from 2D image data |
| CN119864006A (en) * | 2024-12-23 | 2025-04-22 | 科大讯飞股份有限公司 | Speech synthesis generation method, electronic device, and storage medium |
| CN120340507A (en) * | 2025-06-19 | 2025-07-18 | 北京生数科技有限公司 | Audio generation method, device, storage medium, electronic device and program product |
| CN120496565A (en) * | 2025-06-23 | 2025-08-15 | 成都埃文数智信息技术有限公司 | Voice enhancement method based on distribution enhancement diffusion model |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2024184745A1 (en) | Unsupervised voice restoration with unconditional diffusion model | |
| US20230282202A1 (en) | Audio generator and methods for generating an audio signal and training an audio generator | |
| Qian et al. | Speech Enhancement Using Bayesian Wavenet. | |
| JP7103390B2 (en) | Acoustic signal generation method, acoustic signal generator and program | |
| Iashchenko et al. | UnDiff: Unsupervised voice restoration with unconditional diffusion model | |
| Peracha et al. | Causal speech enhancement using dynamical-weighted loss and attention encoder-decoder recurrent neural network | |
| Korostik et al. | Modifying flow matching for generative speech enhancement | |
| Wu et al. | Denoising Recurrent Neural Network for Deep Bidirectional LSTM Based Voice Conversion. | |
| KR20230101829A (en) | Apparatus for providing a processed audio signal, method for providing a processed audio signal, apparatus for providing neural network parameters, and method for providing neural network parameters | |
| Ueno et al. | Refining synthesized speech using speaker information and phone masking for data augmentation of speech recognition | |
| RU2823017C1 (en) | Uncontrolled voice restoration using unconditioned diffusion model without teacher | |
| JP2021033466A (en) | Encoding device, decoding device, parameter learning device, and program | |
| Yang et al. | SDNet: Noise-Robust Bandwidth Extension under Flexible Sampling Rates | |
| Lay et al. | Diffusion Buffer for Online Generative Speech Enhancement | |
| RU2823015C1 (en) | Audio data generator and methods of generating audio signal and training audio data generator | |
| Villani et al. | A Two-Stage Neural Network for Speech Signal Reconstruction from Mel Spectrograms | |
| RU2823016C1 (en) | Audio data generator and methods of generating audio signal and training audio data generator | |
| Yang et al. | Spectral network based on lattice convolution and adversarial training for noise-robust speech super-resolution | |
| Tachibana et al. | Diffusion Generative Vocoder for Fullband Speech Synthesis Based on Weak Third-order SDE Solver. | |
| Gao | Extremely Lightweight Vocoders for On-device Speech Synthesis | |
| Shin et al. | TF-Restormer: Complex Spectral Prediction for Speech Restoration | |
| Salem et al. | Causal speech enhancement using dynamical-weighted loss and attention encoder-decoder recurrent neural network | |
| Cohen et al. | Discovering Directions of Uncertainty in Speech Inpainting | |
| Tsemko et al. | EDGE-READY SPEECH SEPARATION WITH SUDO-TASNET | |
| Sach et al. | A Maximum Entropy Information Bottleneck (MEIB) Regularization for Generative Speech Enhancement with HiFi-GAN |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24766584 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |