[go: up one dir, main page]

WO2025213833A1 - Neural network-based voice packet loss concealment method and device - Google Patents

Neural network-based voice packet loss concealment method and device

Info

Publication number
WO2025213833A1
WO2025213833A1 PCT/CN2024/139754 CN2024139754W WO2025213833A1 WO 2025213833 A1 WO2025213833 A1 WO 2025213833A1 CN 2024139754 W CN2024139754 W CN 2024139754W WO 2025213833 A1 WO2025213833 A1 WO 2025213833A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
sample
neural network
layer
trained
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/CN2024/139754
Other languages
French (fr)
Chinese (zh)
Inventor
夏咸军
张子晗
肖益剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zitiao Network Technology Co Ltd
Original Assignee
Beijing Zitiao Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zitiao Network Technology Co Ltd filed Critical Beijing Zitiao Network Technology Co Ltd
Publication of WO2025213833A1 publication Critical patent/WO2025213833A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/005Correction of errors induced by the transmission channel, if related to the coding algorithm
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the embodiments of the present disclosure relate to the field of computer technology, and more particularly to a method and apparatus for voice packet loss compensation based on a neural network.
  • audio streaming has become a crucial component of network communications.
  • audio packets can be lost during transmission due to various factors, such as network congestion, bandwidth limitations, and hardware failures. This can severely impact the quality of voice communication and degrade the user experience. Therefore, recovering from audio packet loss is an urgent problem that needs to be addressed.
  • the embodiments of the present disclosure describe a method and apparatus for voice packet loss compensation based on a neural network.
  • the neural network trained by this method can more accurately compensate for voice packet loss.
  • a method for training a neural network for speech packet loss compensation wherein the neural network to be trained includes an encoder layer, an intermediate layer and a decoder layer, and the intermediate layer is connected between the encoder layer and the decoder layer.
  • the method includes: obtaining a training sample set, wherein each training sample includes sample packet loss audio, its corresponding sample frame loss position information and sample non-packet loss audio; generating input features based on the sample packet loss audio and its corresponding sample frame loss position information; inputting the input features into the neural network to be trained; inputting the features output by the intermediate layer into a pre-trained fundamental frequency prediction network, and the fundamental frequency prediction network outputs a predicted fundamental frequency; and adjusting the network parameters of the encoder layer and the intermediate layer based on the predicted fundamental frequency and the true fundamental frequency calculated based on the sample non-packet loss audio.
  • the neural network to be trained is a U-Net neural network, wherein the intermediate layer is a bottleneck layer in the U-Net structure.
  • voice packet loss compensation can be achieved using the U-Net neural network.
  • the pitch prediction network includes a bidirectional long short-term memory network.
  • the bidirectional long short-term memory network can fully consider the contextual information in the audio data, thereby improving the processing capability of the audio data.
  • the features output by the intermediate layer include features of the frame corresponding to the frame loss position corresponding to the sample frame loss position information, and the predicted fundamental frequency output by the fundamental frequency prediction network includes the fundamental frequency of the frame corresponding to the frame loss position.
  • generating input features based on the packet-loss audio sample and its corresponding sample frame-loss location information includes: performing sub-band decomposition on the packet-loss audio sample to obtain multiple sub-bands; and generating input features based on the results of converting the multiple sub-bands into the time-frequency domain and the sample frame-loss location information. This allows the packet-loss audio sample to be decomposed into multiple sub-bands for processing, significantly reducing computational complexity.
  • the encoder layer includes multiple encoders, each of which includes a gated convolution layer and a time-frequency dilated convolution layer.
  • the time-frequency dilated convolution layer is used to extract features through dilated convolution in the time and frequency dimensions. This effectively increases the receptive field of the convolution layer.
  • the decoder layer includes multiple decoders, each decoder includes a first branch and a second branch in parallel, the first branch is used to predict the real part of the audio, and the second branch is used to predict the imaginary part of the audio;
  • the neural network to be trained outputs sample predicted audio based on the real and imaginary parts of the predicted audio output by the encoder layer; and the method further includes: inputting the sample predicted audio into at least one pre-trained discriminator, and each discriminator outputs a discrimination result for the sample predicted audio; calculating the loss based on at least one discrimination result, the sample predicted audio, and the sample un-lost audio, and adjusting the network parameters of the neural network to be trained based on the loss. Therefore, a generative adversarial network (GAN) can be used to train the neural network to be trained.
  • GAN generative adversarial network
  • the at least one discriminator includes a first discriminator for determining the probability that the sample predicted audio is real audio and a second discriminator for determining the audio quality of the sample predicted audio.
  • the accuracy of the generator can be improved by using multiple discriminators.
  • the method further includes: inputting the sample predicted audio output by the neural network to be trained and its corresponding sample unpacked audio into a pre-trained speech recognition model; obtaining coding layer features of the sample predicted audio and its corresponding sample unpacked audio in the speech recognition model; and adjusting the network parameters of the neural network to be trained based on the difference loss between the two obtained coding layer features.
  • the network parameters of the neural network to be trained can be adjusted using the pre-trained speech recognition model, thereby improving the accuracy of the neural network to be trained.
  • a method for voice packet loss compensation based on a neural network comprising: obtaining a neural network for voice packet loss compensation trained according to any one of the methods of the first aspect; receiving audio to be processed and frame loss position information corresponding to the audio to be processed; and inputting input features generated based on the audio to be processed and the frame loss position information corresponding to the audio to be processed into the neural network to obtain audio corresponding to the audio to be processed after packet loss compensation.
  • a device for training a neural network for speech packet loss compensation includes an encoder layer, an intermediate layer and a decoder layer, and the intermediate layer is connected between the encoder layer and the decoder layer.
  • the device includes: an acquisition unit configured to acquire a training sample set, wherein each training sample includes sample packet loss audio, its corresponding sample frame loss position information and sample non-packet loss audio; a generation unit configured to generate input features based on the sample packet loss audio and its corresponding sample frame loss position information; a first input unit configured to input the input features into the neural network to be trained; a second input unit configured to input the features output by the intermediate layer into a pre-trained fundamental frequency prediction network, and the fundamental frequency prediction network outputs a predicted fundamental frequency; and an adjustment unit configured to adjust the network parameters of the encoder layer and the intermediate layer based on the predicted fundamental frequency and the true fundamental frequency calculated based on the sample non-packet loss audio.
  • a speech packet loss compensation device based on a neural network includes: a model acquisition unit, configured to obtain a neural network for speech packet loss compensation trained according to any one of the methods of the first aspect; a receiving unit, configured to receive audio to be processed and frame loss position information corresponding to the audio to be processed; and a feature input unit, configured to input input features generated based on the audio to be processed and the frame loss position information corresponding to the audio to be processed into the neural network to obtain audio after packet loss compensation corresponding to the audio to be processed.
  • a computer program product comprising a computer program, wherein when the computer program is executed by a processor, the computer program implements any of the above methods in the first aspect.
  • a computer-readable storage medium on which a computer program is stored.
  • the computer program is executed in a computer, the computer is caused to execute any one of the methods in the first aspect.
  • an electronic device comprising a memory and a processor, wherein the memory stores executable code, and when the processor executes the executable code, any one of the above methods in the first aspect is implemented.
  • FIG1 is a schematic diagram showing an application scenario in which an embodiment of the present disclosure can be applied.
  • FIG2 is a schematic flow chart showing a method for training a neural network for voice packet loss compensation according to one embodiment
  • FIG3 is a schematic diagram showing an example of an encoder structure
  • FIG4 is a schematic diagram showing an example of a fundamental frequency prediction network structure
  • FIG5 is a schematic diagram showing an example of a neural network structure to be trained
  • FIG6 is a schematic diagram showing an example of a frequency domain multi-resolution discriminator
  • FIG7 is a schematic diagram showing an example of a time-domain multi-cycle discriminator
  • Figure 8 shows a schematic diagram of an example of MetricGAN discriminator training
  • FIG9 shows a method for compensating for speech packet loss based on a neural network according to one embodiment
  • FIG10 shows a schematic block diagram of an apparatus for training a neural network for voice packet loss compensation according to one embodiment
  • FIG11 shows a schematic block diagram of a speech packet loss compensation apparatus based on a neural network according to an embodiment
  • FIG12 shows a schematic structural diagram of an electronic device suitable for implementing the embodiments of the present application.
  • a prompt message is sent to the user to clearly inform the user that the operation requested will require the acquisition and use of the user's personal information. This allows the user to independently choose whether to provide personal information to the electronic device, application, server, storage medium, or other software or hardware that performs the operations of the disclosed technical solution based on the prompt message.
  • the prompt information in response to receiving a user's active request, may be sent to the user in the form of a pop-up window, in which the prompt information may be presented in text form.
  • the pop-up window may also contain a selection control for the user to select "agree” or “disagree” to provide personal information to the electronic device.
  • voice packet loss compensation PLC
  • Traditional packet loss compensation technology can compensate for packet loss based on redundant coding or interpolation of signal processing. For example, the forward error correction technology used in many codecs, when poor network conditions are detected, the sender can transmit redundant information about past frames to recover short packet losses. However, this method introduces additional network overhead and additional delay, and cannot handle longer packet losses.
  • embodiments of the present disclosure provide a neural network-based voice packet loss compensation method.
  • a neural network can be trained using the method provided in embodiments of the present disclosure for training a neural network for voice packet loss compensation. This allows the trained neural network to more accurately compensate for voice packet loss without introducing additional network overhead or latency.
  • a method and device for speech packet loss compensation based on a neural network is provided.
  • the neural network to be trained may include an encoder layer, an intermediate layer and a decoder layer, and the intermediate layer is connected between the encoder layer and the decoder layer.
  • the training samples used include sample packet loss audio, and its corresponding sample frame loss position information and sample non-packet loss audio.
  • input features can be generated based on the sample packet loss audio and its corresponding sample frame loss position information, and the input features can be input into the neural network to be trained.
  • the features output by the intermediate layer can be input into a pre-trained fundamental frequency prediction network, and the fundamental frequency prediction network outputs a predicted fundamental frequency. Then, based on the predicted fundamental frequency and the actual fundamental frequency calculated based on the sample non-packet loss audio, the network parameters of the encoder layer and the intermediate layer are adjusted. In this way, the outputs of the encoder layer and the intermediate layer can be made more accurate, and the trained neural network can then perform speech packet loss compensation more accurately.
  • FIG. 1 illustrates a schematic diagram of an application scenario in which embodiments of the present disclosure can be applied.
  • electronic device 10 can first train a neural network 101 for voice packet loss compensation.
  • Electronic device 20 can then retrieve neural network 101 from electronic device 10 and use the trained neural network 101 to perform voice packet loss compensation.
  • the process of training the neural network 101 by the electronic device 10 may include the following steps 1-5: Step 1: Obtain a training sample set.
  • each training sample in the training sample set may include the sample audio with packet loss, its corresponding sample frame loss position information, and the sample audio without packet loss.
  • Step 2 Generate input features based on the sample audio with packet loss and its corresponding sample frame loss position information.
  • Step 3 Input the input features into the neural network to be trained.
  • the neural network to be trained may include an encoder layer, an intermediate layer, and a decoder layer, with the intermediate layer connected between the encoder layer and the decoder layer.
  • Step 4 Input the features output by the intermediate layer into a pre-trained fundamental frequency prediction network, which outputs a predicted fundamental frequency.
  • Step 5 Update the network parameters of the encoder layer and the intermediate layer based on the predicted fundamental frequency and the actual fundamental frequency calculated based on the sample audio without packet loss.
  • the network parameters of the entire neural network to be trained may also be updated based on the prediction results output by the decoder layer and the sample audio without packet loss. This results in a trained neural network 101.
  • the electronic device 20 can obtain the trained neural network 101 from the electronic device 10, and input the audio to be processed and the frame loss position information corresponding to the audio to be processed into the neural network 101, thereby obtaining the audio after packet loss compensation corresponding to the audio to be processed.
  • FIG2 illustrates a flow chart of a method for training a neural network for voice packet loss compensation according to one embodiment.
  • the method can be performed by any device, apparatus, platform, or device cluster with computing and processing capabilities.
  • the method for training a neural network for voice packet loss compensation may include the following steps 201 to 205, specifically:
  • each training sample in the training sample set may include sample packet loss audio, sample frame loss position information corresponding to the sample packet loss audio, and sample non-packet loss audio corresponding to the sample packet loss audio.
  • the sample frame loss position information may be used to indicate the position of a frame in the sample packet loss audio where data loss occurred.
  • the sample packet loss audio may be audio at various sampling rates, for example, at a 48 kHz sampling rate.
  • audio data is usually processed and transmitted in frames, and each frame can contain audio of a certain length.
  • a flag bit is usually set in each frame of audio data.
  • this flag bit can be a binary bit used to indicate whether the frame has arrived intact. If the flag bit is set to indicate a packet loss state, the receiving end can detect the loss of this frame of data. Therefore, the position of the frame in the audio where data loss occurred can be determined based on the flag bit.
  • Step 202 Generate input features based on the packet-loss audio sample and its corresponding sample-loss frame location information.
  • the input features can be generated based on the packet-loss audio sample and its corresponding sample-loss frame location information.
  • the packet-loss audio sample can be first converted to the frequency domain and then concatenated with the sample-loss frame location information to generate the input features.
  • step 202 may include the following steps 1 and 2. Specifically, in step 1, sub-band decomposition is performed on the sample packet loss audio to obtain multiple sub-bands. In step 2, input features are generated based on the conversion results of the multiple sub-bands into the time-frequency domain and the sample frame loss location information.
  • multiple methods can be used to perform sub-band decomposition on the sample loss audio.
  • a stable and efficient Pseudo-Quadrature Mirror Filter Bank can be used to perform sub-band decomposition, dividing the sample loss audio into multiple sub-bands.
  • sub-band decomposition can include K FIR (Finite Impulse Response) filter banks.
  • the sub-band decomposition process can include FIR analysis, downsampling, and short-time Fourier transform (STFT).
  • y represents the original audio to be input to the neural network to be trained
  • yk represents the audio after subband analysis and downsampling
  • k ⁇ [1,K] is the subband number
  • yk can represent the kth subband.
  • the sampling rate of yk is It can represent the frequency domain corresponding to y k after short-time Fourier transform (STFT).
  • STFT short-time Fourier transform
  • Stack Y k along the channel dimension to obtain the frequency domain subband features of the input neural network to be trained
  • the input features can be directly concatenated with the frame loss position information, or compressed and concatenated with the frame loss position information to obtain the input features.
  • the input features can be concatenated with the frame loss position information after amplitude spectrum compression with a scaling factor of 0.5.
  • the sub-band decoding process corresponding to the above sub-band decomposition process may include iSTFT (Inverse Short-Time Fourier Transform), upsampling and FIR synthesis.
  • iSTFT Inverse Short-Time Fourier Transform
  • the output of the neural network is For each subband After inverse Fourier transform, the time domain subband audio is obtained After upsampling and synthesis filter, it is restored to audio
  • K can be 4, that is, the audio signal is divided into 4 sub-bands for processing.
  • This implementation allows audio with sample loss to be decomposed into multiple subbands for processing, significantly reducing computational complexity. Therefore, this implementation is particularly suitable for audio at high sampling rates, such as audio with a sampling rate greater than or equal to 48 kHz.
  • Step 203 Input the input features to the neural network to be trained.
  • the obtained input features can be input to the neural network to be trained and processed by the neural network to be trained.
  • the neural network to be trained may include an encoder layer, an intermediate layer, and a decoder layer.
  • the intermediate layer is connected to the encoder layer and the decoder layer.
  • the encoder layer and the decoder layer may or may not be connected.
  • the neural network to be trained can be a neural network with a U-Net structure.
  • the U-Net structure can primarily include an encoder, a bottleneck layer, and a decoder.
  • the bottleneck layer in the U-Net structure can be an intermediate layer.
  • a skip connection can be used between the encoder and the decoder.
  • the encoder layer may include several encoders. As needed, each encoder may include various layers. For example, it may include a convolution layer, a pooling layer, and the like. For example, each encoder may include a gated convolution layer and a time-frequency dilated convolution layer, and the time-frequency dilated convolution layer may be used to extract features through dilated convolutions in the time and frequency dimensions. As shown in FIG3 , FIG3 shows a schematic diagram of an example of an encoder structure.
  • the encoder may include a gated convolution layer and a time-frequency dilated convolution layer in sequence
  • the time-frequency dilated convolution layer may include dilated convolutions FConv (Frequency Dilated Convolution) and TConv (Time Dilated Convolution) in the frequency and time dimensions.
  • FConv and TConv can be regarded as multi-scale modeling along the frequency axis and the time axis, fully perceiving historical information.
  • the time-frequency hole convolution layer can also include a BN (Batch Normalization) layer, a PReLU (Parametric Rectified Linear Unit, ReLU with parameters) activation function, PWConv (Pointwise convolution), etc.
  • PWConv can be used to align the input and output dimensions.
  • b convolution layers with expansion degrees from 1 to 2 b-1 are spliced together to form a time-frequency hole convolution layer, which can effectively improve the receptive field of the convolution layer.
  • the encoder structure shown in Figure 3 is only schematic and not a limitation of the encoder structure. In practice, the layers included in the encoder can be set according to actual needs.
  • the features output by the intermediate layer are input into a pre-trained fundamental frequency prediction network, which then outputs a predicted fundamental frequency.
  • the intermediate layer can be used for packet loss compensation and can be implemented using various neural networks.
  • the intermediate layer can be implemented using a bidirectional GRU (Gate Recurrent Unit), which can extract the correlation between frequency and time dimensions to compensate for packet loss.
  • GRU Gate Recurrent Unit
  • the features output by the intermediate layer can include features of the frame corresponding to the frame loss location corresponding to the sample frame loss location information, and the pre-trained FFT prediction network can be used to predict the FFT based on the input features. Therefore, the features output by the intermediate layer are input into the FFT prediction network, which can output a predicted FFT.
  • This predicted FFT can include the FFT of the frame corresponding to the frame loss location.
  • the FFT prediction network can be any of various neural networks.
  • the FFT prediction network may include a bidirectional long short-term memory (Bi-LSTM) network.
  • Bi-LSTM bidirectional long short-term memory
  • FIG4 illustrates a schematic diagram of an example FFT prediction network structure.
  • the FFT prediction network may include, in sequence, a reshape layer, a Maxpool layer, a reshape layer, a Bi-LSTM layer, a Linear layer, a reshape layer, a Linear_C layer, a Linear_F layer, and so on.
  • the output of the first reshape layer forms a skip connection with the output of the Linear layer.
  • the reshape layer can be used to adjust the data dimension
  • the Maxpool layer can be used for downsampling
  • the Linear layer can be used for linear mapping.
  • the Linear_C layer can represent linear mapping in the channel dimension
  • the Linear_F layer can represent linear mapping in the frequency dimension.
  • the fundamental frequency prediction network can be trained in various ways. For example, it can be trained in a supervised manner. For example, in a supervised manner, a sample set can be first obtained, and each sample in the sample set can include a sample audio feature and a fundamental frequency label corresponding to the sample audio feature.
  • the sample audio feature can be obtained through the network structure of the neural network to be trained.
  • the sample audio features in the sample can be input into the fundamental frequency prediction network to be trained, and the fundamental frequency prediction network to be trained outputs a predicted fundamental frequency. Then, based on a predefined loss function, the difference loss between the predicted fundamental frequency and the fundamental frequency label is calculated, and the network parameters of the fundamental frequency prediction network are adjusted with the goal of minimizing the difference loss.
  • Step 205 Adjust the network parameters of the encoder layer and the intermediate layer based on the predicted fundamental frequency and the actual fundamental frequency calculated based on the sampled unpacked audio.
  • the actual fundamental frequency of each frame can be calculated using the sampled unpacked audio, and then the network parameters of the encoder layer and the intermediate layer can be adjusted based on the predicted fundamental frequency and the actual fundamental frequency.
  • the following formula can be used as the loss function:
  • T can represent the number of frames
  • fi can represent the true fundamental frequency of the i-th frame
  • Lf0 can represent the difference loss of the fundamental frequency prediction.
  • the network parameters of the encoder layer and the intermediate layer can be adjusted with the goal of minimizing Lf0 .
  • the decoder layer may also include multiple decoders, each of which may include a first branch and a second branch in parallel.
  • the first branch may be used to predict the real part of the audio
  • the second branch may be used to predict the imaginary part of the audio.
  • the neural network to be trained may output sample predicted audio based on the real and imaginary parts of the predicted audio output by the encoder layer.
  • Figure 5 shows a schematic diagram of an example of a neural network structure to be trained.
  • the input may include sample packet loss audio
  • the sample frame loss position information m(t), t ⁇ T, B can represent batch size
  • N can represent audio length
  • T can represent the number of frames.
  • the output can include sample prediction audio after packet loss compensation
  • the sample packet loss audio After passing through the pseudo-quadrature mirror filter bank (PQMF), short-time Fourier transform (STFT), and compression, it is stacked with the sample frame loss position information m(t) and input into the neural network to be trained.
  • the number of encoders included in the encoder layer is the same as the number of decoders included in the decoder layer, and each encoder and corresponding decoder can be skipped.
  • the encoder layer can include 4 encoders, each of which can include a gated convolution layer and a time-frequency hole convolution layer.
  • the middle layer can be implemented by 3 bidirectional GRUs (Gate Recurrent Units), which can extract the correlation between frequency and time dimensions and compensate for packet loss.
  • the decoder layer also includes 4 decoders, each of which can include a transposed gated convolution layer and a time-frequency hole convolution layer.
  • Each decoder can include a first branch and a second branch in parallel. The first branch can be used to predict the real part of the audio, and the second branch can be used to predict the imaginary part of the audio.
  • the real and imaginary parts of the predicted audio output by the encoder layer can be concatenated, decompressed, inverse short-time Fourier transform (ISTFT), and pseudo-orthogonal mirror filter bank (PQMF) in sequence to obtain the sample predicted audio after packet loss compensation.
  • ISTFT inverse short-time Fourier transform
  • PQMF pseudo-orthogonal mirror filter bank
  • the above-mentioned method for training a neural network for speech packet loss compensation may further include the following steps S1 and S2. Specifically, in step S1, the sample predicted audio is input into at least one pre-trained discriminator, and each discriminator outputs a discrimination result for the sample predicted audio. In step S2, based on the at least one discrimination result, the sample predicted audio, and the sample non-packet loss audio, a loss is calculated, and network parameters of the neural network to be trained are adjusted based on the loss.
  • a generative adversarial network can be used to train the neural network.
  • the neural network to be trained can serve as a generator, and its output of sample predicted audio can be input into a discriminator, which then outputs a judgment result for the sample predicted audio.
  • the network parameters of the neural network to be trained are then adjusted based on the judgment result output by the discriminator.
  • one or more discriminators can be used.
  • the at least one discriminator may include a first discriminator for determining the probability that the sample predicted audio is real audio, and may also include a second discriminator for determining the audio quality of the sample predicted audio.
  • the first discriminator can include a frequency domain multi-resolution discriminator, a time domain multi-period discriminator, and so on.
  • the real audio without packet loss and the audio generated by the generator can be input into the first discriminator, and the first discriminator determines whether the input audio is the audio generated by the generator.
  • the discriminator loss is calculated, and the network parameters of the first discriminator are updated through backpropagation.
  • the loss of the first discriminator can be calculated using the following formula:
  • s can represent the real non-packet-loss audio
  • x can represent the packet-loss audio corresponding to s
  • G can represent the generator
  • D can represent the discriminator
  • G(x) can represent the result after the generator processes x
  • It can represent a computational expectation, and its subscript indicates the object of the computational expectation.
  • the first discriminator may include a frequency domain multi-resolution discriminator.
  • Figure 6 shows a schematic diagram of an example of a frequency domain multi-resolution discriminator.
  • the frequency domain multi-resolution discriminator can use short-time Fourier transform (STFT) with different window lengths and window shifts to transform the time domain waveform.
  • STFT short-time Fourier transform
  • the window lengths can be [30, 60, 120, 240, 480, 960], respectively, and the input audio can be discriminated from different frequency division ratios.
  • the first discriminator can also include a time-domain multi-period discriminator (MPD).
  • MPD time-domain multi-period discriminator
  • Figure 7 shows a schematic diagram of an example of a time-domain multi-period discriminator.
  • the time-domain multi-period discriminator can fold the input one-dimensional sample point sequence into a two-dimensional plane with a certain period, and then apply two-dimensional convolution for processing.
  • the sub-discriminator of each specific period is first padded to ensure that the number of sample points is an integer multiple of the period to facilitate folding into a two-dimensional plane.
  • the output channel number can be, for example, [32, 128, 512, 1024] respectively.
  • time-domain multi-period discriminator After convolution, leaky_relu activation is used. Finally, post-processing is performed using a convolutional layer with an input channel number, for example, 1024, and an output channel number, for example, 1, and flattening is used as the final output of the time-domain multi-period discriminator. It can be understood that the time-domain multi-period discriminator shown in Figure 7 is a well-known existing discriminator. The contents of its various parts are shown in the figure and will not be repeated here.
  • the second discriminator can be a MetricGAN discriminator.
  • the packet loss-compensated audio can be fed into the MetricGAN discriminator to estimate the PESQ (Perceptual Evaluation of Speech Quality), and the loss can be calculated by comparing it with the actual PESQ obtained through the Python library.
  • the MetricGAN discriminator is then updated based on the loss.
  • the loss of the MetricGAN discriminator can be calculated using the following loss function:
  • s can represent the real non-packet-loss audio
  • x can represent the packet-loss audio corresponding to s
  • G can represent the generator
  • D can represent the discriminator
  • G(x) can represent the result after the generator processes x.
  • Q' can represent the pypesq library function. Can represent computational expectations.
  • Figure 8 shows a schematic diagram of an example of MetricGAN discriminator training.
  • the MetricGAN discriminator can output a predicted PESQ score (Predicted PESQ score).
  • the MetricGAN discriminator can include four layers of downsampled two-dimensional convolutions and three layers of fully connected layers, where the four layers of downsampled two-dimensional convolutions are followed by three layers of fully connected layers.
  • the output of the MetricGAN discriminator is a predicted PESQ score.
  • the predicted PESQ score is first compared with the true PESQ score output by the pypesq library function to calculate the mean squared error (MSE) loss, and the network parameters of the MetricGAN discriminator are updated. Afterwards, the MSE loss is calculated with the maximum PESQ value, and the network parameters of the generator are updated.
  • MSE mean squared error
  • the above-mentioned method for training a neural network for voice packet loss compensation may further include the following steps 1 to 3. Specifically: Step 1, the sample prediction audio output by the neural network to be trained and its corresponding sample non-packet loss audio are respectively input into a pre-trained speech recognition model.
  • the speech recognition model can be any existing speech recognition model, for example, it can be a whisper model.
  • Step 2 obtain the coding layer features of the sample prediction audio and its corresponding sample non-packet loss audio in the speech recognition model.
  • Step 3 based on the difference loss of the two obtained coding layer features, adjust the network parameters of the neural network to be trained. For example, the network parameters of the neural network to be trained can be adjusted with the goal of minimizing the difference loss.
  • the difference loss of the two encoding layer features can be calculated using the following formula:
  • W(s) can represent the coding layer features of the whisper model corresponding to the sample unlost audio.
  • the network parameters of the neural network to be trained can also be adjusted based on the sample predicted audio and its corresponding sample unpacked audio.
  • the difference loss between the sample predicted audio and its corresponding sample unpacked audio can be calculated, and the network parameters of the neural network to be trained can be adjusted with the goal of minimizing the difference loss.
  • the difference loss can be calculated using the following formula:
  • Lmae can represent the mae (Mean Absolute Error) loss in the time domain
  • N can represent the audio length
  • si can represent the audio without packet loss at the i-th sampling point. It can represent the sample prediction audio of the i-th sampling point.
  • L plcpa can represent the mean square error of the amplitude spectrum compression in the frequency domain
  • S(t,f) can represent the spectrum corresponding to the sample without packet loss audio. It can represent the spectrum corresponding to the sample prediction audio
  • t and f can represent the subscripts of the time dimension and frequency dimension respectively
  • p can represent the compression coefficient
  • j It is possible to represent some phases related to the Fourier transform. It can be expressed as the mean square error of the amplitude spectrum, It can be expressed as the mean square error of the phase.
  • the difference loss can be calculated using the following formula:
  • s represents the true audio without packet loss
  • x represents the packet loss audio corresponding to s
  • G represents the generator
  • D represents the discriminator
  • G(x) represents the result of processing x by the generator
  • PESQ max represents the maximum value of the PESQ indicator.
  • the above describes the training process of a neural network for voice packet loss compensation.
  • the neural network obtained in this way can compensate for packet loss of input audio.
  • the neural network-based voice packet loss compensation method may include steps 901 to 903. Specifically, step 901: Obtain a pre-trained neural network for voice packet loss compensation. In this embodiment, a neural network for voice packet loss compensation trained using the method described in Figure 2 can be obtained. This neural network can perform packet loss compensation based on packet loss audio and frame loss location information corresponding to the packet loss audio. Step 902: Receive audio to be processed and frame loss location information corresponding to the audio to be processed.
  • Step 903 Input features generated based on the audio to be processed and the frame loss location information corresponding to the audio to be processed are input into the neural network to obtain packet loss-compensated audio corresponding to the audio to be processed.
  • a device for training a neural network for voice packet loss compensation is provided. This device for training a neural network for voice packet loss compensation can be deployed on any device, equipment, platform, device cluster, or the like with computing and processing capabilities.
  • Figure 10 shows a schematic block diagram of an apparatus for training a neural network for voice packet loss compensation according to one embodiment.
  • the apparatus shown in Figure 10 is used to perform the method shown in Figure 2.
  • the neural network to be trained includes an encoder layer, an intermediate layer, and a decoder layer, with the intermediate layer connected between the encoder layer and the decoder layer.
  • the device 100 for training a neural network for speech packet loss compensation includes: an acquisition unit 1001, configured to acquire a training sample set, each training sample including sample packet loss audio, its corresponding sample frame loss position information and sample non-packet loss audio; a generation unit 1002, configured to generate input features based on the sample packet loss audio and its corresponding sample frame loss position information; a first input unit 1003, configured to input the above-mentioned input features into the above-mentioned neural network to be trained; a second input unit 1004, configured to input the features output by the above-mentioned intermediate layer into a pre-trained fundamental frequency prediction network, and the above-mentioned fundamental frequency prediction network outputs a predicted fundamental frequency; an adjustment unit 1005, configured to adjust the network parameters of the above-mentioned encoder layer and the above-mentioned intermediate layer based on the above-mentioned predicted fundamental frequency and the actual fundamental frequency calculated based on the sample non-packet loss audio.
  • the neural network to be trained is a neural network with a U-Net structure
  • the intermediate layer is a bottleneck layer in the U-Net structure.
  • the fundamental frequency prediction network includes a bidirectional long short-term memory network.
  • the features output by the intermediate layer include features of the frame corresponding to the frame loss position corresponding to the sample frame loss position information, and the predicted fundamental frequency output by the fundamental frequency prediction network includes the fundamental frequency of the frame corresponding to the frame loss position.
  • the generation unit 1002 is further configured to perform sub-band decomposition on the above-mentioned sample packet loss audio to obtain multiple sub-bands; and generate input features based on the conversion results of the above-mentioned multiple sub-bands into the time-frequency domain and the above-mentioned sample frame loss position information.
  • the encoder layer includes multiple encoders, each encoder includes a gated convolution layer and a time-frequency dilated convolution layer, and the time-frequency dilated convolution layer is used to extract features through dilated convolution in the time dimension and the frequency dimension.
  • the decoder layer includes multiple decoders, each decoder includes a first branch and a second branch in parallel, the first branch is used to predict the real part of the audio, and the second branch is used to predict the imaginary part of the audio;
  • the neural network to be trained outputs sample predicted audio based on the real and imaginary parts of the predicted audio output by the encoder layer;
  • the device 1000 also includes: a third input unit (not shown in the figure), configured to input the sample predicted audio into at least one pre-trained discriminator, and each discriminator outputs a discrimination result for the sample predicted audio; a calculation unit (not shown in the figure), configured to calculate the loss based on at least one discrimination result, the sample predicted audio and the sample non-lost audio, and adjust the network parameters of the neural network to be trained based on the loss.
  • the at least one discriminator includes a first discriminator for determining the probability that the sample predicted audio is real audio and a second discriminator for determining the audio quality of the sample predicted audio.
  • the above-mentioned device 1000 also includes: a fourth input unit (not shown in the figure), configured to input the sample prediction audio output by the above-mentioned neural network to be trained and its corresponding sample non-packet-lost audio into a pre-trained speech recognition model respectively; a coding layer feature acquisition unit (not shown in the figure), configured to obtain the coding layer features of the above-mentioned sample prediction audio and its corresponding sample non-packet-lost audio in the above-mentioned speech recognition model; a network parameter adjustment unit (not shown in the figure), configured to adjust the network parameters of the above-mentioned neural network to be trained based on the difference loss of the two obtained coding layer features.
  • a fourth input unit configured to input the sample prediction audio output by the above-mentioned neural network to be trained and its corresponding sample non-packet-lost audio into a pre-trained speech recognition model respectively
  • a coding layer feature acquisition unit (not shown in the figure), configured to obtain the coding
  • a neural network-based voice packet loss compensation device is provided.
  • the neural network-based voice packet loss compensation device can be deployed on any device, equipment, platform, device cluster, etc. with computing and processing capabilities.
  • FIG11 shows a schematic block diagram of a speech packet loss compensation device based on a neural network according to an embodiment.
  • the device shown in FIG11 is used to execute the method shown in FIG9 .
  • the speech packet loss compensation device 1100 based on a neural network includes: a model acquisition unit 1101, configured to acquire a neural network for speech packet loss compensation trained according to the method described in FIG2 ; a receiving unit 1102, configured to receive audio to be processed and frame loss position information corresponding to the audio to be processed; and a feature input unit 1103, configured to input input features generated based on the audio to be processed and the frame loss position information corresponding to the audio to be processed into the neural network to obtain audio after packet loss compensation corresponding to the audio to be processed.
  • the above-mentioned device embodiments correspond to the method embodiments.
  • the device embodiments are obtained based on the corresponding method embodiments and have the same technical effects as the corresponding method embodiments.
  • a computer-readable storage medium on which a computer program is stored.
  • the computer program is executed in a computer, the computer is caused to execute the method described in FIG. 2 or FIG. 9 .
  • an electronic device including a memory and a processor, wherein executable code is stored in the memory, and when the processor executes the executable code, the method described in FIG. 2 or FIG. 9 is implemented.
  • FIG12 shows a schematic diagram of the structure of an electronic device 1200 suitable for implementing the embodiments of the present application.
  • the electronic device shown in FIG12 is only an example and should not limit the functions and scope of use of the embodiments of the present application.
  • the electronic device 1200 may include a processing device (e.g., a central processing unit, a graphics processing unit, etc.) 1201, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 1202 or a program loaded from a storage device 1208 into a random access memory (RAM) 1203.
  • a processing device e.g., a central processing unit, a graphics processing unit, etc.
  • ROM read-only memory
  • RAM random access memory
  • Various programs and data required for the operation of the electronic device 1200 are also stored in the RAM 1203.
  • the processing device 1201, the ROM 1202, and the RAM 1203 are connected to each other via a bus 1204.
  • An input/output (I/O) interface 1205 is also connected to the bus 1204.
  • the following devices can be connected to the I/O interface 1205: an input device 1206 including, for example, a touch screen, a touchpad, a keyboard, a mouse, etc.; an output device 1207 including, for example, a liquid crystal display (LCD), a speaker, a vibrator, etc.; a storage device 1208 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 1209.
  • the communication device 1209 can allow the electronic device 1200 to communicate with other devices wirelessly or by wire to exchange data.
  • Figure 12 shows an electronic device 1200 with various devices, it should be understood that it is not required to implement or have all of the devices shown. More or fewer devices may be implemented or have alternatively. Each box shown in Figure 12 may represent one device, or may represent multiple devices as needed.
  • an embodiment of the present application includes a computer program product, which includes a computer program carried on a computer-readable medium, and the computer program includes program code for executing the method shown in the flowchart.
  • the computer program can be downloaded and installed from the network via the communication device 1209, or installed from the storage device 1208, or installed from the ROM 1202.
  • the processing device 1201 the above-mentioned functions defined in the method of the embodiment of the present application are performed.
  • An embodiment of the present disclosure further provides a computer-readable storage medium having a computer program stored thereon.
  • the computer program When the computer program is executed in a computer, the computer is caused to execute the method provided in the present disclosure.
  • the computer-readable medium described in the embodiments of the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination of the above.
  • Computer-readable storage media may include, but are not limited to: an electrical connection with one or more wires, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above.
  • the computer-readable storage medium may be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, device, or device.
  • the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, which carries computer-readable program code. Such a propagated data signal may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
  • Program code embodied on a computer-readable medium may be transmitted using any suitable medium, including but not limited to wire, optical cable, RF (Radio Frequency), or any suitable combination thereof.
  • the computer-readable medium may be included in the electronic device, or may exist independently without being installed in the electronic device.
  • the computer-readable medium carries one or more programs.
  • the electronic device obtains a training sample set, wherein each training sample includes a sample packet loss audio, its corresponding sample frame loss position information, and a sample non-packet loss audio; generates input features based on the sample packet loss audio and its corresponding sample frame loss position information; inputs the input features into the neural network to be trained, wherein the neural network to be trained includes an encoder layer, an intermediate layer, and a decoder layer, wherein the intermediate layer is connected between the encoder layer and the decoder layer; inputs the features output by the intermediate layer into a pre-trained fundamental frequency prediction network, and the fundamental frequency prediction network outputs a predicted fundamental frequency; and adjusts the network parameters of the encoder layer and the intermediate layer based on the predicted fundamental frequency and the true fundamental frequency calculated based on the sample non-packet loss audio.
  • Computer program code for performing the operations of the embodiments of the present disclosure may be written in one or more programming languages, or a combination thereof, including object-oriented programming languages such as Java, Smalltalk, C++, and conventional procedural programming languages such as "C" or similar programming languages.
  • the program code may be executed entirely on the user's computer, partially on the user's computer, as a stand-alone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or electronic device.
  • the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (e.g., via the Internet using an Internet service provider).
  • LAN local area network
  • WAN wide area network
  • Internet service provider e.g., via the Internet using an Internet service provider

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

A neural network-based voice packet loss concealment method and device. The method comprises: acquiring a pre-trained neural network used for voice packet loss concealment; receiving an audio to be processed and frame loss position information corresponding to said audio; and inputting an input feature generated on the basis of said audio and the frame loss position information corresponding to said audio into the neural network to obtain a packet-loss-concealed audio corresponding to said audio.

Description

基于神经网络的语音丢包补偿方法和装置Voice packet loss compensation method and device based on neural network

本申请要求2024年4月7日递交的、名称为“基于神经网络的语音丢包补偿方法和装置”、申请号为202410411048.5的中国发明专利申请的优先权,该申请的全部内容通过引用结合在本申请中。This application claims priority to the Chinese invention patent application entitled “Voice packet loss compensation method and device based on neural network” and application number 202410411048.5 filed on April 7, 2024. The entire contents of the application are incorporated by reference into this application.

技术领域Technical Field

本公开实施例涉及计算机技术领域,尤其涉及一种基于神经网络的语音丢包补偿方法和装置。The embodiments of the present disclosure relate to the field of computer technology, and more particularly to a method and apparatus for voice packet loss compensation based on a neural network.

背景技术Background Art

随着互联网技术和通讯技术的发展,音频流传输已经成为了网络通信的一个非常重要的部分。实践中,由于各种因素,比如网络拥堵、带宽限制、硬件故障等等,音频数据包在传输过程中可能会丢失,这会对语音通信的质量产生严重影响,造成用户体验下降。因此,对音频丢包进行修补恢复是亟需解决的问题。With the development of internet and communications technologies, audio streaming has become a crucial component of network communications. In practice, audio packets can be lost during transmission due to various factors, such as network congestion, bandwidth limitations, and hardware failures. This can severely impact the quality of voice communication and degrade the user experience. Therefore, recovering from audio packet loss is an urgent problem that needs to be addressed.

发明内容Summary of the Invention

本公开的实施例描述了一种基于神经网络的语音丢包补偿方法和装置,通过该方法训练得到的神经网络,可以更加准确的进行语音丢包补偿。The embodiments of the present disclosure describe a method and apparatus for voice packet loss compensation based on a neural network. The neural network trained by this method can more accurately compensate for voice packet loss.

根据第一方面,提供了一种训练用于进行语音丢包补偿的神经网络的方法,其中,待训练神经网络包括编码器层、中间层和解码器层,上述中间层连接于上述编码器层和上述解码器层之间,上述方法包括:获取训练样本集,其中,各训练样本包括样本丢包音频、及其对应的样本丢帧位置信息和样本未丢包音频;基于样本丢包音频及其对应的样本丢帧位置信息,生成输入特征;将上述输入特征输入上述待训练神经网络;将上述中间层输出的特征输入预先训练的基频预测网络,由上述基频预测网络输出预测基频;基于上述预测基频以及基于样本未丢包音频计算得到的真实基频,调整上述编码器层和上述中间层的网络参数。According to a first aspect, a method for training a neural network for speech packet loss compensation is provided, wherein the neural network to be trained includes an encoder layer, an intermediate layer and a decoder layer, and the intermediate layer is connected between the encoder layer and the decoder layer. The method includes: obtaining a training sample set, wherein each training sample includes sample packet loss audio, its corresponding sample frame loss position information and sample non-packet loss audio; generating input features based on the sample packet loss audio and its corresponding sample frame loss position information; inputting the input features into the neural network to be trained; inputting the features output by the intermediate layer into a pre-trained fundamental frequency prediction network, and the fundamental frequency prediction network outputs a predicted fundamental frequency; and adjusting the network parameters of the encoder layer and the intermediate layer based on the predicted fundamental frequency and the true fundamental frequency calculated based on the sample non-packet loss audio.

在一个实施例中,上述待训练神经网络为U-Net结构的神经网络,其中,上述中间层为上述U-Net结构中的瓶颈层。由此,可以通过U-Net结构的神经网络实现语音丢包补偿。In one embodiment, the neural network to be trained is a U-Net neural network, wherein the intermediate layer is a bottleneck layer in the U-Net structure. Thus, voice packet loss compensation can be achieved using the U-Net neural network.

在一个实施例中,上述基频预测网络包括双向长短期记忆网络。双向长短期记忆网络在处理音频数据时,可以充分考虑音频数据中的前后信息,提高了对音频数据的处理能力。In one embodiment, the pitch prediction network includes a bidirectional long short-term memory network. When processing audio data, the bidirectional long short-term memory network can fully consider the contextual information in the audio data, thereby improving the processing capability of the audio data.

在一个实施例中,上述中间层输出的特征包括样本丢帧位置信息对应的丢帧位置所对应帧的特征,上述基频预测网络输出的预测基频包括丢帧位置所对应帧的基频。In one embodiment, the features output by the intermediate layer include features of the frame corresponding to the frame loss position corresponding to the sample frame loss position information, and the predicted fundamental frequency output by the fundamental frequency prediction network includes the fundamental frequency of the frame corresponding to the frame loss position.

在一个实施例中,上述基于样本丢包音频及其对应的样本丢帧位置信息,生成输入特征,包括:对上述样本丢包音频进行子带分解得到多个子带;基于上述多个子带转换到时频域的转换结果和上述样本丢帧位置信息,生成输入特征。由此,可以将样本丢包音频分解为多个子带进行处理,从而大幅度降低了计算复杂度。In one embodiment, generating input features based on the packet-loss audio sample and its corresponding sample frame-loss location information includes: performing sub-band decomposition on the packet-loss audio sample to obtain multiple sub-bands; and generating input features based on the results of converting the multiple sub-bands into the time-frequency domain and the sample frame-loss location information. This allows the packet-loss audio sample to be decomposed into multiple sub-bands for processing, significantly reducing computational complexity.

在一个实施例中,上述编码器层包括多个编码器,各编码器包括门控卷积层和时频空洞卷积层,上述时频空洞卷积层用于通过时间维度和频率维度的空洞卷积来提取特征。由此,可以有效提高卷积层的感受野。In one embodiment, the encoder layer includes multiple encoders, each of which includes a gated convolution layer and a time-frequency dilated convolution layer. The time-frequency dilated convolution layer is used to extract features through dilated convolution in the time and frequency dimensions. This effectively increases the receptive field of the convolution layer.

在一个实施例中,上述解码器层包括多个解码器,各解码器包括并行的第一分支和第二分支,上述第一分支用于预测音频的实部,上述第二分支用于预测音频的虚部;上述待训练神经网络基于上述编码器层输出的预测音频的实部和虚部,输出样本预测音频;以及,上述方法还包括:将上述样本预测音频输入预先训练的至少一个判别器,由各判别器输出针对上述样本预测音频的判别结果;基于至少一个判别结果、上述样本预测音频和上述样本未丢包音频,计算损失,以及基于上述损失调整上述待训练神经网络的网络参数。由此,可以采用生成对抗结构GAN来训练待训练神经网络。In one embodiment, the decoder layer includes multiple decoders, each decoder includes a first branch and a second branch in parallel, the first branch is used to predict the real part of the audio, and the second branch is used to predict the imaginary part of the audio; the neural network to be trained outputs sample predicted audio based on the real and imaginary parts of the predicted audio output by the encoder layer; and the method further includes: inputting the sample predicted audio into at least one pre-trained discriminator, and each discriminator outputs a discrimination result for the sample predicted audio; calculating the loss based on at least one discrimination result, the sample predicted audio, and the sample un-lost audio, and adjusting the network parameters of the neural network to be trained based on the loss. Therefore, a generative adversarial network (GAN) can be used to train the neural network to be trained.

在一个实施例中,上述至少一个判别器中包括用于判别上述样本预测音频为真实音频的概率的第一判别器和用于判别上述样本预测音频的音频质量的第二判别器。由此,可以通过使用多个判别器提高生成器的准确度。In one embodiment, the at least one discriminator includes a first discriminator for determining the probability that the sample predicted audio is real audio and a second discriminator for determining the audio quality of the sample predicted audio. Thus, the accuracy of the generator can be improved by using multiple discriminators.

在一个实施例中,上述方法还包括:将上述待训练神经网络输出的样本预测音频和其对应的样本未丢包音频分别输入预先训练的语音识别模型;获取上述样本预测音频和其对应的样本未丢包音频在上述语音识别模型中的编码层特征;基于所获取的两个编码层特征的差异损失,调整上述待训练神经网络的网络参数。由此,可以通过预先训练的语音识别模型调整待训练神经网络的网络参数,提高待训练神经网络的准确度。In one embodiment, the method further includes: inputting the sample predicted audio output by the neural network to be trained and its corresponding sample unpacked audio into a pre-trained speech recognition model; obtaining coding layer features of the sample predicted audio and its corresponding sample unpacked audio in the speech recognition model; and adjusting the network parameters of the neural network to be trained based on the difference loss between the two obtained coding layer features. In this way, the network parameters of the neural network to be trained can be adjusted using the pre-trained speech recognition model, thereby improving the accuracy of the neural network to be trained.

根据第二方面,提供了一种基于神经网络的语音丢包补偿方法,包括:获取根据第一方面中任一项的方法训练得到的用于进行语音丢包补偿的神经网络;接收待处理音频和上述待处理音频对应的丢帧位置信息;将基于上述待处理音频和上述待处理音频对应的丢帧位置信息生成的输入特征,输入上述神经网络,得到上述待处理音频对应的丢包补偿后的音频。According to a second aspect, a method for voice packet loss compensation based on a neural network is provided, comprising: obtaining a neural network for voice packet loss compensation trained according to any one of the methods of the first aspect; receiving audio to be processed and frame loss position information corresponding to the audio to be processed; and inputting input features generated based on the audio to be processed and the frame loss position information corresponding to the audio to be processed into the neural network to obtain audio corresponding to the audio to be processed after packet loss compensation.

根据第三方面,一种训练用于进行语音丢包补偿的神经网络的装置,其中,待训练神经网络包括编码器层、中间层和解码器层,上述中间层连接于上述编码器层和上述解码器层之间,上述装置包括:获取单元,配置为,获取训练样本集,其中,各训练样本包括样本丢包音频、及其对应的样本丢帧位置信息和样本未丢包音频;生成单元,配置为,基于样本丢包音频及其对应的样本丢帧位置信息,生成输入特征;第一输入单元,配置为,将上述输入特征输入上述待训练神经网络;第二输入单元,配置为,将上述中间层输出的特征输入预先训练的基频预测网络,由上述基频预测网络输出预测基频;调整单元,配置为,基于上述预测基频以及基于样本未丢包音频计算得到的真实基频,调整上述编码器层和上述中间层的网络参数。According to a third aspect, a device for training a neural network for speech packet loss compensation is provided, wherein the neural network to be trained includes an encoder layer, an intermediate layer and a decoder layer, and the intermediate layer is connected between the encoder layer and the decoder layer. The device includes: an acquisition unit configured to acquire a training sample set, wherein each training sample includes sample packet loss audio, its corresponding sample frame loss position information and sample non-packet loss audio; a generation unit configured to generate input features based on the sample packet loss audio and its corresponding sample frame loss position information; a first input unit configured to input the input features into the neural network to be trained; a second input unit configured to input the features output by the intermediate layer into a pre-trained fundamental frequency prediction network, and the fundamental frequency prediction network outputs a predicted fundamental frequency; and an adjustment unit configured to adjust the network parameters of the encoder layer and the intermediate layer based on the predicted fundamental frequency and the true fundamental frequency calculated based on the sample non-packet loss audio.

根据第四方面,一种基于神经网络的语音丢包补偿装置,包括:模型获取单元,配置为,获取根据第一方面中任一项的方法训练得到的用于进行语音丢包补偿的神经网络;接收单元,配置为,接收待处理音频和上述待处理音频对应的丢帧位置信息;特征输入单元,配置为,将基于上述待处理音频和上述待处理音频对应的丢帧位置信息生成的输入特征,输入上述神经网络,得到上述待处理音频对应的丢包补偿后的音频。According to a fourth aspect, a speech packet loss compensation device based on a neural network includes: a model acquisition unit, configured to obtain a neural network for speech packet loss compensation trained according to any one of the methods of the first aspect; a receiving unit, configured to receive audio to be processed and frame loss position information corresponding to the audio to be processed; and a feature input unit, configured to input input features generated based on the audio to be processed and the frame loss position information corresponding to the audio to be processed into the neural network to obtain audio after packet loss compensation corresponding to the audio to be processed.

根据第五方面,提供了一种计算机程序产品,包括计算机程序,上述计算机程序被处理器执行时实现如第一方面中任一项上述的方法。According to a fifth aspect, a computer program product is provided, comprising a computer program, wherein when the computer program is executed by a processor, the computer program implements any of the above methods in the first aspect.

根据第六方面,提供了一种计算机可读存储介质,其上存储有计算机程序,当上述计算机程序在计算机中执行时,令上述计算机执行第一方面中任一项上述的方法。According to the sixth aspect, a computer-readable storage medium is provided, on which a computer program is stored. When the computer program is executed in a computer, the computer is caused to execute any one of the methods in the first aspect.

根据第七方面,提供了一种电子设备,包括存储器和处理器,上述存储器中存储有可执行代码,上述处理器执行上述可执行代码时,实现第一方面中任一项上述的方法。According to the seventh aspect, an electronic device is provided, comprising a memory and a processor, wherein the memory stores executable code, and when the processor executes the executable code, any one of the above methods in the first aspect is implemented.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1示出了本公开实施例可以应用于其中的一个应用场景的示意图;FIG1 is a schematic diagram showing an application scenario in which an embodiment of the present disclosure can be applied;

图2示出了根据一个实施例的训练用于进行语音丢包补偿的神经网络的方法的流程示意图;FIG2 is a schematic flow chart showing a method for training a neural network for voice packet loss compensation according to one embodiment;

图3示出了一个编码器结构的一个例子的示意图;FIG3 is a schematic diagram showing an example of an encoder structure;

图4示出了基频预测网络结构的一个例子的示意图;FIG4 is a schematic diagram showing an example of a fundamental frequency prediction network structure;

图5示出了待训练神经网络结构的一个例子的示意图;FIG5 is a schematic diagram showing an example of a neural network structure to be trained;

图6示出了频域多分辨率判别器的一个例子的示意图;FIG6 is a schematic diagram showing an example of a frequency domain multi-resolution discriminator;

图7示出了时域多周期判别器的一个例子的示意图;FIG7 is a schematic diagram showing an example of a time-domain multi-cycle discriminator;

图8示出了MetricGAN判别器训练的一个例子的示意图;Figure 8 shows a schematic diagram of an example of MetricGAN discriminator training;

图9示出了根据一个实施例的基于神经网络的语音丢包补偿方法;FIG9 shows a method for compensating for speech packet loss based on a neural network according to one embodiment;

图10示出了根据一个实施例的训练用于进行语音丢包补偿的神经网络的装置的示意性框图;FIG10 shows a schematic block diagram of an apparatus for training a neural network for voice packet loss compensation according to one embodiment;

图11示出了根据一个实施例的基于神经网络的语音丢包补偿装置的示意性框图;FIG11 shows a schematic block diagram of a speech packet loss compensation apparatus based on a neural network according to an embodiment;

图12示出了适于用来实现本申请实施例的电子设备的结构示意图。FIG12 shows a schematic structural diagram of an electronic device suitable for implementing the embodiments of the present application.

具体实施方式DETAILED DESCRIPTION

可以理解的是,在使用本公开各实施例公开的技术方案之前,均应当依据相关法律法规通过恰当的方式对本公开所涉及个人信息的类型、使用范围、使用场景等告知用户并获得用户的授权。It is understandable that before using the technical solutions disclosed in the various embodiments of this disclosure, the type, scope of use, usage scenarios, etc. of the personal information involved in this disclosure should be informed to the user and the user's authorization should be obtained in an appropriate manner in accordance with relevant laws and regulations.

例如,在响应于接收到用户的主动请求时,向用户发送提示信息,以明确地提示用户,其请求执行的操作将需要获取和使用到用户的个人信息。从而,使得用户可以根据提示信息来自主地选择是否向执行本公开技术方案的操作的电子设备、应用程序、服务器或存储介质等软件或硬件提供个人信息。For example, in response to a user's active request, a prompt message is sent to the user to clearly inform the user that the operation requested will require the acquisition and use of the user's personal information. This allows the user to independently choose whether to provide personal information to the electronic device, application, server, storage medium, or other software or hardware that performs the operations of the disclosed technical solution based on the prompt message.

作为一种可选的但非限定性的实现方式,响应于接收到用户的主动请求,向用户发送提示信息的方式例如可以是弹窗的方式,弹窗中可以以文字的方式呈现提示信息。此外,弹窗中还可以承载供用户选择“同意”或者“不同意”向电子设备提供个人信息的选择控件。As an optional but non-limiting implementation, in response to receiving a user's active request, the prompt information may be sent to the user in the form of a pop-up window, in which the prompt information may be presented in text form. Furthermore, the pop-up window may also contain a selection control for the user to select "agree" or "disagree" to provide personal information to the electronic device.

可以理解的是,上述通知和获取用户授权过程仅是示意性的,不对本公开的实现方式构成限定,其它满足相关法律法规的方式也可应用于本公开的实现方式中。It is understandable that the above notification and user authorization process are merely illustrative and do not limit the implementation of the present disclosure. Other methods that comply with relevant laws and regulations may also be applied to the implementation of the present disclosure.

下面结合附图和实施例,对本公开提供的技术方案做进一步的详细描述。可以理解的是,此处所描述的具体实施例仅仅用于解释相关发明,而非对该发明的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与有关发明相关的部分。需要说明的是,在不冲突的情况下,本公开的实施例及实施例中的特征可以相互组合。The technical solutions provided by the present disclosure are further described in detail below in conjunction with the accompanying drawings and embodiments. It will be understood that the specific embodiments described herein are merely for explaining the relevant inventions and are not intended to limit the inventions. It should also be noted that, for ease of description, only the portions relevant to the relevant inventions are shown in the accompanying drawings. It should be noted that, unless there is a conflict, the embodiments of the present disclosure and the features therein may be combined with each other.

如前所述,音频数据包在传输过程中可能会丢失,这会对语音通信的质量产生严重影响,造成用户体验下降。而语音丢包补偿(Packet Loss Concealment,PLC)技术的主要目标是通过各种方式尽可能地恢复或掩饰丢失的数据包,从而保持语音通信的连续性和清晰度。传统的丢包补偿技术可以基于冗余编码或信号处理的插值等方式进行丢包补充。比如,在许多编解码器中采用的前向纠错技术,当检测到不良网络条件时,发送方可以传输有关过去帧的冗余信息,以便恢复短丢包。然而,这种方式会引入额外的网络开销和额外的延迟,同时无法处理较长丢包。As mentioned earlier, audio data packets may be lost during transmission, which will have a serious impact on the quality of voice communication and cause a degradation in user experience. The main goal of voice packet loss compensation (PLC) technology is to recover or conceal lost data packets as much as possible through various means, thereby maintaining the continuity and clarity of voice communication. Traditional packet loss compensation technology can compensate for packet loss based on redundant coding or interpolation of signal processing. For example, the forward error correction technology used in many codecs, when poor network conditions are detected, the sender can transmit redundant information about past frames to recover short packet losses. However, this method introduces additional network overhead and additional delay, and cannot handle longer packet losses.

为此,本公开的实施例提供了一种基于神经网络的语音丢包补偿方法。首先可以采用本公开的实施例提供的、训练用于进行语音丢包补偿的神经网络的方法来训练神经网络,从而使训练得到神经网络可以更加准确的进行语音丢包补偿,且不会引入额外的网络开销和额外的延迟。To this end, embodiments of the present disclosure provide a neural network-based voice packet loss compensation method. First, a neural network can be trained using the method provided in embodiments of the present disclosure for training a neural network for voice packet loss compensation. This allows the trained neural network to more accurately compensate for voice packet loss without introducing additional network overhead or latency.

根据本公开实施例提供的基于神经网络的语音丢包补偿方法和装置。首先,需要训练神经网络,待训练神经网络可以包括编码器层、中间层和解码器层,该中间层连接于编码器层和解码器层之间。所使用的训练样本包括样本丢包音频、及其对应的样本丢帧位置信息和样本未丢包音频。在训练过程中,可以基于样本丢包音频及其对应的样本丢帧位置信息生成输入特征,并将该输入特征输入待训练神经网络。而后,可以将中间层输出的特征输入预先训练的基频预测网络,由该基频预测网络输出预测基频。然后,基于预测基频以及基于样本未丢包音频计算得到的真实基频,调整编码器层和中间层的网络参数。由此,可以使编码器层和中间层的输出更加准确,进而使训练得到神经网络可以更加准确的进行语音丢包补偿。According to the embodiment of the present disclosure, a method and device for speech packet loss compensation based on a neural network is provided. First, it is necessary to train a neural network. The neural network to be trained may include an encoder layer, an intermediate layer and a decoder layer, and the intermediate layer is connected between the encoder layer and the decoder layer. The training samples used include sample packet loss audio, and its corresponding sample frame loss position information and sample non-packet loss audio. During the training process, input features can be generated based on the sample packet loss audio and its corresponding sample frame loss position information, and the input features can be input into the neural network to be trained. Then, the features output by the intermediate layer can be input into a pre-trained fundamental frequency prediction network, and the fundamental frequency prediction network outputs a predicted fundamental frequency. Then, based on the predicted fundamental frequency and the actual fundamental frequency calculated based on the sample non-packet loss audio, the network parameters of the encoder layer and the intermediate layer are adjusted. In this way, the outputs of the encoder layer and the intermediate layer can be made more accurate, and the trained neural network can then perform speech packet loss compensation more accurately.

图1示出了本公开实施例可以应用于其中的一个应用场景的示意图。如图1所示,在图1所示的应用场景中,电子设备10可以首先训练用于进行语音丢包补偿的神经网络101。然后,电子设备20可以从电子设备10获取神经网络101,再使用训练完成的神经网络101进行语音丢包补偿。Figure 1 illustrates a schematic diagram of an application scenario in which embodiments of the present disclosure can be applied. As shown in Figure 1 , in the application scenario illustrated in Figure 1 , electronic device 10 can first train a neural network 101 for voice packet loss compensation. Electronic device 20 can then retrieve neural network 101 from electronic device 10 and use the trained neural network 101 to perform voice packet loss compensation.

具体的,电子设备10训练神经网络101的过程可以包括以下步骤1-步骤5:步骤1,获取训练样本集。这里,训练样本集中的每个训练样本可以包括样本丢包音频、及其对应的样本丢帧位置信息和样本未丢包音频。步骤2,基于样本丢包音频及其对应的样本丢帧位置信息,生成输入特征。步骤3,将输入特征输入待训练神经网络。本应用场景中,待训练神经网络可以包括编码器层、中间层和解码器层,中间层连接于编码器层和解码器层之间。步骤4,将中间层输出的特征输入预先训练的基频预测网络,由该基频预测网络输出预测基频。步骤5,基于预测基频以及基于样本未丢包音频计算得到的真实基频,更新编码器层和中间层的网络参数。此外,还可以基于解码器层输出的预测结果与样本未丢包音频,更新整个待训练神经网络的网络参数。从而得到训练完成的神经网络101。Specifically, the process of training the neural network 101 by the electronic device 10 may include the following steps 1-5: Step 1: Obtain a training sample set. Here, each training sample in the training sample set may include the sample audio with packet loss, its corresponding sample frame loss position information, and the sample audio without packet loss. Step 2: Generate input features based on the sample audio with packet loss and its corresponding sample frame loss position information. Step 3: Input the input features into the neural network to be trained. In this application scenario, the neural network to be trained may include an encoder layer, an intermediate layer, and a decoder layer, with the intermediate layer connected between the encoder layer and the decoder layer. Step 4: Input the features output by the intermediate layer into a pre-trained fundamental frequency prediction network, which outputs a predicted fundamental frequency. Step 5: Update the network parameters of the encoder layer and the intermediate layer based on the predicted fundamental frequency and the actual fundamental frequency calculated based on the sample audio without packet loss. In addition, the network parameters of the entire neural network to be trained may also be updated based on the prediction results output by the decoder layer and the sample audio without packet loss. This results in a trained neural network 101.

之后,电子设备20可以从电子设备10获取训练完成的神经网络101,并将待处理音频和待处理音频对应的丢帧位置信息输入神经网络101,从而得到待处理音频对应的丢包补偿后的音频。Afterwards, the electronic device 20 can obtain the trained neural network 101 from the electronic device 10, and input the audio to be processed and the frame loss position information corresponding to the audio to be processed into the neural network 101, thereby obtaining the audio after packet loss compensation corresponding to the audio to be processed.

继续参见图2,图2示出了根据一个实施例的训练用于进行语音丢包补偿的神经网络的方法的流程示意图。该方法可以通过任何具有计算、处理能力的装置、设备、平台、设备集群来执行。如图2所示,该训练用于进行语音丢包补偿的神经网络的方法,可以包括以下步骤201-步骤205,具体的:Continuing with FIG2 , FIG2 illustrates a flow chart of a method for training a neural network for voice packet loss compensation according to one embodiment. The method can be performed by any device, apparatus, platform, or device cluster with computing and processing capabilities. As shown in FIG2 , the method for training a neural network for voice packet loss compensation may include the following steps 201 to 205, specifically:

步骤201,获取训练样本集。在本实施例中,训练样本集中的各个训练样本可以包括样本丢包音频、样本丢包音频对应的样本丢帧位置信息、以及样本丢包音频对应的样本未丢包音频。样本丢帧位置信息可以用于指示样本丢包音频中发生了数据丢失的帧的位置。本例中,样本丢包音频可以是各种采样率下的音频,例如,可以是48k Hz采样率下的音频。Step 201: Obtain a training sample set. In this embodiment, each training sample in the training sample set may include sample packet loss audio, sample frame loss position information corresponding to the sample packet loss audio, and sample non-packet loss audio corresponding to the sample packet loss audio. The sample frame loss position information may be used to indicate the position of a frame in the sample packet loss audio where data loss occurred. In this example, the sample packet loss audio may be audio at various sampling rates, for example, at a 48 kHz sampling rate.

实践中,音频数据通常以帧为单位进行处理和传输,每个帧中可以包含一定时长的音频。在网络传输过程中,由于网络拥塞、延迟或其他原因,某些帧可能无法成功到达其目的地,从而导致丢包现象。为了检测和处理这些丢包情况,通常在音频数据的每个帧中设置一个标志位。举例来说,这个标志位可以是一个二进制位,用于指示该帧是否完整到达。如果标志位被设置为表示丢包的状态,接收端就可以检测到这一帧数据的丢失。由此,可以根据标志位确定音频中发生了数据丢失的帧的位置。In practice, audio data is usually processed and transmitted in frames, and each frame can contain audio of a certain length. During network transmission, due to network congestion, delays, or other reasons, some frames may not successfully reach their destination, resulting in packet loss. In order to detect and handle these packet loss situations, a flag bit is usually set in each frame of audio data. For example, this flag bit can be a binary bit used to indicate whether the frame has arrived intact. If the flag bit is set to indicate a packet loss state, the receiving end can detect the loss of this frame of data. Therefore, the position of the frame in the audio where data loss occurred can be determined based on the flag bit.

步骤202,基于样本丢包音频及其对应的样本丢帧位置信息,生成输入特征。在本实施例中,可以根据样本丢包音频和样本丢包音频对应的样本丢帧位置信息,生成输入特征。例如,可以先将样本丢包音频进行转换到频域,然后与样本丢帧位置信息进行拼接,得到输入特征。Step 202: Generate input features based on the packet-loss audio sample and its corresponding sample-loss frame location information. In this embodiment, the input features can be generated based on the packet-loss audio sample and its corresponding sample-loss frame location information. For example, the packet-loss audio sample can be first converted to the frequency domain and then concatenated with the sample-loss frame location information to generate the input features.

在一些实现方式中,上述步骤202可以具备包括以下步骤1和步骤2,具体的:步骤1,对样本丢包音频进行子带分解得到多个子带。步骤2,基于多个子带转换到时频域的转换结果和样本丢帧位置信息,生成输入特征。In some implementations, step 202 may include the following steps 1 and 2. Specifically, in step 1, sub-band decomposition is performed on the sample packet loss audio to obtain multiple sub-bands. In step 2, input features are generated based on the conversion results of the multiple sub-bands into the time-frequency domain and the sample frame loss location information.

在本实现方式中,可以采用多种方式对样本丢包音频进行子带分解。举例来说,可以采用稳定高效的伪正交镜像滤波器组(Pseudo-Quadrature Mirror Filter Bank,PQMF)进行子带分解,将样本丢包音频分为多个子带。在PQMF中,子带分解可以包括K个FIR(Finite Impulse Response)滤波器组。子带分解过程可以包括FIR分析、下采样和短时傅立叶变换(Short-Time Fourier Transform,STFT)。In this implementation, multiple methods can be used to perform sub-band decomposition on the sample loss audio. For example, a stable and efficient Pseudo-Quadrature Mirror Filter Bank (PQMF) can be used to perform sub-band decomposition, dividing the sample loss audio into multiple sub-bands. In PQMF, sub-band decomposition can include K FIR (Finite Impulse Response) filter banks. The sub-band decomposition process can include FIR analysis, downsampling, and short-time Fourier transform (STFT).

举例来说,假设y表示要输入到待训练神经网络的原始音频,yk表示子带分析和下采样后的音频,k∈[1,K]为子带编号,yk可以表示第k个子带。yk的采样率是y的可以表示yk经过短时傅里叶变换(STFT)后对应的频域。将Yk沿通道维度堆叠起来,得到输入待训练神经网络的频域子带特征之后,可以直接与丢帧位置信息拼接得到输入特征,还可以进行压缩后与丢帧位置信息拼接得到输入特征。比如,可以经过缩放因子为0.5的幅度谱压缩后与丢帧位置信息拼接得到输入特征。For example, assuming that y represents the original audio to be input to the neural network to be trained, yk represents the audio after subband analysis and downsampling, k∈[1,K] is the subband number, and yk can represent the kth subband. The sampling rate of yk is It can represent the frequency domain corresponding to y k after short-time Fourier transform (STFT). Stack Y k along the channel dimension to obtain the frequency domain subband features of the input neural network to be trained After that, the input features can be directly concatenated with the frame loss position information, or compressed and concatenated with the frame loss position information to obtain the input features. For example, the input features can be concatenated with the frame loss position information after amplitude spectrum compression with a scaling factor of 0.5.

可以理解,与前述子带分解过程对应的子带解码过程可以包括iSTFT(Inverse Short-Time Fourier Transform,反短时傅立叶变换),上采样和FIR合成。与子带分析过程相对应,神经网络的输出为对每个子带进行逆傅里叶变换后得到时域子带音频经过上采样和合成滤波器后恢复为音频在本实现方式中,K可以取4,即将音频信号分为4个子带进行处理。It can be understood that the sub-band decoding process corresponding to the above sub-band decomposition process may include iSTFT (Inverse Short-Time Fourier Transform), upsampling and FIR synthesis. Corresponding to the sub-band analysis process, the output of the neural network is For each subband After inverse Fourier transform, the time domain subband audio is obtained After upsampling and synthesis filter, it is restored to audio In this implementation, K can be 4, that is, the audio signal is divided into 4 sub-bands for processing.

通过本实现方式,可以将样本丢包音频分解为多个子带进行处理,从而大幅度降低了计算复杂度。因此,本实现方式尤其适用于高采样率下的音频,例如,采样率大于或等于48k Hz采样率的音频。This implementation allows audio with sample loss to be decomposed into multiple subbands for processing, significantly reducing computational complexity. Therefore, this implementation is particularly suitable for audio at high sampling rates, such as audio with a sampling rate greater than or equal to 48 kHz.

步骤203,将输入特征输入待训练神经网络。在本实施例中,可以将得到的输入特征输入到待训练神经网络,由待训练神经网络进行处理。这里,待训练神经网络可以包括编码器层、中间层和解码器层。中间层连接于编码器层和解码器层。编码器层与解码器层之间可以连接,也可以不连接。Step 203: Input the input features to the neural network to be trained. In this embodiment, the obtained input features can be input to the neural network to be trained and processed by the neural network to be trained. Here, the neural network to be trained may include an encoder layer, an intermediate layer, and a decoder layer. The intermediate layer is connected to the encoder layer and the decoder layer. The encoder layer and the decoder layer may or may not be connected.

在一些实现方式中,待训练神经网络可以是U-Net结构的神经网络,U-Net结构主要可以包括编码器、瓶颈层和解码器,U-Net结构中的瓶颈层可以为中间层。在U-Net结构中,编码器和解码器之间可以跳跃连接(Skip Connection)。In some implementations, the neural network to be trained can be a neural network with a U-Net structure. The U-Net structure can primarily include an encoder, a bottleneck layer, and a decoder. The bottleneck layer in the U-Net structure can be an intermediate layer. In the U-Net structure, a skip connection can be used between the encoder and the decoder.

在一些实现方式中,编码器层可以包括若干个编码器。根据需要,各编码器可以包括各种层。例如,可以包括卷积层、池化层等等。举例来说,各编码器可以包括门控卷积(Gated Convolution)层和时频空洞卷积层,时频空洞卷积层可以用于通过时间维度和频率维度的空洞卷积来提取特征。如图3所示,图3示出了一个编码器结构的一个例子的示意图。在图3所示的例子中,编码器依次可以包括门控卷积层和时频空洞卷积层,时频空洞卷积层可以包括频率维度和时间维度的扩张卷积FConv(Frequency Dilated Convolution)与TConv(Time Dilated Convolution)。FConv和TConv可以看作是沿频率轴和时间轴的多尺度建模,充分感知历史信息。此外,时频空洞卷积层还可以包括BN(Batch Normalization,批量归一化)层、PReLU(Parametric Rectified Linear Unit,带参数的ReLU)激活函数、PWConv(Pointwise convolution,逐点卷积)等等,PWConv可以用于对齐输入输出维度。这里,将b个膨胀度从1到2b-1的卷积层拼接在一起,形成一个时频空洞卷积层,可以有效提高卷积层的感受野。可以理解,图3所示的编码器结构仅仅是示意性的,而非对编码器结构的限定。实践中,可以根据实际需要设定编码器中包括的层。In some implementations, the encoder layer may include several encoders. As needed, each encoder may include various layers. For example, it may include a convolution layer, a pooling layer, and the like. For example, each encoder may include a gated convolution layer and a time-frequency dilated convolution layer, and the time-frequency dilated convolution layer may be used to extract features through dilated convolutions in the time and frequency dimensions. As shown in FIG3 , FIG3 shows a schematic diagram of an example of an encoder structure. In the example shown in FIG3 , the encoder may include a gated convolution layer and a time-frequency dilated convolution layer in sequence, and the time-frequency dilated convolution layer may include dilated convolutions FConv (Frequency Dilated Convolution) and TConv (Time Dilated Convolution) in the frequency and time dimensions. FConv and TConv can be regarded as multi-scale modeling along the frequency axis and the time axis, fully perceiving historical information. In addition, the time-frequency hole convolution layer can also include a BN (Batch Normalization) layer, a PReLU (Parametric Rectified Linear Unit, ReLU with parameters) activation function, PWConv (Pointwise convolution), etc. PWConv can be used to align the input and output dimensions. Here, b convolution layers with expansion degrees from 1 to 2 b-1 are spliced together to form a time-frequency hole convolution layer, which can effectively improve the receptive field of the convolution layer. It can be understood that the encoder structure shown in Figure 3 is only schematic and not a limitation of the encoder structure. In practice, the layers included in the encoder can be set according to actual needs.

步骤204,将中间层输出的特征输入预先训练的基频预测网络,由基频预测网络输出预测基频。在本实施例中,中间层可以用来进行丢包补偿,中间层可以采用各种神经网络实现。举例来说,中间层可以由双向GRU(Gate Recurrent Unit,门控循环单元)实现,双向GRU可以提取频率与时间维度的相关性,进行丢包补偿。In step 204, the features output by the intermediate layer are input into a pre-trained fundamental frequency prediction network, which then outputs a predicted fundamental frequency. In this embodiment, the intermediate layer can be used for packet loss compensation and can be implemented using various neural networks. For example, the intermediate layer can be implemented using a bidirectional GRU (Gate Recurrent Unit), which can extract the correlation between frequency and time dimensions to compensate for packet loss.

由此,中间层输出的特征可以包括样本丢帧位置信息对应的丢帧位置所对应帧的特征,而预先训练的基频预测网络可以用于根据输入的特征来预测基频。因此,将中间层输出的特征输入基频预测网络,可以由基频预测网络输出预测基频。该预测基频可以包括丢帧位置所对应帧的基频。本例中,基频预测网络可以是各种神经网络。Therefore, the features output by the intermediate layer can include features of the frame corresponding to the frame loss location corresponding to the sample frame loss location information, and the pre-trained FFT prediction network can be used to predict the FFT based on the input features. Therefore, the features output by the intermediate layer are input into the FFT prediction network, which can output a predicted FFT. This predicted FFT can include the FFT of the frame corresponding to the frame loss location. In this example, the FFT prediction network can be any of various neural networks.

在一些实现方式中,基频预测网络可以包括双向长短期记忆网络(Bi-directional Long Short-Term Memory,Bi-LSTM)。如图4所示,图4示出了基频预测网络结构的一个例子的示意图。在图4所示的例子中,基频预测网络依次可以包括reshape(重塑)层、Maxpool(最大池化)层、reshape层、Bi-LSTM层、Linear(线性)层、reshape层、Linear_C层、Linear_F层等等。这里,第一个reshape层的输出与Linear层的输出形成跳跃连接。reshape层可以用于进行数据维度调整,Maxpool层可以用作下采样,Linear层可以用于线性映射,Linear_C层可以表示在通道维度进行线性映射,Linear_F可以表示在频率维度进行线性映射。可以理解,图4中的基频预测网络结构仅仅是示意性的,而非对基频预测网络结构的限定。实践中,可以根据实际需要设定基频预测网络中包括的层。In some implementations, the FFT prediction network may include a bidirectional long short-term memory (Bi-LSTM) network. As shown in FIG4 , FIG4 illustrates a schematic diagram of an example FFT prediction network structure. In the example shown in FIG4 , the FFT prediction network may include, in sequence, a reshape layer, a Maxpool layer, a reshape layer, a Bi-LSTM layer, a Linear layer, a reshape layer, a Linear_C layer, a Linear_F layer, and so on. Here, the output of the first reshape layer forms a skip connection with the output of the Linear layer. The reshape layer can be used to adjust the data dimension, the Maxpool layer can be used for downsampling, and the Linear layer can be used for linear mapping. The Linear_C layer can represent linear mapping in the channel dimension, and the Linear_F layer can represent linear mapping in the frequency dimension. It should be understood that the FFT prediction network structure in FIG4 is merely illustrative and does not limit the FFT prediction network structure. In practice, the layers included in the FFT prediction network can be set according to actual needs.

作为示例,可以通过各种方式对基频预测网络进行训练。例如,可以采用有监督方式进行训练。举例来说,在有监督方式中,可以首先获取样本集,该样本集中的每个样本可以包括样本音频特征,以及样本音频特征对应的基频标签。这里,样本音频特征可以是通过待训练神经网络的网络结构得到的。之后,可以将样本中的样本音频特征输入待训练的基频预测网络,由待训练的基频预测网络输出一个预测的基频。然后,基于预先定义的损失函数计算预测的基频与基频标签之间的差异损失,以该差异损失最小化为目标,调整基频预测网络的网络参数。As an example, the fundamental frequency prediction network can be trained in various ways. For example, it can be trained in a supervised manner. For example, in a supervised manner, a sample set can be first obtained, and each sample in the sample set can include a sample audio feature and a fundamental frequency label corresponding to the sample audio feature. Here, the sample audio feature can be obtained through the network structure of the neural network to be trained. Afterwards, the sample audio features in the sample can be input into the fundamental frequency prediction network to be trained, and the fundamental frequency prediction network to be trained outputs a predicted fundamental frequency. Then, based on a predefined loss function, the difference loss between the predicted fundamental frequency and the fundamental frequency label is calculated, and the network parameters of the fundamental frequency prediction network are adjusted with the goal of minimizing the difference loss.

步骤205,基于预测基频以及基于样本未丢包音频计算得到的真实基频,调整编码器层和中间层的网络参数。在本实施例中,可以使用样本未丢包音频计算各帧的真实基频,然后可以基于预测基频和真实基频,调整编码器层和中间层的网络参数。举例来说,可以将以下公式作为损失函数:
Step 205: Adjust the network parameters of the encoder layer and the intermediate layer based on the predicted fundamental frequency and the actual fundamental frequency calculated based on the sampled unpacked audio. In this embodiment, the actual fundamental frequency of each frame can be calculated using the sampled unpacked audio, and then the network parameters of the encoder layer and the intermediate layer can be adjusted based on the predicted fundamental frequency and the actual fundamental frequency. For example, the following formula can be used as the loss function:

其中,T可以表示帧数,可以表示基频预测网络预测的第i帧的基频,fi可以表示第i帧的真实基频,Lf0可以表示基频预测的差异损失。之后,可以以Lf0最小化为目标,调整编码器层和中间层的网络参数。Among them, T can represent the number of frames, can represent the fundamental frequency of the i-th frame predicted by the fundamental frequency prediction network, fi can represent the true fundamental frequency of the i-th frame, and Lf0 can represent the difference loss of the fundamental frequency prediction. Afterwards, the network parameters of the encoder layer and the intermediate layer can be adjusted with the goal of minimizing Lf0 .

在一些实现方式中,与编码器层相对应,解码器层也可以包括若干个解码器,各解码器可以包括并行的第一分支和第二分支。第一分支可以用于预测音频的实部,第二分支可以用于预测音频的虚部。待训练神经网络可以基于编码器层输出的预测音频的实部和虚部,输出样本预测音频。In some implementations, corresponding to the encoder layer, the decoder layer may also include multiple decoders, each of which may include a first branch and a second branch in parallel. The first branch may be used to predict the real part of the audio, and the second branch may be used to predict the imaginary part of the audio. The neural network to be trained may output sample predicted audio based on the real and imaginary parts of the predicted audio output by the encoder layer.

如图5所示,图5示出了待训练神经网络结构的一个例子的示意图。在图5所示的例子中,输入可以包括样本丢包音频和样本丢帧位置信息m(t),t∈T,B可以表示batchsize,N可以表示音频长度,T可以表示帧数。输出可以包括进行了丢包补偿后的样本预测音频 As shown in Figure 5, Figure 5 shows a schematic diagram of an example of a neural network structure to be trained. In the example shown in Figure 5, the input may include sample packet loss audio And the sample frame loss position information m(t), t∈T, B can represent batch size, N can represent audio length, and T can represent the number of frames. The output can include sample prediction audio after packet loss compensation

在图5所示的例子中,样本丢包音频依次经过伪正交镜像滤波器组(PQMF)、短时傅里叶变换(STFT)、压缩(compress)之后,与样本丢帧位置信息m(t)进行堆叠(stack)后输入待训练神经网络。编码器层包括的编码器数量与解码器层包括的解码器数量相同,各编码器和对应解码器之间可以跳跃连接(Skip Connection)。本例中,编码器层可以包括4个编码器,各编码器可以包括门控卷积(Gated Convolution)层和时频空洞卷积层。中间层可以由3个双向GRU(Gate Recurrent Unit,门控循环单元)实现,双向GRU可以提取频率与时间维度的相关性,进行丢包补偿。相应的,解码器层也包括4个解码器,各解码器可以包括转置门控卷积层和时频空洞卷积层。各解码器可以包括并行的第一分支和第二分支。第一分支可以用于预测音频的实部,第二分支可以用于预测音频的虚部。之后,编码器层输出的预测音频的实部和虚部,可以依次经过连接(concat)、解压缩(uncompress)、短时傅立叶逆变换(Inverse Short-Time Fourier Transform,ISTFT)、伪正交镜像滤波器组(PQMF),从而得到进行了丢包补偿后的样本预测音频 In the example shown in Figure 5, the sample packet loss audio After passing through the pseudo-quadrature mirror filter bank (PQMF), short-time Fourier transform (STFT), and compression, it is stacked with the sample frame loss position information m(t) and input into the neural network to be trained. The number of encoders included in the encoder layer is the same as the number of decoders included in the decoder layer, and each encoder and corresponding decoder can be skipped. In this example, the encoder layer can include 4 encoders, each of which can include a gated convolution layer and a time-frequency hole convolution layer. The middle layer can be implemented by 3 bidirectional GRUs (Gate Recurrent Units), which can extract the correlation between frequency and time dimensions and compensate for packet loss. Correspondingly, the decoder layer also includes 4 decoders, each of which can include a transposed gated convolution layer and a time-frequency hole convolution layer. Each decoder can include a first branch and a second branch in parallel. The first branch can be used to predict the real part of the audio, and the second branch can be used to predict the imaginary part of the audio. After that, the real and imaginary parts of the predicted audio output by the encoder layer can be concatenated, decompressed, inverse short-time Fourier transform (ISTFT), and pseudo-orthogonal mirror filter bank (PQMF) in sequence to obtain the sample predicted audio after packet loss compensation.

可以理解,图5所示的待训练神经网络结构所包括的各部分仅仅是示意性的,而非对待训练神经网络结构的限定。实践中,可以根据实际需要设计待训练神经网络的结构。It is understood that the various parts of the neural network structure to be trained shown in Figure 5 are merely illustrative and not intended to limit the neural network structure to be trained. In practice, the structure of the neural network to be trained can be designed according to actual needs.

基于待训练神经网络输出的样本预测音频,上述训练用于进行语音丢包补偿的神经网络的方法,还可以包括以下步骤S1和步骤S2,具体的:步骤S1,将样本预测音频输入预先训练的至少一个判别器,由各判别器输出针对样本预测音频的判别结果。步骤S2,基于至少一个判别结果、所述样本预测音频和样本未丢包音频,计算损失,以及基于损失调整待训练神经网络的网络参数。Based on the sample predicted audio output by the neural network to be trained, the above-mentioned method for training a neural network for speech packet loss compensation may further include the following steps S1 and S2. Specifically, in step S1, the sample predicted audio is input into at least one pre-trained discriminator, and each discriminator outputs a discrimination result for the sample predicted audio. In step S2, based on the at least one discrimination result, the sample predicted audio, and the sample non-packet loss audio, a loss is calculated, and network parameters of the neural network to be trained are adjusted based on the loss.

在本实现方式中,可以采用生成对抗结构GAN(Generative adversarial networks)来训练待训练神经网络。待训练神经网络可以作为生成器,其输出的样本预测音频可以输入判别器,由判别器输出针对样本预测音频的判别结果。之后,基于判别器输出的判别结果调整待训练神经网络的网络参数。本例中,可以使用一个或者多个判别器。In this implementation, a generative adversarial network (GAN) can be used to train the neural network. The neural network to be trained can serve as a generator, and its output of sample predicted audio can be input into a discriminator, which then outputs a judgment result for the sample predicted audio. The network parameters of the neural network to be trained are then adjusted based on the judgment result output by the discriminator. In this example, one or more discriminators can be used.

在一些实现方式中,上述至少一个判别器可以包括用于判别样本预测音频为真实音频的概率的第一判别器,还可以包括用于判别样本预测音频的音频质量的第二判别器。In some implementations, the at least one discriminator may include a first discriminator for determining the probability that the sample predicted audio is real audio, and may also include a second discriminator for determining the audio quality of the sample predicted audio.

举例来说,第一判别器可以包括频域多分辨率判别器、时域多周期判别器等等。训练时,可以将真实的未丢包音频和生成器生成的音频输入第一判别器,由第一判别器判断输入的音频是否是生成器生成的音频。计算判别器损失,反向传播更新第一判别器的网络参数。本例中,第一判别器的损失可以通过以下公式计算:
For example, the first discriminator can include a frequency domain multi-resolution discriminator, a time domain multi-period discriminator, and so on. During training, the real audio without packet loss and the audio generated by the generator can be input into the first discriminator, and the first discriminator determines whether the input audio is the audio generated by the generator. The discriminator loss is calculated, and the network parameters of the first discriminator are updated through backpropagation. In this example, the loss of the first discriminator can be calculated using the following formula:

其中,s可以表示真实的未丢包音频;x可以表示s对应的丢包音频;G可以表示生成器;D可以表示判别器;G(x)可以表示使生成器对x进行处理后的结果;可以表示计算期望,其下角标表示计算期望的对象。Where s can represent the real non-packet-loss audio; x can represent the packet-loss audio corresponding to s; G can represent the generator; D can represent the discriminator; G(x) can represent the result after the generator processes x; It can represent a computational expectation, and its subscript indicates the object of the computational expectation.

如前所述,第一判别器可以包括频域多分辨率判别器。如图6所示,图6示出了频域多分辨率判别器的一个例子的示意图。在图6所示的例子中,音频波形(Waveform)输入频域多分辨率判别器之后,频域多分辨率判别器可以使用不同窗长、窗移的短时傅里叶变换(STFT)对时域波形进行变换。使用二维卷积对不同频谱分辨率的特征进行下采样。举例来说,窗长可以分别为[30,60,120,240,480,960],可以从不同的分频率对输入的音频进行判别。可以理解,图6所示的频域多分辨率判别器为公知的现有判别器,其各部分所包含的内容如图所示,此处不再赘述。As mentioned above, the first discriminator may include a frequency domain multi-resolution discriminator. As shown in Figure 6, Figure 6 shows a schematic diagram of an example of a frequency domain multi-resolution discriminator. In the example shown in Figure 6, after the audio waveform (Waveform) is input into the frequency domain multi-resolution discriminator, the frequency domain multi-resolution discriminator can use short-time Fourier transform (STFT) with different window lengths and window shifts to transform the time domain waveform. Features of different spectral resolutions are downsampled using two-dimensional convolution. For example, the window lengths can be [30, 60, 120, 240, 480, 960], respectively, and the input audio can be discriminated from different frequency division ratios. It can be understood that the frequency domain multi-resolution discriminator shown in Figure 6 is a well-known existing discriminator, and the contents of its various parts are shown in the figure and will not be repeated here.

第一判别器还可以包括时域多周期判别器(Multi-Period Discriminator,MPD)。如图7所示,图7示出了时域多周期判别器的一个例子的示意图。在图7所示的例子中,时域多周期判别器可以将输入的一维样本点序列以一定周期折叠为二维平面,然后应用二维卷积进行处理。具体来说,每个特定周期的子判别器首先进行填充,保证样本点数是周期的整倍数,以方便折叠为二维平面。接下来进入多个卷积层,输出通道数,例如可以分别为[32,128,512,1024],卷积之后利用leaky_relu激活。最后利用一个输入通道数例如为1024,输出通道例如为1的卷积层进行后处理,展平作为时域多周期判别器的最终输出。可以理解,图7所示的时域多周期判别器为公知的现有判别器,其各部分所包含的内容如图所示,此处不再赘述。The first discriminator can also include a time-domain multi-period discriminator (MPD). As shown in Figure 7, Figure 7 shows a schematic diagram of an example of a time-domain multi-period discriminator. In the example shown in Figure 7, the time-domain multi-period discriminator can fold the input one-dimensional sample point sequence into a two-dimensional plane with a certain period, and then apply two-dimensional convolution for processing. Specifically, the sub-discriminator of each specific period is first padded to ensure that the number of sample points is an integer multiple of the period to facilitate folding into a two-dimensional plane. Next, it enters multiple convolutional layers, and the output channel number can be, for example, [32, 128, 512, 1024] respectively. After convolution, leaky_relu activation is used. Finally, post-processing is performed using a convolutional layer with an input channel number, for example, 1024, and an output channel number, for example, 1, and flattening is used as the final output of the time-domain multi-period discriminator. It can be understood that the time-domain multi-period discriminator shown in Figure 7 is a well-known existing discriminator. The contents of its various parts are shown in the figure and will not be repeated here.

作为实例,第二判别器可以是MetricGAN判别器。可以将丢包补偿后的音频输入MetricGAN判别器估计PESQ(Perceptual evaluation of speech quality,客观语音质量评估),与通过python库得到的真实PESQ计算损失,根据损失更新MetricGAN判别器。举例来说,MetricGAN判别器的损失可以通过以下损失函数计算:
As an example, the second discriminator can be a MetricGAN discriminator. The packet loss-compensated audio can be fed into the MetricGAN discriminator to estimate the PESQ (Perceptual Evaluation of Speech Quality), and the loss can be calculated by comparing it with the actual PESQ obtained through the Python library. The MetricGAN discriminator is then updated based on the loss. For example, the loss of the MetricGAN discriminator can be calculated using the following loss function:

其中,s可以表示真实的未丢包音频;x可以表示s对应的丢包音频;G可以表示生成器;D可以表示判别器;G(x)可以表示使生成器对x进行处理后的结果。Q'可以表示pypesq库函数,可以表示计算期望。Where s can represent the real non-packet-loss audio; x can represent the packet-loss audio corresponding to s; G can represent the generator; D can represent the discriminator; G(x) can represent the result after the generator processes x. Q' can represent the pypesq library function. Can represent computational expectations.

如图8所示,图8示出了MetricGAN判别器训练的一个例子的示意图。在图8所示的例子中,针对输入,MetricGAN判别器可以输出预测PESQ得分(Predicted PESQ score)。作为实例,MetricGAN判别器可以包括四层下采样的二维卷积和三层全连接层,四层下采样的二维卷积后面接三层全连接层。MetricGAN判别器的输出为一个预测的PESQ得分。训练阶段,首先,预测的PESQ得分与pypesq库函数输出的真实PESQ得分计算MSE损失(Mean Squared Error Loss,均方差损失),更新MetricGAN判别器的网络参数。之后,再与PESQ最大值计算MSE损失,更新生成器的网络参数。As shown in Figure 8, Figure 8 shows a schematic diagram of an example of MetricGAN discriminator training. In the example shown in Figure 8, for each input, the MetricGAN discriminator can output a predicted PESQ score (Predicted PESQ score). As an example, the MetricGAN discriminator can include four layers of downsampled two-dimensional convolutions and three layers of fully connected layers, where the four layers of downsampled two-dimensional convolutions are followed by three layers of fully connected layers. The output of the MetricGAN discriminator is a predicted PESQ score. During the training phase, the predicted PESQ score is first compared with the true PESQ score output by the pypesq library function to calculate the mean squared error (MSE) loss, and the network parameters of the MetricGAN discriminator are updated. Afterwards, the MSE loss is calculated with the maximum PESQ value, and the network parameters of the generator are updated.

在一些实现方式中,上述训练用于进行语音丢包补偿的神经网络的方法,还可以包括以下步骤一至步骤三,具体的:步骤一,将待训练神经网络输出的样本预测音频和其对应的样本未丢包音频分别输入预先训练的语音识别模型。本例中,语音识别模型可以是现有的任意语音识别模型,例如,可以是whisper模型。步骤二,获取样本预测音频和其对应的样本未丢包音频在语音识别模型中的编码层特征。步骤三,基于所获取的两个编码层特征的差异损失,调整待训练神经网络的网络参数。例如,可以以差异损失最小化为目标,调整待训练神经网络的网络参数。In some implementations, the above-mentioned method for training a neural network for voice packet loss compensation may further include the following steps 1 to 3. Specifically: Step 1, the sample prediction audio output by the neural network to be trained and its corresponding sample non-packet loss audio are respectively input into a pre-trained speech recognition model. In this example, the speech recognition model can be any existing speech recognition model, for example, it can be a whisper model. Step 2, obtain the coding layer features of the sample prediction audio and its corresponding sample non-packet loss audio in the speech recognition model. Step 3, based on the difference loss of the two obtained coding layer features, adjust the network parameters of the neural network to be trained. For example, the network parameters of the neural network to be trained can be adjusted with the goal of minimizing the difference loss.

在本实现方式中,以语音识别模型为whisper模型为例,可以通过以下公式计算两个编码层特征的差异损失:
In this implementation, taking the whisper model as an example, the difference loss of the two encoding layer features can be calculated using the following formula:

其中,可以表示样本预测音频对应的whisper模型的编码层特征,W(s)可以表示样本未丢包音频对应的whisper模型的编码层特征。in, It can represent the coding layer features of the whisper model corresponding to the sample prediction audio, and W(s) can represent the coding layer features of the whisper model corresponding to the sample unlost audio.

可以理解,基于样本预测音频和其对应的样本未丢包音频,也可以调整待训练神经网络的网络参数。举例来说,可以计算样本预测音频和其对应的样本未丢包音频的差异损失,并以差异损失最小化为目标,调整待训练神经网络的网络参数。例如,可以通过以下公式计算差异损失:
It is understood that the network parameters of the neural network to be trained can also be adjusted based on the sample predicted audio and its corresponding sample unpacked audio. For example, the difference loss between the sample predicted audio and its corresponding sample unpacked audio can be calculated, and the network parameters of the neural network to be trained can be adjusted with the goal of minimizing the difference loss. For example, the difference loss can be calculated using the following formula:

其中,Lmae可以表示在时域的mae(Mean Absolute Error,平均绝对误差)损失,N可以表示音频长度,si可以表示第i个采样点的样本未丢包音频,可以表示第i个采样点的样本预测音频。Wherein, Lmae can represent the mae (Mean Absolute Error) loss in the time domain, N can represent the audio length, and si can represent the audio without packet loss at the i-th sampling point. It can represent the sample prediction audio of the i-th sampling point.

这里,Lplcpa可以表示在频域的幅度谱压缩的均方误差,S(t,f)可以表示样本未丢包音频对应的频谱,可以表示样本预测音频对应的频谱,t、f可以分别表示时间维度、频率维度的下标,p可以表示压缩系数,j和可以表示与傅里叶变换相关的一些相位。可以表示幅度谱的均方误差,可以表示相位的均方误差。Here, L plcpa can represent the mean square error of the amplitude spectrum compression in the frequency domain, and S(t,f) can represent the spectrum corresponding to the sample without packet loss audio. It can represent the spectrum corresponding to the sample prediction audio, t and f can represent the subscripts of the time dimension and frequency dimension respectively, p can represent the compression coefficient, j and It is possible to represent some phases related to the Fourier transform. It can be expressed as the mean square error of the amplitude spectrum, It can be expressed as the mean square error of the phase.

又例如,还可以通过以下公式计算差异损失:
For another example, the difference loss can be calculated using the following formula:

其中,s可以表示真实的未丢包音频;x可以表示s对应的丢包音频;G可以表示生成器;D可以表示判别器。G(x)可以表示使生成器对x进行处理后的结果,PESQmax可以表示PESQ指标的最大值。Where s represents the true audio without packet loss; x represents the packet loss audio corresponding to s; G represents the generator; and D represents the discriminator. G(x) represents the result of processing x by the generator, and PESQ max represents the maximum value of the PESQ indicator.

以上描述了用于进行语音丢包补偿的神经网络的训练过程,如此得到的神经网络可以对输入的音频进行丢包补偿。The above describes the training process of a neural network for voice packet loss compensation. The neural network obtained in this way can compensate for packet loss of input audio.

接着,请继续参见图9,图9示出了根据一个实施例的基于神经网络的语音丢包补偿方法。该方法可以通过任何具有计算、处理能力的装置、设备、平台、设备集群来执行。如图9所示,该基于神经网络的语音丢包补偿方法,可以包括以下步骤901至步骤903,具体的:步骤901,获取预先训练得到的用于进行语音丢包补偿的神经网络。在本实施例中,可以获取图2所描述方法训练得到的用于进行语音丢包补偿的神经网络,该神经网络可以基于丢包音频和丢包音频对应的丢帧位置信息进行丢包补偿。步骤902,接收待处理音频和待处理音频对应的丢帧位置信息。步骤903,将基于待处理音频和待处理音频对应的丢帧位置信息生成的输入特征,输入神经网络,得到待处理音频对应的丢包补偿后的音频。根据另一方面的实施例,提供了一种训练用于进行语音丢包补偿的神经网络的装置。上述训练用于进行语音丢包补偿的神经网络的装置可以部署于任何具有计算、处理能力的装置、设备、平台、设备集群等等。Next, please refer to Figure 9, which illustrates a neural network-based voice packet loss compensation method according to one embodiment. This method can be performed by any device, equipment, platform, or device cluster with computing and processing capabilities. As shown in Figure 9, the neural network-based voice packet loss compensation method may include steps 901 to 903. Specifically, step 901: Obtain a pre-trained neural network for voice packet loss compensation. In this embodiment, a neural network for voice packet loss compensation trained using the method described in Figure 2 can be obtained. This neural network can perform packet loss compensation based on packet loss audio and frame loss location information corresponding to the packet loss audio. Step 902: Receive audio to be processed and frame loss location information corresponding to the audio to be processed. Step 903: Input features generated based on the audio to be processed and the frame loss location information corresponding to the audio to be processed are input into the neural network to obtain packet loss-compensated audio corresponding to the audio to be processed. According to another embodiment, a device for training a neural network for voice packet loss compensation is provided. This device for training a neural network for voice packet loss compensation can be deployed on any device, equipment, platform, device cluster, or the like with computing and processing capabilities.

图10示出了根据一个实施例的训练用于进行语音丢包补偿的神经网络的装置的示意性框图。图10所示装置用于执行图2所示方法。待训练神经网络包括编码器层、中间层和解码器层,上述中间层连接于上述编码器层和上述解码器层之间。如图10所示,该训练用于进行语音丢包补偿的神经网络的装置100包括:获取单元1001,配置为,获取训练样本集,各训练样本包括样本丢包音频、及其对应的样本丢帧位置信息和样本未丢包音频;生成单元1002,配置为,基于样本丢包音频及其对应的样本丢帧位置信息,生成输入特征;第一输入单元1003,配置为,将上述输入特征输入上述待训练神经网络;第二输入单元1004,配置为,将上述中间层输出的特征输入预先训练的基频预测网络,由上述基频预测网络输出预测基频;调整单元1005,配置为,基于上述预测基频以及基于样本未丢包音频计算得到的真实基频,调整上述编码器层和上述中间层的网络参数。Figure 10 shows a schematic block diagram of an apparatus for training a neural network for voice packet loss compensation according to one embodiment. The apparatus shown in Figure 10 is used to perform the method shown in Figure 2. The neural network to be trained includes an encoder layer, an intermediate layer, and a decoder layer, with the intermediate layer connected between the encoder layer and the decoder layer. As shown in Figure 10, the device 100 for training a neural network for speech packet loss compensation includes: an acquisition unit 1001, configured to acquire a training sample set, each training sample including sample packet loss audio, its corresponding sample frame loss position information and sample non-packet loss audio; a generation unit 1002, configured to generate input features based on the sample packet loss audio and its corresponding sample frame loss position information; a first input unit 1003, configured to input the above-mentioned input features into the above-mentioned neural network to be trained; a second input unit 1004, configured to input the features output by the above-mentioned intermediate layer into a pre-trained fundamental frequency prediction network, and the above-mentioned fundamental frequency prediction network outputs a predicted fundamental frequency; an adjustment unit 1005, configured to adjust the network parameters of the above-mentioned encoder layer and the above-mentioned intermediate layer based on the above-mentioned predicted fundamental frequency and the actual fundamental frequency calculated based on the sample non-packet loss audio.

在本实施例的一些可选的实现方式中,上述待训练神经网络为U-Net结构的神经网络,上述中间层为上述U-Net结构中的瓶颈层。In some optional implementations of this embodiment, the neural network to be trained is a neural network with a U-Net structure, and the intermediate layer is a bottleneck layer in the U-Net structure.

在本实施例的一些可选的实现方式中,上述基频预测网络包括双向长短期记忆网络。In some optional implementations of this embodiment, the fundamental frequency prediction network includes a bidirectional long short-term memory network.

在本实施例的一些可选的实现方式中,上述中间层输出的特征包括样本丢帧位置信息对应的丢帧位置所对应帧的特征,上述基频预测网络输出的预测基频包括丢帧位置所对应帧的基频。In some optional implementations of this embodiment, the features output by the intermediate layer include features of the frame corresponding to the frame loss position corresponding to the sample frame loss position information, and the predicted fundamental frequency output by the fundamental frequency prediction network includes the fundamental frequency of the frame corresponding to the frame loss position.

在本实施例的一些可选的实现方式中,生成单元1002进一步配置为,对上述样本丢包音频进行子带分解得到多个子带;基于上述多个子带转换到时频域的转换结果和上述样本丢帧位置信息,生成输入特征。In some optional implementations of this embodiment, the generation unit 1002 is further configured to perform sub-band decomposition on the above-mentioned sample packet loss audio to obtain multiple sub-bands; and generate input features based on the conversion results of the above-mentioned multiple sub-bands into the time-frequency domain and the above-mentioned sample frame loss position information.

在本实施例的一些可选的实现方式中,上述编码器层包括多个编码器,各编码器包括门控卷积层和时频空洞卷积层,上述时频空洞卷积层用于通过时间维度和频率维度的空洞卷积来提取特征。In some optional implementations of this embodiment, the encoder layer includes multiple encoders, each encoder includes a gated convolution layer and a time-frequency dilated convolution layer, and the time-frequency dilated convolution layer is used to extract features through dilated convolution in the time dimension and the frequency dimension.

在本实施例的一些可选的实现方式中,上述解码器层包括多个解码器,各解码器包括并行的第一分支和第二分支,上述第一分支用于预测音频的实部,上述第二分支用于预测音频的虚部;上述待训练神经网络基于上述编码器层输出的预测音频的实部和虚部,输出样本预测音频;以及,上述装置1000还包括:第三输入单元(图中未示出),配置为,将上述样本预测音频输入预先训练的至少一个判别器,由各判别器输出针对上述样本预测音频的判别结果;计算单元(图中未示出),配置为,基于至少一个判别结果、上述样本预测音频和上述样本未丢包音频,计算损失,以及基于上述损失调整上述待训练神经网络的网络参数。In some optional implementations of this embodiment, the decoder layer includes multiple decoders, each decoder includes a first branch and a second branch in parallel, the first branch is used to predict the real part of the audio, and the second branch is used to predict the imaginary part of the audio; the neural network to be trained outputs sample predicted audio based on the real and imaginary parts of the predicted audio output by the encoder layer; and the device 1000 also includes: a third input unit (not shown in the figure), configured to input the sample predicted audio into at least one pre-trained discriminator, and each discriminator outputs a discrimination result for the sample predicted audio; a calculation unit (not shown in the figure), configured to calculate the loss based on at least one discrimination result, the sample predicted audio and the sample non-lost audio, and adjust the network parameters of the neural network to be trained based on the loss.

在本实施例的一些可选的实现方式中,上述至少一个判别器中包括用于判别上述样本预测音频为真实音频的概率的第一判别器和用于判别上述样本预测音频的音频质量的第二判别器。In some optional implementations of this embodiment, the at least one discriminator includes a first discriminator for determining the probability that the sample predicted audio is real audio and a second discriminator for determining the audio quality of the sample predicted audio.

在本实施例的一些可选的实现方式中,上述装置1000还包括:第四输入单元(图中未示出),配置为,将上述待训练神经网络输出的样本预测音频和其对应的样本未丢包音频分别输入预先训练的语音识别模型;编码层特征获取单元(图中未示出),配置为,获取上述样本预测音频和其对应的样本未丢包音频在上述语音识别模型中的编码层特征;网络参数调整单元(图中未示出),配置为,基于所获取的两个编码层特征的差异损失,调整上述待训练神经网络的网络参数。In some optional implementations of this embodiment, the above-mentioned device 1000 also includes: a fourth input unit (not shown in the figure), configured to input the sample prediction audio output by the above-mentioned neural network to be trained and its corresponding sample non-packet-lost audio into a pre-trained speech recognition model respectively; a coding layer feature acquisition unit (not shown in the figure), configured to obtain the coding layer features of the above-mentioned sample prediction audio and its corresponding sample non-packet-lost audio in the above-mentioned speech recognition model; a network parameter adjustment unit (not shown in the figure), configured to adjust the network parameters of the above-mentioned neural network to be trained based on the difference loss of the two obtained coding layer features.

根据另一方面的实施例,提供了一种基于神经网络的语音丢包补偿装置。上述基于神经网络的语音丢包补偿装置可以部署于任何具有计算、处理能力的装置、设备、平台、设备集群等等。According to another embodiment, a neural network-based voice packet loss compensation device is provided. The neural network-based voice packet loss compensation device can be deployed on any device, equipment, platform, device cluster, etc. with computing and processing capabilities.

图11示出了根据一个实施例的基于神经网络的语音丢包补偿装置的示意性框图。图11所示装置用于执行图9所示方法。如图11所示,该基于神经网络的语音丢包补偿装置1100包括:模型获取单元1101,配置为,获取根据图2所描述方法训练得到的用于进行语音丢包补偿的神经网络;接收单元1102,配置为,接收待处理音频和所述待处理音频对应的丢帧位置信息;特征输入单元1103,配置为,将基于所述待处理音频和所述待处理音频对应的丢帧位置信息生成的输入特征,输入所述神经网络,得到所述待处理音频对应的丢包补偿后的音频。FIG11 shows a schematic block diagram of a speech packet loss compensation device based on a neural network according to an embodiment. The device shown in FIG11 is used to execute the method shown in FIG9 . As shown in FIG11 , the speech packet loss compensation device 1100 based on a neural network includes: a model acquisition unit 1101, configured to acquire a neural network for speech packet loss compensation trained according to the method described in FIG2 ; a receiving unit 1102, configured to receive audio to be processed and frame loss position information corresponding to the audio to be processed; and a feature input unit 1103, configured to input input features generated based on the audio to be processed and the frame loss position information corresponding to the audio to be processed into the neural network to obtain audio after packet loss compensation corresponding to the audio to be processed.

上述装置实施例与方法实施例相对应,具体说明可以参见方法实施例部分的描述,此处不再赘述。装置实施例是基于对应的方法实施例得到,与对应的方法实施例具有同样的技术效果,具体说明可参见对应的方法实施例。The above-mentioned device embodiments correspond to the method embodiments. For detailed descriptions, please refer to the description of the method embodiments, which will not be repeated here. The device embodiments are obtained based on the corresponding method embodiments and have the same technical effects as the corresponding method embodiments. For detailed descriptions, please refer to the corresponding method embodiments.

根据另一方面的实施例,还提供一种计算机可读存储介质,其上存储有计算机程序,当上述计算机程序在计算机中执行时,令计算机执行图2或图9所描述的方法。According to another embodiment, a computer-readable storage medium is provided, on which a computer program is stored. When the computer program is executed in a computer, the computer is caused to execute the method described in FIG. 2 or FIG. 9 .

根据再一方面的实施例,还提供一种电子设备,包括存储器和处理器,其特征在于,上述存储器中存储有可执行代码,上述处理器执行上述可执行代码时,实现图2或图9所描述的方法。According to yet another embodiment, an electronic device is provided, including a memory and a processor, wherein executable code is stored in the memory, and when the processor executes the executable code, the method described in FIG. 2 or FIG. 9 is implemented.

上述内容对本公开的特定实施例进行了描述,其他实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行,并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要按照示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的,或者可能是有利的。The foregoing describes specific embodiments of the present disclosure, and other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in an order different from that described in the embodiments and still achieve the desired results. In addition, the processes depicted in the accompanying drawings do not necessarily need to be performed in the specific order shown or in a sequential order to achieve the desired results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

下面参考图12,其示出了适于用来实现本申请实施例的电子设备1200的结构示意图。图12示出的电子设备仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。12, which shows a schematic diagram of the structure of an electronic device 1200 suitable for implementing the embodiments of the present application. The electronic device shown in FIG12 is only an example and should not limit the functions and scope of use of the embodiments of the present application.

如图12所示,电子设备1200可以包括处理装置(例如中央处理器、图形处理器等)1201,其可以根据存储在只读存储器(ROM)1202中的程序或者从存储装置1208加载到随机访问存储器(RAM)1203中的程序而执行各种适当的动作和处理。在RAM1203中,还存储有电子设备1200操作所需的各种程序和数据。处理装置1201、ROM1202以及RAM1203通过总线1204彼此相连。输入/输出(I/O)接口1205也连接至总线1204。As shown in FIG12 , the electronic device 1200 may include a processing device (e.g., a central processing unit, a graphics processing unit, etc.) 1201, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 1202 or a program loaded from a storage device 1208 into a random access memory (RAM) 1203. Various programs and data required for the operation of the electronic device 1200 are also stored in the RAM 1203. The processing device 1201, the ROM 1202, and the RAM 1203 are connected to each other via a bus 1204. An input/output (I/O) interface 1205 is also connected to the bus 1204.

通常,以下装置可以连接至I/O接口1205:包括例如触摸屏、触摸板、键盘、鼠标等的输入装置1206;包括例如液晶显示器(LCD,Liquid Crystal Display)、扬声器、振动器等的输出装置1207;包括例如磁带、硬盘等的存储装置1208;以及通信装置1209。通信装置1209可以允许电子设备1200与其他设备进行无线或有线通信以交换数据。虽然图12示出了具有各种装置的电子设备1200,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。图12中示出的每个方框可以代表一个装置,也可以根据需要代表多个装置。Typically, the following devices can be connected to the I/O interface 1205: an input device 1206 including, for example, a touch screen, a touchpad, a keyboard, a mouse, etc.; an output device 1207 including, for example, a liquid crystal display (LCD), a speaker, a vibrator, etc.; a storage device 1208 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 1209. The communication device 1209 can allow the electronic device 1200 to communicate with other devices wirelessly or by wire to exchange data. Although Figure 12 shows an electronic device 1200 with various devices, it should be understood that it is not required to implement or have all of the devices shown. More or fewer devices may be implemented or have alternatively. Each box shown in Figure 12 may represent one device, or may represent multiple devices as needed.

特别地,根据本申请的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本申请的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置1209从网络上被下载和安装,或者从存储装置1208被安装,或者从ROM1202被安装。在该计算机程序被处理装置1201执行时,执行本申请的实施例的方法中限定的上述功能。In particular, according to an embodiment of the present application, the process described above with reference to the flowchart can be implemented as a computer software program. For example, an embodiment of the present application includes a computer program product, which includes a computer program carried on a computer-readable medium, and the computer program includes program code for executing the method shown in the flowchart. In such an embodiment, the computer program can be downloaded and installed from the network via the communication device 1209, or installed from the storage device 1208, or installed from the ROM 1202. When the computer program is executed by the processing device 1201, the above-mentioned functions defined in the method of the embodiment of the present application are performed.

本公开实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行本公开所提供的所述方法。An embodiment of the present disclosure further provides a computer-readable storage medium having a computer program stored thereon. When the computer program is executed in a computer, the computer is caused to execute the method provided in the present disclosure.

需要说明的是,本公开的实施例所述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开的实施例中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开的实施例中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(Radio Frequency,射频)等等,或者上述的任意合适的组合。It should be noted that the computer-readable medium described in the embodiments of the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two. The computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination of the above. More specific examples of computer-readable storage media may include, but are not limited to: an electrical connection with one or more wires, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above. In the embodiments of the present disclosure, the computer-readable storage medium may be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, device, or device. In the embodiments of the present disclosure, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, which carries computer-readable program code. Such a propagated data signal may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer-readable medium may be transmitted using any suitable medium, including but not limited to wire, optical cable, RF (Radio Frequency), or any suitable combination thereof.

上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:获取训练样本集,其中,各训练样本包括样本丢包音频、及其对应的样本丢帧位置信息和样本未丢包音频;基于样本丢包音频及其对应的样本丢帧位置信息,生成输入特征;将上述输入特征输入上述待训练神经网络,其中,待训练神经网络包括编码器层、中间层和解码器层,上述中间层连接于上述编码器层和上述解码器层之间;将上述中间层输出的特征输入预先训练的基频预测网络,由上述基频预测网络输出预测基频;基于上述预测基频以及基于样本未丢包音频计算得到的真实基频,调整上述编码器层和上述中间层的网络参数。The computer-readable medium may be included in the electronic device, or may exist independently without being installed in the electronic device. The computer-readable medium carries one or more programs. When the one or more programs are executed by the electronic device, the electronic device: obtains a training sample set, wherein each training sample includes a sample packet loss audio, its corresponding sample frame loss position information, and a sample non-packet loss audio; generates input features based on the sample packet loss audio and its corresponding sample frame loss position information; inputs the input features into the neural network to be trained, wherein the neural network to be trained includes an encoder layer, an intermediate layer, and a decoder layer, wherein the intermediate layer is connected between the encoder layer and the decoder layer; inputs the features output by the intermediate layer into a pre-trained fundamental frequency prediction network, and the fundamental frequency prediction network outputs a predicted fundamental frequency; and adjusts the network parameters of the encoder layer and the intermediate layer based on the predicted fundamental frequency and the true fundamental frequency calculated based on the sample non-packet loss audio.

可以以一种或多种程序设计语言或其组合来编写用于执行本公开实施例的操作的计算机程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或电子设备上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for performing the operations of the embodiments of the present disclosure may be written in one or more programming languages, or a combination thereof, including object-oriented programming languages such as Java, Smalltalk, C++, and conventional procedural programming languages such as "C" or similar programming languages. The program code may be executed entirely on the user's computer, partially on the user's computer, as a stand-alone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or electronic device. In cases involving a remote computer, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (e.g., via the Internet using an Internet service provider).

本公开中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于存储介质和计算设备实施例而言,由于其基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。The various embodiments of this disclosure are described in a progressive manner. Similar portions between the various embodiments can be referenced to each other, and each embodiment focuses on the differences between the other embodiments. In particular, the storage medium and computing device embodiments are described briefly because they are generally similar to the method embodiments. For relevant portions, reference can be made to the description of the method embodiments.

本领域技术人员应该可以意识到,在上述一个或多个示例中,本公开实施例所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。Those skilled in the art will appreciate that, in one or more of the above examples, the functions described in the embodiments of the present disclosure may be implemented using hardware, software, firmware, or any combination thereof. When implemented using software, these functions may be stored in a computer-readable medium or transmitted as one or more instructions or codes on a computer-readable medium.

以上所述的具体实施方式,对本公开实施例的目的、技术方案和有益效果进行了进一步的详细说明。所应理解的是,以上所述仅为本公开实施例的具体实施方式而已,并不用于限定本公开的保护范围,凡在本公开的技术方案的基础之上所做的任何修改、等同替换、改进等,均应包括在本公开的保护范围之内。The specific implementation methods described above further illustrate the purpose, technical solutions, and beneficial effects of the embodiments of the present disclosure. It should be understood that the above description is only a specific implementation method of the embodiments of the present disclosure and is not intended to limit the scope of protection of the present disclosure. Any modifications, equivalent replacements, improvements, etc. made on the basis of the technical solutions of the present disclosure shall be included in the scope of protection of the present disclosure.

Claims (14)

一种基于神经网络的语音丢包补偿方法,包括:A method for compensating voice packet loss based on a neural network, comprising: 获取预先训练得到的用于进行语音丢包补偿的神经网络;Obtaining a pre-trained neural network for voice packet loss compensation; 接收待处理音频和所述待处理音频对应的丢帧位置信息;以及receiving audio to be processed and frame loss position information corresponding to the audio to be processed; and 将基于所述待处理音频和所述待处理音频对应的丢帧位置信息生成的输入特征,输入所述神经网络,得到所述待处理音频对应的丢包补偿后的音频。Input features generated based on the audio to be processed and the frame loss position information corresponding to the audio to be processed are input into the neural network to obtain audio after packet loss compensation corresponding to the audio to be processed. 一种训练用于进行语音丢包补偿的神经网络的方法,其中,待训练神经网络包括编码器层、中间层和解码器层,所述中间层连接于所述编码器层和所述解码器层之间,所述方法包括:A method for training a neural network for voice packet loss compensation, wherein the neural network to be trained includes an encoder layer, an intermediate layer, and a decoder layer, wherein the intermediate layer is connected between the encoder layer and the decoder layer, and the method includes: 获取训练样本集,其中,各训练样本包括样本丢包音频、及其对应的样本丢帧位置信息和样本未丢包音频;Obtaining a training sample set, wherein each training sample includes sample packet loss audio, its corresponding sample frame loss position information, and sample non-packet loss audio; 基于样本丢包音频及其对应的样本丢帧位置信息,生成输入特征;Generate input features based on the sample packet loss audio and its corresponding sample frame loss position information; 将所述输入特征输入所述待训练神经网络;Inputting the input features into the neural network to be trained; 将所述中间层输出的特征输入预先训练的基频预测网络,由所述基频预测网络输出预测基频;以及Inputting the features output by the intermediate layer into a pre-trained fundamental frequency prediction network, and having the fundamental frequency prediction network output a predicted fundamental frequency; and 基于所述预测基频以及基于样本未丢包音频计算得到的真实基频,调整所述编码器层和所述中间层的网络参数。Based on the predicted fundamental frequency and the actual fundamental frequency calculated based on the sampled unpacked audio, network parameters of the encoder layer and the intermediate layer are adjusted. 根据权利要求2所述的方法,其中,所述待训练神经网络为U-Net结构的神经网络,其中,所述中间层为所述U-Net结构中的瓶颈层。The method according to claim 2, wherein the neural network to be trained is a neural network with a U-Net structure, and wherein the intermediate layer is a bottleneck layer in the U-Net structure. 根据权利要求2所述的方法,其中,所述基频预测网络包括双向长短期记忆网络。The method according to claim 2, wherein the fundamental frequency prediction network comprises a bidirectional long short-term memory network. 根据权利要求2所述的方法,其中,所述中间层输出的特征包括样本丢帧位置信息对应的丢帧位置所对应帧的特征,所述基频预测网络输出的预测基频包括丢帧位置所对应帧的基频。The method according to claim 2, wherein the features output by the intermediate layer include features of the frame corresponding to the frame loss position corresponding to the sample frame loss position information, and the predicted fundamental frequency output by the fundamental frequency prediction network includes the fundamental frequency of the frame corresponding to the frame loss position. 根据权利要求2所述的方法,其中,所述基于样本丢包音频及其对应的样本丢帧位置信息,生成输入特征,包括:The method according to claim 2, wherein generating input features based on the sample packet loss audio and its corresponding sample frame loss position information comprises: 对所述样本丢包音频进行子带分解得到多个子带;Performing sub-band decomposition on the sample packet loss audio to obtain multiple sub-bands; 基于所述多个子带转换到时频域的转换结果和所述样本丢帧位置信息,生成输入特征。An input feature is generated based on the conversion results of the multiple sub-bands into the time-frequency domain and the sample frame loss position information. 根据权利要求2所述的方法,其中,所述编码器层包括多个编码器,各编码器包括门控卷积层和时频空洞卷积层,所述时频空洞卷积层用于通过时间维度和频率维度的空洞卷积来提取特征。The method according to claim 2, wherein the encoder layer includes multiple encoders, each encoder includes a gated convolution layer and a time-frequency dilated convolution layer, and the time-frequency dilated convolution layer is used to extract features through dilated convolution in the time dimension and the frequency dimension. 根据权利要求2所述的方法,其中,所述解码器层包括多个解码器,各解码器包括并行的第一分支和第二分支,所述第一分支用于预测音频的实部,所述第二分支用于预测音频的虚部;所述待训练神经网络基于所述编码器层输出的预测音频的实部和虚部,输出样本预测音频;以及,所述方法还包括:The method according to claim 2, wherein the decoder layer includes a plurality of decoders, each decoder includes a first branch and a second branch in parallel, the first branch is used to predict the real part of the audio, and the second branch is used to predict the imaginary part of the audio; the neural network to be trained outputs sample predicted audio based on the real part and imaginary part of the predicted audio output by the encoder layer; and the method further includes: 将所述样本预测音频输入预先训练的至少一个判别器,由各判别器输出针对所述样本预测音频的判别结果;Inputting the sample prediction audio into at least one pre-trained discriminator, and having each discriminator output a discrimination result for the sample prediction audio; 基于至少一个判别结果、所述样本预测音频和所述样本未丢包音频,计算损失,以及基于所述损失调整所述待训练神经网络的网络参数。Based on at least one discrimination result, the sample predicted audio and the sample non-packet-loss audio, a loss is calculated, and network parameters of the neural network to be trained are adjusted based on the loss. 根据权利要求8所述的方法,其中,所述至少一个判别器中包括用于判别所述样本预测音频为真实音频的概率的第一判别器和用于判别所述样本预测音频的音频质量的第二判别器。The method according to claim 8, wherein the at least one discriminator includes a first discriminator for discriminating the probability that the sample predicted audio is real audio and a second discriminator for discriminating the audio quality of the sample predicted audio. 根据权利要求2至9任一项所述的方法,其中,所述方法还包括:The method according to any one of claims 2 to 9, further comprising: 将所述待训练神经网络输出的样本预测音频和其对应的样本未丢包音频分别输入预先训练的语音识别模型;Inputting the sample prediction audio output by the neural network to be trained and its corresponding sample unpacked audio into the pre-trained speech recognition model respectively; 获取所述样本预测音频和其对应的样本未丢包音频在所述语音识别模型中的编码层特征;以及Obtaining coding layer features of the sample predicted audio and its corresponding sample unpacked audio in the speech recognition model; and 基于所获取的两个编码层特征的差异损失,调整所述待训练神经网络的网络参数。Based on the obtained difference loss of the two coding layer features, the network parameters of the neural network to be trained are adjusted. 一种训练用于进行语音丢包补偿的神经网络的装置,其中,待训练神经网络包括编码器层、中间层和解码器层,所述中间层连接于所述编码器层和所述解码器层之间,所述装置包括:A device for training a neural network for voice packet loss compensation, wherein the neural network to be trained includes an encoder layer, an intermediate layer, and a decoder layer, wherein the intermediate layer is connected between the encoder layer and the decoder layer, and the device includes: 获取单元,配置为,获取训练样本集,其中,各训练样本包括样本丢包音频、及其对应的样本丢帧位置信息和样本未丢包音频;An acquisition unit is configured to acquire a training sample set, wherein each training sample includes a sample packet loss audio, its corresponding sample frame loss position information, and a sample non-packet loss audio; 生成单元,配置为,基于样本丢包音频及其对应的样本丢帧位置信息,生成输入特征;A generating unit configured to generate input features based on the sample packet loss audio and its corresponding sample frame loss position information; 第一输入单元,配置为,将所述输入特征输入所述待训练神经网络;A first input unit is configured to input the input feature into the neural network to be trained; 第二输入单元,配置为,将所述中间层输出的特征输入预先训练的基频预测网络,由所述基频预测网络输出预测基频;以及A second input unit is configured to input the features output by the intermediate layer into a pre-trained fundamental frequency prediction network, so that the fundamental frequency prediction network outputs a predicted fundamental frequency; and 调整单元,配置为,基于所述预测基频以及基于样本未丢包音频计算得到的真实基频,调整所述编码器层和所述中间层的网络参数。The adjustment unit is configured to adjust network parameters of the encoder layer and the intermediate layer based on the predicted fundamental frequency and the real fundamental frequency calculated based on the sampled unpacked audio. 一种基于神经网络的语音丢包补偿装置,包括:A speech packet loss compensation device based on a neural network, comprising: 模型获取单元,配置为,获取预先训练得到的用于进行语音丢包补偿的神经网络;A model acquisition unit configured to acquire a pre-trained neural network for voice packet loss compensation; 接收单元,配置为,接收待处理音频和所述待处理音频对应的丢帧位置信息;以及a receiving unit configured to receive audio to be processed and frame loss position information corresponding to the audio to be processed; and 特征输入单元,配置为,将基于所述待处理音频和所述待处理音频对应的丢帧位置信息生成的输入特征,输入所述神经网络,得到所述待处理音频对应的丢包补偿后的音频。The feature input unit is configured to input input features generated based on the audio to be processed and the frame loss position information corresponding to the audio to be processed into the neural network to obtain audio after packet loss compensation corresponding to the audio to be processed. 一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令所述计算机执行权利要求1-10中任一项所述的方法。A computer-readable storage medium having a computer program stored thereon, which, when executed in a computer, causes the computer to execute the method according to any one of claims 1 to 10. 一种电子设备,包括存储器和处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现权利要求1-10中任一项所述的方法。An electronic device comprises a memory and a processor, wherein the memory stores executable code, and when the processor executes the executable code, the method according to any one of claims 1 to 10 is implemented.
PCT/CN2024/139754 2024-04-07 2024-12-16 Neural network-based voice packet loss concealment method and device Pending WO2025213833A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202410411048.5 2024-04-07
CN202410411048.5A CN118136026A (en) 2024-04-07 2024-04-07 Voice packet loss compensation method and device based on neural network

Publications (1)

Publication Number Publication Date
WO2025213833A1 true WO2025213833A1 (en) 2025-10-16

Family

ID=91234398

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2024/139754 Pending WO2025213833A1 (en) 2024-04-07 2024-12-16 Neural network-based voice packet loss concealment method and device

Country Status (2)

Country Link
CN (1) CN118136026A (en)
WO (1) WO2025213833A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118136026A (en) * 2024-04-07 2024-06-04 北京字跳网络技术有限公司 Voice packet loss compensation method and device based on neural network

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111640442A (en) * 2020-06-01 2020-09-08 北京猿力未来科技有限公司 Method for processing audio packet loss, method for training neural network and respective devices
CN111653285A (en) * 2020-06-01 2020-09-11 北京猿力未来科技有限公司 Packet loss compensation method and device
CN111883173A (en) * 2020-03-20 2020-11-03 珠海市杰理科技股份有限公司 Audio packet loss repairing method, device and system based on neural network
CN111883172A (en) * 2020-03-20 2020-11-03 珠海市杰理科技股份有限公司 Neural network training method, device and system for audio packet loss repair
CN113096685A (en) * 2021-04-02 2021-07-09 北京猿力未来科技有限公司 Audio processing method and device
CN115171705A (en) * 2022-06-01 2022-10-11 阿里巴巴云计算(北京)有限公司 Voice packet loss compensation method, voice call method and device
US20230377584A1 (en) * 2020-10-15 2023-11-23 Dolby International Ab Real-time packet loss concealment using deep generative networks
CN118136026A (en) * 2024-04-07 2024-06-04 北京字跳网络技术有限公司 Voice packet loss compensation method and device based on neural network

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111883173A (en) * 2020-03-20 2020-11-03 珠海市杰理科技股份有限公司 Audio packet loss repairing method, device and system based on neural network
CN111883172A (en) * 2020-03-20 2020-11-03 珠海市杰理科技股份有限公司 Neural network training method, device and system for audio packet loss repair
CN111640442A (en) * 2020-06-01 2020-09-08 北京猿力未来科技有限公司 Method for processing audio packet loss, method for training neural network and respective devices
CN111653285A (en) * 2020-06-01 2020-09-11 北京猿力未来科技有限公司 Packet loss compensation method and device
US20230377584A1 (en) * 2020-10-15 2023-11-23 Dolby International Ab Real-time packet loss concealment using deep generative networks
CN113096685A (en) * 2021-04-02 2021-07-09 北京猿力未来科技有限公司 Audio processing method and device
CN115171705A (en) * 2022-06-01 2022-10-11 阿里巴巴云计算(北京)有限公司 Voice packet loss compensation method, voice call method and device
CN118136026A (en) * 2024-04-07 2024-06-04 北京字跳网络技术有限公司 Voice packet loss compensation method and device based on neural network

Also Published As

Publication number Publication date
CN118136026A (en) 2024-06-04

Similar Documents

Publication Publication Date Title
EP4229629B1 (en) Real-time packet loss concealment using deep generative networks
CN101471073B (en) Package loss compensation method, apparatus and system based on frequency domain
US8392176B2 (en) Processing of excitation in audio coding and decoding
CN101023472B (en) Scalable encoding device and scalable encoding method
US11646042B2 (en) Digital voice packet loss concealment using deep learning
US12380901B2 (en) Encoder, decoder, encoding method and decoding method for frequency domain long-term prediction of tonal signals for audio coding
WO2025213833A1 (en) Neural network-based voice packet loss concealment method and device
EP4229558B1 (en) A generative neural network model for processing audio samples in a filter-bank domain
WO2025007868A1 (en) Audio signal recovery method and apparatus, electronic device, and readable storage medium
CN113096670B (en) Audio data processing method, device, equipment and storage medium
CN117316160B (en) Silent speech recognition method, silent speech recognition apparatus, electronic device, and computer-readable medium
US20080031365A1 (en) Signal coding and decoding based on spectral dynamics
US12494208B2 (en) Frame loss concealment for a low-frequency effects channel
KR20220050924A (en) Multi-lag format for audio coding
Li et al. A Packet Loss Concealment Method Based on the Demucs Network Structure
JP2025525585A (en) Neural Network-Based Signal Processing
CN117037809A (en) Voice signal processing method, device, equipment and storage medium
CN118016075A (en) Audio data compensation method, audio receiving device and storage medium
HK40085116A (en) Frame loss concealment for a low-frequency effects channel
HK40085116B (en) Frame loss concealment for a low-frequency effects channel
CN118629391A (en) Speech synthesis method, device, electronic device and computer readable medium
CN117037808A (en) Voice signal processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24934896

Country of ref document: EP

Kind code of ref document: A1