WO2022121169A1 - Emotional speech synthesis method, apparatus, and device, and storage medium - Google Patents
Emotional speech synthesis method, apparatus, and device, and storage medium Download PDFInfo
- Publication number
- WO2022121169A1 WO2022121169A1 PCT/CN2021/083559 CN2021083559W WO2022121169A1 WO 2022121169 A1 WO2022121169 A1 WO 2022121169A1 CN 2021083559 W CN2021083559 W CN 2021083559W WO 2022121169 A1 WO2022121169 A1 WO 2022121169A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speech
- data
- generate
- emotional
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/16—Vocoder architecture
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/14—Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
- G06F17/141—Discrete Fourier transforms
- G06F17/142—Fast Fourier transforms, e.g. using a Cooley-Tukey type algorithm
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
Definitions
- the present application relates to the technical field of speech synthesis, and in particular, to a method, apparatus, device and storage medium for synthesizing emotional speech.
- speech synthesis methods are mainly based on hidden Markov speech synthesis methods or neural network-based speech synthesis methods.
- the inventor realized that although these two speech synthesis methods can obtain good synthesized speech, the resulting Synthesized speech is flat and lacking emotion, making it impossible to get emotional speech.
- the present application provides a method, device, device and storage medium for synthesizing emotional speech, which are used to solve the problem of dullness and lack of emotion in the synthesized speech, and increase the diversity of the synthesized speech.
- a first aspect of the present application provides a method for synthesizing emotional speech, including: acquiring speech data to be recognized and corresponding text data; inputting the speech data to be recognized into a pre-trained emotion recognition network to generate Mel spectrum features and position encoding, and process in the emotion recognition network in combination with the Mel spectrum feature and the position encoding to generate an emotion embedding feature; input the emotion embedding feature and the text data into the pre-trained speech synthesis
- target mel-spectrum data is generated; a neural vocoder is used to perform speech conversion on the target mel-spectrum data to generate target emotional speech.
- a second aspect of the present application provides a device for synthesizing emotional speech, comprising: an acquisition module for acquiring to-be-recognized speech data and corresponding text data; an embedded feature generation module for inputting the to-be-recognized speech data into pre-training In a good emotion recognition network, the mel spectrum feature and the position code are generated, and the mel spectrum feature and the position code are processed in the emotion recognition network to generate the emotion embedded feature; the mel spectrum data generation module , for inputting the emotion embedded feature and the text data into the pre-trained speech synthesis network to generate target mel-spectrum data; the speech conversion module is used for using a neural vocoder to analyze the target mel-spectrum data Perform speech conversion to generate target emotional speech.
- a third aspect of the present application provides a device for synthesizing emotional speech, including: a memory and at least one processor, where instructions are stored in the memory; the at least one processor invokes the instructions in the memory, so that The device for synthesizing emotional speech performs the following method for synthesizing emotional speech:
- a fourth aspect of the present application provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, when the computer-readable storage medium runs on a computer, the computer executes the following method for synthesizing emotional speech:
- the speech data to be recognized and the corresponding text data are obtained;
- the speech data to be recognized is input into a pre-trained emotion recognition network to generate Mel spectrum features and location codes, and combined with the Mel spectrum features and the position encoding are processed in the emotion recognition network to generate emotion embedded features;
- input the emotion embedded features and the text data into a pre-trained speech synthesis network to generate target Mel spectrum data ;
- Use a neural vocoder to convert the target Mel spectrum data to generate target emotional speech.
- the emotion embedded feature is generated, and then the emotion embedded feature and the text data are spliced to generate the target emotional speech, which solves the problem of dull synthetic speech,
- the problem of lack of emotion increases the variety of synthesized speech.
- FIG. 1 is a schematic diagram of an embodiment of a method for synthesizing emotional speech in an embodiment of the present application
- FIG. 2 is a schematic diagram of another embodiment of a method for synthesizing emotional speech in an embodiment of the present application
- FIG. 3 is a schematic diagram of an embodiment of an apparatus for synthesizing emotional speech in an embodiment of the present application
- FIG. 4 is a schematic diagram of another embodiment of an apparatus for synthesizing emotional speech in an embodiment of the present application.
- FIG. 5 is a schematic diagram of an embodiment of a device for synthesizing emotional speech in an embodiment of the present application.
- Embodiments of the present application provide a method, apparatus, device, and storage medium for synthesizing emotional speech, which are used to solve the problem that the synthesized speech is dull and lack emotion, and increase the diversity of the synthesized speech.
- an embodiment of the method for synthesizing emotional speech in the embodiment of the present application includes:
- the server obtains the speech data to be recognized and the text data corresponding to the text data to be recognized. It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned voice data and text data to be recognized, the above-mentioned voice data and text data to be recognized can also be stored in a node of a blockchain.
- the to-be-recognized speech data is the to-be-recognized speech data with emotion, which may be the to-be-recognized speech data with happy emotion, the to-be-recognized speech data with surprised emotion, and/or the to-be-recognized speech data with anger emotion.
- the server obtains the speech data to be recognized with emotion, it also obtains the corresponding text data. For example, the speech data to be recognized with emotion is "really! disappointment!, the server is obtaining "really! disappointment! ! !, the text data of "Really! disappointment! is also obtained.
- the execution subject of the present application may be a device for synthesizing emotional speech, and may also be a terminal or a server, which is not specifically limited here.
- the embodiments of the present application take the server as an execution subject as an example for description.
- the server inputs the speech data to be recognized into the pre-trained emotion recognition network, firstly generates the mel spectrum feature and the position code, and then processes the mel spectrum feature and the position code in the emotion recognition network to generate the emotion embedded feature .
- the server inputs the speech data of "really! disappointment! into the pre-trained emotion recognition network for calculation, and generates Mel spectrum features [B 1 , T 1 , D 1 ] and position code P, where, The location code P is generated based on the mel spectrum feature. The location code P is actually a hidden layer output.
- the server combines the mel spectrum features [B 1 , T 1 , D 1 ] and the location code P for calculation to generate emotional embedding features. [B 2 , T 2 , D 2 ].
- the server inputs the emotion embedded features and text data into the pre-trained speech synthesis network for calculation, and generates the target mel spectrum data.
- the server inputs the emotional embedding features of [B 2 , T 2 , D 2 ] and the text data of "Really! disappointment! into the pre-trained speech synthesis network for calculation.
- the speech synthesis network It includes an encoder, in which the text data of "Really! disappointment! is feature-lifted, an extraction result is generated, and the extraction result is combined with the emotion embedding feature [B 2 ,T 2 ,D 2 ] for splicing to generate target Mel spectrum data [B 2 , T 2 , D 2 +D].
- the server uses a neural vocoder to convert the target mel-spectral data into target emotional speech.
- the neural vocoder is Wave Glow
- the target Mel spectrum data is the input of the neural vocoder
- the input frame length is 1024
- the frame shift is 256.
- the Er spectrum data is input into the affine coupling layer of the neural vocoder for scaling and transformation to generate emotional speech features, and then reversible convolution is performed on the emotional speech features to generate the target emotional speech "Really! (surprised emotion) soda to you. ! (happy emotion)”.
- the emotion embedded feature is generated, and then the emotion embedded feature and the text data are spliced to generate the target emotional speech, which solves the problem of dull synthetic speech,
- the problem of lack of emotion increases the variety of synthesized speech.
- another embodiment of the method for synthesizing emotional speech in the embodiment of the present application includes:
- the server obtains emotional speech training data, emotional label data and text training data from the big data platform or database.
- emotional speech training data can be divided into emotional speech training data including noise and emotional speech training data not including noise.
- the emotional voice training data can be voice training data with emotion such as "too much”, “really” or “too good”, and obtain emotional label data and text training data, among which "too much”
- the emotional voice training data corresponds to the emotional label data of "anger” and the text training data of "too much”;
- the emotional voice training data of "really” corresponds to the emotional label data of "surprise” and corresponds to the text of "really” Training data;
- “too good” emotional speech training data corresponds to "happy” emotional label data, corresponding to "too good” text training data.
- the server performs training based on the emotional voice training data and emotional label data, combined with the regularization mechanism, to generate a pre-trained emotion recognition network, and then performs model training based on the emotional voice training data and text training data to generate a pre-trained speech synthesis network.
- the pre-trained emotion recognition network is used to extract emotional features, so emotional speech training data and emotion label data are used to train the emotion recognition network; the pre-trained speech synthesis network is used to synthesize emotional speech, so emotional speech is used. Speech training data and text training data train the speech synthesis network.
- the above training process includes emotional speech training data, when training the emotion recognition network, the emotional speech training data can be either the emotional speech training data including noise or the emotional speech training data that does not include noise, but training speech synthesis When the network is used, it is necessary to call high-quality training data for training, that is, emotional speech training data that does not include noise.
- the server combines the layer regularization mechanism to train the emotion recognition network, mainly adding a layer regularization mechanism after each sub-layer, and the layer regularization mechanism is to calculate the output of the i layer in the channel dimension. and variance, and then subtract the mean from the output of layer i and divide by the variance, so that the output of layer i has a mean of 0 and a variance of 1.
- the layer regularization mechanism can make the distribution of training data consistent and make the training process stable.
- the server obtains the speech data to be recognized and the text data corresponding to the text data to be recognized. It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned voice data and text data to be recognized, the above-mentioned voice data and text data to be recognized can also be stored in a node of a blockchain.
- the to-be-recognized speech data is the to-be-recognized speech data with emotion, which may be the to-be-recognized speech data with happy emotion, the to-be-recognized speech data with surprised emotion, and/or the to-be-recognized speech data with anger emotion.
- the server obtains the speech data to be recognized with emotion, it also obtains the corresponding text data. For example, the speech data to be recognized with emotion is "really! disappointment!, the server is obtaining "really! disappointment! ! !, the text data of "Really! disappointment! is also obtained.
- the server inputs the speech data to be recognized into the pre-trained emotion recognition network, firstly generates the mel spectrum feature and the position code, and then processes the mel spectrum feature and the position code in the emotion recognition network to generate the emotion embedded feature .
- the server inputs the speech data of "really! disappointment! into the pre-trained emotion recognition network for calculation, and generates Mel spectrum features [B 1 , T 1 , D 1 ] and position code P, where, The location code P is generated based on the mel spectrum feature. The location code P is actually a hidden layer output.
- the server combines the mel spectrum features [B 1 , T 1 , D 1 ] and the location code P for calculation to generate emotional embedding features. [B 2 , T 2 , D 2 ].
- the server inputs the speech data to be recognized into a pre-trained emotion recognition network to generate mel spectrum features; the server generates a position code according to the mel spectrum features and a preset position conversion formula; the server converts the mel spectrum features and The positional encoding is input into the encoder of the emotion recognition network for encoding to generate emotion embedding features.
- the server inputs the speech data of "really! disappointment! into the pre-trained emotion recognition network, and firstly inputs "really! disappointment! into the trained emotion recognition network for feature extraction.
- the server inputs the Mel spectrum features [B 1 , T 1 , D 1 ] and the position encoding P into the encoder of the emotion recognition network for encoding to generate emotion embedding features, where the encoder of the emotion recognition network includes five identical module and a layer of long short-term memory artificial neural network layer, in which each module includes two sub-layers, namely multi-head self-attention layer and forward propagation layer, in the encoder [B 1 , T 1 , D 1 ] and P are encoded to generate sentiment embedding features [B 2 , T 2
- the server inputs the speech data to be recognized into the pre-trained emotion recognition network, and the generated Mel spectrum features include:
- the server performs windowing processing on the speech data to be recognized to generate windowed speech data; then, the server performs short-time Fourier transform on the windowed speech data to generate Fourier transformed speech data; finally, the server uses a Mel filter bank to process the Fourier transformed speech data to generate Mel spectrum features.
- the server uses a window function to perform windowing processing on the speech data to be recognized for "really! disappointment! to generate windowed speech data; then the server performs Fourier transform on the windowed speech data, and determines to add The frequency and phase of the speech data after the window are obtained, thereby generating the speech data after Fourier transformation; finally, the server uses a Mel filter bank to process the speech data after Fourier transformation into mel spectrum features.
- the server According to the Mel spectrum feature and the preset position conversion formula, the server generates the position code including:
- the server reads the length of the mel spectrum feature and reads the position of the mel spectrum feature; the service generates a position input value based on the length of the mel spectrum feature and the position of the mel spectrum feature; the server inputs the position input vector into the preset Position conversion formula to generate position codes.
- the preset position conversion formula is
- pos is the position of the mel spectral feature
- 2i represents the even dimension
- 2i+1 represents the odd dimension
- d mode represents the preset dimension vector corresponding to the location of the mel spectral feature, such as 256.
- the server reads the length of the mel spectral feature as 5 and the position of the mel spectral feature as 0, then the server determines the position based on the length of the mel spectral feature of "5" and the position of the mel spectral feature of "0"
- the input value is [0,1,2,3,4], and then the position input value [0,1,2,3,4] is calculated based on the above formula to generate the position code P.
- P is only a pronoun, and does not represent specific position encoding data.
- the server inputs the mel spectrum feature and position encoding into the encoder of the emotion recognition network for encoding, and the generated emotion embedding features include:
- the server inputs the position encoding of the mel spectrum feature sum into the multi-head self-attention layer of the emotion recognition network, and combines the residual connections to generate the initial emotion feature vector; the server inputs the initial emotion feature vector into the forward propagation layer of the emotion recognition network for processing. Convolution to generate sentiment embedding features.
- the server first inputs the Mel spectrum features [B 1 , T 1 , D 1 ] into the multi-head self-attention layer for calculation, and combines the residual connections to generate the initial emotion feature vector.
- the formula for the multi-head self-attention layer design is as follows:
- Q, K, V are the input, that is, the mel spectrum feature
- d k is the preset dimension vector, such as 256
- head i is the ith head
- each calculation in the multi-head self-attention layer is a head
- W i Q , KW i K , KW i V are the weights, which are generated during the training process
- Concat refers to how many The heads are spliced together along the last dimension.
- the dimension vectors of these four heads are [B t , T t , D t ], and the server splices them together to generate the initial emotion feature vector [B t , T t ] , 4D t ], W O is the parameter learned in advance.
- the initial emotion feature vector and the Mel spectrum feature input are convolved in the corresponding forward propagation layer to generate the first module emotion feature vector. Since the encoder includes five identical modules, Therefore, five calculations are performed according to the above calculation method, and the output result of the last module is input into a layer of long short-term memory artificial neural network layer, thereby generating emotional embedded features [B 2 , T 2 , D 2 ].
- the residual connection is to add the input of each multi-head self-attention layer to the output as the input of the next forward propagation layer.
- the first multi-head self-attention layer it is the initial emotion.
- Mel spectrum features are added to the feature vector to generate the input of the forward propagation layer. Thereby improving the relevance of generating emotional embedding features.
- the server inputs the emotion embedded features and text data into the pre-trained speech synthesis network for calculation, and generates the target mel spectrum data.
- the server inputs the emotion embedding features of [B 2 , T 2 , D 3 ] and the text data of "Really! disappointment! into the pre-trained speech synthesis network for calculation.
- the speech synthesis network It includes an encoder, in which the text data of "Really! disappointment! is feature-lifted, an extraction result is generated, and the extraction result is combined with the emotion embedding feature [B 2 ,T 2 ,D 2 ] for splicing to generate target Mel spectrum data [B 2 , T 2 , D 2 +D].
- the server converts text data into text embedding features in the pre-trained speech synthesis network; the server splices the text embedding features and emotion embedding features in order of time to generate target Mel spectrum data.
- the server first converts the text data into text embedding features in the same form as the emotional embedding features in the order of time, and then the server splices the text embedding features and emotional embedding features in the order of time to generate the target Mei spectral data.
- the emotion embedding feature is [B 2 , T 2 , D 2 ]
- the text embedding feature is [B 2 , T 2 , D 3 ]
- the server will D 2 ] is spliced with [B 2 , T 2 , D] to generate target mel spectrum data as [B 2 , T 2 , D 2 +D].
- the emotion embedding feature dimension is [B 2 , D 2 ]
- the server expands the emotion embedding feature to [B 2 , 1, D 2 ], and then the emotion embedding feature and cost embedding feature Do stitching.
- the server uses a neural vocoder to convert the target mel-spectral data into target emotional speech.
- the neural vocoder is Wave Glow
- the target Mel spectrum data is the input of the neural vocoder
- the input frame length is 1024
- the frame shift is 256.
- the Er spectrum data is input into the affine coupling layer of the neural vocoder for scaling and transformation to generate emotional speech features, and then reversible convolution is performed on the emotional speech features to generate the target emotional speech "Really! (surprised emotion) soda to you. ! (happy emotion)”.
- the emotion embedded feature is generated, and then the emotion embedded feature and the text data are spliced to generate the target emotional speech, which solves the problem of dull synthetic speech,
- the problem of lack of emotion increases the variety of synthesized speech.
- An embodiment of the device for synthesizing emotional speech in the embodiment of the present application includes:
- a to-be-recognized data acquisition module 301 configured to acquire to-be-recognized voice data and corresponding text data;
- the embedded feature generation module 302 is used to input the speech data to be recognized into the pre-trained emotion recognition network, generate a mel spectrum feature and a position code, and combine the mel spectrum feature and the position code in the described mel spectrum feature and the position code. Processed in the emotion recognition network to generate emotion embedded features;
- Mel spectrum data generation module 303 for inputting the emotion embedded feature and the text data into a pre-trained speech synthesis network to generate target Mel spectrum data
- the speech conversion module 304 is configured to perform speech conversion on the target mel spectrum data by using a neural vocoder to generate a target emotional speech.
- the emotion embedded feature is generated, and then the emotion embedded feature and the text data are spliced to generate the target emotional speech, which solves the problem of dull synthetic speech,
- the problem of lack of emotion increases the variety of synthesized speech.
- another embodiment of the apparatus for synthesizing emotional speech in the embodiment of the present application includes:
- a to-be-recognized data acquisition module 301 configured to acquire to-be-recognized voice data and corresponding text data;
- the embedded feature generation module 302 is used to input the speech data to be recognized into the pre-trained emotion recognition network, generate a mel spectrum feature and a position code, and combine the mel spectrum feature and the position code in the described mel spectrum feature and the position code. Processed in the emotion recognition network to generate emotion embedded features;
- Mel spectrum data generation module 303 for inputting the emotion embedded feature and the text data into a pre-trained speech synthesis network to generate target Mel spectrum data
- the speech conversion module 304 is configured to perform speech conversion on the target mel spectrum data by using a neural vocoder to generate a target emotional speech.
- the embedded feature generation module 302 includes:
- Mel spectrum feature generation unit 3021 for inputting the speech data to be recognized into the pre-trained emotion recognition network to generate Mel spectrum features
- a position code generation unit 3022 configured to generate a position code according to the mel spectrum feature and a preset position conversion formula
- the encoding unit 3023 is configured to input the mel spectrum feature and the position code into the encoder of the emotion recognition network for encoding, and generate an emotion embedded feature.
- the mel spectrum feature generating unit 3021 can also be specifically used for:
- Windowing is performed on the to-be-recognized speech data to generate windowed speech data
- the Fourier-transformed speech data is processed by using a Mel filter bank to generate Mel spectrum features.
- the location code generation unit 3022 can also be specifically used for:
- the encoding unit 3023 can also be specifically used for:
- the initial emotion feature vector is input into the forward propagation layer of the emotion recognition network for convolution to generate emotion embedded features.
- the Mel spectrum data generation module 303 can also be specifically used for:
- the text embedding feature and the emotion embedding feature are spliced to generate target mel spectrum data.
- the device for synthesizing emotional speech further includes:
- a training data acquisition module 305 configured to acquire emotional speech training data, emotional label data and text training data
- the training module 306 is configured to use the emotional voice training data and the emotional label data, and perform model training in combination with a layer regularization mechanism, generate a pre-trained emotional recognition network, and use the emotional voice training data and the text
- the training data is used for model training to generate a pre-trained speech synthesis network.
- the emotion embedded feature is generated, and then the emotion embedded feature and the text data are spliced to generate the target emotional speech, which solves the problem of dull synthetic speech,
- the problem of lack of emotion increases the variety of synthesized speech.
- the device 500 for synthesizing emotional speech may vary greatly due to different configurations or performances, and may include one or more processors (central processing units, CPU) 510 (eg, one or more processors) and memory 520, one or more storage media 530 (eg, one or more mass storage devices) that store application programs 533 or data 532.
- the memory 520 and the storage medium 530 may be short-term storage or persistent storage.
- the program stored in the storage medium 530 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations in the device 500 for synthesizing emotional speech.
- the processor 510 may be configured to communicate with the storage medium 530, and execute a series of instruction operations in the storage medium 530 on the emotional speech synthesis device 500.
- the emotional speech synthesis device 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input and output interfaces 560, and/or, one or more operating systems 531, such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD, and more.
- operating systems 531 such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD, and more.
- the present application also provides a device for synthesizing emotional speech.
- the computer device includes a memory and a processor, and computer-readable instructions are stored in the memory.
- the processor executes the steps in the foregoing embodiments. The steps of the emotion speech synthesis method.
- the present application also provides a computer-readable storage medium.
- the computer-readable storage medium may be a non-volatile computer-readable storage medium.
- the computer-readable storage medium may also be a volatile computer-readable storage medium.
- the computer-readable storage medium stores instructions that, when executed on a computer, cause the computer to execute the steps of the method for synthesizing emotional speech.
- the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
- Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
- the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
- the integrated unit if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium.
- the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
- the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Data Mining & Analysis (AREA)
- Signal Processing (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Computational Mathematics (AREA)
- Pure & Applied Mathematics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Computing Systems (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Discrete Mathematics (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Child & Adolescent Psychology (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
Description
本申请要求于2020年12月10日提交中国专利局、申请号为202011432589.4、发明名称为“情感语音的合成方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。This application claims the priority of the Chinese patent application with the application number 202011432589.4 and the invention titled "Method, Apparatus, Equipment and Storage Medium for Emotional Speech Synthesis" filed with the China Patent Office on December 10, 2020, the entire contents of which are by reference incorporated in the application.
本申请涉及语音合成技术领域,尤其涉及一种情感语音的合成方法、装置、设备及存储介质。The present application relates to the technical field of speech synthesis, and in particular, to a method, apparatus, device and storage medium for synthesizing emotional speech.
随着科技的发展,智能客服中心、聊天机器人、智能音箱等人工智能服务走进我们的日常生活,且发挥着越来越重要的作用。这种人工智能服务器通常涉及到语音合成技术,因此语音合成技术也得到了更为广泛的应用。With the development of technology, artificial intelligence services such as smart customer service centers, chat robots, and smart speakers have entered our daily lives and are playing an increasingly important role. This kind of artificial intelligence server usually involves speech synthesis technology, so speech synthesis technology has also been more widely used.
在现有技术中,语音合成方法主要为基于隐马尔可夫的语音合成方式或者基于神经网络的语音合成方式,发明人意识到这两种语音合成方式虽然可以获得不错的合成语音,但是生成的合成语音平淡、缺乏情感,从而无法获得饱含情感的语音。In the prior art, speech synthesis methods are mainly based on hidden Markov speech synthesis methods or neural network-based speech synthesis methods. The inventor realized that although these two speech synthesis methods can obtain good synthesized speech, the resulting Synthesized speech is flat and lacking emotion, making it impossible to get emotional speech.
发明内容SUMMARY OF THE INVENTION
本申请提供了一种情感语音的合成方法、装置、设备及存储介质,用于解决合成语音平淡、缺乏情感的问题,增加合成语音的多样性。The present application provides a method, device, device and storage medium for synthesizing emotional speech, which are used to solve the problem of dullness and lack of emotion in the synthesized speech, and increase the diversity of the synthesized speech.
本申请第一方面提供了一种情感语音的合成方法,包括:获取待识别语音数据和对应的文本数据;将所述待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征和位置编码,并结合所述梅尔谱特征和所述位置编码在所述情感识别网络中进行处理,生成情感嵌入特征;将所述情感嵌入特征和所述文本数据输入预先训练好的语音合成网络中,生成目标梅尔谱数据;采用神经声码器对所述目标梅尔谱数据进行语音转换,生成目标情感语音。A first aspect of the present application provides a method for synthesizing emotional speech, including: acquiring speech data to be recognized and corresponding text data; inputting the speech data to be recognized into a pre-trained emotion recognition network to generate Mel spectrum features and position encoding, and process in the emotion recognition network in combination with the Mel spectrum feature and the position encoding to generate an emotion embedding feature; input the emotion embedding feature and the text data into the pre-trained speech synthesis In the network, target mel-spectrum data is generated; a neural vocoder is used to perform speech conversion on the target mel-spectrum data to generate target emotional speech.
本申请第二方面提供了一种情感语音的合成装置,包括:获取模块,用于获取待识别语音数据和对应的文本数据;嵌入特征生成模块,用于将所述待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征和位置编码,并结合所述梅尔谱特征和所述位置编码在所述情感识别网络中进行处理,生成情感嵌入特征;梅尔谱数据生成模块,用于将所述情感嵌入特征和所述文本数据输入预先训练好的语音合成网络中,生成目标梅尔谱数据;语音转换模块,用于采用神经声码器对所述目标梅尔谱数据进行语音转换,生成目标情感语音。A second aspect of the present application provides a device for synthesizing emotional speech, comprising: an acquisition module for acquiring to-be-recognized speech data and corresponding text data; an embedded feature generation module for inputting the to-be-recognized speech data into pre-training In a good emotion recognition network, the mel spectrum feature and the position code are generated, and the mel spectrum feature and the position code are processed in the emotion recognition network to generate the emotion embedded feature; the mel spectrum data generation module , for inputting the emotion embedded feature and the text data into the pre-trained speech synthesis network to generate target mel-spectrum data; the speech conversion module is used for using a neural vocoder to analyze the target mel-spectrum data Perform speech conversion to generate target emotional speech.
本申请第三方面提供了一种情感语音的合成设备,包括:存储器和至少一个处理器,所述存储器中存储有指令;所述至少一个处理器调用所述存储器中的所述指令,以使得所述情感语音的合成设备执行如下所述的情感语音的合成方法:A third aspect of the present application provides a device for synthesizing emotional speech, including: a memory and at least one processor, where instructions are stored in the memory; the at least one processor invokes the instructions in the memory, so that The device for synthesizing emotional speech performs the following method for synthesizing emotional speech:
获取待识别语音数据和对应的文本数据;将所述待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征和位置编码,并结合所述梅尔谱特征和所述位置编码在所述情感识别网络中进行处理,生成情感嵌入特征;将所述情感嵌入特征和所述文本数据输入预先训练好的语音合成网络中,生成目标梅尔谱数据;采用神经声码器对所述目标梅尔谱数据进行语音转换,生成目标情感语音。Obtain the speech data to be recognized and the corresponding text data; input the speech data to be recognized into a pre-trained emotion recognition network, generate mel spectrum features and position coding, and combine the mel spectrum features and the position coding Perform processing in the emotion recognition network to generate emotion embedded features; input the emotion embedded features and the text data into a pre-trained speech synthesis network to generate target mel spectrum data; use a neural vocoder to The target mel spectrum data is used for speech conversion to generate the target emotional speech.
本申请的第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行如下所述的情感语音的合成方法:A fourth aspect of the present application provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, when the computer-readable storage medium runs on a computer, the computer executes the following method for synthesizing emotional speech:
获取待识别语音数据和对应的文本数据;将所述待识别语音数据输入预先训练好的情 感识别网络中,生成梅尔谱特征和位置编码,并结合所述梅尔谱特征和所述位置编码在所述情感识别网络中进行处理,生成情感嵌入特征;将所述情感嵌入特征和所述文本数据输入预先训练好的语音合成网络中,生成目标梅尔谱数据;采用神经声码器对所述目标梅尔谱数据进行语音转换,生成目标情感语音。Obtain the speech data to be recognized and the corresponding text data; input the speech data to be recognized into a pre-trained emotion recognition network, generate mel spectrum features and position coding, and combine the mel spectrum features and the position coding Perform processing in the emotion recognition network to generate emotion embedded features; input the emotion embedded features and the text data into a pre-trained speech synthesis network to generate target mel spectrum data; use a neural vocoder to The target mel spectrum data is used for speech conversion to generate the target emotional speech.
本申请提供的技术方案中,获取待识别语音数据和对应的文本数据;将所述待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征和位置编码,并结合所述梅尔谱特征和所述位置编码在所述情感识别网络中进行处理,生成情感嵌入特征;将所述情感嵌入特征和所述文本数据输入预先训练好的语音合成网络中,生成目标梅尔谱数据;采用神经声码器对所述目标梅尔谱数据进行语音转换,生成目标情感语音。本申请实施例中,通过预先训练好的情感识别网络,结合梅尔谱特征和位置编码生成情感嵌入特征,然后将情感嵌入特征和文本数据进行拼接,生成目标情感语音,解决了合成语音平淡、缺乏情感的问题,增加了合成语音的多样性。In the technical solution provided by this application, the speech data to be recognized and the corresponding text data are obtained; the speech data to be recognized is input into a pre-trained emotion recognition network to generate Mel spectrum features and location codes, and combined with the Mel spectrum features and the position encoding are processed in the emotion recognition network to generate emotion embedded features; input the emotion embedded features and the text data into a pre-trained speech synthesis network to generate target Mel spectrum data ; Use a neural vocoder to convert the target Mel spectrum data to generate target emotional speech. In the embodiment of the present application, through the pre-trained emotion recognition network, combined with the Mel spectrum feature and the position encoding, the emotion embedded feature is generated, and then the emotion embedded feature and the text data are spliced to generate the target emotional speech, which solves the problem of dull synthetic speech, The problem of lack of emotion increases the variety of synthesized speech.
图1为本申请实施例中情感语音的合成方法的一个实施例示意图;1 is a schematic diagram of an embodiment of a method for synthesizing emotional speech in an embodiment of the present application;
图2为本申请实施例中情感语音的合成方法的另一个实施例示意图;2 is a schematic diagram of another embodiment of a method for synthesizing emotional speech in an embodiment of the present application;
图3为本申请实施例中情感语音的合成装置的一个实施例示意图;3 is a schematic diagram of an embodiment of an apparatus for synthesizing emotional speech in an embodiment of the present application;
图4为本申请实施例中情感语音的合成装置的另一个实施例示意图;4 is a schematic diagram of another embodiment of an apparatus for synthesizing emotional speech in an embodiment of the present application;
图5为本申请实施例中情感语音的合成设备的一个实施例示意图。FIG. 5 is a schematic diagram of an embodiment of a device for synthesizing emotional speech in an embodiment of the present application.
本申请实施例提供了一种情感语音的合成方法、装置、设备及存储介质,用于解决合成语音平淡、缺乏情感的问题,增加合成语音的多样性。Embodiments of the present application provide a method, apparatus, device, and storage medium for synthesizing emotional speech, which are used to solve the problem that the synthesized speech is dull and lack emotion, and increase the diversity of the synthesized speech.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”或“具有”及其任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third", "fourth", etc. (if any) in the description and claims of the present application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is to be understood that data so used can be interchanged under appropriate circumstances so that the embodiments described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" or "having" and any variations thereof are intended to cover non-exclusive inclusion, for example, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those expressly listed steps or units, but may include other steps or units not expressly listed or inherent to these processes, methods, products or devices.
为便于理解,下面对本申请实施例的具体流程进行描述,请参阅图1,本申请实施例中情感语音的合成方法的一个实施例包括:For ease of understanding, the specific flow of the embodiment of the present application is described below, referring to FIG. 1 , an embodiment of the method for synthesizing emotional speech in the embodiment of the present application includes:
101、获取待识别语音数据和对应的文本数据;101. Acquire speech data to be recognized and corresponding text data;
服务器获取待识别语音数据和与待识别文本数据对应的文本数据。需要强调的是,为进一步保证上述待识别语音数据和文本数据的私密和安全性,上述待识别语音数据和文本数据还可以存储于一区块链的节点中。The server obtains the speech data to be recognized and the text data corresponding to the text data to be recognized. It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned voice data and text data to be recognized, the above-mentioned voice data and text data to be recognized can also be stored in a node of a blockchain.
待识别语音数据为带有情感的待识别语音数据,可以为带有高兴情感的待识别语音数据、带有惊讶情感的待识别语音数据和/或带有愤怒情感的待识别语音数据。服务器在获取带有情感的待识别语音数据时,还获取对应的文本数据,例如带有情感的待识别语音数据为“真的吗!恭喜你!”,服务器在获取“真的吗!恭喜你!”的待识别语音数据时,还获取“真的吗!恭喜你!”的文本数据。The to-be-recognized speech data is the to-be-recognized speech data with emotion, which may be the to-be-recognized speech data with happy emotion, the to-be-recognized speech data with surprised emotion, and/or the to-be-recognized speech data with anger emotion. When the server obtains the speech data to be recognized with emotion, it also obtains the corresponding text data. For example, the speech data to be recognized with emotion is "really! Congratulations!", the server is obtaining "really! Congratulations!" !", the text data of "Really! Congratulations!" is also obtained.
可以理解的是,本申请的执行主体可以为情感语音的合成装置,还可以是终端或者服务器,具体此处不做限定。本申请实施例以服务器为执行主体为例进行说明。It can be understood that the execution subject of the present application may be a device for synthesizing emotional speech, and may also be a terminal or a server, which is not specifically limited here. The embodiments of the present application take the server as an execution subject as an example for description.
102、将待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征和位置编 码,并结合梅尔谱特征和位置编码在情感识别网络中进行处理,生成情感嵌入特征;102, input the speech data to be recognized in the pre-trained emotion recognition network, generate mel spectrum feature and position coding, and process in emotion recognition network in conjunction with mel spectrum feature and position coding, generate emotion embedded feature;
服务器将待识别语音数据输入预先训练好的情感识别网络中,首先生成梅尔谱特征和位置编码,然后在情感识别网络中对该梅尔谱特征和该位置编码进行处理,从而生成情感嵌入特征。The server inputs the speech data to be recognized into the pre-trained emotion recognition network, firstly generates the mel spectrum feature and the position code, and then processes the mel spectrum feature and the position code in the emotion recognition network to generate the emotion embedded feature .
服务器将“真的吗!恭喜你!”的待识别语音数据输入预先训练好的情感识别网络中进行计算,生成梅尔谱特征[B 1,T 1,D 1]以及位置编码P,其中,位置编码P基于梅尔谱特征生成,该位置编码P实际上是一个隐藏层输出,服务器再结合梅尔谱特征[B 1,T 1,D 1]以及位置编码P进行计算,生成情感嵌入特征[B 2,T 2,D 2]。 The server inputs the speech data of "really! Congratulations!" into the pre-trained emotion recognition network for calculation, and generates Mel spectrum features [B 1 , T 1 , D 1 ] and position code P, where, The location code P is generated based on the mel spectrum feature. The location code P is actually a hidden layer output. The server combines the mel spectrum features [B 1 , T 1 , D 1 ] and the location code P for calculation to generate emotional embedding features. [B 2 , T 2 , D 2 ].
103、将情感嵌入特征和文本数据输入预先训练好的语音合成网络中,生成目标梅尔谱数据;103. Input the emotion embedded feature and text data into the pre-trained speech synthesis network to generate target mel spectrum data;
服务器将情感嵌入特征和文本数据输入预先训练好的语音合成网络中进行计算,生成目标梅尔谱数据。The server inputs the emotion embedded features and text data into the pre-trained speech synthesis network for calculation, and generates the target mel spectrum data.
服务器将[B 2,T 2,D 2]的情感嵌入特征和“真的吗!恭喜你!”的文本数据输入预先训练好的语音合成网络中进行计算,在本实施例中,语音合成网络中包括编码器,在该编码器中,将“真的吗!恭喜你!”的文本数据进行特征提权,生成提取结果,并将该提取结果与情感嵌入特征[B 2,T 2,D 2]进行拼接,生成目标梅尔谱数据[B 2,T 2,D 2+D]。 The server inputs the emotional embedding features of [B 2 , T 2 , D 2 ] and the text data of "Really! Congratulations!" into the pre-trained speech synthesis network for calculation. In this embodiment, the speech synthesis network It includes an encoder, in which the text data of "Really! Congratulations!" is feature-lifted, an extraction result is generated, and the extraction result is combined with the emotion embedding feature [B 2 ,T 2 ,D 2 ] for splicing to generate target Mel spectrum data [B 2 , T 2 , D 2 +D].
104、采用神经声码器对目标梅尔谱数据进行语音转换,生成目标情感语音。104. Use a neural vocoder to perform speech conversion on the target Mel spectrum data to generate a target emotional speech.
服务器采用神经声码器将目标梅尔谱数据转换为目标情感语音。The server uses a neural vocoder to convert the target mel-spectral data into target emotional speech.
需要说明的是,在本实施例中,神经声码器为Wave Glow,目标梅尔谱数据为神经声码器的输入,该输入的帧长为1024,帧移位256,首先将该目标梅尔谱数据输入神经声码器的仿射耦合层中进行缩放和转换,生成情感语音特征,然后对该情感语音特征进行可逆卷积,生成目标情感语音“真的吗!(惊讶情感)恭喜你!(高兴情感)”。It should be noted that, in this embodiment, the neural vocoder is Wave Glow, the target Mel spectrum data is the input of the neural vocoder, the input frame length is 1024, and the frame shift is 256. The Er spectrum data is input into the affine coupling layer of the neural vocoder for scaling and transformation to generate emotional speech features, and then reversible convolution is performed on the emotional speech features to generate the target emotional speech "Really! (surprised emotion) Congratulations to you. ! (happy emotion)".
本申请实施例中,通过预先训练好的情感识别网络,结合梅尔谱特征和位置编码生成情感嵌入特征,然后将情感嵌入特征和文本数据进行拼接,生成目标情感语音,解决了合成语音平淡、缺乏情感的问题,增加了合成语音的多样性。In the embodiment of the present application, through the pre-trained emotion recognition network, combined with the Mel spectrum feature and the position encoding, the emotion embedded feature is generated, and then the emotion embedded feature and the text data are spliced to generate the target emotional speech, which solves the problem of dull synthetic speech, The problem of lack of emotion increases the variety of synthesized speech.
请参阅图2,本申请实施例中情感语音的合成方法的另一个实施例包括:Referring to FIG. 2, another embodiment of the method for synthesizing emotional speech in the embodiment of the present application includes:
201、获取情感语音训练数据、情感标签数据和文本训练数据;201. Obtain emotional speech training data, emotional label data, and text training data;
服务器从大数据平台或者数据库中获取情感语音训练数据、情感标签数据和文本训练数据。The server obtains emotional speech training data, emotional label data and text training data from the big data platform or database.
需要说明的是,情感语音训练数据可以分为包括噪声的情感语音训练数据和不包括噪声的情感语音训练数据。It should be noted that the emotional speech training data can be divided into emotional speech training data including noise and emotional speech training data not including noise.
情感语音训练数据可以为“太过分了”、“真的吗”或者“太好了”之类带有情感的语音训练数据,并获取情感标签数据和文本训练数据,其中“太过分了”的情感语音训练数据对应“愤怒”的情感标签数据,对应“太过分了”的文本训练数据;“真的吗”的情感语音训练数据对应“惊讶”的情感标签数据对应“真的吗”的文本训练数据;“太好了”的情感语音训练数据对应“高兴”的情感标签数据,对应“太好了”的文本训练数据。The emotional voice training data can be voice training data with emotion such as "too much", "really" or "too good", and obtain emotional label data and text training data, among which "too much" The emotional voice training data corresponds to the emotional label data of "anger" and the text training data of "too much"; the emotional voice training data of "really" corresponds to the emotional label data of "surprise" and corresponds to the text of "really" Training data; "too good" emotional speech training data corresponds to "happy" emotional label data, corresponding to "too good" text training data.
202、采用情感语音训练数据和情感标签数据,结合层正则化机制进行模型训练,生成预先训练好的情感识别网络,并采用情感语音训练数据和文本训练数据进行模型训练,生成预先训练好的语音合成网络;202. Use emotional speech training data and emotional label data, and combine layer regularization mechanism to perform model training to generate a pre-trained emotion recognition network, and use emotional speech training data and text training data for model training to generate pre-trained speech synthetic network;
服务器根据情感语音训练数据和情感标签数据,结合正则化机制进行训练,生成预先训练好的情感识别网络,然后根据情感语音训练数据和文本训练数据进行模型训练,生成 预先训练好的语音合成网络。The server performs training based on the emotional voice training data and emotional label data, combined with the regularization mechanism, to generate a pre-trained emotion recognition network, and then performs model training based on the emotional voice training data and text training data to generate a pre-trained speech synthesis network.
需要说明的是,预先训练好的情感识别网络用于提取情感特征,因此采用情感语音训练数据和情感标签数据训练该情感识别网络;预先训练好的语音合成网络用于合成情感语音,因此采用情感语音训练数据和文本训练数据训练该语音合成网络。虽然上述的训练过程均包括情感语音训练数据,但是在训练情感识别网络时,情感语音训练数据既可以为包括噪声的情感语音训练数据也可以为不包括噪声的情感语音训练数据,但是训练语音合成网络时,需要调用高质量的训练数据进行训练,即不包括噪声的情感语音训练数据。It should be noted that the pre-trained emotion recognition network is used to extract emotional features, so emotional speech training data and emotion label data are used to train the emotion recognition network; the pre-trained speech synthesis network is used to synthesize emotional speech, so emotional speech is used. Speech training data and text training data train the speech synthesis network. Although the above training process includes emotional speech training data, when training the emotion recognition network, the emotional speech training data can be either the emotional speech training data including noise or the emotional speech training data that does not include noise, but training speech synthesis When the network is used, it is necessary to call high-quality training data for training, that is, emotional speech training data that does not include noise.
在训练情感识别网络的过程中,服务器结合层正则化机制来训练情感识别网络,主要是在每个子层后面添加层正则化机制,层正则化机制是算出i层的输出在通道维度上的均值和方差,再让i层的输出减去均值,除以方差,使i层的输出均值为0,方差为1。层正则化机制能够使得训练数据的分布一致,使训练的过程具有稳定性。In the process of training the emotion recognition network, the server combines the layer regularization mechanism to train the emotion recognition network, mainly adding a layer regularization mechanism after each sub-layer, and the layer regularization mechanism is to calculate the output of the i layer in the channel dimension. and variance, and then subtract the mean from the output of layer i and divide by the variance, so that the output of layer i has a mean of 0 and a variance of 1. The layer regularization mechanism can make the distribution of training data consistent and make the training process stable.
203、获取待识别语音数据和对应的文本数据;203. Obtain speech data to be recognized and corresponding text data;
服务器获取待识别语音数据和与待识别文本数据对应的文本数据。需要强调的是,为进一步保证上述待识别语音数据和文本数据的私密和安全性,上述待识别语音数据和文本数据还可以存储于一区块链的节点中。The server obtains the speech data to be recognized and the text data corresponding to the text data to be recognized. It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned voice data and text data to be recognized, the above-mentioned voice data and text data to be recognized can also be stored in a node of a blockchain.
待识别语音数据为带有情感的待识别语音数据,可以为带有高兴情感的待识别语音数据、带有惊讶情感的待识别语音数据和/或带有愤怒情感的待识别语音数据。服务器在获取带有情感的待识别语音数据时,还获取对应的文本数据,例如带有情感的待识别语音数据为“真的吗!恭喜你!”,服务器在获取“真的吗!恭喜你!”的待识别语音数据时,还获取“真的吗!恭喜你!”的文本数据。The to-be-recognized speech data is the to-be-recognized speech data with emotion, which may be the to-be-recognized speech data with happy emotion, the to-be-recognized speech data with surprised emotion, and/or the to-be-recognized speech data with anger emotion. When the server obtains the speech data to be recognized with emotion, it also obtains the corresponding text data. For example, the speech data to be recognized with emotion is "really! Congratulations!", the server is obtaining "really! Congratulations!" !", the text data of "Really! Congratulations!" is also obtained.
204、将待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征和位置编码,并结合梅尔谱特征和位置编码在情感识别网络中进行处理,生成情感嵌入特征;204. Input the speech data to be recognized into a pre-trained emotion recognition network, generate mel spectrum features and positional codes, and process them in the emotion recognition network in combination with the mel spectrum features and positional codes to generate emotion embedded features;
服务器将待识别语音数据输入预先训练好的情感识别网络中,首先生成梅尔谱特征和位置编码,然后在情感识别网络中对该梅尔谱特征和该位置编码进行处理,从而生成情感嵌入特征。The server inputs the speech data to be recognized into the pre-trained emotion recognition network, firstly generates the mel spectrum feature and the position code, and then processes the mel spectrum feature and the position code in the emotion recognition network to generate the emotion embedded feature .
服务器将“真的吗!恭喜你!”的待识别语音数据输入预先训练好的情感识别网络中进行计算,生成梅尔谱特征[B 1,T 1,D 1]以及位置编码P,其中,位置编码P基于梅尔谱特征生成,该位置编码P实际上是一个隐藏层输出,服务器再结合梅尔谱特征[B 1,T 1,D 1]以及位置编码P进行计算,生成情感嵌入特征[B 2,T 2,D 2]。 The server inputs the speech data of "really! Congratulations!" into the pre-trained emotion recognition network for calculation, and generates Mel spectrum features [B 1 , T 1 , D 1 ] and position code P, where, The location code P is generated based on the mel spectrum feature. The location code P is actually a hidden layer output. The server combines the mel spectrum features [B 1 , T 1 , D 1 ] and the location code P for calculation to generate emotional embedding features. [B 2 , T 2 , D 2 ].
具体的,服务器将待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征;服务器根据梅尔谱特征和预置的位置转换公式,生成位置编码;服务器将梅尔谱特征和位置编码输入情感识别网络的编码器中进行编码,生成情感嵌入特征。Specifically, the server inputs the speech data to be recognized into a pre-trained emotion recognition network to generate mel spectrum features; the server generates a position code according to the mel spectrum features and a preset position conversion formula; the server converts the mel spectrum features and The positional encoding is input into the encoder of the emotion recognition network for encoding to generate emotion embedding features.
服务器将“真的吗!恭喜你!”的待识别语音数据输入预先训练好的情感识别网络中,首先将“真的吗!恭喜你!”输入训练好的情感识别网络中,进行特征提取,生成梅尔谱特征[B 1,T 1,D 1];然后服务器对[B 1,T 1,D 1]的梅尔谱特征按照预置的位置转换公式进行位置编码的计算,生成位置编码P;最后服务器将梅尔谱特征[B 1,T 1,D 1]和位置编码P输入情感识别网络的编码器中进行编码,生成情感嵌入特征,其中情感识别网络的编码器包括五个相同的模块和一层长短期记忆人工神经网络层,其中每个模块都包括两个子层,分别为多头自注意力层和前向传播层,在编码器中对[B 1,T 1,D 1]和P进行编码,生成情感嵌入特征[B 2,T 2,D 2]。 The server inputs the speech data of "really! Congratulations!" into the pre-trained emotion recognition network, and firstly inputs "really! Congratulations!" into the trained emotion recognition network for feature extraction. Generate mel spectrum features [B 1 , T 1 , D 1 ]; then the server calculates the position encoding of the Mel spectrum features of [B 1 , T 1 , D 1 ] according to the preset position conversion formula, and generates the position code P; Finally, the server inputs the Mel spectrum features [B 1 , T 1 , D 1 ] and the position encoding P into the encoder of the emotion recognition network for encoding to generate emotion embedding features, where the encoder of the emotion recognition network includes five identical module and a layer of long short-term memory artificial neural network layer, in which each module includes two sub-layers, namely multi-head self-attention layer and forward propagation layer, in the encoder [B 1 , T 1 , D 1 ] and P are encoded to generate sentiment embedding features [B 2 , T 2 , D 2 ].
服务器将待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征包括:The server inputs the speech data to be recognized into the pre-trained emotion recognition network, and the generated Mel spectrum features include:
首先,服务器对待识别语音数据进行加窗处理,生成加窗后的语音数据;然后,服务器对加窗后的语音数据进行短时傅里叶变换,生成傅里叶变换后的语音数据;最后,服务 器采用梅尔滤波器组对傅里叶变换后的语音数据进行处理,生成梅尔谱特征。First, the server performs windowing processing on the speech data to be recognized to generate windowed speech data; then, the server performs short-time Fourier transform on the windowed speech data to generate Fourier transformed speech data; finally, The server uses a Mel filter bank to process the Fourier transformed speech data to generate Mel spectrum features.
服务器采用窗函数对“真的吗!恭喜你!”的待识别语音数据进行加窗处理,生成加窗后的语音数据;然后服务器对该加窗后的语音数据进行傅里叶变换,确定加窗后的语音数据的频率和相位,从而生成傅里叶变换后的语音数据;最后服务器采用梅尔滤波器组将傅里叶变换后的语音数据处理为梅尔谱特征。The server uses a window function to perform windowing processing on the speech data to be recognized for "really! Congratulations!" to generate windowed speech data; then the server performs Fourier transform on the windowed speech data, and determines to add The frequency and phase of the speech data after the window are obtained, thereby generating the speech data after Fourier transformation; finally, the server uses a Mel filter bank to process the speech data after Fourier transformation into mel spectrum features.
服务器根据梅尔谱特征和预置的位置转换公式,生成位置编码包括:According to the Mel spectrum feature and the preset position conversion formula, the server generates the position code including:
服务器读取梅尔谱特征的长度,并读取梅尔谱特征的位置;服务基于梅尔谱特征的长度和梅尔谱特征的位置,生成位置输入值;服务器将位置输入向量输入预置的位置转换公式,生成位置编码。The server reads the length of the mel spectrum feature and reads the position of the mel spectrum feature; the service generates a position input value based on the length of the mel spectrum feature and the position of the mel spectrum feature; the server inputs the position input vector into the preset Position conversion formula to generate position codes.
在本实施例中,预置的位置转换公式为In this embodiment, the preset position conversion formula is
其中,pos为梅尔谱特征的位置,2i表示偶数的维度,2i+1表示奇数的维度,d mode表示梅尔谱特征的位置对应的预置维度向量,例如256。 Among them, pos is the position of the mel spectral feature, 2i represents the even dimension, 2i+1 represents the odd dimension, and d mode represents the preset dimension vector corresponding to the location of the mel spectral feature, such as 256.
例如,服务器读取梅尔谱特征的长度为5,梅尔谱特征的位置为0,然后服务器基于“5”的梅尔谱特征的长度和“0”的梅尔谱特征的位置,确定位置输入值为[0,1,2,3,4],然后将位置输入值[0,1,2,3,4]基于上述公式进行计算,生成位置编码P。需要说明的是,在本实施例中P只是一个指代词,并不为具体的位置编码数据。For example, the server reads the length of the mel spectral feature as 5 and the position of the mel spectral feature as 0, then the server determines the position based on the length of the mel spectral feature of "5" and the position of the mel spectral feature of "0" The input value is [0,1,2,3,4], and then the position input value [0,1,2,3,4] is calculated based on the above formula to generate the position code P. It should be noted that, in this embodiment, P is only a pronoun, and does not represent specific position encoding data.
服务器将梅尔谱特征和位置编码输入情感识别网络的编码器中进行编码,生成情感嵌入特征包括:The server inputs the mel spectrum feature and position encoding into the encoder of the emotion recognition network for encoding, and the generated emotion embedding features include:
服务器将梅尔谱特征和的位置编码输入情感识别网络的多头自注意力层中,结合残差连接,生成初始情感特征向量;服务器将初始情感特征向量输入情感识别网络的前向传播层中进行卷积,生成情感嵌入特征。The server inputs the position encoding of the mel spectrum feature sum into the multi-head self-attention layer of the emotion recognition network, and combines the residual connections to generate the initial emotion feature vector; the server inputs the initial emotion feature vector into the forward propagation layer of the emotion recognition network for processing. Convolution to generate sentiment embedding features.
服务器首先将梅尔谱特征[B 1,T 1,D 1]输入多头自注意力层进行计算,结合残差连接生成初始情感特征向量,其中多头自注意力层设计的公式如下所示: The server first inputs the Mel spectrum features [B 1 , T 1 , D 1 ] into the multi-head self-attention layer for calculation, and combines the residual connections to generate the initial emotion feature vector. The formula for the multi-head self-attention layer design is as follows:
head i=Attention(QW i Q,KW i K,KW i V) head i =Attention(QW i Q ,KW i K ,KW i V )
c t=MultiHead(Q,K,V)=Concat(head 1,...,head h)W O; c t =MultiHead(Q,K,V)=Concat(head 1 ,...,head h )W O ;
其中,Q,K,V为输入,即梅尔谱特征,d k为预置维度向量,例如256,head i为第i个头,在多头自注意力层中的每一次计算为一个头,如果进行4次Attention(QW i Q,KW i K,KV V)计算,则生成4个头,W i Q,KW i K,KW i V为权重,该权重是在训练过程生成的,Concat指将多个头沿着最后一个维度拼接在一起,例如,这四个头的维度向量都是[B t,T t,D t],服务器则将他们拼接到一起,生成初始情感特征向量[B t,T t,4D t],W O为提前学习好的参数。在生成初始情感特征向量之后,将该初始情感特征向量和梅尔谱特征输入对应的前向传播层中进行卷积,生成第一模块情感特征向量,由于该编码器包括五个相同的模块,因此按照上述的计算方式进行五次计算,将最后一个模块的输出结果输入一层长短期记忆人工神经网络层,从而生成情感嵌入特征[B 2,T 2,D 2]。 Among them, Q, K, V are the input, that is, the mel spectrum feature, d k is the preset dimension vector, such as 256, head i is the ith head, and each calculation in the multi-head self-attention layer is a head, if Perform 4 Attention (QW i Q , KW i K , KV V ) calculations, then generate 4 heads, W i Q , KW i K , KW i V are the weights, which are generated during the training process, and Concat refers to how many The heads are spliced together along the last dimension. For example, the dimension vectors of these four heads are [B t , T t , D t ], and the server splices them together to generate the initial emotion feature vector [B t , T t ] , 4D t ], W O is the parameter learned in advance. After generating the initial emotion feature vector, the initial emotion feature vector and the Mel spectrum feature input are convolved in the corresponding forward propagation layer to generate the first module emotion feature vector. Since the encoder includes five identical modules, Therefore, five calculations are performed according to the above calculation method, and the output result of the last module is input into a layer of long short-term memory artificial neural network layer, thereby generating emotional embedded features [B 2 , T 2 , D 2 ].
需要说明的是,残差连接是将每个多头自注意力层的输入又添加至输出中,作为下一 层前向传播层的输入,在第一个多头自注意力层中就是在初始情感特征向量的基础上加入梅尔谱特征,从而生成前向传播层的输入。从而提高生成情感嵌入特征的关联性。It should be noted that the residual connection is to add the input of each multi-head self-attention layer to the output as the input of the next forward propagation layer. In the first multi-head self-attention layer, it is the initial emotion. Mel spectrum features are added to the feature vector to generate the input of the forward propagation layer. Thereby improving the relevance of generating emotional embedding features.
205、将情感嵌入特征和文本数据输入预先训练好的语音合成网络中,生成目标梅尔谱数据;205. Input the emotion embedded feature and text data into the pre-trained speech synthesis network to generate target mel spectrum data;
服务器将情感嵌入特征和文本数据输入预先训练好的语音合成网络中进行计算,生成目标梅尔谱数据。The server inputs the emotion embedded features and text data into the pre-trained speech synthesis network for calculation, and generates the target mel spectrum data.
服务器将[B 2,T 2,D 3]的情感嵌入特征和“真的吗!恭喜你!”的文本数据输入预先训练好的语音合成网络中进行计算,在本实施例中,语音合成网络中包括编码器,在该编码器中,将“真的吗!恭喜你!”的文本数据进行特征提权,生成提取结果,并将该提取结果与情感嵌入特征[B 2,T 2,D 2]进行拼接,生成目标梅尔谱数据[B 2,T 2,D 2+D]。 The server inputs the emotion embedding features of [B 2 , T 2 , D 3 ] and the text data of "Really! Congratulations!" into the pre-trained speech synthesis network for calculation. In this embodiment, the speech synthesis network It includes an encoder, in which the text data of "Really! Congratulations!" is feature-lifted, an extraction result is generated, and the extraction result is combined with the emotion embedding feature [B 2 ,T 2 ,D 2 ] for splicing to generate target Mel spectrum data [B 2 , T 2 , D 2 +D].
具体的,服务器在预先训练好的语音合成网络中,将文本数据转换为文本嵌入特征;服务器按照时刻顺序,将文本嵌入特征和情感嵌入特征进行拼接,生成目标梅尔谱数据。Specifically, the server converts text data into text embedding features in the pre-trained speech synthesis network; the server splices the text embedding features and emotion embedding features in order of time to generate target Mel spectrum data.
服务器在预先训练好的语音合成网络中,首先将文本数据按照时刻顺序转换为与情感嵌入特征形式相同的文本嵌入特征,然后服务器按照时刻顺序将文本嵌入特征和情感嵌入特征进行拼接,生成目标梅尔谱数据。在本实施例中,例如情感嵌入特征为[B 2,T 2,D 2],文本嵌入特征为[B 2,T 2,D 3],服务器则基于时刻顺序将[B 2,T 2,D 2]与[B 2,T 2,D]进行拼接,生成目标梅尔谱数据为[B 2,T 2,D 2+D]。在其他实施例中,若情感嵌入特征维度为[B 2,D 2],服务器则将该情感嵌入特征扩展为[B 2,1,D 2],然后再对该情感嵌入特征与本钱嵌入特征进行拼接。 In the pre-trained speech synthesis network, the server first converts the text data into text embedding features in the same form as the emotional embedding features in the order of time, and then the server splices the text embedding features and emotional embedding features in the order of time to generate the target Mei spectral data. In this embodiment, for example, the emotion embedding feature is [B 2 , T 2 , D 2 ], and the text embedding feature is [B 2 , T 2 , D 3 ], and the server will D 2 ] is spliced with [B 2 , T 2 , D] to generate target mel spectrum data as [B 2 , T 2 , D 2 +D]. In other embodiments, if the emotion embedding feature dimension is [B 2 , D 2 ], the server expands the emotion embedding feature to [B 2 , 1, D 2 ], and then the emotion embedding feature and cost embedding feature Do stitching.
206、采用神经声码器对目标梅尔谱数据进行语音转换,生成目标情感语音。206. Use a neural vocoder to perform speech conversion on the target Mel spectrum data to generate a target emotional speech.
服务器采用神经声码器将目标梅尔谱数据转换为目标情感语音。The server uses a neural vocoder to convert the target mel-spectral data into target emotional speech.
需要说明的是,在本实施例中,神经声码器为Wave Glow,目标梅尔谱数据为神经声码器的输入,该输入的帧长为1024,帧移位256,首先将该目标梅尔谱数据输入神经声码器的仿射耦合层中进行缩放和转换,生成情感语音特征,然后对该情感语音特征进行可逆卷积,生成目标情感语音“真的吗!(惊讶情感)恭喜你!(高兴情感)”。It should be noted that, in this embodiment, the neural vocoder is Wave Glow, the target Mel spectrum data is the input of the neural vocoder, the input frame length is 1024, and the frame shift is 256. The Er spectrum data is input into the affine coupling layer of the neural vocoder for scaling and transformation to generate emotional speech features, and then reversible convolution is performed on the emotional speech features to generate the target emotional speech "Really! (surprised emotion) Congratulations to you. ! (happy emotion)".
本申请实施例中,通过预先训练好的情感识别网络,结合梅尔谱特征和位置编码生成情感嵌入特征,然后将情感嵌入特征和文本数据进行拼接,生成目标情感语音,解决了合成语音平淡、缺乏情感的问题,增加了合成语音的多样性。In the embodiment of the present application, through the pre-trained emotion recognition network, combined with the Mel spectrum feature and the position encoding, the emotion embedded feature is generated, and then the emotion embedded feature and the text data are spliced to generate the target emotional speech, which solves the problem of dull synthetic speech, The problem of lack of emotion increases the variety of synthesized speech.
上面对本申请实施例中情感语音的合成方法进行了描述,下面对本申请实施例中情感语音的合成装置进行描述,请参阅图3,本申请实施例中情感语音的合成装置一个实施例包括:The method for synthesizing emotional speech in the embodiment of the present application has been described above, and the device for synthesizing emotional speech in the embodiment of the present application is described below. Please refer to FIG. 3 . An embodiment of the device for synthesizing emotional speech in the embodiment of the present application includes:
待识别数据获取模块301,用于获取待识别语音数据和对应的文本数据;A to-be-recognized
嵌入特征生成模块302,用于将所述待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征和位置编码,并结合所述梅尔谱特征和所述位置编码在所述情感识别网络中进行处理,生成情感嵌入特征;The embedded
梅尔谱数据生成模块303,用于将所述情感嵌入特征和所述文本数据输入预先训练好的语音合成网络中,生成目标梅尔谱数据;Mel spectrum
语音转换模块304,用于采用神经声码器对所述目标梅尔谱数据进行语音转换,生成目标情感语音。The
本申请实施例中,通过预先训练好的情感识别网络,结合梅尔谱特征和位置编码生成情感嵌入特征,然后将情感嵌入特征和文本数据进行拼接,生成目标情感语音,解决了合 成语音平淡、缺乏情感的问题,增加了合成语音的多样性。In the embodiment of the present application, through the pre-trained emotion recognition network, combined with the Mel spectrum feature and the position encoding, the emotion embedded feature is generated, and then the emotion embedded feature and the text data are spliced to generate the target emotional speech, which solves the problem of dull synthetic speech, The problem of lack of emotion increases the variety of synthesized speech.
请参阅图4,本申请实施例中情感语音的合成装置的另一个实施例包括:Referring to FIG. 4 , another embodiment of the apparatus for synthesizing emotional speech in the embodiment of the present application includes:
待识别数据获取模块301,用于获取待识别语音数据和对应的文本数据;A to-be-recognized
嵌入特征生成模块302,用于将所述待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征和位置编码,并结合所述梅尔谱特征和所述位置编码在所述情感识别网络中进行处理,生成情感嵌入特征;The embedded
梅尔谱数据生成模块303,用于将所述情感嵌入特征和所述文本数据输入预先训练好的语音合成网络中,生成目标梅尔谱数据;Mel spectrum
语音转换模块304,用于采用神经声码器对所述目标梅尔谱数据进行语音转换,生成目标情感语音。The
可选的,嵌入特征生成模块302包括:Optionally, the embedded
梅尔谱特征生成单元3021,用于将所述待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征;Mel spectrum
位置编码生成单元3022,用于根据所述梅尔谱特征和预置的位置转换公式,生成位置编码;a position
编码单元3023,用于将所述梅尔谱特征和所述位置编码输入所述情感识别网络的编码器中进行编码,生成情感嵌入特征。The
可选的,梅尔谱特征生成单元3021还可以具体用于:Optionally, the mel spectrum
对所述待识别语音数据进行加窗处理,生成加窗后的语音数据;Windowing is performed on the to-be-recognized speech data to generate windowed speech data;
对所述加窗后的语音数据进行短时傅里叶变换,生成傅里叶变换后的语音数据;Carry out short-time Fourier transform to the voice data after the windowing, and generate the voice data after the Fourier transform;
采用梅尔滤波器组对所述傅里叶变换后的语音数据进行处理,生成梅尔谱特征。The Fourier-transformed speech data is processed by using a Mel filter bank to generate Mel spectrum features.
可选的,位置编码生成单元3022还可以具体用于:Optionally, the location
读取梅尔谱特征的长度,并读取梅尔谱特征的位置;Read the length of the mel spectrum feature, and read the position of the mel spectrum feature;
基于所述梅尔谱特征的长度和所述梅尔谱特征的位置,生成位置输入值;generating a position input value based on the length of the mel spectral feature and the position of the mel spectral feature;
将所述位置输入向量输入预置的位置转换公式,生成位置编码。Input the position input vector into a preset position conversion formula to generate a position code.
可选的,编码单元3023还可以具体用于:Optionally, the
将所述梅尔谱特征和所述的位置编码输入所述情感识别网络的多头自注意力层中,结合残差连接,生成初始情感特征向量;Inputting the Mel spectrum feature and the position encoding into the multi-head self-attention layer of the emotion recognition network, and combining the residual connections to generate an initial emotion feature vector;
将所述初始情感特征向量输入所述情感识别网络的前向传播层中进行卷积,生成情感嵌入特征。The initial emotion feature vector is input into the forward propagation layer of the emotion recognition network for convolution to generate emotion embedded features.
可选的,梅尔谱数据生成模块303还可以具体用于:Optionally, the Mel spectrum
在预先训练好的语音合成网络中,将所述文本数据转换为文本嵌入特征;In a pre-trained speech synthesis network, convert the text data into text embedding features;
按照时刻顺序,将所述文本嵌入特征和所述情感嵌入特征进行拼接,生成目标梅尔谱数据。According to the time sequence, the text embedding feature and the emotion embedding feature are spliced to generate target mel spectrum data.
可选的,情感语音的合成装置还包括:Optionally, the device for synthesizing emotional speech further includes:
训练数据获取模块305,用于获取情感语音训练数据、情感标签数据和文本训练数据;A training
训练模块306,用于采用所述情感语音训练数据和所述情感标签数据,结合层正则化机制进行模型训练,生成预先训练好的情感识别网络,并采用所述情感语音训练数据和所述文本训练数据进行模型训练,生成预先训练好的语音合成网络。The
本申请实施例中,通过预先训练好的情感识别网络,结合梅尔谱特征和位置编码生成情感嵌入特征,然后将情感嵌入特征和文本数据进行拼接,生成目标情感语音,解决了合成语音平淡、缺乏情感的问题,增加了合成语音的多样性。In the embodiment of the present application, through the pre-trained emotion recognition network, combined with the Mel spectrum feature and the position encoding, the emotion embedded feature is generated, and then the emotion embedded feature and the text data are spliced to generate the target emotional speech, which solves the problem of dull synthetic speech, The problem of lack of emotion increases the variety of synthesized speech.
上面图3和图4从模块化功能实体的角度对本申请实施例中的情感语音的合成装置进行详细描述,下面从硬件处理的角度对本申请实施例中情感语音的合成设备进行详细描述。3 and 4 above describe the device for synthesizing emotional speech in the embodiment of the present application in detail from the perspective of modular functional entities, and the following describes the device for synthesizing emotional speech in the embodiment of the present application in detail from the perspective of hardware processing.
图5是本申请实施例提供的一种情感语音的合成设备的结构示意图,该情感语音的合成设备500可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)510(例如,一个或一个以上处理器)和存储器520,一个或一个以上存储应用程序533或数据532的存储介质530(例如一个或一个以上海量存储设备)。其中,存储器520和存储介质530可以是短暂存储或持久存储。存储在存储介质530的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对情感语音的合成设备500中的一系列指令操作。更进一步地,处理器510可以设置为与存储介质530通信,在情感语音的合成设备500上执行存储介质530中的一系列指令操作。5 is a schematic structural diagram of a device for synthesizing emotional speech provided by an embodiment of the present application. The
情感语音的合成设备500还可以包括一个或一个以上电源540,一个或一个以上有线或无线网络接口550,一个或一个以上输入输出接口560,和/或,一个或一个以上操作系统531,例如Windows Serve,Mac OS X,Unix,Linux,FreeBSD等等。本领域技术人员可以理解,图5示出的情感语音的合成设备结构并不构成对情感语音的合成设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。The emotional
本申请还提供一种情感语音的合成设备,所述计算机设备包括存储器和处理器,存储器中存储有计算机可读指令,计算机可读指令被处理器执行时,使得处理器执行上述各实施例中的所述情感语音的合成方法的步骤。The present application also provides a device for synthesizing emotional speech. The computer device includes a memory and a processor, and computer-readable instructions are stored in the memory. When the computer-readable instructions are executed by the processor, the processor executes the steps in the foregoing embodiments. The steps of the emotion speech synthesis method.
本申请还提供一种计算机可读存储介质,该计算机可读存储介质可以为非易失性计算机可读存储介质,该计算机可读存储介质也可以为易失性计算机可读存储介质,所述计算机可读存储介质中存储有指令,当所述指令在计算机上运行时,使得计算机执行所述情感语音的合成方法的步骤。The present application also provides a computer-readable storage medium. The computer-readable storage medium may be a non-volatile computer-readable storage medium. The computer-readable storage medium may also be a volatile computer-readable storage medium. The computer-readable storage medium stores instructions that, when executed on a computer, cause the computer to execute the steps of the method for synthesizing emotional speech.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the system, device and unit described above may refer to the corresponding process in the foregoing method embodiments, which will not be repeated here.
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand: The technical solutions described in the embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the present application.
Claims (22)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202011432589.4A CN112562700B (en) | 2020-12-10 | 2020-12-10 | Emotional speech synthesis method, device, equipment and storage medium |
| CN202011432589.4 | 2020-12-10 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2022121169A1 true WO2022121169A1 (en) | 2022-06-16 |
Family
ID=75060069
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2021/083559 Ceased WO2022121169A1 (en) | 2020-12-10 | 2021-03-29 | Emotional speech synthesis method, apparatus, and device, and storage medium |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN112562700B (en) |
| WO (1) | WO2022121169A1 (en) |
Cited By (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112562700A (en) * | 2020-12-10 | 2021-03-26 | 平安科技(深圳)有限公司 | Emotional voice synthesis method, device, equipment and storage medium |
| CN116189715A (en) * | 2022-12-13 | 2023-05-30 | 中国科学院声学研究所 | A method and device for detecting lung disease using cough sound |
| CN116486781A (en) * | 2023-05-06 | 2023-07-25 | 平安科技(深圳)有限公司 | Speech synthesis method combined with emotional strength, electronic device and readable storage medium |
| CN116665639A (en) * | 2023-06-16 | 2023-08-29 | 平安科技(深圳)有限公司 | Speech synthesis method, speech synthesis device, electronic device and storage medium |
| CN117079637A (en) * | 2023-06-19 | 2023-11-17 | 内蒙古工业大学 | Mongolian emotion voice synthesis method based on condition generation countermeasure network |
| CN119559930A (en) * | 2024-11-26 | 2025-03-04 | 平安科技(深圳)有限公司 | A method, device, equipment and medium for singing synthesis based on controllable noise |
| CN120260539A (en) * | 2025-06-03 | 2025-07-04 | 国网浙江省电力有限公司营销服务中心 | A method and system for generating conversational speech based on emotion perception adapter and large model reasoning |
Families Citing this family (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113112987B (en) * | 2021-04-14 | 2024-05-03 | 北京地平线信息技术有限公司 | Speech synthesis method, training method and device of speech synthesis model |
| CN113436621B (en) * | 2021-06-01 | 2022-03-15 | 深圳市北科瑞声科技股份有限公司 | GPU (graphics processing Unit) -based voice recognition method and device, electronic equipment and storage medium |
| CN113436608B (en) * | 2021-06-25 | 2023-11-28 | 平安科技(深圳)有限公司 | Double-flow voice conversion method, device, equipment and storage medium |
| CN114842881B (en) * | 2022-06-07 | 2025-06-17 | 四川启睿克科技有限公司 | Unsupervised emotional speech synthesis device and method |
| CN115273906B (en) * | 2022-07-29 | 2025-08-19 | 平安科技(深圳)有限公司 | Speech emotion conversion method, speech emotion conversion device, apparatus, and storage medium |
| CN118571266A (en) * | 2024-06-28 | 2024-08-30 | 南京龙垣信息科技有限公司 | Emotion voice synthesis method and system for identity encryption |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20190355347A1 (en) * | 2018-05-18 | 2019-11-21 | Baidu Usa Llc | Spectrogram to waveform synthesis using convolutional networks |
| CN110675881A (en) * | 2019-09-05 | 2020-01-10 | 北京捷通华声科技股份有限公司 | Voice verification method and device |
| CN111667812A (en) * | 2020-05-29 | 2020-09-15 | 北京声智科技有限公司 | Voice synthesis method, device, equipment and storage medium |
| CN111883106A (en) * | 2020-07-27 | 2020-11-03 | 腾讯音乐娱乐科技(深圳)有限公司 | Audio processing method and device |
| CN112562700A (en) * | 2020-12-10 | 2021-03-26 | 平安科技(深圳)有限公司 | Emotional voice synthesis method, device, equipment and storage medium |
Family Cites Families (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109754779A (en) * | 2019-01-14 | 2019-05-14 | 出门问问信息科技有限公司 | Controllable emotional speech synthesizing method, device, electronic equipment and readable storage medium storing program for executing |
| CN109754778B (en) * | 2019-01-17 | 2023-05-30 | 平安科技(深圳)有限公司 | Text speech synthesis method and device and computer equipment |
| CN110379409B (en) * | 2019-06-14 | 2024-04-16 | 平安科技(深圳)有限公司 | Speech synthesis method, system, terminal device and readable storage medium |
-
2020
- 2020-12-10 CN CN202011432589.4A patent/CN112562700B/en active Active
-
2021
- 2021-03-29 WO PCT/CN2021/083559 patent/WO2022121169A1/en not_active Ceased
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20190355347A1 (en) * | 2018-05-18 | 2019-11-21 | Baidu Usa Llc | Spectrogram to waveform synthesis using convolutional networks |
| CN110675881A (en) * | 2019-09-05 | 2020-01-10 | 北京捷通华声科技股份有限公司 | Voice verification method and device |
| CN111667812A (en) * | 2020-05-29 | 2020-09-15 | 北京声智科技有限公司 | Voice synthesis method, device, equipment and storage medium |
| CN111883106A (en) * | 2020-07-27 | 2020-11-03 | 腾讯音乐娱乐科技(深圳)有限公司 | Audio processing method and device |
| CN112562700A (en) * | 2020-12-10 | 2021-03-26 | 平安科技(深圳)有限公司 | Emotional voice synthesis method, device, equipment and storage medium |
Cited By (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112562700A (en) * | 2020-12-10 | 2021-03-26 | 平安科技(深圳)有限公司 | Emotional voice synthesis method, device, equipment and storage medium |
| CN112562700B (en) * | 2020-12-10 | 2025-04-29 | 平安科技(深圳)有限公司 | Emotional speech synthesis method, device, equipment and storage medium |
| CN116189715A (en) * | 2022-12-13 | 2023-05-30 | 中国科学院声学研究所 | A method and device for detecting lung disease using cough sound |
| CN116486781A (en) * | 2023-05-06 | 2023-07-25 | 平安科技(深圳)有限公司 | Speech synthesis method combined with emotional strength, electronic device and readable storage medium |
| CN116665639A (en) * | 2023-06-16 | 2023-08-29 | 平安科技(深圳)有限公司 | Speech synthesis method, speech synthesis device, electronic device and storage medium |
| CN117079637A (en) * | 2023-06-19 | 2023-11-17 | 内蒙古工业大学 | Mongolian emotion voice synthesis method based on condition generation countermeasure network |
| CN119559930A (en) * | 2024-11-26 | 2025-03-04 | 平安科技(深圳)有限公司 | A method, device, equipment and medium for singing synthesis based on controllable noise |
| CN119559930B (en) * | 2024-11-26 | 2025-11-21 | 平安科技(深圳)有限公司 | Singing voice synthesizing method, device, equipment and medium based on noise control |
| CN120260539A (en) * | 2025-06-03 | 2025-07-04 | 国网浙江省电力有限公司营销服务中心 | A method and system for generating conversational speech based on emotion perception adapter and large model reasoning |
Also Published As
| Publication number | Publication date |
|---|---|
| CN112562700A (en) | 2021-03-26 |
| CN112562700B (en) | 2025-04-29 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2022121169A1 (en) | Emotional speech synthesis method, apparatus, and device, and storage medium | |
| US11847727B2 (en) | Generating facial position data based on audio data | |
| CN110264991B (en) | Training method of speech synthesis model, speech synthesis method, device, equipment and storage medium | |
| CN112562691B (en) | Voiceprint recognition method, voiceprint recognition device, computer equipment and storage medium | |
| US20230351998A1 (en) | Text and audio-based real-time face reenactment | |
| CN111627418A (en) | Training method, synthesizing method, system, device and medium for speech synthesis model | |
| CN116250036A (en) | System and method for synthesizing photo-level realistic video of speech | |
| CN111212245B (en) | Method and device for synthesizing video | |
| WO2022007438A1 (en) | Emotional voice data conversion method, apparatus, computer device, and storage medium | |
| CN112634920A (en) | Method and device for training voice conversion model based on domain separation | |
| US12400631B2 (en) | Method, electronic device, and computer program product for generating cross-modality encoder | |
| CN113327578B (en) | Acoustic model training method and device, terminal equipment and storage medium | |
| JP2023169230A (en) | Computer program, server device, terminal device, learned model, program generation method, and method | |
| CN114898380A (en) | Method, device and equipment for generating handwritten text image and storage medium | |
| Melechovsky et al. | DART: Disentanglement of accent and speaker representation in multispeaker text-to-speech | |
| CN114495977B (en) | Speech translation and model training methods, devices, electronic devices and storage media | |
| CN113270090B (en) | Combined model training method and equipment based on ASR model and TTS model | |
| CN114842860A (en) | Voice conversion method, device and equipment based on quantization coding and storage medium | |
| CN114187892A (en) | Style migration synthesis method and device and electronic equipment | |
| Pham et al. | Style transfer for 2d talking head generation | |
| US20250104692A1 (en) | Text-to-audio conversion with byte-encoding vectors | |
| Gu et al. | A voice anonymization method based on content and non-content disentanglement for emotion preservation | |
| CN113889129B (en) | Speech conversion method, device, equipment and storage medium | |
| Patel et al. | Adagan: Adaptive gan for many-to-many non-parallel voice conversion | |
| Ko et al. | Adversarial training of denoising diffusion model using dual discriminators for high-fidelity multi-speaker tts |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21901885 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 21901885 Country of ref document: EP Kind code of ref document: A1 |