[go: up one dir, main page]

WO2022121169A1 - Emotional speech synthesis method, apparatus, and device, and storage medium - Google Patents

Emotional speech synthesis method, apparatus, and device, and storage medium Download PDF

Info

Publication number
WO2022121169A1
WO2022121169A1 PCT/CN2021/083559 CN2021083559W WO2022121169A1 WO 2022121169 A1 WO2022121169 A1 WO 2022121169A1 CN 2021083559 W CN2021083559 W CN 2021083559W WO 2022121169 A1 WO2022121169 A1 WO 2022121169A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
data
generate
emotional
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2021/083559
Other languages
French (fr)
Chinese (zh)
Inventor
梁爽
陈闽川
马骏
王少军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Publication of WO2022121169A1 publication Critical patent/WO2022121169A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/14Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
    • G06F17/141Discrete Fourier transforms
    • G06F17/142Fast Fourier transforms, e.g. using a Cooley-Tukey type algorithm
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Definitions

  • the present application relates to the technical field of speech synthesis, and in particular, to a method, apparatus, device and storage medium for synthesizing emotional speech.
  • speech synthesis methods are mainly based on hidden Markov speech synthesis methods or neural network-based speech synthesis methods.
  • the inventor realized that although these two speech synthesis methods can obtain good synthesized speech, the resulting Synthesized speech is flat and lacking emotion, making it impossible to get emotional speech.
  • the present application provides a method, device, device and storage medium for synthesizing emotional speech, which are used to solve the problem of dullness and lack of emotion in the synthesized speech, and increase the diversity of the synthesized speech.
  • a first aspect of the present application provides a method for synthesizing emotional speech, including: acquiring speech data to be recognized and corresponding text data; inputting the speech data to be recognized into a pre-trained emotion recognition network to generate Mel spectrum features and position encoding, and process in the emotion recognition network in combination with the Mel spectrum feature and the position encoding to generate an emotion embedding feature; input the emotion embedding feature and the text data into the pre-trained speech synthesis
  • target mel-spectrum data is generated; a neural vocoder is used to perform speech conversion on the target mel-spectrum data to generate target emotional speech.
  • a second aspect of the present application provides a device for synthesizing emotional speech, comprising: an acquisition module for acquiring to-be-recognized speech data and corresponding text data; an embedded feature generation module for inputting the to-be-recognized speech data into pre-training In a good emotion recognition network, the mel spectrum feature and the position code are generated, and the mel spectrum feature and the position code are processed in the emotion recognition network to generate the emotion embedded feature; the mel spectrum data generation module , for inputting the emotion embedded feature and the text data into the pre-trained speech synthesis network to generate target mel-spectrum data; the speech conversion module is used for using a neural vocoder to analyze the target mel-spectrum data Perform speech conversion to generate target emotional speech.
  • a third aspect of the present application provides a device for synthesizing emotional speech, including: a memory and at least one processor, where instructions are stored in the memory; the at least one processor invokes the instructions in the memory, so that The device for synthesizing emotional speech performs the following method for synthesizing emotional speech:
  • a fourth aspect of the present application provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, when the computer-readable storage medium runs on a computer, the computer executes the following method for synthesizing emotional speech:
  • the speech data to be recognized and the corresponding text data are obtained;
  • the speech data to be recognized is input into a pre-trained emotion recognition network to generate Mel spectrum features and location codes, and combined with the Mel spectrum features and the position encoding are processed in the emotion recognition network to generate emotion embedded features;
  • input the emotion embedded features and the text data into a pre-trained speech synthesis network to generate target Mel spectrum data ;
  • Use a neural vocoder to convert the target Mel spectrum data to generate target emotional speech.
  • the emotion embedded feature is generated, and then the emotion embedded feature and the text data are spliced to generate the target emotional speech, which solves the problem of dull synthetic speech,
  • the problem of lack of emotion increases the variety of synthesized speech.
  • FIG. 1 is a schematic diagram of an embodiment of a method for synthesizing emotional speech in an embodiment of the present application
  • FIG. 2 is a schematic diagram of another embodiment of a method for synthesizing emotional speech in an embodiment of the present application
  • FIG. 3 is a schematic diagram of an embodiment of an apparatus for synthesizing emotional speech in an embodiment of the present application
  • FIG. 4 is a schematic diagram of another embodiment of an apparatus for synthesizing emotional speech in an embodiment of the present application.
  • FIG. 5 is a schematic diagram of an embodiment of a device for synthesizing emotional speech in an embodiment of the present application.
  • Embodiments of the present application provide a method, apparatus, device, and storage medium for synthesizing emotional speech, which are used to solve the problem that the synthesized speech is dull and lack emotion, and increase the diversity of the synthesized speech.
  • an embodiment of the method for synthesizing emotional speech in the embodiment of the present application includes:
  • the server obtains the speech data to be recognized and the text data corresponding to the text data to be recognized. It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned voice data and text data to be recognized, the above-mentioned voice data and text data to be recognized can also be stored in a node of a blockchain.
  • the to-be-recognized speech data is the to-be-recognized speech data with emotion, which may be the to-be-recognized speech data with happy emotion, the to-be-recognized speech data with surprised emotion, and/or the to-be-recognized speech data with anger emotion.
  • the server obtains the speech data to be recognized with emotion, it also obtains the corresponding text data. For example, the speech data to be recognized with emotion is "really! disappointment!, the server is obtaining "really! disappointment! ! !, the text data of "Really! disappointment! is also obtained.
  • the execution subject of the present application may be a device for synthesizing emotional speech, and may also be a terminal or a server, which is not specifically limited here.
  • the embodiments of the present application take the server as an execution subject as an example for description.
  • the server inputs the speech data to be recognized into the pre-trained emotion recognition network, firstly generates the mel spectrum feature and the position code, and then processes the mel spectrum feature and the position code in the emotion recognition network to generate the emotion embedded feature .
  • the server inputs the speech data of "really! disappointment! into the pre-trained emotion recognition network for calculation, and generates Mel spectrum features [B 1 , T 1 , D 1 ] and position code P, where, The location code P is generated based on the mel spectrum feature. The location code P is actually a hidden layer output.
  • the server combines the mel spectrum features [B 1 , T 1 , D 1 ] and the location code P for calculation to generate emotional embedding features. [B 2 , T 2 , D 2 ].
  • the server inputs the emotion embedded features and text data into the pre-trained speech synthesis network for calculation, and generates the target mel spectrum data.
  • the server inputs the emotional embedding features of [B 2 , T 2 , D 2 ] and the text data of "Really! disappointment! into the pre-trained speech synthesis network for calculation.
  • the speech synthesis network It includes an encoder, in which the text data of "Really! disappointment! is feature-lifted, an extraction result is generated, and the extraction result is combined with the emotion embedding feature [B 2 ,T 2 ,D 2 ] for splicing to generate target Mel spectrum data [B 2 , T 2 , D 2 +D].
  • the server uses a neural vocoder to convert the target mel-spectral data into target emotional speech.
  • the neural vocoder is Wave Glow
  • the target Mel spectrum data is the input of the neural vocoder
  • the input frame length is 1024
  • the frame shift is 256.
  • the Er spectrum data is input into the affine coupling layer of the neural vocoder for scaling and transformation to generate emotional speech features, and then reversible convolution is performed on the emotional speech features to generate the target emotional speech "Really! (surprised emotion) soda to you. ! (happy emotion)”.
  • the emotion embedded feature is generated, and then the emotion embedded feature and the text data are spliced to generate the target emotional speech, which solves the problem of dull synthetic speech,
  • the problem of lack of emotion increases the variety of synthesized speech.
  • another embodiment of the method for synthesizing emotional speech in the embodiment of the present application includes:
  • the server obtains emotional speech training data, emotional label data and text training data from the big data platform or database.
  • emotional speech training data can be divided into emotional speech training data including noise and emotional speech training data not including noise.
  • the emotional voice training data can be voice training data with emotion such as "too much”, “really” or “too good”, and obtain emotional label data and text training data, among which "too much”
  • the emotional voice training data corresponds to the emotional label data of "anger” and the text training data of "too much”;
  • the emotional voice training data of "really” corresponds to the emotional label data of "surprise” and corresponds to the text of "really” Training data;
  • “too good” emotional speech training data corresponds to "happy” emotional label data, corresponding to "too good” text training data.
  • the server performs training based on the emotional voice training data and emotional label data, combined with the regularization mechanism, to generate a pre-trained emotion recognition network, and then performs model training based on the emotional voice training data and text training data to generate a pre-trained speech synthesis network.
  • the pre-trained emotion recognition network is used to extract emotional features, so emotional speech training data and emotion label data are used to train the emotion recognition network; the pre-trained speech synthesis network is used to synthesize emotional speech, so emotional speech is used. Speech training data and text training data train the speech synthesis network.
  • the above training process includes emotional speech training data, when training the emotion recognition network, the emotional speech training data can be either the emotional speech training data including noise or the emotional speech training data that does not include noise, but training speech synthesis When the network is used, it is necessary to call high-quality training data for training, that is, emotional speech training data that does not include noise.
  • the server combines the layer regularization mechanism to train the emotion recognition network, mainly adding a layer regularization mechanism after each sub-layer, and the layer regularization mechanism is to calculate the output of the i layer in the channel dimension. and variance, and then subtract the mean from the output of layer i and divide by the variance, so that the output of layer i has a mean of 0 and a variance of 1.
  • the layer regularization mechanism can make the distribution of training data consistent and make the training process stable.
  • the server obtains the speech data to be recognized and the text data corresponding to the text data to be recognized. It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned voice data and text data to be recognized, the above-mentioned voice data and text data to be recognized can also be stored in a node of a blockchain.
  • the to-be-recognized speech data is the to-be-recognized speech data with emotion, which may be the to-be-recognized speech data with happy emotion, the to-be-recognized speech data with surprised emotion, and/or the to-be-recognized speech data with anger emotion.
  • the server obtains the speech data to be recognized with emotion, it also obtains the corresponding text data. For example, the speech data to be recognized with emotion is "really! disappointment!, the server is obtaining "really! disappointment! ! !, the text data of "Really! disappointment! is also obtained.
  • the server inputs the speech data to be recognized into the pre-trained emotion recognition network, firstly generates the mel spectrum feature and the position code, and then processes the mel spectrum feature and the position code in the emotion recognition network to generate the emotion embedded feature .
  • the server inputs the speech data of "really! disappointment! into the pre-trained emotion recognition network for calculation, and generates Mel spectrum features [B 1 , T 1 , D 1 ] and position code P, where, The location code P is generated based on the mel spectrum feature. The location code P is actually a hidden layer output.
  • the server combines the mel spectrum features [B 1 , T 1 , D 1 ] and the location code P for calculation to generate emotional embedding features. [B 2 , T 2 , D 2 ].
  • the server inputs the speech data to be recognized into a pre-trained emotion recognition network to generate mel spectrum features; the server generates a position code according to the mel spectrum features and a preset position conversion formula; the server converts the mel spectrum features and The positional encoding is input into the encoder of the emotion recognition network for encoding to generate emotion embedding features.
  • the server inputs the speech data of "really! disappointment! into the pre-trained emotion recognition network, and firstly inputs "really! disappointment! into the trained emotion recognition network for feature extraction.
  • the server inputs the Mel spectrum features [B 1 , T 1 , D 1 ] and the position encoding P into the encoder of the emotion recognition network for encoding to generate emotion embedding features, where the encoder of the emotion recognition network includes five identical module and a layer of long short-term memory artificial neural network layer, in which each module includes two sub-layers, namely multi-head self-attention layer and forward propagation layer, in the encoder [B 1 , T 1 , D 1 ] and P are encoded to generate sentiment embedding features [B 2 , T 2
  • the server inputs the speech data to be recognized into the pre-trained emotion recognition network, and the generated Mel spectrum features include:
  • the server performs windowing processing on the speech data to be recognized to generate windowed speech data; then, the server performs short-time Fourier transform on the windowed speech data to generate Fourier transformed speech data; finally, the server uses a Mel filter bank to process the Fourier transformed speech data to generate Mel spectrum features.
  • the server uses a window function to perform windowing processing on the speech data to be recognized for "really! disappointment! to generate windowed speech data; then the server performs Fourier transform on the windowed speech data, and determines to add The frequency and phase of the speech data after the window are obtained, thereby generating the speech data after Fourier transformation; finally, the server uses a Mel filter bank to process the speech data after Fourier transformation into mel spectrum features.
  • the server According to the Mel spectrum feature and the preset position conversion formula, the server generates the position code including:
  • the server reads the length of the mel spectrum feature and reads the position of the mel spectrum feature; the service generates a position input value based on the length of the mel spectrum feature and the position of the mel spectrum feature; the server inputs the position input vector into the preset Position conversion formula to generate position codes.
  • the preset position conversion formula is
  • pos is the position of the mel spectral feature
  • 2i represents the even dimension
  • 2i+1 represents the odd dimension
  • d mode represents the preset dimension vector corresponding to the location of the mel spectral feature, such as 256.
  • the server reads the length of the mel spectral feature as 5 and the position of the mel spectral feature as 0, then the server determines the position based on the length of the mel spectral feature of "5" and the position of the mel spectral feature of "0"
  • the input value is [0,1,2,3,4], and then the position input value [0,1,2,3,4] is calculated based on the above formula to generate the position code P.
  • P is only a pronoun, and does not represent specific position encoding data.
  • the server inputs the mel spectrum feature and position encoding into the encoder of the emotion recognition network for encoding, and the generated emotion embedding features include:
  • the server inputs the position encoding of the mel spectrum feature sum into the multi-head self-attention layer of the emotion recognition network, and combines the residual connections to generate the initial emotion feature vector; the server inputs the initial emotion feature vector into the forward propagation layer of the emotion recognition network for processing. Convolution to generate sentiment embedding features.
  • the server first inputs the Mel spectrum features [B 1 , T 1 , D 1 ] into the multi-head self-attention layer for calculation, and combines the residual connections to generate the initial emotion feature vector.
  • the formula for the multi-head self-attention layer design is as follows:
  • Q, K, V are the input, that is, the mel spectrum feature
  • d k is the preset dimension vector, such as 256
  • head i is the ith head
  • each calculation in the multi-head self-attention layer is a head
  • W i Q , KW i K , KW i V are the weights, which are generated during the training process
  • Concat refers to how many The heads are spliced together along the last dimension.
  • the dimension vectors of these four heads are [B t , T t , D t ], and the server splices them together to generate the initial emotion feature vector [B t , T t ] , 4D t ], W O is the parameter learned in advance.
  • the initial emotion feature vector and the Mel spectrum feature input are convolved in the corresponding forward propagation layer to generate the first module emotion feature vector. Since the encoder includes five identical modules, Therefore, five calculations are performed according to the above calculation method, and the output result of the last module is input into a layer of long short-term memory artificial neural network layer, thereby generating emotional embedded features [B 2 , T 2 , D 2 ].
  • the residual connection is to add the input of each multi-head self-attention layer to the output as the input of the next forward propagation layer.
  • the first multi-head self-attention layer it is the initial emotion.
  • Mel spectrum features are added to the feature vector to generate the input of the forward propagation layer. Thereby improving the relevance of generating emotional embedding features.
  • the server inputs the emotion embedded features and text data into the pre-trained speech synthesis network for calculation, and generates the target mel spectrum data.
  • the server inputs the emotion embedding features of [B 2 , T 2 , D 3 ] and the text data of "Really! disappointment! into the pre-trained speech synthesis network for calculation.
  • the speech synthesis network It includes an encoder, in which the text data of "Really! disappointment! is feature-lifted, an extraction result is generated, and the extraction result is combined with the emotion embedding feature [B 2 ,T 2 ,D 2 ] for splicing to generate target Mel spectrum data [B 2 , T 2 , D 2 +D].
  • the server converts text data into text embedding features in the pre-trained speech synthesis network; the server splices the text embedding features and emotion embedding features in order of time to generate target Mel spectrum data.
  • the server first converts the text data into text embedding features in the same form as the emotional embedding features in the order of time, and then the server splices the text embedding features and emotional embedding features in the order of time to generate the target Mei spectral data.
  • the emotion embedding feature is [B 2 , T 2 , D 2 ]
  • the text embedding feature is [B 2 , T 2 , D 3 ]
  • the server will D 2 ] is spliced with [B 2 , T 2 , D] to generate target mel spectrum data as [B 2 , T 2 , D 2 +D].
  • the emotion embedding feature dimension is [B 2 , D 2 ]
  • the server expands the emotion embedding feature to [B 2 , 1, D 2 ], and then the emotion embedding feature and cost embedding feature Do stitching.
  • the server uses a neural vocoder to convert the target mel-spectral data into target emotional speech.
  • the neural vocoder is Wave Glow
  • the target Mel spectrum data is the input of the neural vocoder
  • the input frame length is 1024
  • the frame shift is 256.
  • the Er spectrum data is input into the affine coupling layer of the neural vocoder for scaling and transformation to generate emotional speech features, and then reversible convolution is performed on the emotional speech features to generate the target emotional speech "Really! (surprised emotion) soda to you. ! (happy emotion)”.
  • the emotion embedded feature is generated, and then the emotion embedded feature and the text data are spliced to generate the target emotional speech, which solves the problem of dull synthetic speech,
  • the problem of lack of emotion increases the variety of synthesized speech.
  • An embodiment of the device for synthesizing emotional speech in the embodiment of the present application includes:
  • a to-be-recognized data acquisition module 301 configured to acquire to-be-recognized voice data and corresponding text data;
  • the embedded feature generation module 302 is used to input the speech data to be recognized into the pre-trained emotion recognition network, generate a mel spectrum feature and a position code, and combine the mel spectrum feature and the position code in the described mel spectrum feature and the position code. Processed in the emotion recognition network to generate emotion embedded features;
  • Mel spectrum data generation module 303 for inputting the emotion embedded feature and the text data into a pre-trained speech synthesis network to generate target Mel spectrum data
  • the speech conversion module 304 is configured to perform speech conversion on the target mel spectrum data by using a neural vocoder to generate a target emotional speech.
  • the emotion embedded feature is generated, and then the emotion embedded feature and the text data are spliced to generate the target emotional speech, which solves the problem of dull synthetic speech,
  • the problem of lack of emotion increases the variety of synthesized speech.
  • another embodiment of the apparatus for synthesizing emotional speech in the embodiment of the present application includes:
  • a to-be-recognized data acquisition module 301 configured to acquire to-be-recognized voice data and corresponding text data;
  • the embedded feature generation module 302 is used to input the speech data to be recognized into the pre-trained emotion recognition network, generate a mel spectrum feature and a position code, and combine the mel spectrum feature and the position code in the described mel spectrum feature and the position code. Processed in the emotion recognition network to generate emotion embedded features;
  • Mel spectrum data generation module 303 for inputting the emotion embedded feature and the text data into a pre-trained speech synthesis network to generate target Mel spectrum data
  • the speech conversion module 304 is configured to perform speech conversion on the target mel spectrum data by using a neural vocoder to generate a target emotional speech.
  • the embedded feature generation module 302 includes:
  • Mel spectrum feature generation unit 3021 for inputting the speech data to be recognized into the pre-trained emotion recognition network to generate Mel spectrum features
  • a position code generation unit 3022 configured to generate a position code according to the mel spectrum feature and a preset position conversion formula
  • the encoding unit 3023 is configured to input the mel spectrum feature and the position code into the encoder of the emotion recognition network for encoding, and generate an emotion embedded feature.
  • the mel spectrum feature generating unit 3021 can also be specifically used for:
  • Windowing is performed on the to-be-recognized speech data to generate windowed speech data
  • the Fourier-transformed speech data is processed by using a Mel filter bank to generate Mel spectrum features.
  • the location code generation unit 3022 can also be specifically used for:
  • the encoding unit 3023 can also be specifically used for:
  • the initial emotion feature vector is input into the forward propagation layer of the emotion recognition network for convolution to generate emotion embedded features.
  • the Mel spectrum data generation module 303 can also be specifically used for:
  • the text embedding feature and the emotion embedding feature are spliced to generate target mel spectrum data.
  • the device for synthesizing emotional speech further includes:
  • a training data acquisition module 305 configured to acquire emotional speech training data, emotional label data and text training data
  • the training module 306 is configured to use the emotional voice training data and the emotional label data, and perform model training in combination with a layer regularization mechanism, generate a pre-trained emotional recognition network, and use the emotional voice training data and the text
  • the training data is used for model training to generate a pre-trained speech synthesis network.
  • the emotion embedded feature is generated, and then the emotion embedded feature and the text data are spliced to generate the target emotional speech, which solves the problem of dull synthetic speech,
  • the problem of lack of emotion increases the variety of synthesized speech.
  • the device 500 for synthesizing emotional speech may vary greatly due to different configurations or performances, and may include one or more processors (central processing units, CPU) 510 (eg, one or more processors) and memory 520, one or more storage media 530 (eg, one or more mass storage devices) that store application programs 533 or data 532.
  • the memory 520 and the storage medium 530 may be short-term storage or persistent storage.
  • the program stored in the storage medium 530 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations in the device 500 for synthesizing emotional speech.
  • the processor 510 may be configured to communicate with the storage medium 530, and execute a series of instruction operations in the storage medium 530 on the emotional speech synthesis device 500.
  • the emotional speech synthesis device 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input and output interfaces 560, and/or, one or more operating systems 531, such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD, and more.
  • operating systems 531 such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD, and more.
  • the present application also provides a device for synthesizing emotional speech.
  • the computer device includes a memory and a processor, and computer-readable instructions are stored in the memory.
  • the processor executes the steps in the foregoing embodiments. The steps of the emotion speech synthesis method.
  • the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium.
  • the computer-readable storage medium may also be a volatile computer-readable storage medium.
  • the computer-readable storage medium stores instructions that, when executed on a computer, cause the computer to execute the steps of the method for synthesizing emotional speech.
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium.
  • the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Discrete Mathematics (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

An emotional speech synthesis method, apparatus, and device, and a storage medium, for use in solving the problems that a synthesized speech is flat and lacks emotions, and increasing the diversity of the synthesized speech. The emotional speech synthesis method comprises: obtaining speech data to be recognized and corresponding text data (101); inputting said speech data into a pretrained emotion recognition network to generate a Mel spectrum feature and a position code, and performing processing in the emotion recognition network in combination with the Mel spectrum feature and the position code to generate an emotion embedding feature (102); inputting the emotion embedding feature and the text data into a pre-trained speech synthesis network to generate target Mel spectrum data (103); and performing speech conversion on the target Mel spectrum data by using a neural vocoder to generate a target emotional speech (104). In addition, the method also relates to a blockchain technology, and said speech data and the text data can be stored in a blockchain.

Description

情感语音的合成方法、装置、设备及存储介质Emotional speech synthesis method, device, device and storage medium

本申请要求于2020年12月10日提交中国专利局、申请号为202011432589.4、发明名称为“情感语音的合成方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。This application claims the priority of the Chinese patent application with the application number 202011432589.4 and the invention titled "Method, Apparatus, Equipment and Storage Medium for Emotional Speech Synthesis" filed with the China Patent Office on December 10, 2020, the entire contents of which are by reference incorporated in the application.

技术领域technical field

本申请涉及语音合成技术领域,尤其涉及一种情感语音的合成方法、装置、设备及存储介质。The present application relates to the technical field of speech synthesis, and in particular, to a method, apparatus, device and storage medium for synthesizing emotional speech.

背景技术Background technique

随着科技的发展,智能客服中心、聊天机器人、智能音箱等人工智能服务走进我们的日常生活,且发挥着越来越重要的作用。这种人工智能服务器通常涉及到语音合成技术,因此语音合成技术也得到了更为广泛的应用。With the development of technology, artificial intelligence services such as smart customer service centers, chat robots, and smart speakers have entered our daily lives and are playing an increasingly important role. This kind of artificial intelligence server usually involves speech synthesis technology, so speech synthesis technology has also been more widely used.

在现有技术中,语音合成方法主要为基于隐马尔可夫的语音合成方式或者基于神经网络的语音合成方式,发明人意识到这两种语音合成方式虽然可以获得不错的合成语音,但是生成的合成语音平淡、缺乏情感,从而无法获得饱含情感的语音。In the prior art, speech synthesis methods are mainly based on hidden Markov speech synthesis methods or neural network-based speech synthesis methods. The inventor realized that although these two speech synthesis methods can obtain good synthesized speech, the resulting Synthesized speech is flat and lacking emotion, making it impossible to get emotional speech.

发明内容SUMMARY OF THE INVENTION

本申请提供了一种情感语音的合成方法、装置、设备及存储介质,用于解决合成语音平淡、缺乏情感的问题,增加合成语音的多样性。The present application provides a method, device, device and storage medium for synthesizing emotional speech, which are used to solve the problem of dullness and lack of emotion in the synthesized speech, and increase the diversity of the synthesized speech.

本申请第一方面提供了一种情感语音的合成方法,包括:获取待识别语音数据和对应的文本数据;将所述待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征和位置编码,并结合所述梅尔谱特征和所述位置编码在所述情感识别网络中进行处理,生成情感嵌入特征;将所述情感嵌入特征和所述文本数据输入预先训练好的语音合成网络中,生成目标梅尔谱数据;采用神经声码器对所述目标梅尔谱数据进行语音转换,生成目标情感语音。A first aspect of the present application provides a method for synthesizing emotional speech, including: acquiring speech data to be recognized and corresponding text data; inputting the speech data to be recognized into a pre-trained emotion recognition network to generate Mel spectrum features and position encoding, and process in the emotion recognition network in combination with the Mel spectrum feature and the position encoding to generate an emotion embedding feature; input the emotion embedding feature and the text data into the pre-trained speech synthesis In the network, target mel-spectrum data is generated; a neural vocoder is used to perform speech conversion on the target mel-spectrum data to generate target emotional speech.

本申请第二方面提供了一种情感语音的合成装置,包括:获取模块,用于获取待识别语音数据和对应的文本数据;嵌入特征生成模块,用于将所述待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征和位置编码,并结合所述梅尔谱特征和所述位置编码在所述情感识别网络中进行处理,生成情感嵌入特征;梅尔谱数据生成模块,用于将所述情感嵌入特征和所述文本数据输入预先训练好的语音合成网络中,生成目标梅尔谱数据;语音转换模块,用于采用神经声码器对所述目标梅尔谱数据进行语音转换,生成目标情感语音。A second aspect of the present application provides a device for synthesizing emotional speech, comprising: an acquisition module for acquiring to-be-recognized speech data and corresponding text data; an embedded feature generation module for inputting the to-be-recognized speech data into pre-training In a good emotion recognition network, the mel spectrum feature and the position code are generated, and the mel spectrum feature and the position code are processed in the emotion recognition network to generate the emotion embedded feature; the mel spectrum data generation module , for inputting the emotion embedded feature and the text data into the pre-trained speech synthesis network to generate target mel-spectrum data; the speech conversion module is used for using a neural vocoder to analyze the target mel-spectrum data Perform speech conversion to generate target emotional speech.

本申请第三方面提供了一种情感语音的合成设备,包括:存储器和至少一个处理器,所述存储器中存储有指令;所述至少一个处理器调用所述存储器中的所述指令,以使得所述情感语音的合成设备执行如下所述的情感语音的合成方法:A third aspect of the present application provides a device for synthesizing emotional speech, including: a memory and at least one processor, where instructions are stored in the memory; the at least one processor invokes the instructions in the memory, so that The device for synthesizing emotional speech performs the following method for synthesizing emotional speech:

获取待识别语音数据和对应的文本数据;将所述待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征和位置编码,并结合所述梅尔谱特征和所述位置编码在所述情感识别网络中进行处理,生成情感嵌入特征;将所述情感嵌入特征和所述文本数据输入预先训练好的语音合成网络中,生成目标梅尔谱数据;采用神经声码器对所述目标梅尔谱数据进行语音转换,生成目标情感语音。Obtain the speech data to be recognized and the corresponding text data; input the speech data to be recognized into a pre-trained emotion recognition network, generate mel spectrum features and position coding, and combine the mel spectrum features and the position coding Perform processing in the emotion recognition network to generate emotion embedded features; input the emotion embedded features and the text data into a pre-trained speech synthesis network to generate target mel spectrum data; use a neural vocoder to The target mel spectrum data is used for speech conversion to generate the target emotional speech.

本申请的第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行如下所述的情感语音的合成方法:A fourth aspect of the present application provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, when the computer-readable storage medium runs on a computer, the computer executes the following method for synthesizing emotional speech:

获取待识别语音数据和对应的文本数据;将所述待识别语音数据输入预先训练好的情 感识别网络中,生成梅尔谱特征和位置编码,并结合所述梅尔谱特征和所述位置编码在所述情感识别网络中进行处理,生成情感嵌入特征;将所述情感嵌入特征和所述文本数据输入预先训练好的语音合成网络中,生成目标梅尔谱数据;采用神经声码器对所述目标梅尔谱数据进行语音转换,生成目标情感语音。Obtain the speech data to be recognized and the corresponding text data; input the speech data to be recognized into a pre-trained emotion recognition network, generate mel spectrum features and position coding, and combine the mel spectrum features and the position coding Perform processing in the emotion recognition network to generate emotion embedded features; input the emotion embedded features and the text data into a pre-trained speech synthesis network to generate target mel spectrum data; use a neural vocoder to The target mel spectrum data is used for speech conversion to generate the target emotional speech.

本申请提供的技术方案中,获取待识别语音数据和对应的文本数据;将所述待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征和位置编码,并结合所述梅尔谱特征和所述位置编码在所述情感识别网络中进行处理,生成情感嵌入特征;将所述情感嵌入特征和所述文本数据输入预先训练好的语音合成网络中,生成目标梅尔谱数据;采用神经声码器对所述目标梅尔谱数据进行语音转换,生成目标情感语音。本申请实施例中,通过预先训练好的情感识别网络,结合梅尔谱特征和位置编码生成情感嵌入特征,然后将情感嵌入特征和文本数据进行拼接,生成目标情感语音,解决了合成语音平淡、缺乏情感的问题,增加了合成语音的多样性。In the technical solution provided by this application, the speech data to be recognized and the corresponding text data are obtained; the speech data to be recognized is input into a pre-trained emotion recognition network to generate Mel spectrum features and location codes, and combined with the Mel spectrum features and the position encoding are processed in the emotion recognition network to generate emotion embedded features; input the emotion embedded features and the text data into a pre-trained speech synthesis network to generate target Mel spectrum data ; Use a neural vocoder to convert the target Mel spectrum data to generate target emotional speech. In the embodiment of the present application, through the pre-trained emotion recognition network, combined with the Mel spectrum feature and the position encoding, the emotion embedded feature is generated, and then the emotion embedded feature and the text data are spliced to generate the target emotional speech, which solves the problem of dull synthetic speech, The problem of lack of emotion increases the variety of synthesized speech.

附图说明Description of drawings

图1为本申请实施例中情感语音的合成方法的一个实施例示意图;1 is a schematic diagram of an embodiment of a method for synthesizing emotional speech in an embodiment of the present application;

图2为本申请实施例中情感语音的合成方法的另一个实施例示意图;2 is a schematic diagram of another embodiment of a method for synthesizing emotional speech in an embodiment of the present application;

图3为本申请实施例中情感语音的合成装置的一个实施例示意图;3 is a schematic diagram of an embodiment of an apparatus for synthesizing emotional speech in an embodiment of the present application;

图4为本申请实施例中情感语音的合成装置的另一个实施例示意图;4 is a schematic diagram of another embodiment of an apparatus for synthesizing emotional speech in an embodiment of the present application;

图5为本申请实施例中情感语音的合成设备的一个实施例示意图。FIG. 5 is a schematic diagram of an embodiment of a device for synthesizing emotional speech in an embodiment of the present application.

具体实施方式Detailed ways

本申请实施例提供了一种情感语音的合成方法、装置、设备及存储介质,用于解决合成语音平淡、缺乏情感的问题,增加合成语音的多样性。Embodiments of the present application provide a method, apparatus, device, and storage medium for synthesizing emotional speech, which are used to solve the problem that the synthesized speech is dull and lack emotion, and increase the diversity of the synthesized speech.

本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”或“具有”及其任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third", "fourth", etc. (if any) in the description and claims of the present application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is to be understood that data so used can be interchanged under appropriate circumstances so that the embodiments described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" or "having" and any variations thereof are intended to cover non-exclusive inclusion, for example, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those expressly listed steps or units, but may include other steps or units not expressly listed or inherent to these processes, methods, products or devices.

为便于理解,下面对本申请实施例的具体流程进行描述,请参阅图1,本申请实施例中情感语音的合成方法的一个实施例包括:For ease of understanding, the specific flow of the embodiment of the present application is described below, referring to FIG. 1 , an embodiment of the method for synthesizing emotional speech in the embodiment of the present application includes:

101、获取待识别语音数据和对应的文本数据;101. Acquire speech data to be recognized and corresponding text data;

服务器获取待识别语音数据和与待识别文本数据对应的文本数据。需要强调的是,为进一步保证上述待识别语音数据和文本数据的私密和安全性,上述待识别语音数据和文本数据还可以存储于一区块链的节点中。The server obtains the speech data to be recognized and the text data corresponding to the text data to be recognized. It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned voice data and text data to be recognized, the above-mentioned voice data and text data to be recognized can also be stored in a node of a blockchain.

待识别语音数据为带有情感的待识别语音数据,可以为带有高兴情感的待识别语音数据、带有惊讶情感的待识别语音数据和/或带有愤怒情感的待识别语音数据。服务器在获取带有情感的待识别语音数据时,还获取对应的文本数据,例如带有情感的待识别语音数据为“真的吗!恭喜你!”,服务器在获取“真的吗!恭喜你!”的待识别语音数据时,还获取“真的吗!恭喜你!”的文本数据。The to-be-recognized speech data is the to-be-recognized speech data with emotion, which may be the to-be-recognized speech data with happy emotion, the to-be-recognized speech data with surprised emotion, and/or the to-be-recognized speech data with anger emotion. When the server obtains the speech data to be recognized with emotion, it also obtains the corresponding text data. For example, the speech data to be recognized with emotion is "really! Congratulations!", the server is obtaining "really! Congratulations!" !", the text data of "Really! Congratulations!" is also obtained.

可以理解的是,本申请的执行主体可以为情感语音的合成装置,还可以是终端或者服务器,具体此处不做限定。本申请实施例以服务器为执行主体为例进行说明。It can be understood that the execution subject of the present application may be a device for synthesizing emotional speech, and may also be a terminal or a server, which is not specifically limited here. The embodiments of the present application take the server as an execution subject as an example for description.

102、将待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征和位置编 码,并结合梅尔谱特征和位置编码在情感识别网络中进行处理,生成情感嵌入特征;102, input the speech data to be recognized in the pre-trained emotion recognition network, generate mel spectrum feature and position coding, and process in emotion recognition network in conjunction with mel spectrum feature and position coding, generate emotion embedded feature;

服务器将待识别语音数据输入预先训练好的情感识别网络中,首先生成梅尔谱特征和位置编码,然后在情感识别网络中对该梅尔谱特征和该位置编码进行处理,从而生成情感嵌入特征。The server inputs the speech data to be recognized into the pre-trained emotion recognition network, firstly generates the mel spectrum feature and the position code, and then processes the mel spectrum feature and the position code in the emotion recognition network to generate the emotion embedded feature .

服务器将“真的吗!恭喜你!”的待识别语音数据输入预先训练好的情感识别网络中进行计算,生成梅尔谱特征[B 1,T 1,D 1]以及位置编码P,其中,位置编码P基于梅尔谱特征生成,该位置编码P实际上是一个隐藏层输出,服务器再结合梅尔谱特征[B 1,T 1,D 1]以及位置编码P进行计算,生成情感嵌入特征[B 2,T 2,D 2]。 The server inputs the speech data of "really! Congratulations!" into the pre-trained emotion recognition network for calculation, and generates Mel spectrum features [B 1 , T 1 , D 1 ] and position code P, where, The location code P is generated based on the mel spectrum feature. The location code P is actually a hidden layer output. The server combines the mel spectrum features [B 1 , T 1 , D 1 ] and the location code P for calculation to generate emotional embedding features. [B 2 , T 2 , D 2 ].

103、将情感嵌入特征和文本数据输入预先训练好的语音合成网络中,生成目标梅尔谱数据;103. Input the emotion embedded feature and text data into the pre-trained speech synthesis network to generate target mel spectrum data;

服务器将情感嵌入特征和文本数据输入预先训练好的语音合成网络中进行计算,生成目标梅尔谱数据。The server inputs the emotion embedded features and text data into the pre-trained speech synthesis network for calculation, and generates the target mel spectrum data.

服务器将[B 2,T 2,D 2]的情感嵌入特征和“真的吗!恭喜你!”的文本数据输入预先训练好的语音合成网络中进行计算,在本实施例中,语音合成网络中包括编码器,在该编码器中,将“真的吗!恭喜你!”的文本数据进行特征提权,生成提取结果,并将该提取结果与情感嵌入特征[B 2,T 2,D 2]进行拼接,生成目标梅尔谱数据[B 2,T 2,D 2+D]。 The server inputs the emotional embedding features of [B 2 , T 2 , D 2 ] and the text data of "Really! Congratulations!" into the pre-trained speech synthesis network for calculation. In this embodiment, the speech synthesis network It includes an encoder, in which the text data of "Really! Congratulations!" is feature-lifted, an extraction result is generated, and the extraction result is combined with the emotion embedding feature [B 2 ,T 2 ,D 2 ] for splicing to generate target Mel spectrum data [B 2 , T 2 , D 2 +D].

104、采用神经声码器对目标梅尔谱数据进行语音转换,生成目标情感语音。104. Use a neural vocoder to perform speech conversion on the target Mel spectrum data to generate a target emotional speech.

服务器采用神经声码器将目标梅尔谱数据转换为目标情感语音。The server uses a neural vocoder to convert the target mel-spectral data into target emotional speech.

需要说明的是,在本实施例中,神经声码器为Wave Glow,目标梅尔谱数据为神经声码器的输入,该输入的帧长为1024,帧移位256,首先将该目标梅尔谱数据输入神经声码器的仿射耦合层中进行缩放和转换,生成情感语音特征,然后对该情感语音特征进行可逆卷积,生成目标情感语音“真的吗!(惊讶情感)恭喜你!(高兴情感)”。It should be noted that, in this embodiment, the neural vocoder is Wave Glow, the target Mel spectrum data is the input of the neural vocoder, the input frame length is 1024, and the frame shift is 256. The Er spectrum data is input into the affine coupling layer of the neural vocoder for scaling and transformation to generate emotional speech features, and then reversible convolution is performed on the emotional speech features to generate the target emotional speech "Really! (surprised emotion) Congratulations to you. ! (happy emotion)".

本申请实施例中,通过预先训练好的情感识别网络,结合梅尔谱特征和位置编码生成情感嵌入特征,然后将情感嵌入特征和文本数据进行拼接,生成目标情感语音,解决了合成语音平淡、缺乏情感的问题,增加了合成语音的多样性。In the embodiment of the present application, through the pre-trained emotion recognition network, combined with the Mel spectrum feature and the position encoding, the emotion embedded feature is generated, and then the emotion embedded feature and the text data are spliced to generate the target emotional speech, which solves the problem of dull synthetic speech, The problem of lack of emotion increases the variety of synthesized speech.

请参阅图2,本申请实施例中情感语音的合成方法的另一个实施例包括:Referring to FIG. 2, another embodiment of the method for synthesizing emotional speech in the embodiment of the present application includes:

201、获取情感语音训练数据、情感标签数据和文本训练数据;201. Obtain emotional speech training data, emotional label data, and text training data;

服务器从大数据平台或者数据库中获取情感语音训练数据、情感标签数据和文本训练数据。The server obtains emotional speech training data, emotional label data and text training data from the big data platform or database.

需要说明的是,情感语音训练数据可以分为包括噪声的情感语音训练数据和不包括噪声的情感语音训练数据。It should be noted that the emotional speech training data can be divided into emotional speech training data including noise and emotional speech training data not including noise.

情感语音训练数据可以为“太过分了”、“真的吗”或者“太好了”之类带有情感的语音训练数据,并获取情感标签数据和文本训练数据,其中“太过分了”的情感语音训练数据对应“愤怒”的情感标签数据,对应“太过分了”的文本训练数据;“真的吗”的情感语音训练数据对应“惊讶”的情感标签数据对应“真的吗”的文本训练数据;“太好了”的情感语音训练数据对应“高兴”的情感标签数据,对应“太好了”的文本训练数据。The emotional voice training data can be voice training data with emotion such as "too much", "really" or "too good", and obtain emotional label data and text training data, among which "too much" The emotional voice training data corresponds to the emotional label data of "anger" and the text training data of "too much"; the emotional voice training data of "really" corresponds to the emotional label data of "surprise" and corresponds to the text of "really" Training data; "too good" emotional speech training data corresponds to "happy" emotional label data, corresponding to "too good" text training data.

202、采用情感语音训练数据和情感标签数据,结合层正则化机制进行模型训练,生成预先训练好的情感识别网络,并采用情感语音训练数据和文本训练数据进行模型训练,生成预先训练好的语音合成网络;202. Use emotional speech training data and emotional label data, and combine layer regularization mechanism to perform model training to generate a pre-trained emotion recognition network, and use emotional speech training data and text training data for model training to generate pre-trained speech synthetic network;

服务器根据情感语音训练数据和情感标签数据,结合正则化机制进行训练,生成预先训练好的情感识别网络,然后根据情感语音训练数据和文本训练数据进行模型训练,生成 预先训练好的语音合成网络。The server performs training based on the emotional voice training data and emotional label data, combined with the regularization mechanism, to generate a pre-trained emotion recognition network, and then performs model training based on the emotional voice training data and text training data to generate a pre-trained speech synthesis network.

需要说明的是,预先训练好的情感识别网络用于提取情感特征,因此采用情感语音训练数据和情感标签数据训练该情感识别网络;预先训练好的语音合成网络用于合成情感语音,因此采用情感语音训练数据和文本训练数据训练该语音合成网络。虽然上述的训练过程均包括情感语音训练数据,但是在训练情感识别网络时,情感语音训练数据既可以为包括噪声的情感语音训练数据也可以为不包括噪声的情感语音训练数据,但是训练语音合成网络时,需要调用高质量的训练数据进行训练,即不包括噪声的情感语音训练数据。It should be noted that the pre-trained emotion recognition network is used to extract emotional features, so emotional speech training data and emotion label data are used to train the emotion recognition network; the pre-trained speech synthesis network is used to synthesize emotional speech, so emotional speech is used. Speech training data and text training data train the speech synthesis network. Although the above training process includes emotional speech training data, when training the emotion recognition network, the emotional speech training data can be either the emotional speech training data including noise or the emotional speech training data that does not include noise, but training speech synthesis When the network is used, it is necessary to call high-quality training data for training, that is, emotional speech training data that does not include noise.

在训练情感识别网络的过程中,服务器结合层正则化机制来训练情感识别网络,主要是在每个子层后面添加层正则化机制,层正则化机制是算出i层的输出在通道维度上的均值和方差,再让i层的输出减去均值,除以方差,使i层的输出均值为0,方差为1。层正则化机制能够使得训练数据的分布一致,使训练的过程具有稳定性。In the process of training the emotion recognition network, the server combines the layer regularization mechanism to train the emotion recognition network, mainly adding a layer regularization mechanism after each sub-layer, and the layer regularization mechanism is to calculate the output of the i layer in the channel dimension. and variance, and then subtract the mean from the output of layer i and divide by the variance, so that the output of layer i has a mean of 0 and a variance of 1. The layer regularization mechanism can make the distribution of training data consistent and make the training process stable.

203、获取待识别语音数据和对应的文本数据;203. Obtain speech data to be recognized and corresponding text data;

服务器获取待识别语音数据和与待识别文本数据对应的文本数据。需要强调的是,为进一步保证上述待识别语音数据和文本数据的私密和安全性,上述待识别语音数据和文本数据还可以存储于一区块链的节点中。The server obtains the speech data to be recognized and the text data corresponding to the text data to be recognized. It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned voice data and text data to be recognized, the above-mentioned voice data and text data to be recognized can also be stored in a node of a blockchain.

待识别语音数据为带有情感的待识别语音数据,可以为带有高兴情感的待识别语音数据、带有惊讶情感的待识别语音数据和/或带有愤怒情感的待识别语音数据。服务器在获取带有情感的待识别语音数据时,还获取对应的文本数据,例如带有情感的待识别语音数据为“真的吗!恭喜你!”,服务器在获取“真的吗!恭喜你!”的待识别语音数据时,还获取“真的吗!恭喜你!”的文本数据。The to-be-recognized speech data is the to-be-recognized speech data with emotion, which may be the to-be-recognized speech data with happy emotion, the to-be-recognized speech data with surprised emotion, and/or the to-be-recognized speech data with anger emotion. When the server obtains the speech data to be recognized with emotion, it also obtains the corresponding text data. For example, the speech data to be recognized with emotion is "really! Congratulations!", the server is obtaining "really! Congratulations!" !", the text data of "Really! Congratulations!" is also obtained.

204、将待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征和位置编码,并结合梅尔谱特征和位置编码在情感识别网络中进行处理,生成情感嵌入特征;204. Input the speech data to be recognized into a pre-trained emotion recognition network, generate mel spectrum features and positional codes, and process them in the emotion recognition network in combination with the mel spectrum features and positional codes to generate emotion embedded features;

服务器将待识别语音数据输入预先训练好的情感识别网络中,首先生成梅尔谱特征和位置编码,然后在情感识别网络中对该梅尔谱特征和该位置编码进行处理,从而生成情感嵌入特征。The server inputs the speech data to be recognized into the pre-trained emotion recognition network, firstly generates the mel spectrum feature and the position code, and then processes the mel spectrum feature and the position code in the emotion recognition network to generate the emotion embedded feature .

服务器将“真的吗!恭喜你!”的待识别语音数据输入预先训练好的情感识别网络中进行计算,生成梅尔谱特征[B 1,T 1,D 1]以及位置编码P,其中,位置编码P基于梅尔谱特征生成,该位置编码P实际上是一个隐藏层输出,服务器再结合梅尔谱特征[B 1,T 1,D 1]以及位置编码P进行计算,生成情感嵌入特征[B 2,T 2,D 2]。 The server inputs the speech data of "really! Congratulations!" into the pre-trained emotion recognition network for calculation, and generates Mel spectrum features [B 1 , T 1 , D 1 ] and position code P, where, The location code P is generated based on the mel spectrum feature. The location code P is actually a hidden layer output. The server combines the mel spectrum features [B 1 , T 1 , D 1 ] and the location code P for calculation to generate emotional embedding features. [B 2 , T 2 , D 2 ].

具体的,服务器将待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征;服务器根据梅尔谱特征和预置的位置转换公式,生成位置编码;服务器将梅尔谱特征和位置编码输入情感识别网络的编码器中进行编码,生成情感嵌入特征。Specifically, the server inputs the speech data to be recognized into a pre-trained emotion recognition network to generate mel spectrum features; the server generates a position code according to the mel spectrum features and a preset position conversion formula; the server converts the mel spectrum features and The positional encoding is input into the encoder of the emotion recognition network for encoding to generate emotion embedding features.

服务器将“真的吗!恭喜你!”的待识别语音数据输入预先训练好的情感识别网络中,首先将“真的吗!恭喜你!”输入训练好的情感识别网络中,进行特征提取,生成梅尔谱特征[B 1,T 1,D 1];然后服务器对[B 1,T 1,D 1]的梅尔谱特征按照预置的位置转换公式进行位置编码的计算,生成位置编码P;最后服务器将梅尔谱特征[B 1,T 1,D 1]和位置编码P输入情感识别网络的编码器中进行编码,生成情感嵌入特征,其中情感识别网络的编码器包括五个相同的模块和一层长短期记忆人工神经网络层,其中每个模块都包括两个子层,分别为多头自注意力层和前向传播层,在编码器中对[B 1,T 1,D 1]和P进行编码,生成情感嵌入特征[B 2,T 2,D 2]。 The server inputs the speech data of "really! Congratulations!" into the pre-trained emotion recognition network, and firstly inputs "really! Congratulations!" into the trained emotion recognition network for feature extraction. Generate mel spectrum features [B 1 , T 1 , D 1 ]; then the server calculates the position encoding of the Mel spectrum features of [B 1 , T 1 , D 1 ] according to the preset position conversion formula, and generates the position code P; Finally, the server inputs the Mel spectrum features [B 1 , T 1 , D 1 ] and the position encoding P into the encoder of the emotion recognition network for encoding to generate emotion embedding features, where the encoder of the emotion recognition network includes five identical module and a layer of long short-term memory artificial neural network layer, in which each module includes two sub-layers, namely multi-head self-attention layer and forward propagation layer, in the encoder [B 1 , T 1 , D 1 ] and P are encoded to generate sentiment embedding features [B 2 , T 2 , D 2 ].

服务器将待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征包括:The server inputs the speech data to be recognized into the pre-trained emotion recognition network, and the generated Mel spectrum features include:

首先,服务器对待识别语音数据进行加窗处理,生成加窗后的语音数据;然后,服务器对加窗后的语音数据进行短时傅里叶变换,生成傅里叶变换后的语音数据;最后,服务 器采用梅尔滤波器组对傅里叶变换后的语音数据进行处理,生成梅尔谱特征。First, the server performs windowing processing on the speech data to be recognized to generate windowed speech data; then, the server performs short-time Fourier transform on the windowed speech data to generate Fourier transformed speech data; finally, The server uses a Mel filter bank to process the Fourier transformed speech data to generate Mel spectrum features.

服务器采用窗函数对“真的吗!恭喜你!”的待识别语音数据进行加窗处理,生成加窗后的语音数据;然后服务器对该加窗后的语音数据进行傅里叶变换,确定加窗后的语音数据的频率和相位,从而生成傅里叶变换后的语音数据;最后服务器采用梅尔滤波器组将傅里叶变换后的语音数据处理为梅尔谱特征。The server uses a window function to perform windowing processing on the speech data to be recognized for "really! Congratulations!" to generate windowed speech data; then the server performs Fourier transform on the windowed speech data, and determines to add The frequency and phase of the speech data after the window are obtained, thereby generating the speech data after Fourier transformation; finally, the server uses a Mel filter bank to process the speech data after Fourier transformation into mel spectrum features.

服务器根据梅尔谱特征和预置的位置转换公式,生成位置编码包括:According to the Mel spectrum feature and the preset position conversion formula, the server generates the position code including:

服务器读取梅尔谱特征的长度,并读取梅尔谱特征的位置;服务基于梅尔谱特征的长度和梅尔谱特征的位置,生成位置输入值;服务器将位置输入向量输入预置的位置转换公式,生成位置编码。The server reads the length of the mel spectrum feature and reads the position of the mel spectrum feature; the service generates a position input value based on the length of the mel spectrum feature and the position of the mel spectrum feature; the server inputs the position input vector into the preset Position conversion formula to generate position codes.

在本实施例中,预置的位置转换公式为In this embodiment, the preset position conversion formula is

Figure PCTCN2021083559-appb-000001
Figure PCTCN2021083559-appb-000001

Figure PCTCN2021083559-appb-000002
Figure PCTCN2021083559-appb-000002

其中,pos为梅尔谱特征的位置,2i表示偶数的维度,2i+1表示奇数的维度,d mode表示梅尔谱特征的位置对应的预置维度向量,例如256。 Among them, pos is the position of the mel spectral feature, 2i represents the even dimension, 2i+1 represents the odd dimension, and d mode represents the preset dimension vector corresponding to the location of the mel spectral feature, such as 256.

例如,服务器读取梅尔谱特征的长度为5,梅尔谱特征的位置为0,然后服务器基于“5”的梅尔谱特征的长度和“0”的梅尔谱特征的位置,确定位置输入值为[0,1,2,3,4],然后将位置输入值[0,1,2,3,4]基于上述公式进行计算,生成位置编码P。需要说明的是,在本实施例中P只是一个指代词,并不为具体的位置编码数据。For example, the server reads the length of the mel spectral feature as 5 and the position of the mel spectral feature as 0, then the server determines the position based on the length of the mel spectral feature of "5" and the position of the mel spectral feature of "0" The input value is [0,1,2,3,4], and then the position input value [0,1,2,3,4] is calculated based on the above formula to generate the position code P. It should be noted that, in this embodiment, P is only a pronoun, and does not represent specific position encoding data.

服务器将梅尔谱特征和位置编码输入情感识别网络的编码器中进行编码,生成情感嵌入特征包括:The server inputs the mel spectrum feature and position encoding into the encoder of the emotion recognition network for encoding, and the generated emotion embedding features include:

服务器将梅尔谱特征和的位置编码输入情感识别网络的多头自注意力层中,结合残差连接,生成初始情感特征向量;服务器将初始情感特征向量输入情感识别网络的前向传播层中进行卷积,生成情感嵌入特征。The server inputs the position encoding of the mel spectrum feature sum into the multi-head self-attention layer of the emotion recognition network, and combines the residual connections to generate the initial emotion feature vector; the server inputs the initial emotion feature vector into the forward propagation layer of the emotion recognition network for processing. Convolution to generate sentiment embedding features.

服务器首先将梅尔谱特征[B 1,T 1,D 1]输入多头自注意力层进行计算,结合残差连接生成初始情感特征向量,其中多头自注意力层设计的公式如下所示: The server first inputs the Mel spectrum features [B 1 , T 1 , D 1 ] into the multi-head self-attention layer for calculation, and combines the residual connections to generate the initial emotion feature vector. The formula for the multi-head self-attention layer design is as follows:

Figure PCTCN2021083559-appb-000003
Figure PCTCN2021083559-appb-000003

head i=Attention(QW i Q,KW i K,KW i V) head i =Attention(QW i Q ,KW i K ,KW i V )

c t=MultiHead(Q,K,V)=Concat(head 1,...,head h)W Oc t =MultiHead(Q,K,V)=Concat(head 1 ,...,head h )W O ;

其中,Q,K,V为输入,即梅尔谱特征,d k为预置维度向量,例如256,head i为第i个头,在多头自注意力层中的每一次计算为一个头,如果进行4次Attention(QW i Q,KW i K,KV V)计算,则生成4个头,W i Q,KW i K,KW i V为权重,该权重是在训练过程生成的,Concat指将多个头沿着最后一个维度拼接在一起,例如,这四个头的维度向量都是[B t,T t,D t],服务器则将他们拼接到一起,生成初始情感特征向量[B t,T t,4D t],W O为提前学习好的参数。在生成初始情感特征向量之后,将该初始情感特征向量和梅尔谱特征输入对应的前向传播层中进行卷积,生成第一模块情感特征向量,由于该编码器包括五个相同的模块,因此按照上述的计算方式进行五次计算,将最后一个模块的输出结果输入一层长短期记忆人工神经网络层,从而生成情感嵌入特征[B 2,T 2,D 2]。 Among them, Q, K, V are the input, that is, the mel spectrum feature, d k is the preset dimension vector, such as 256, head i is the ith head, and each calculation in the multi-head self-attention layer is a head, if Perform 4 Attention (QW i Q , KW i K , KV V ) calculations, then generate 4 heads, W i Q , KW i K , KW i V are the weights, which are generated during the training process, and Concat refers to how many The heads are spliced together along the last dimension. For example, the dimension vectors of these four heads are [B t , T t , D t ], and the server splices them together to generate the initial emotion feature vector [B t , T t ] , 4D t ], W O is the parameter learned in advance. After generating the initial emotion feature vector, the initial emotion feature vector and the Mel spectrum feature input are convolved in the corresponding forward propagation layer to generate the first module emotion feature vector. Since the encoder includes five identical modules, Therefore, five calculations are performed according to the above calculation method, and the output result of the last module is input into a layer of long short-term memory artificial neural network layer, thereby generating emotional embedded features [B 2 , T 2 , D 2 ].

需要说明的是,残差连接是将每个多头自注意力层的输入又添加至输出中,作为下一 层前向传播层的输入,在第一个多头自注意力层中就是在初始情感特征向量的基础上加入梅尔谱特征,从而生成前向传播层的输入。从而提高生成情感嵌入特征的关联性。It should be noted that the residual connection is to add the input of each multi-head self-attention layer to the output as the input of the next forward propagation layer. In the first multi-head self-attention layer, it is the initial emotion. Mel spectrum features are added to the feature vector to generate the input of the forward propagation layer. Thereby improving the relevance of generating emotional embedding features.

205、将情感嵌入特征和文本数据输入预先训练好的语音合成网络中,生成目标梅尔谱数据;205. Input the emotion embedded feature and text data into the pre-trained speech synthesis network to generate target mel spectrum data;

服务器将情感嵌入特征和文本数据输入预先训练好的语音合成网络中进行计算,生成目标梅尔谱数据。The server inputs the emotion embedded features and text data into the pre-trained speech synthesis network for calculation, and generates the target mel spectrum data.

服务器将[B 2,T 2,D 3]的情感嵌入特征和“真的吗!恭喜你!”的文本数据输入预先训练好的语音合成网络中进行计算,在本实施例中,语音合成网络中包括编码器,在该编码器中,将“真的吗!恭喜你!”的文本数据进行特征提权,生成提取结果,并将该提取结果与情感嵌入特征[B 2,T 2,D 2]进行拼接,生成目标梅尔谱数据[B 2,T 2,D 2+D]。 The server inputs the emotion embedding features of [B 2 , T 2 , D 3 ] and the text data of "Really! Congratulations!" into the pre-trained speech synthesis network for calculation. In this embodiment, the speech synthesis network It includes an encoder, in which the text data of "Really! Congratulations!" is feature-lifted, an extraction result is generated, and the extraction result is combined with the emotion embedding feature [B 2 ,T 2 ,D 2 ] for splicing to generate target Mel spectrum data [B 2 , T 2 , D 2 +D].

具体的,服务器在预先训练好的语音合成网络中,将文本数据转换为文本嵌入特征;服务器按照时刻顺序,将文本嵌入特征和情感嵌入特征进行拼接,生成目标梅尔谱数据。Specifically, the server converts text data into text embedding features in the pre-trained speech synthesis network; the server splices the text embedding features and emotion embedding features in order of time to generate target Mel spectrum data.

服务器在预先训练好的语音合成网络中,首先将文本数据按照时刻顺序转换为与情感嵌入特征形式相同的文本嵌入特征,然后服务器按照时刻顺序将文本嵌入特征和情感嵌入特征进行拼接,生成目标梅尔谱数据。在本实施例中,例如情感嵌入特征为[B 2,T 2,D 2],文本嵌入特征为[B 2,T 2,D 3],服务器则基于时刻顺序将[B 2,T 2,D 2]与[B 2,T 2,D]进行拼接,生成目标梅尔谱数据为[B 2,T 2,D 2+D]。在其他实施例中,若情感嵌入特征维度为[B 2,D 2],服务器则将该情感嵌入特征扩展为[B 2,1,D 2],然后再对该情感嵌入特征与本钱嵌入特征进行拼接。 In the pre-trained speech synthesis network, the server first converts the text data into text embedding features in the same form as the emotional embedding features in the order of time, and then the server splices the text embedding features and emotional embedding features in the order of time to generate the target Mei spectral data. In this embodiment, for example, the emotion embedding feature is [B 2 , T 2 , D 2 ], and the text embedding feature is [B 2 , T 2 , D 3 ], and the server will D 2 ] is spliced with [B 2 , T 2 , D] to generate target mel spectrum data as [B 2 , T 2 , D 2 +D]. In other embodiments, if the emotion embedding feature dimension is [B 2 , D 2 ], the server expands the emotion embedding feature to [B 2 , 1, D 2 ], and then the emotion embedding feature and cost embedding feature Do stitching.

206、采用神经声码器对目标梅尔谱数据进行语音转换,生成目标情感语音。206. Use a neural vocoder to perform speech conversion on the target Mel spectrum data to generate a target emotional speech.

服务器采用神经声码器将目标梅尔谱数据转换为目标情感语音。The server uses a neural vocoder to convert the target mel-spectral data into target emotional speech.

需要说明的是,在本实施例中,神经声码器为Wave Glow,目标梅尔谱数据为神经声码器的输入,该输入的帧长为1024,帧移位256,首先将该目标梅尔谱数据输入神经声码器的仿射耦合层中进行缩放和转换,生成情感语音特征,然后对该情感语音特征进行可逆卷积,生成目标情感语音“真的吗!(惊讶情感)恭喜你!(高兴情感)”。It should be noted that, in this embodiment, the neural vocoder is Wave Glow, the target Mel spectrum data is the input of the neural vocoder, the input frame length is 1024, and the frame shift is 256. The Er spectrum data is input into the affine coupling layer of the neural vocoder for scaling and transformation to generate emotional speech features, and then reversible convolution is performed on the emotional speech features to generate the target emotional speech "Really! (surprised emotion) Congratulations to you. ! (happy emotion)".

本申请实施例中,通过预先训练好的情感识别网络,结合梅尔谱特征和位置编码生成情感嵌入特征,然后将情感嵌入特征和文本数据进行拼接,生成目标情感语音,解决了合成语音平淡、缺乏情感的问题,增加了合成语音的多样性。In the embodiment of the present application, through the pre-trained emotion recognition network, combined with the Mel spectrum feature and the position encoding, the emotion embedded feature is generated, and then the emotion embedded feature and the text data are spliced to generate the target emotional speech, which solves the problem of dull synthetic speech, The problem of lack of emotion increases the variety of synthesized speech.

上面对本申请实施例中情感语音的合成方法进行了描述,下面对本申请实施例中情感语音的合成装置进行描述,请参阅图3,本申请实施例中情感语音的合成装置一个实施例包括:The method for synthesizing emotional speech in the embodiment of the present application has been described above, and the device for synthesizing emotional speech in the embodiment of the present application is described below. Please refer to FIG. 3 . An embodiment of the device for synthesizing emotional speech in the embodiment of the present application includes:

待识别数据获取模块301,用于获取待识别语音数据和对应的文本数据;A to-be-recognized data acquisition module 301, configured to acquire to-be-recognized voice data and corresponding text data;

嵌入特征生成模块302,用于将所述待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征和位置编码,并结合所述梅尔谱特征和所述位置编码在所述情感识别网络中进行处理,生成情感嵌入特征;The embedded feature generation module 302 is used to input the speech data to be recognized into the pre-trained emotion recognition network, generate a mel spectrum feature and a position code, and combine the mel spectrum feature and the position code in the described mel spectrum feature and the position code. Processed in the emotion recognition network to generate emotion embedded features;

梅尔谱数据生成模块303,用于将所述情感嵌入特征和所述文本数据输入预先训练好的语音合成网络中,生成目标梅尔谱数据;Mel spectrum data generation module 303, for inputting the emotion embedded feature and the text data into a pre-trained speech synthesis network to generate target Mel spectrum data;

语音转换模块304,用于采用神经声码器对所述目标梅尔谱数据进行语音转换,生成目标情感语音。The speech conversion module 304 is configured to perform speech conversion on the target mel spectrum data by using a neural vocoder to generate a target emotional speech.

本申请实施例中,通过预先训练好的情感识别网络,结合梅尔谱特征和位置编码生成情感嵌入特征,然后将情感嵌入特征和文本数据进行拼接,生成目标情感语音,解决了合 成语音平淡、缺乏情感的问题,增加了合成语音的多样性。In the embodiment of the present application, through the pre-trained emotion recognition network, combined with the Mel spectrum feature and the position encoding, the emotion embedded feature is generated, and then the emotion embedded feature and the text data are spliced to generate the target emotional speech, which solves the problem of dull synthetic speech, The problem of lack of emotion increases the variety of synthesized speech.

请参阅图4,本申请实施例中情感语音的合成装置的另一个实施例包括:Referring to FIG. 4 , another embodiment of the apparatus for synthesizing emotional speech in the embodiment of the present application includes:

待识别数据获取模块301,用于获取待识别语音数据和对应的文本数据;A to-be-recognized data acquisition module 301, configured to acquire to-be-recognized voice data and corresponding text data;

嵌入特征生成模块302,用于将所述待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征和位置编码,并结合所述梅尔谱特征和所述位置编码在所述情感识别网络中进行处理,生成情感嵌入特征;The embedded feature generation module 302 is used to input the speech data to be recognized into the pre-trained emotion recognition network, generate a mel spectrum feature and a position code, and combine the mel spectrum feature and the position code in the described mel spectrum feature and the position code. Processed in the emotion recognition network to generate emotion embedded features;

梅尔谱数据生成模块303,用于将所述情感嵌入特征和所述文本数据输入预先训练好的语音合成网络中,生成目标梅尔谱数据;Mel spectrum data generation module 303, for inputting the emotion embedded feature and the text data into a pre-trained speech synthesis network to generate target Mel spectrum data;

语音转换模块304,用于采用神经声码器对所述目标梅尔谱数据进行语音转换,生成目标情感语音。The speech conversion module 304 is configured to perform speech conversion on the target mel spectrum data by using a neural vocoder to generate a target emotional speech.

可选的,嵌入特征生成模块302包括:Optionally, the embedded feature generation module 302 includes:

梅尔谱特征生成单元3021,用于将所述待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征;Mel spectrum feature generation unit 3021, for inputting the speech data to be recognized into the pre-trained emotion recognition network to generate Mel spectrum features;

位置编码生成单元3022,用于根据所述梅尔谱特征和预置的位置转换公式,生成位置编码;a position code generation unit 3022, configured to generate a position code according to the mel spectrum feature and a preset position conversion formula;

编码单元3023,用于将所述梅尔谱特征和所述位置编码输入所述情感识别网络的编码器中进行编码,生成情感嵌入特征。The encoding unit 3023 is configured to input the mel spectrum feature and the position code into the encoder of the emotion recognition network for encoding, and generate an emotion embedded feature.

可选的,梅尔谱特征生成单元3021还可以具体用于:Optionally, the mel spectrum feature generating unit 3021 can also be specifically used for:

对所述待识别语音数据进行加窗处理,生成加窗后的语音数据;Windowing is performed on the to-be-recognized speech data to generate windowed speech data;

对所述加窗后的语音数据进行短时傅里叶变换,生成傅里叶变换后的语音数据;Carry out short-time Fourier transform to the voice data after the windowing, and generate the voice data after the Fourier transform;

采用梅尔滤波器组对所述傅里叶变换后的语音数据进行处理,生成梅尔谱特征。The Fourier-transformed speech data is processed by using a Mel filter bank to generate Mel spectrum features.

可选的,位置编码生成单元3022还可以具体用于:Optionally, the location code generation unit 3022 can also be specifically used for:

读取梅尔谱特征的长度,并读取梅尔谱特征的位置;Read the length of the mel spectrum feature, and read the position of the mel spectrum feature;

基于所述梅尔谱特征的长度和所述梅尔谱特征的位置,生成位置输入值;generating a position input value based on the length of the mel spectral feature and the position of the mel spectral feature;

将所述位置输入向量输入预置的位置转换公式,生成位置编码。Input the position input vector into a preset position conversion formula to generate a position code.

可选的,编码单元3023还可以具体用于:Optionally, the encoding unit 3023 can also be specifically used for:

将所述梅尔谱特征和所述的位置编码输入所述情感识别网络的多头自注意力层中,结合残差连接,生成初始情感特征向量;Inputting the Mel spectrum feature and the position encoding into the multi-head self-attention layer of the emotion recognition network, and combining the residual connections to generate an initial emotion feature vector;

将所述初始情感特征向量输入所述情感识别网络的前向传播层中进行卷积,生成情感嵌入特征。The initial emotion feature vector is input into the forward propagation layer of the emotion recognition network for convolution to generate emotion embedded features.

可选的,梅尔谱数据生成模块303还可以具体用于:Optionally, the Mel spectrum data generation module 303 can also be specifically used for:

在预先训练好的语音合成网络中,将所述文本数据转换为文本嵌入特征;In a pre-trained speech synthesis network, convert the text data into text embedding features;

按照时刻顺序,将所述文本嵌入特征和所述情感嵌入特征进行拼接,生成目标梅尔谱数据。According to the time sequence, the text embedding feature and the emotion embedding feature are spliced to generate target mel spectrum data.

可选的,情感语音的合成装置还包括:Optionally, the device for synthesizing emotional speech further includes:

训练数据获取模块305,用于获取情感语音训练数据、情感标签数据和文本训练数据;A training data acquisition module 305, configured to acquire emotional speech training data, emotional label data and text training data;

训练模块306,用于采用所述情感语音训练数据和所述情感标签数据,结合层正则化机制进行模型训练,生成预先训练好的情感识别网络,并采用所述情感语音训练数据和所述文本训练数据进行模型训练,生成预先训练好的语音合成网络。The training module 306 is configured to use the emotional voice training data and the emotional label data, and perform model training in combination with a layer regularization mechanism, generate a pre-trained emotional recognition network, and use the emotional voice training data and the text The training data is used for model training to generate a pre-trained speech synthesis network.

本申请实施例中,通过预先训练好的情感识别网络,结合梅尔谱特征和位置编码生成情感嵌入特征,然后将情感嵌入特征和文本数据进行拼接,生成目标情感语音,解决了合成语音平淡、缺乏情感的问题,增加了合成语音的多样性。In the embodiment of the present application, through the pre-trained emotion recognition network, combined with the Mel spectrum feature and the position encoding, the emotion embedded feature is generated, and then the emotion embedded feature and the text data are spliced to generate the target emotional speech, which solves the problem of dull synthetic speech, The problem of lack of emotion increases the variety of synthesized speech.

上面图3和图4从模块化功能实体的角度对本申请实施例中的情感语音的合成装置进行详细描述,下面从硬件处理的角度对本申请实施例中情感语音的合成设备进行详细描述。3 and 4 above describe the device for synthesizing emotional speech in the embodiment of the present application in detail from the perspective of modular functional entities, and the following describes the device for synthesizing emotional speech in the embodiment of the present application in detail from the perspective of hardware processing.

图5是本申请实施例提供的一种情感语音的合成设备的结构示意图,该情感语音的合成设备500可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)510(例如,一个或一个以上处理器)和存储器520,一个或一个以上存储应用程序533或数据532的存储介质530(例如一个或一个以上海量存储设备)。其中,存储器520和存储介质530可以是短暂存储或持久存储。存储在存储介质530的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对情感语音的合成设备500中的一系列指令操作。更进一步地,处理器510可以设置为与存储介质530通信,在情感语音的合成设备500上执行存储介质530中的一系列指令操作。5 is a schematic structural diagram of a device for synthesizing emotional speech provided by an embodiment of the present application. The device 500 for synthesizing emotional speech may vary greatly due to different configurations or performances, and may include one or more processors (central processing units, CPU) 510 (eg, one or more processors) and memory 520, one or more storage media 530 (eg, one or more mass storage devices) that store application programs 533 or data 532. Among them, the memory 520 and the storage medium 530 may be short-term storage or persistent storage. The program stored in the storage medium 530 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations in the device 500 for synthesizing emotional speech. Furthermore, the processor 510 may be configured to communicate with the storage medium 530, and execute a series of instruction operations in the storage medium 530 on the emotional speech synthesis device 500.

情感语音的合成设备500还可以包括一个或一个以上电源540,一个或一个以上有线或无线网络接口550,一个或一个以上输入输出接口560,和/或,一个或一个以上操作系统531,例如Windows Serve,Mac OS X,Unix,Linux,FreeBSD等等。本领域技术人员可以理解,图5示出的情感语音的合成设备结构并不构成对情感语音的合成设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。The emotional speech synthesis device 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input and output interfaces 560, and/or, one or more operating systems 531, such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD, and more. Those skilled in the art can understand that the structure of the device for synthesizing emotional speech shown in FIG. 5 does not constitute a limitation on the device for synthesizing emotional speech, and may include more or less components than those shown in the figure, or combine some components, or Different component arrangements.

本申请还提供一种情感语音的合成设备,所述计算机设备包括存储器和处理器,存储器中存储有计算机可读指令,计算机可读指令被处理器执行时,使得处理器执行上述各实施例中的所述情感语音的合成方法的步骤。The present application also provides a device for synthesizing emotional speech. The computer device includes a memory and a processor, and computer-readable instructions are stored in the memory. When the computer-readable instructions are executed by the processor, the processor executes the steps in the foregoing embodiments. The steps of the emotion speech synthesis method.

本申请还提供一种计算机可读存储介质,该计算机可读存储介质可以为非易失性计算机可读存储介质,该计算机可读存储介质也可以为易失性计算机可读存储介质,所述计算机可读存储介质中存储有指令,当所述指令在计算机上运行时,使得计算机执行所述情感语音的合成方法的步骤。The present application also provides a computer-readable storage medium. The computer-readable storage medium may be a non-volatile computer-readable storage medium. The computer-readable storage medium may also be a volatile computer-readable storage medium. The computer-readable storage medium stores instructions that, when executed on a computer, cause the computer to execute the steps of the method for synthesizing emotional speech.

所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the system, device and unit described above may refer to the corresponding process in the foregoing method embodiments, which will not be repeated here.

本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand: The technical solutions described in the embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the present application.

Claims (22)

一种情感语音的合成方法,其中,所述情感语音的合成方法包括:A method for synthesizing emotional speech, wherein the method for synthesizing emotional speech comprises: 获取待识别语音数据和对应的文本数据;Obtain the speech data to be recognized and the corresponding text data; 将所述待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征和位置编码,并结合所述梅尔谱特征和所述位置编码在所述情感识别网络中进行处理,生成情感嵌入特征;Input the speech data to be recognized into the pre-trained emotion recognition network, generate mel spectrum features and position codes, and process them in the emotion recognition network in combination with the mel spectrum features and the position codes, and generate emotion embedded features; 将所述情感嵌入特征和所述文本数据输入预先训练好的语音合成网络中,生成目标梅尔谱数据;Inputting the emotion embedded feature and the text data into a pre-trained speech synthesis network to generate target mel spectrum data; 采用神经声码器对所述目标梅尔谱数据进行语音转换,生成目标情感语音。A neural vocoder is used to perform speech conversion on the target Mel spectrum data to generate target emotional speech. 根据权利要求1所述的情感语音的合成方法,其中,所述将所述待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征和位置编码,并结合所述梅尔谱特征和所述位置编码在所述情感识别网络中进行处理,生成情感嵌入特征包括:The method for synthesizing emotional speech according to claim 1, wherein said inputting said to-be-recognized speech data into a pre-trained emotion recognition network, generating mel spectrum features and position codes, and combining said mel spectrum The feature and the location encoding are processed in the emotion recognition network, and generating emotion embedding features includes: 将所述待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征;Inputting the speech data to be recognized into a pre-trained emotion recognition network to generate Mel spectrum features; 根据所述梅尔谱特征和预置的位置转换公式,生成位置编码;generating a position code according to the mel spectrum feature and a preset position conversion formula; 将所述梅尔谱特征和所述位置编码输入所述情感识别网络的编码器中进行编码,生成情感嵌入特征。The mel spectrum feature and the position code are input into the encoder of the emotion recognition network for encoding to generate emotion embedded features. 根据权利要求2所述的情感语音的合成方法,其中,所述将所述待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征包括:The method for synthesizing emotional speech according to claim 2, wherein the inputting the speech data to be recognized into a pre-trained emotion recognition network, and generating a Mel spectrum feature comprises: 对所述待识别语音数据进行加窗处理,生成加窗后的语音数据;Windowing is performed on the to-be-recognized speech data to generate windowed speech data; 对所述加窗后的语音数据进行短时傅里叶变换,生成傅里叶变换后的语音数据;Carry out short-time Fourier transform to the voice data after the windowing, and generate the voice data after the Fourier transform; 采用梅尔滤波器组对所述傅里叶变换后的语音数据进行处理,生成梅尔谱特征。The Fourier-transformed speech data is processed by using a Mel filter bank to generate Mel spectrum features. 根据权利要求2所述的情感语音的合成方法,其中,所述根据所述梅尔谱特征和预置的位置转换公式,生成位置编码包括:The method for synthesizing emotional speech according to claim 2, wherein the generating a position code according to the Mel spectrum feature and a preset position conversion formula comprises: 读取梅尔谱特征的长度,并读取梅尔谱特征的位置;Read the length of the mel spectrum feature, and read the position of the mel spectrum feature; 基于所述梅尔谱特征的长度和所述梅尔谱特征的位置,生成位置输入值;generating a position input value based on the length of the mel spectral feature and the position of the mel spectral feature; 将所述位置输入向量输入预置的位置转换公式,生成位置编码。Input the position input vector into a preset position conversion formula to generate a position code. 根据权利要求2所述的情感语音的合成方法,其中,所述将所述梅尔谱特征和所述位置编码输入所述情感识别网络的编码器中进行编码,生成情感嵌入特征包括:The method for synthesizing emotional speech according to claim 2, wherein the inputting the Mel spectrum feature and the position code into an encoder of the emotion recognition network for encoding, and generating the emotion embedded feature comprises: 将所述梅尔谱特征和所述的位置编码输入所述情感识别网络的多头自注意力层中,结合残差连接,生成初始情感特征向量;Inputting the Mel spectrum feature and the position encoding into the multi-head self-attention layer of the emotion recognition network, and combining the residual connections to generate an initial emotion feature vector; 将所述初始情感特征向量输入所述情感识别网络的前向传播层中进行卷积,生成情感嵌入特征。The initial emotion feature vector is input into the forward propagation layer of the emotion recognition network for convolution to generate emotion embedded features. 根据权利要求1所述的情感语音的合成方法,其中,所述将所述情感嵌入特征和所述文本数据输入预先训练好的语音合成网络中,生成目标梅尔谱数据包括:The method for synthesizing emotional speech according to claim 1, wherein the inputting the emotional embedding feature and the text data into a pre-trained speech synthesis network, and generating target mel spectrum data comprises: 在预先训练好的语音合成网络中,将所述文本数据转换为文本嵌入特征;In a pre-trained speech synthesis network, convert the text data into text embedding features; 按照时刻顺序,将所述文本嵌入特征和所述情感嵌入特征进行拼接,生成目标梅尔谱数据。According to the time sequence, the text embedding feature and the emotion embedding feature are spliced to generate target mel spectrum data. 根据权利要求1-6中任意一项所述的情感语音的合成方法,其中,在所述获取待识别语音数据和对应的文本数据之前,所述情感语音的合成方法包括:The method for synthesizing emotional speech according to any one of claims 1-6, wherein, before acquiring the speech data to be recognized and the corresponding text data, the method for synthesizing emotional speech comprises: 获取情感语音训练数据、情感标签数据和文本训练数据;Obtain emotional speech training data, emotional label data and text training data; 采用所述情感语音训练数据和所述情感标签数据,结合层正则化机制进行模型训练,生成预先训练好的情感识别网络,并采用所述情感语音训练数据和所述文本训练数据进行模型训练,生成预先训练好的语音合成网络。Using the emotional voice training data and the emotional label data, combined with the layer regularization mechanism, model training is performed to generate a pre-trained emotion recognition network, and the emotional voice training data and the text training data are used for model training, Generate a pretrained speech synthesis network. 一种情感语音的合成装置,其中,所述情感语音的合成装置包括:A device for synthesizing emotional speech, wherein the device for synthesizing emotional speech comprises: 获取模块,用于获取待识别语音数据和对应的文本数据;an acquisition module for acquiring the speech data to be recognized and the corresponding text data; 嵌入特征生成模块,用于将所述待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征和位置编码,并结合所述梅尔谱特征和所述位置编码在所述情感识别网络中进行处理,生成情感嵌入特征;The embedded feature generation module is used to input the speech data to be recognized into the pre-trained emotion recognition network, generate mel spectrum features and position codes, and combine the mel spectrum features and the position codes in the emotion recognition network. Process in the recognition network to generate emotional embedded features; 梅尔谱数据生成模块,用于将所述情感嵌入特征和所述文本数据输入预先训练好的语音合成网络中,生成目标梅尔谱数据;Mel spectrum data generation module, for inputting the emotion embedded feature and the text data into the pre-trained speech synthesis network to generate target Mel spectrum data; 语音转换模块,用于采用神经声码器对所述目标梅尔谱数据进行语音转换,生成目标情感语音。The speech conversion module is used to perform speech conversion on the target mel spectrum data by using a neural vocoder to generate target emotional speech. 一种情感语音的合成设备,其中,所述情感语音的合成设备包括:存储器和至少一个处理器,所述存储器中存储有指令;A device for synthesizing emotional speech, wherein the device for synthesizing emotional speech comprises: a memory and at least one processor, wherein instructions are stored in the memory; 所述至少一个处理器调用所述存储器中的所述指令,以使得所述情感语音的合成设备执行如下所述的情感语音的合成方法:The at least one processor invokes the instructions in the memory, so that the device for synthesizing emotional speech executes the following method for synthesizing emotional speech: 获取待识别语音数据和对应的文本数据;Obtain the speech data to be recognized and the corresponding text data; 将所述待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征和位置编码,并结合所述梅尔谱特征和所述位置编码在所述情感识别网络中进行处理,生成情感嵌入特征;Input the speech data to be recognized into the pre-trained emotion recognition network, generate mel spectrum features and position codes, and process them in the emotion recognition network in combination with the mel spectrum features and the position codes, and generate emotion embedded features; 将所述情感嵌入特征和所述文本数据输入预先训练好的语音合成网络中,生成目标梅尔谱数据;Inputting the emotion embedded feature and the text data into a pre-trained speech synthesis network to generate target mel spectrum data; 采用神经声码器对所述目标梅尔谱数据进行语音转换,生成目标情感语音。A neural vocoder is used to perform speech conversion on the target Mel spectrum data to generate target emotional speech. 根据权利要求9所述的情感语音的合成设备,其中,所述情感语音的合成设备被所述处理器执行所述将所述待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征和位置编码,并结合所述梅尔谱特征和所述位置编码在所述情感识别网络中进行处理,生成情感嵌入特征的步骤时,包括:The device for synthesizing emotional speech according to claim 9, wherein the device for synthesizing emotional speech is executed by the processor by inputting the to-be-recognized speech data into a pre-trained emotion recognition network to generate Mel spectral features and positional encoding, and combined with the mel spectral features and the positional encoding for processing in the emotion recognition network, the steps of generating emotional embedded features include: 将所述待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征;Inputting the speech data to be recognized into a pre-trained emotion recognition network to generate Mel spectrum features; 根据所述梅尔谱特征和预置的位置转换公式,生成位置编码;generating a position code according to the mel spectrum feature and a preset position conversion formula; 将所述梅尔谱特征和所述位置编码输入所述情感识别网络的编码器中进行编码,生成情感嵌入特征。The mel spectrum feature and the position code are input into the encoder of the emotion recognition network for encoding to generate emotion embedded features. 根据权利要求10所述的情感语音的合成设备,其中,所述情感语音的合成设备被所述处理器执行所述将所述待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征的步骤时,包括:The device for synthesizing emotional speech according to claim 10, wherein the device for synthesizing emotional speech is executed by the processor by inputting the to-be-recognized speech data into a pre-trained emotion recognition network to generate Mel The steps for spectral features include: 对所述待识别语音数据进行加窗处理,生成加窗后的语音数据;Windowing is performed on the to-be-recognized speech data to generate windowed speech data; 对所述加窗后的语音数据进行短时傅里叶变换,生成傅里叶变换后的语音数据;Carry out short-time Fourier transform to the voice data after the windowing, and generate the voice data after the Fourier transform; 采用梅尔滤波器组对所述傅里叶变换后的语音数据进行处理,生成梅尔谱特征。The Fourier-transformed speech data is processed by using a Mel filter bank to generate Mel spectrum features. 根据权利要求10所述的情感语音的合成设备,其中,所述情感语音的合成设备被所述处理器执行所述根据所述梅尔谱特征和预置的位置转换公式,生成位置编码的步骤时,包括:The device for synthesizing emotional speech according to claim 10, wherein the device for synthesizing emotional speech is executed by the processor of the step of generating a position code according to the Mel spectrum feature and a preset position conversion formula , including: 读取梅尔谱特征的长度,并读取梅尔谱特征的位置;Read the length of the mel spectrum feature, and read the position of the mel spectrum feature; 基于所述梅尔谱特征的长度和所述梅尔谱特征的位置,生成位置输入值;generating a position input value based on the length of the mel spectral feature and the position of the mel spectral feature; 将所述位置输入向量输入预置的位置转换公式,生成位置编码。Input the position input vector into a preset position conversion formula to generate a position code. 根据权利要求10所述的情感语音的合成设备,其中,所述情感语音的合成设备被所述处理器执行所述将所述梅尔谱特征和所述位置编码输入所述情感识别网络的编码器中进行编码,生成情感嵌入特征的步骤时,包括:The device for synthesizing emotional speech according to claim 10, wherein the device for synthesizing emotional speech is executed by the processor to input the encoding of the mel spectrum feature and the positional code into the emotion recognition network The steps of generating emotion embedded features include: 将所述梅尔谱特征和所述的位置编码输入所述情感识别网络的多头自注意力层中,结合残差连接,生成初始情感特征向量;Inputting the Mel spectrum feature and the position encoding into the multi-head self-attention layer of the emotion recognition network, and combining the residual connections to generate an initial emotion feature vector; 将所述初始情感特征向量输入所述情感识别网络的前向传播层中进行卷积,生成情感嵌入特征。The initial emotion feature vector is input into the forward propagation layer of the emotion recognition network for convolution to generate emotion embedded features. 根据权利要求9所述的情感语音的合成设备,其中,所述情感语音的合成设备被所述处理器执行所述将所述情感嵌入特征和所述文本数据输入预先训练好的语音合成网络中,生成目标梅尔谱数据的步骤时,包括:The device for synthesizing emotional speech according to claim 9, wherein the device for synthesizing emotional speech is executed by the processor to input the emotion embedding feature and the text data into a pre-trained speech synthesis network , the steps to generate target mel spectrum data include: 在预先训练好的语音合成网络中,将所述文本数据转换为文本嵌入特征;In a pre-trained speech synthesis network, convert the text data into text embedding features; 按照时刻顺序,将所述文本嵌入特征和所述情感嵌入特征进行拼接,生成目标梅尔谱数据。According to the time sequence, the text embedding feature and the emotion embedding feature are spliced to generate target mel spectrum data. 根据权利要求9-14中任意一项所述的情感语音的合成设备,其中,在所述情感语音的合成设备被所述处理器执行所述获取待识别语音数据和对应的文本数据的步骤之前,包括:The device for synthesizing emotional speech according to any one of claims 9 to 14, wherein before the device for synthesizing emotional speech is executed by the processor the step of acquiring the speech data to be recognized and the corresponding text data ,include: 获取情感语音训练数据、情感标签数据和文本训练数据;Obtain emotional speech training data, emotional label data and text training data; 采用所述情感语音训练数据和所述情感标签数据,结合层正则化机制进行模型训练,生成预先训练好的情感识别网络,并采用所述情感语音训练数据和所述文本训练数据进行模型训练,生成预先训练好的语音合成网络。Using the emotional voice training data and the emotional label data, combined with the layer regularization mechanism, model training is performed to generate a pre-trained emotion recognition network, and the emotional voice training data and the text training data are used for model training, Generate a pretrained speech synthesis network. 一种计算机可读存储介质,所述计算机可读存储介质上存储有指令,其中,所述指令被处理器执行时实现如下所述情感语音的合成方法:A computer-readable storage medium storing instructions on the computer-readable storage medium, wherein, when the instructions are executed by a processor, the following method for synthesizing emotional speech is implemented: 获取待识别语音数据和对应的文本数据;Obtain the speech data to be recognized and the corresponding text data; 将所述待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征和位置编码,并结合所述梅尔谱特征和所述位置编码在所述情感识别网络中进行处理,生成情感嵌入特征;Input the speech data to be recognized into the pre-trained emotion recognition network, generate mel spectrum features and position codes, and process them in the emotion recognition network in combination with the mel spectrum features and the position codes, and generate emotion embedded features; 将所述情感嵌入特征和所述文本数据输入预先训练好的语音合成网络中,生成目标梅尔谱数据;Inputting the emotion embedded feature and the text data into a pre-trained speech synthesis network to generate target mel spectrum data; 采用神经声码器对所述目标梅尔谱数据进行语音转换,生成目标情感语音。A neural vocoder is used to perform speech conversion on the target Mel spectrum data to generate target emotional speech. 根据权利要求16所述的计算机可读存储介质,其中,所述情感语音的合成指令被所述处理器执行所述将所述待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征和位置编码,并结合所述梅尔谱特征和所述位置编码在所述情感识别网络中进行处理,生成情感嵌入特征的步骤时,包括:The computer-readable storage medium according to claim 16, wherein the synthesizing instruction of the emotional speech is executed by the processor, and the to-be-recognized speech data is input into a pre-trained emotion recognition network to generate a Mel spectral features and positional encoding, and combined with the mel spectral features and the positional encoding for processing in the emotion recognition network, the steps of generating emotional embedded features include: 将所述待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征;Inputting the speech data to be recognized into a pre-trained emotion recognition network to generate Mel spectrum features; 根据所述梅尔谱特征和预置的位置转换公式,生成位置编码;generating a position code according to the mel spectrum feature and a preset position conversion formula; 将所述梅尔谱特征和所述位置编码输入所述情感识别网络的编码器中进行编码,生成情感嵌入特征。The mel spectrum feature and the position code are input into the encoder of the emotion recognition network for encoding to generate emotion embedded features. 根据权利要求17所述的计算机可读存储介质,其中,所述情感语音的合成指令被所述处理器执行所述将所述待识别语音数据输入预先训练好的情感识别网络中,生成梅尔谱特征的步骤时,包括:The computer-readable storage medium according to claim 17, wherein the synthesizing instruction of the emotional speech is executed by the processor and the inputting the to-be-recognized speech data into a pre-trained emotion recognition network to generate a Mel The steps for spectral features include: 对所述待识别语音数据进行加窗处理,生成加窗后的语音数据;Windowing is performed on the to-be-recognized speech data to generate windowed speech data; 对所述加窗后的语音数据进行短时傅里叶变换,生成傅里叶变换后的语音数据;Carry out short-time Fourier transform to the voice data after the windowing, and generate the voice data after the Fourier transform; 采用梅尔滤波器组对所述傅里叶变换后的语音数据进行处理,生成梅尔谱特征。The Fourier-transformed speech data is processed by using a Mel filter bank to generate Mel spectrum features. 根据权利要求17所述的计算机可读存储介质,其中,所述情感语音的合成指令被所述处理器执行所述根据所述梅尔谱特征和预置的位置转换公式,生成位置编码的步骤时,包括:The computer-readable storage medium according to claim 17, wherein the step of generating a position code according to the mel spectrum feature and a preset position conversion formula is executed by the processor for the synthetic instruction of the emotional speech , including: 读取梅尔谱特征的长度,并读取梅尔谱特征的位置;Read the length of the mel spectrum feature, and read the position of the mel spectrum feature; 基于所述梅尔谱特征的长度和所述梅尔谱特征的位置,生成位置输入值;generating a position input value based on the length of the mel spectral feature and the position of the mel spectral feature; 将所述位置输入向量输入预置的位置转换公式,生成位置编码。Input the position input vector into a preset position conversion formula to generate a position code. 根据权利要求17所述的计算机可读存储介质,其中,所述情感语音的合成指令被所述处理器执行所述将所述梅尔谱特征和所述位置编码输入所述情感识别网络的编码器中进行编码,生成情感嵌入特征的步骤时,包括:18. The computer-readable storage medium of claim 17, wherein the synthesizing instructions of the emotional speech are executed by the processor and the encoding to input the mel spectral features and the positional encoding into the emotion recognition network The steps of generating emotion embedded features include: 将所述梅尔谱特征和所述的位置编码输入所述情感识别网络的多头自注意力层中,结合残差连接,生成初始情感特征向量;Inputting the Mel spectrum feature and the position encoding into the multi-head self-attention layer of the emotion recognition network, and combining the residual connections to generate an initial emotion feature vector; 将所述初始情感特征向量输入所述情感识别网络的前向传播层中进行卷积,生成情感嵌入特征。The initial emotion feature vector is input into the forward propagation layer of the emotion recognition network for convolution to generate emotion embedded features. 根据权利要求16所述的计算机可读存储介质,其中,所述情感语音的合成指令被所述处理器执行所述将所述情感嵌入特征和所述文本数据输入预先训练好的语音合成网络中,生成目标梅尔谱数据的步骤时,包括:17. The computer-readable storage medium of claim 16, wherein the emotional speech synthesis instructions are executed by the processor and the inputting the emotion embedding feature and the textual data into a pre-trained speech synthesis network , the steps to generate target mel spectrum data include: 在预先训练好的语音合成网络中,将所述文本数据转换为文本嵌入特征;In a pre-trained speech synthesis network, convert the text data into text embedding features; 按照时刻顺序,将所述文本嵌入特征和所述情感嵌入特征进行拼接,生成目标梅尔谱数据。According to the time sequence, the text embedding feature and the emotion embedding feature are spliced to generate target mel spectrum data. 根据权利要求16-21中任意一项所述的计算机可读存储介质,其中,在所述情感语音的合成指令被所述处理器执行所述获取待识别语音数据和对应的文本数据的步骤之前,包括:The computer-readable storage medium according to any one of claims 16-21, wherein before the step of obtaining the speech data to be recognized and the corresponding text data is performed by the processor before the synthesizing instruction of the emotional speech ,include: 获取情感语音训练数据、情感标签数据和文本训练数据;Obtain emotional speech training data, emotional label data and text training data; 采用所述情感语音训练数据和所述情感标签数据,结合层正则化机制进行模型训练,生成预先训练好的情感识别网络,并采用所述情感语音训练数据和所述文本训练数据进行模型训练,生成预先训练好的语音合成网络。Using the emotional voice training data and the emotional label data, combined with the layer regularization mechanism, model training is performed to generate a pre-trained emotion recognition network, and the emotional voice training data and the text training data are used for model training, Generate a pretrained speech synthesis network.
PCT/CN2021/083559 2020-12-10 2021-03-29 Emotional speech synthesis method, apparatus, and device, and storage medium Ceased WO2022121169A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011432589.4A CN112562700B (en) 2020-12-10 2020-12-10 Emotional speech synthesis method, device, equipment and storage medium
CN202011432589.4 2020-12-10

Publications (1)

Publication Number Publication Date
WO2022121169A1 true WO2022121169A1 (en) 2022-06-16

Family

ID=75060069

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/083559 Ceased WO2022121169A1 (en) 2020-12-10 2021-03-29 Emotional speech synthesis method, apparatus, and device, and storage medium

Country Status (2)

Country Link
CN (1) CN112562700B (en)
WO (1) WO2022121169A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112562700A (en) * 2020-12-10 2021-03-26 平安科技(深圳)有限公司 Emotional voice synthesis method, device, equipment and storage medium
CN116189715A (en) * 2022-12-13 2023-05-30 中国科学院声学研究所 A method and device for detecting lung disease using cough sound
CN116486781A (en) * 2023-05-06 2023-07-25 平安科技(深圳)有限公司 Speech synthesis method combined with emotional strength, electronic device and readable storage medium
CN116665639A (en) * 2023-06-16 2023-08-29 平安科技(深圳)有限公司 Speech synthesis method, speech synthesis device, electronic device and storage medium
CN117079637A (en) * 2023-06-19 2023-11-17 内蒙古工业大学 Mongolian emotion voice synthesis method based on condition generation countermeasure network
CN119559930A (en) * 2024-11-26 2025-03-04 平安科技(深圳)有限公司 A method, device, equipment and medium for singing synthesis based on controllable noise
CN120260539A (en) * 2025-06-03 2025-07-04 国网浙江省电力有限公司营销服务中心 A method and system for generating conversational speech based on emotion perception adapter and large model reasoning

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113112987B (en) * 2021-04-14 2024-05-03 北京地平线信息技术有限公司 Speech synthesis method, training method and device of speech synthesis model
CN113436621B (en) * 2021-06-01 2022-03-15 深圳市北科瑞声科技股份有限公司 GPU (graphics processing Unit) -based voice recognition method and device, electronic equipment and storage medium
CN113436608B (en) * 2021-06-25 2023-11-28 平安科技(深圳)有限公司 Double-flow voice conversion method, device, equipment and storage medium
CN114842881B (en) * 2022-06-07 2025-06-17 四川启睿克科技有限公司 Unsupervised emotional speech synthesis device and method
CN115273906B (en) * 2022-07-29 2025-08-19 平安科技(深圳)有限公司 Speech emotion conversion method, speech emotion conversion device, apparatus, and storage medium
CN118571266A (en) * 2024-06-28 2024-08-30 南京龙垣信息科技有限公司 Emotion voice synthesis method and system for identity encryption

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190355347A1 (en) * 2018-05-18 2019-11-21 Baidu Usa Llc Spectrogram to waveform synthesis using convolutional networks
CN110675881A (en) * 2019-09-05 2020-01-10 北京捷通华声科技股份有限公司 Voice verification method and device
CN111667812A (en) * 2020-05-29 2020-09-15 北京声智科技有限公司 Voice synthesis method, device, equipment and storage medium
CN111883106A (en) * 2020-07-27 2020-11-03 腾讯音乐娱乐科技(深圳)有限公司 Audio processing method and device
CN112562700A (en) * 2020-12-10 2021-03-26 平安科技(深圳)有限公司 Emotional voice synthesis method, device, equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109754779A (en) * 2019-01-14 2019-05-14 出门问问信息科技有限公司 Controllable emotional speech synthesizing method, device, electronic equipment and readable storage medium storing program for executing
CN109754778B (en) * 2019-01-17 2023-05-30 平安科技(深圳)有限公司 Text speech synthesis method and device and computer equipment
CN110379409B (en) * 2019-06-14 2024-04-16 平安科技(深圳)有限公司 Speech synthesis method, system, terminal device and readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190355347A1 (en) * 2018-05-18 2019-11-21 Baidu Usa Llc Spectrogram to waveform synthesis using convolutional networks
CN110675881A (en) * 2019-09-05 2020-01-10 北京捷通华声科技股份有限公司 Voice verification method and device
CN111667812A (en) * 2020-05-29 2020-09-15 北京声智科技有限公司 Voice synthesis method, device, equipment and storage medium
CN111883106A (en) * 2020-07-27 2020-11-03 腾讯音乐娱乐科技(深圳)有限公司 Audio processing method and device
CN112562700A (en) * 2020-12-10 2021-03-26 平安科技(深圳)有限公司 Emotional voice synthesis method, device, equipment and storage medium

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112562700A (en) * 2020-12-10 2021-03-26 平安科技(深圳)有限公司 Emotional voice synthesis method, device, equipment and storage medium
CN112562700B (en) * 2020-12-10 2025-04-29 平安科技(深圳)有限公司 Emotional speech synthesis method, device, equipment and storage medium
CN116189715A (en) * 2022-12-13 2023-05-30 中国科学院声学研究所 A method and device for detecting lung disease using cough sound
CN116486781A (en) * 2023-05-06 2023-07-25 平安科技(深圳)有限公司 Speech synthesis method combined with emotional strength, electronic device and readable storage medium
CN116665639A (en) * 2023-06-16 2023-08-29 平安科技(深圳)有限公司 Speech synthesis method, speech synthesis device, electronic device and storage medium
CN117079637A (en) * 2023-06-19 2023-11-17 内蒙古工业大学 Mongolian emotion voice synthesis method based on condition generation countermeasure network
CN119559930A (en) * 2024-11-26 2025-03-04 平安科技(深圳)有限公司 A method, device, equipment and medium for singing synthesis based on controllable noise
CN119559930B (en) * 2024-11-26 2025-11-21 平安科技(深圳)有限公司 Singing voice synthesizing method, device, equipment and medium based on noise control
CN120260539A (en) * 2025-06-03 2025-07-04 国网浙江省电力有限公司营销服务中心 A method and system for generating conversational speech based on emotion perception adapter and large model reasoning

Also Published As

Publication number Publication date
CN112562700A (en) 2021-03-26
CN112562700B (en) 2025-04-29

Similar Documents

Publication Publication Date Title
WO2022121169A1 (en) Emotional speech synthesis method, apparatus, and device, and storage medium
US11847727B2 (en) Generating facial position data based on audio data
CN110264991B (en) Training method of speech synthesis model, speech synthesis method, device, equipment and storage medium
CN112562691B (en) Voiceprint recognition method, voiceprint recognition device, computer equipment and storage medium
US20230351998A1 (en) Text and audio-based real-time face reenactment
CN111627418A (en) Training method, synthesizing method, system, device and medium for speech synthesis model
CN116250036A (en) System and method for synthesizing photo-level realistic video of speech
CN111212245B (en) Method and device for synthesizing video
WO2022007438A1 (en) Emotional voice data conversion method, apparatus, computer device, and storage medium
CN112634920A (en) Method and device for training voice conversion model based on domain separation
US12400631B2 (en) Method, electronic device, and computer program product for generating cross-modality encoder
CN113327578B (en) Acoustic model training method and device, terminal equipment and storage medium
JP2023169230A (en) Computer program, server device, terminal device, learned model, program generation method, and method
CN114898380A (en) Method, device and equipment for generating handwritten text image and storage medium
Melechovsky et al. DART: Disentanglement of accent and speaker representation in multispeaker text-to-speech
CN114495977B (en) Speech translation and model training methods, devices, electronic devices and storage media
CN113270090B (en) Combined model training method and equipment based on ASR model and TTS model
CN114842860A (en) Voice conversion method, device and equipment based on quantization coding and storage medium
CN114187892A (en) Style migration synthesis method and device and electronic equipment
Pham et al. Style transfer for 2d talking head generation
US20250104692A1 (en) Text-to-audio conversion with byte-encoding vectors
Gu et al. A voice anonymization method based on content and non-content disentanglement for emotion preservation
CN113889129B (en) Speech conversion method, device, equipment and storage medium
Patel et al. Adagan: Adaptive gan for many-to-many non-parallel voice conversion
Ko et al. Adversarial training of denoising diffusion model using dual discriminators for high-fidelity multi-speaker tts

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21901885

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21901885

Country of ref document: EP

Kind code of ref document: A1