CN115359779A

CN115359779A - Speech synthesis method, apparatus, computer device and storage medium

Info

Publication number: CN115359779A
Application number: CN202210882413.1A
Authority: CN
Inventors: 王启腾; 程杨; 詹维典; 徐伟; 张文锋; 朱煜
Original assignee: Merchants Union Consumer Finance Co Ltd
Current assignee: Merchants Union Consumer Finance Co Ltd
Priority date: 2022-07-26
Filing date: 2022-07-26
Publication date: 2022-11-18
Anticipated expiration: 2042-07-26
Also published as: CN115359779B

Abstract

The present application relates to a speech synthesis method, device, computer equipment and storage medium. The method includes: determining the coded segment of the sound spectrum to be predicted by encoding the phoneme feature information corresponding to the Chinese text; in each round of iterative prediction, determining the current sound spectrum feature corresponding to the coded segment of the current round of prediction, and The attention weight used when predicting the current sound spectrum feature in the current round; when the attention weight reaches the preset weight threshold, the preset decoding corresponding to the coded segment is based on the length of the current sound spectrum feature The number of times is adjusted to obtain the adjusted preset decoding times; when the number of rounds of iterative prediction reaches the preset decoding times, the iteration is stopped, and the current sound spectrum feature is determined as the target corresponding to the coding segment Spectrum features; the target spectrum features are used for synthesizing the speech corresponding to the Chinese text. Adopting the method can improve the accuracy of speech synthesis.

Description

Speech synthesis method, device, computer equipment and storage medium

技术领域technical field

本申请涉及语音处理技术领域，特别是涉及一种语音合成方法、装置、计算机设备和存储介质。The present application relates to the technical field of speech processing, in particular to a speech synthesis method, device, computer equipment and storage medium.

背景技术Background technique

随着语音处理技术的发展，出现了语音合成技术。语音合成技术能够将计算机自己产生的、或外部输入的文字信息转变为可以听得懂的、流利的汉语口语输出，从而实现计算机设备的语音通信。With the development of speech processing technology, speech synthesis technology has emerged. Speech synthesis technology can convert the text information generated by the computer itself or input from the outside into intelligible and fluent spoken Chinese output, so as to realize the voice communication of computer equipment.

传统的基于自回归的语音合成模型需要大量的训练数据，如果训练数据不够多，语音合成模型进行声谱预测时容易出现解码提前结束或者无法结束的情况，造成声谱预测错误，从而导致无法准确进行语音合成。The traditional autoregressive-based speech synthesis model requires a large amount of training data. If the training data is not enough, the speech synthesis model is prone to end early or fail to complete the decoding when performing sound spectrum prediction, resulting in wrong sound spectrum prediction, resulting in inaccurate Perform speech synthesis.

发明内容Contents of the invention

基于此，有必要针对上述技术问题，提供一种能够提高准确性的语音合成方法、装置、计算机设备、计算机可读存储介质和计算机程序产品。Based on this, it is necessary to provide a speech synthesis method, device, computer equipment, computer-readable storage medium and computer program product that can improve the accuracy for the above technical problems.

第一方面，本申请提供了一种语音合成方法。所述方法包括：In a first aspect, the present application provides a speech synthesis method. The methods include:

通过对中文文本对应的音素特征信息进行编码，确定待预测声谱的编码段；Determine the coding segment of the sound spectrum to be predicted by encoding the phoneme feature information corresponding to the Chinese text;

在每轮迭代预测中，确定本轮预测的所述编码段对应的当前声谱特征、以及本轮预测所述当前声谱特征时使用的注意力权重；In each round of iterative prediction, determine the current sound spectrum feature corresponding to the coding segment of the current round of prediction, and the attention weight used when predicting the current sound spectrum feature in the current round;

在所述注意力权重达到预设权重阈值的情况下，基于所述当前声谱特征的长度对所述编码段对应的预设解码次数进行调整，得到调整后的预设解码次数；When the attention weight reaches the preset weight threshold, adjust the preset decoding times corresponding to the encoding segment based on the length of the current sound spectrum feature, to obtain the adjusted preset decoding times;

在迭代预测的轮数达到所述预设解码次数的情况下，停止迭代，并将所述当前声谱特征确定为所述编码段对应的目标声谱特征；所述目标声谱特征用于合成所述中文文本对应的语音。When the number of rounds of iterative prediction reaches the preset number of times of decoding, the iteration is stopped, and the current spectral feature is determined as the target spectral feature corresponding to the encoding segment; the target spectral feature is used for synthesis The voice corresponding to the Chinese text.

在其中一个实施例中，所述通过对中文文本对应的音素特征信息进行编码，确定待预测声谱的编码段包括：In one of the embodiments, the encoding of the phoneme feature information corresponding to the Chinese text to determine the coded segment of the sound spectrum to be predicted includes:

确定中文文本对应的音素特征信息；Determine the phoneme feature information corresponding to the Chinese text;

对所述音素特征信息进行编码，得到所述中文文本对应的文本编码序列；Encoding the phoneme feature information to obtain a text encoding sequence corresponding to the Chinese text;

按照预设时间步，从文本编码序列中确定待进行声谱预测的编码段。According to the preset time step, the coding segment to be subjected to sound spectrum prediction is determined from the text coding sequence.

在其中一个实施例中，所述方法还包括：In one embodiment, the method also includes:

在迭代预测的轮数未达到所述预设解码次数的情况下，将下一轮作为本轮，返回确定本轮预测的所述编码段对应的当前声谱特征、以及本轮预测所述当前声谱特征时使用的注意力权重的步骤继续执行。In the case that the number of rounds of iterative prediction does not reach the preset number of decoding times, the next round is used as the current round, and the current sound spectrum feature corresponding to the coding segment of the current round of prediction and the current sound spectrum feature corresponding to the current round of prediction are returned. The step of using attention weights for spectral features continues.

在其中一个实施例中，所述方法还包括计算注意力权重的步骤；所述计算注意力权重的步骤包括：In one of the embodiments, the method also includes the step of calculating the attention weight; the step of calculating the attention weight includes:

在每轮迭代预测中，确定本轮预测当前声谱特征时使用的当前注意力向量；In each round of iterative prediction, determine the current attention vector used when predicting the current sound spectrum feature in this round;

基于所述当前注意力向量与所述编码段之间的相似度，计算所述当前注意力向量对应的注意力权重；所述注意力权重用于指示对所述注意力向量进行特征提取以得到当前特征向量；所述当前特征向量用于预测当前声谱特征。Based on the similarity between the current attention vector and the coding segment, calculate the attention weight corresponding to the current attention vector; the attention weight is used to indicate that the feature extraction is performed on the attention vector to obtain A current feature vector; the current feature vector is used to predict the current sound spectrum features.

在其中一个实施例中，所述在所述注意力权重达到预设权重阈值的情况下，基于所述当前声谱特征的长度对所述编码段对应的预设解码次数进行调整，得到调整后的预设解码次数包括：In one of the embodiments, when the attention weight reaches the preset weight threshold, the preset decoding times corresponding to the coded segment is adjusted based on the length of the current sound spectrum feature, and the adjusted The preset decoding times for include:

在所述注意力权重达到预设权重阈值的情况下，确定所述编码段对应的解码轮次偏移值；When the attention weight reaches a preset weight threshold, determine a decoding round offset value corresponding to the coding segment;

根据所述解码轮次偏移值和所述当前声谱特征的长度，对所述编码段对应的预设解码次数进行调整，得到调整后的预设解码次数。According to the decoding round offset value and the length of the current sound spectrum feature, the preset decoding times corresponding to the coded segment is adjusted to obtain the adjusted preset decoding times.

基于所述当前声谱特征预测所述编码段对应的结束标识信息；predicting end identification information corresponding to the coded segment based on the current sound spectrum feature;

在所述结束标识信息无法满足预设结束条件的情况下，将迭代预测的轮数与所述预设解码次数进行比对，并执行所述在迭代预测的轮数与所述预设解码次数相匹配的情况下，停止迭代的步骤。In the case that the end identification information cannot satisfy the preset end condition, compare the number of rounds of iterative prediction with the preset number of decoding times, and execute the number of rounds of iterative prediction and the preset number of decoding times In the case of a match, stop the iteration step.

在所述结束标识信息满足预设结束条件的情况下，停止迭代，并将所述当前声谱特征确定为所述编码段对应的目标声谱特征。When the end identification information satisfies the preset end condition, the iteration is stopped, and the current sound spectrum feature is determined as the target sound spectrum feature corresponding to the encoding segment.

第二方面，本申请还提供了一种语音合成装置。所述装置包括：In a second aspect, the present application also provides a speech synthesis device. The devices include:

编码模块，用于通过对中文文本对应的音素特征信息进行编码，确定待预测声谱的编码段；The encoding module is used to determine the encoding segment of the sound spectrum to be predicted by encoding the phoneme feature information corresponding to the Chinese text;

解码模块，在每轮迭代预测中，确定本轮预测的所述编码段对应的当前声谱特征、以及本轮预测所述当前声谱特征时使用的注意力权重；在所述注意力权重达到预设权重阈值的情况下，基于所述当前声谱特征的长度对所述编码段对应的预设解码次数进行调整，得到调整后的预设解码次数；在迭代预测的轮数达到所述预设解码次数的情况下，停止迭代，并将所述当前声谱特征确定为所述编码段对应的目标声谱特征；所述目标声谱特征用于合成所述中文文本对应的语音。The decoding module, in each round of iterative prediction, determines the current sound spectrum feature corresponding to the encoded segment of the current round of prediction, and the attention weight used when predicting the current sound spectrum feature in the current round; when the attention weight reaches In the case of a preset weight threshold, adjust the preset decoding times corresponding to the coded segment based on the length of the current sound spectrum feature to obtain the adjusted preset decoding times; when the number of rounds of iterative prediction reaches the preset When the number of decoding times is set, the iteration is stopped, and the current spectral feature is determined as the target spectral feature corresponding to the coded segment; the target spectral feature is used to synthesize the speech corresponding to the Chinese text.

在其中一个实施例中，所述编码模块，还用于确定中文文本对应的音素特征信息；对所述音素特征信息进行编码，得到所述中文文本对应的文本编码序列；按照预设时间步，从文本编码序列中确定待进行声谱预测的编码段。In one of the embodiments, the encoding module is also used to determine the phoneme feature information corresponding to the Chinese text; encode the phoneme feature information to obtain the text encoding sequence corresponding to the Chinese text; according to the preset time step, The coding segment to be subjected to sound spectrum prediction is determined from the text coding sequence.

在其中一个实施例中，所述解码模块，还用于在迭代预测的轮数未达到所述预设解码次数的情况下，将下一轮作为本轮，返回确定本轮预测的所述编码段对应的当前声谱特征、以及本轮预测所述当前声谱特征时使用的注意力权重的步骤继续执行。In one of the embodiments, the decoding module is further configured to use the next round as the current round when the number of rounds of iterative prediction does not reach the preset number of decoding times, and return to the code for determining the current round of prediction The steps of the current sound spectrum feature corresponding to the segment and the attention weight used when predicting the current sound spectrum feature in this round continue to be executed.

在其中一个实施例中，所述解码模块，还用于在每轮迭代预测中，确定本轮预测当前声谱特征时使用的当前注意力向量；基于所述当前注意力向量与所述编码段之间的相似度，计算所述当前注意力向量对应的注意力权重；所述注意力权重用于指示对所述注意力向量进行特征提取以得到当前特征向量；所述当前特征向量用于预测当前声谱特征。In one of the embodiments, the decoding module is also used to determine the current attention vector used when predicting the current sound spectrum feature in each round of iterative prediction; based on the current attention vector and the coding section The similarity between calculates the attention weight corresponding to the current attention vector; the attention weight is used to indicate that the feature extraction is performed on the attention vector to obtain the current feature vector; the current feature vector is used for prediction Current spectral features.

在其中一个实施例中，所述解码模块，还用于在所述注意力权重达到预设权重阈值的情况下，确定所述编码段对应的解码轮次偏移值；根据所述解码轮次偏移值和所述当前声谱特征的长度，对所述编码段对应的预设解码次数进行调整，得到调整后的预设解码次数。In one of the embodiments, the decoding module is further configured to determine the decoding round offset value corresponding to the coding segment when the attention weight reaches a preset weight threshold; according to the decoding round The offset value and the length of the current sound spectrum feature are used to adjust the preset decoding times corresponding to the encoded segment to obtain the adjusted preset decoding times.

在其中一个实施例中，所述解码模块，还用于基于所述当前声谱特征预测所述编码段对应的结束标识信息；在所述结束标识信息无法满足预设结束条件的情况下，将迭代预测的轮数与所述预设解码次数进行比对，并执行所述在迭代预测的轮数与所述预设解码次数相匹配的情况下，停止迭代的步骤。In one of the embodiments, the decoding module is further configured to predict the end identification information corresponding to the encoded segment based on the current sound spectrum feature; if the end identification information cannot meet the preset end condition, the The number of rounds of iterative prediction is compared with the preset number of decoding times, and the step of stopping the iteration is performed when the number of rounds of iterative prediction matches the preset number of decoding times.

在其中一个实施例中，所述解码模块，还用于在所述结束标识信息满足预设结束条件的情况下，停止迭代，并将所述当前声谱特征确定为所述编码段对应的目标声谱特征。In one of the embodiments, the decoding module is further configured to stop iteration when the end identification information satisfies a preset end condition, and determine the current sound spectrum feature as the target corresponding to the encoding segment Spectral features.

第三方面，本申请还提供了一种计算机设备。所述计算机设备包括存储器和处理器，所述存储器存储有计算机程序，所述处理器执行所述计算机程序时实现本申请所述方法各实施例中的步骤。In a third aspect, the present application also provides a computer device. The computer device includes a memory and a processor, the memory stores a computer program, and the processor implements the steps in each embodiment of the method described in this application when executing the computer program.

第四方面，本申请还提供了一种计算机可读存储介质。所述计算机可读存储介质，其上存储有计算机程序，所述计算机程序被处理器执行时实现本申请所述方法各实施例中的步骤。In a fourth aspect, the present application also provides a computer-readable storage medium. The computer-readable storage medium has a computer program stored thereon, and when the computer program is executed by a processor, the steps in each embodiment of the method described in this application are realized.

第五方面，本申请还提供了一种计算机程序产品。所述计算机程序产品，包括计算机程序，该计算机程序被处理器执行时实现本申请所述方法各实施例中的步骤。In a fifth aspect, the present application also provides a computer program product. The computer program product includes a computer program, and when the computer program is executed by a processor, the steps in each embodiment of the method described in this application are implemented.

上述语音合成方法、装置、计算机设备、存储介质和计算机程序产品，通过对中文文本对应的音素特征信息进行编码，确定待预测声谱的编码段；在每轮迭代预测中，确定本轮预测的编码段对应的当前声谱特征、以及本轮预测当前声谱特征时使用的注意力权重；在注意力权重达到预设权重阈值的情况下，基于当前声谱特征的长度对编码段对应的预设解码次数进行调整，得到调整后的预设解码次数；在迭代预测的轮数达到预设解码次数的情况下，停止迭代，并将当前声谱特征确定为编码段对应的目标声谱特征；目标声谱特征用于合成中文文本对应的语音。通过在每轮迭代预测中，根据注意力权重和当前声谱特征的长度调整预设解码次数，以实现在迭代预测的轮数达到所述预设解码次数的情况下，控制迭代预测结束，从而能够得到准确的目标声谱特征以提高语音合成的准确性。The above speech synthesis method, device, computer equipment, storage medium and computer program product, by encoding the phoneme feature information corresponding to the Chinese text, determine the coding segment of the sound spectrum to be predicted; in each round of iterative prediction, determine the current round of prediction The current spectral feature corresponding to the coding segment, and the attention weight used when predicting the current spectral feature in this round; when the attention weight reaches the preset weight threshold, the prediction corresponding to the coding segment is based on the length of the current spectral feature. Set the number of decodings to be adjusted to obtain the adjusted preset number of decodings; when the number of rounds of iterative prediction reaches the preset number of decodings, stop the iteration, and determine the current sound spectrum feature as the target sound spectrum feature corresponding to the encoding segment; The target spectral features are used to synthesize speech corresponding to Chinese text. By adjusting the preset decoding times according to the attention weight and the length of the current sound spectrum feature in each round of iterative prediction, when the number of rounds of iterative prediction reaches the preset decoding times, the iterative prediction is controlled to end, thereby Accurate target sound spectrum features can be obtained to improve the accuracy of speech synthesis.

附图说明Description of drawings

图1为一个实施例中语音合成方法的应用环境图；Fig. 1 is the application environment diagram of speech synthesis method in an embodiment;

图2为一个实施例中语音合成方法的流程示意图；Fig. 2 is a schematic flow chart of a speech synthesis method in an embodiment;

图3为一个实施例中声谱预测模型的示意图；Fig. 3 is a schematic diagram of an acoustic spectrum prediction model in an embodiment;

图4为一个实施例中语音合成装置的结构框图；Fig. 4 is the structural block diagram of speech synthesis device in an embodiment;

图5为一个实施例中计算机设备的内部结构图；Fig. 5 is an internal structural diagram of a computer device in an embodiment;

图6为另一个实施例中计算机设备的内部结构图。Fig. 6 is an internal structure diagram of a computer device in another embodiment.

具体实施方式Detailed ways

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本申请进行进一步详细说明。应当理解，此处描述的具体实施例仅仅用以解释本申请，并不用于限定本申请。In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not intended to limit the present application.

本申请实施例提供的语音合成方法，可以应用于如图1所示的应用环境中。其中，终端102通过网络与服务器104进行通信。数据存储系统可以存储服务器104需要处理的数据。数据存储系统可以集成在服务器104上，也可以放在云上或其他网络服务器上。服务器104可以通过对中文文本对应的音素特征信息进行编码，确定待预测声谱的编码段；服务器104可以在每轮迭代预测中，确定本轮预测的编码段对应的当前声谱特征、以及本轮预测当前声谱特征时使用的注意力权重；服务器104可以在注意力权重达到预设权重阈值的情况下，基于所述当前声谱特征的长度对编码段对应的预设解码次数进行调整，得到调整后的预设解码次数；服务器104可以在迭代预测的轮数达到预设解码次数的情况下，停止迭代，并将当前声谱特征确定为编码段对应的目标声谱特征；目标声谱特征用于合成中文文本对应的语音。可以理解，终端102可以展示基于目标特征合成的语音。The speech synthesis method provided in the embodiment of the present application may be applied in the application environment shown in FIG. 1 . Wherein, the terminal 102 communicates with the server 104 through the network. The data storage system can store data that needs to be processed by the server 104 . The data storage system can be integrated on the server 104, or placed on the cloud or other network servers. The server 104 can determine the coded segment of the sound spectrum to be predicted by encoding the phoneme feature information corresponding to the Chinese text; the server 104 can determine the current sound spectrum feature corresponding to the coded segment of the current round of prediction in each round of iterative prediction, and this The attention weight used when predicting the current sound spectrum feature in turn; the server 104 can adjust the preset decoding times corresponding to the encoded segment based on the length of the current sound spectrum feature when the attention weight reaches the preset weight threshold, The adjusted preset decoding times are obtained; the server 104 can stop the iteration when the number of rounds of iterative prediction reaches the preset decoding times, and determine the current sound spectrum feature as the target sound spectrum feature corresponding to the encoding segment; the target sound spectrum Features are used to synthesize speech corresponding to Chinese text. It can be understood that the terminal 102 can display the synthesized speech based on the target feature.

其中，终端102可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑、物联网设备和便携式可穿戴设备，物联网设备可为智能音箱、智能电视、智能空调、智能车载设备等。便携式可穿戴设备可为智能手表、智能手环、头戴设备等。服务器104可以用独立的服务器或者是多个服务器组成的服务器集群来实现。Among them, the terminal 102 can be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, Internet of Things devices and portable wearable devices, and the Internet of Things devices can be smart speakers, smart TVs, smart air conditioners, smart vehicle-mounted devices, etc. . Portable wearable devices can be smart watches, smart bracelets, head-mounted devices, and the like. The server 104 can be implemented by an independent server or a server cluster composed of multiple servers.

在一个实施例中，如图2所示，提供了一种语音合成方法，本实施例以该方法应用于服务器进行举例说明，可以理解的是，该方法也可以应用于终端，还可以应用于包括终端和服务器的系统，并通过终端和服务器的交互实现。本实施例中，该方法包括以下步骤：In one embodiment, as shown in FIG. 2 , a speech synthesis method is provided. This embodiment uses the method applied to a server as an example for illustration. It can be understood that this method can also be applied to a terminal or A system including a terminal and a server, and realized through the interaction between the terminal and the server. In this embodiment, the method includes the following steps:

步骤202，通过对中文文本对应的音素特征信息进行编码，确定待预测声谱的编码段；在每轮迭代预测中，确定本轮预测的编码段对应的当前声谱特征、以及本轮预测当前声谱特征时使用的注意力权重。Step 202, by encoding the phoneme feature information corresponding to the Chinese text, determine the code segment of the sound spectrum to be predicted; in each round of iterative prediction, determine the current sound spectrum feature corresponding to the code segment of the current round of prediction, and the current round of prediction. Attention weights to use when spectral features.

其中，音素特征信息用于指示中文文本中包含的语音特征。可以理解，音素是根据语音的自然属性划分出来的最小语音单位，是语音中最小的发音单元，中文文本中包括多种音素特征，比如声母、韵母、标点符号、声调和轻声等特征，这些特征共同决定了中文文本对应的语音结构，即，如何发出中文文本对应的语音。编码段是中文文本对应的部分编码信息。可以理解，对中文文本对应的音素特征信息进行编码可以得到中文文本对应的编码信息，编码段实质上是编码信息中的一段信息。当前声谱特征中包括编码段对应的声学特征，比如声波的频率信息和振幅信息。注意力权重是预测编码段对应的当前声谱特征过程中的中间数据。可以理解，注意力机制，是模仿人类注意力而提出的一种解决问题的方法，核心目标是通过给神经网络的部分节点分配不同的权重，以便从众多信息中选择出对当前任务目标更关键的信息。注意力权重可以指示信息的关键程度，以提取出关键特征。语音合成(text to speech)是将输入的中文文本转换成语音。Wherein, the phoneme feature information is used to indicate the phonetic feature contained in the Chinese text. It can be understood that a phoneme is the smallest unit of speech divided according to the natural attributes of speech, and is the smallest pronunciation unit in speech. Chinese texts include a variety of phoneme features, such as initials, finals, punctuation marks, tones, and soft tones. These features Together, they determine the phonetic structure corresponding to the Chinese text, that is, how to emit the corresponding phonetics of the Chinese text. The coding segment is part of the coding information corresponding to the Chinese text. It can be understood that the encoding information corresponding to the Chinese text can be obtained by encoding the phoneme feature information corresponding to the Chinese text, and the encoding segment is essentially a piece of information in the encoding information. The current sound spectrum features include acoustic features corresponding to the coded segment, such as frequency information and amplitude information of sound waves. The attention weight is the intermediate data in the process of predicting the current spectral feature corresponding to the encoded segment. It can be understood that the attention mechanism is a problem-solving method proposed by imitating human attention. The core goal is to assign different weights to some nodes of the neural network in order to select from a large number of information that is more critical to the current task goal. Information. Attention weights can indicate the criticality of information to extract key features. Speech synthesis (text to speech) is to convert the input Chinese text into speech.

具体地，服务器可以获取中文文本对应的音素特征信息，并通过对音素特征信息进行编码，确定待预测声谱的编码段。可以理解，每轮迭代预测中为了预测当前声谱特征，会使用到中间数据，注意力权重是在预测当前声谱特征的过程中产生的中间数据。在每轮迭代预测中，服务器可以确定本轮预测的编码段对应的当前声谱特征、以及本轮预测当前声谱特征时使用的注意力权重。Specifically, the server may acquire phoneme feature information corresponding to the Chinese text, and determine the encoded segment of the sound spectrum to be predicted by encoding the phoneme feature information. It can be understood that in each round of iterative prediction, in order to predict the current spectral features, intermediate data will be used, and the attention weight is the intermediate data generated during the process of predicting the current spectral features. In each round of iterative prediction, the server may determine the current sound spectrum feature corresponding to the coding segment of the current round of prediction, and the attention weight used in the current round of prediction of the current sound spectrum feature.

在一个实施例中，当前声谱特征可以是梅尔频谱和梅尔倒谱中的至少一种。梅尔频谱(Mel Bank Features)，一种语音信号的声学特征，包含丰富的语音信息，常用于语音识别和语音合成中。In an embodiment, the current acoustic spectrum feature may be at least one of a Mel spectrum and a Mel cepstrum. Mel Bank Features, an acoustic feature of a speech signal, contains rich speech information and is often used in speech recognition and speech synthesis.

步骤204，在注意力权重达到预设权重阈值的情况下，基于所述当前声谱特征的长度对编码段对应的预设解码次数进行调整，得到调整后的预设解码次数。Step 204: When the attention weight reaches the preset weight threshold, adjust the preset decoding times corresponding to the coded segment based on the length of the current sound spectrum feature to obtain the adjusted preset decoding times.

其中，预设权重阈值是针对注意力权重设置的阈值。Wherein, the preset weight threshold is a threshold set for the attention weight.

具体地，当前声谱特征和编码段之间的对齐程度与注意力权重是正相关的关系。可以理解，注意力权重越大，则对齐程度越高。在注意力权重达到预设权重阈值的情况下，可以判定当前声谱特征与编码段在一定程度上对齐了，此时，服务器可以对编码段对应的预设解码次数进行调整，得到调整后的预设解码次数。其中，调整后的预设解码次数相较于之前的预设解码次数更大。Specifically, the degree of alignment between the current spectral feature and the encoded segment is positively correlated with the attention weight. It can be understood that the greater the attention weight, the higher the alignment. When the attention weight reaches the preset weight threshold, it can be determined that the current spectral feature is aligned with the encoding segment to a certain extent. At this time, the server can adjust the preset decoding times corresponding to the encoding segment to obtain the adjusted Default decoding times. Wherein, the adjusted preset decoding times is larger than the previous preset decoding times.

在一个实施例中，在注意力权重达到预设权重阈值、且当前声谱特征与编码段之间长度相匹配的情况下，基于所述当前声谱特征的长度对所述编码段对应的预设解码次数进行调整，得到调整后的预设解码次数。In one embodiment, when the attention weight reaches the preset weight threshold and the length of the current sound spectrum feature matches the length of the coding segment, the corresponding prediction of the coding segment is based on the length of the current sound spectrum feature. Set the number of times of decoding to adjust to obtain the adjusted number of times of preset decoding.

在一个实施例中，当前声谱特征与编码段之间长度相匹配可以是指当前声谱特征对应的时域长度大于编码段对应的时域长度。可以理解，不限于时域长度这一种实现方式，频域长度也可以用于确定当前声谱特征与编码段之间长度的匹配情况。In an embodiment, matching the length between the current sound spectrum feature and the coding segment may mean that the time domain length corresponding to the current sound spectrum feature is greater than the time domain length corresponding to the coding segment. It can be understood that the implementation manner is not limited to the length in the time domain, and the length in the frequency domain can also be used to determine the matching situation between the current sound spectrum feature and the length between the coding segments.

步骤206，在迭代预测的轮数达到预设解码次数的情况下，停止迭代，并将当前声谱特征确定为编码段对应的目标声谱特征。Step 206, when the number of rounds of iterative prediction reaches the preset number of decoding times, stop the iteration, and determine the current sound spectrum feature as the target sound spectrum feature corresponding to the coded segment.

其中，目标声谱特征用于合成中文文本对应的语音。Among them, the target sound spectrum feature is used to synthesize the speech corresponding to the Chinese text.

具体地，虽然可以认为当前声谱特征与编码段在一定程度上对齐了，但是直接将当前声谱特征输出不够准确，服务器可以继续预测所述编码段对应的当前声谱特征，并在迭代预测的轮数达到预设解码次数的情况下，停止迭代，并将当前声谱特征确定为编码段对应的目标声谱特征。Specifically, although it can be considered that the current spectral feature is aligned with the encoded segment to a certain extent, it is not accurate enough to directly output the current spectral feature. The server can continue to predict the current spectral feature corresponding to the encoded segment, and iteratively predict When the number of rounds reaches the preset number of decoding times, the iteration is stopped, and the current sound spectrum feature is determined as the target sound spectrum feature corresponding to the encoding segment.

上述语音合成方法中，通过对中文文本对应的音素特征信息进行编码，确定待预测声谱的编码段；在每轮迭代预测中，确定本轮预测的编码段对应的当前声谱特征、以及本轮预测当前声谱特征时使用的注意力权重；在注意力权重达到预设权重阈值的情况下，基于当前声谱特征的长度对编码段对应的预设解码次数进行调整，得到调整后的预设解码次数；在迭代预测的轮数达到预设解码次数的情况下，停止迭代，并将当前声谱特征确定为编码段对应的目标声谱特征；目标声谱特征用于合成中文文本对应的语音。通过在每轮迭代预测中，根据注意力权重和当前声谱特征的长度调整预设解码次数，以实现在迭代预测的轮数达到所述预设解码次数的情况下，控制迭代预测结束，从而能够得到准确的目标声谱特征以提高语音合成的准确性。In the above speech synthesis method, by encoding the phoneme feature information corresponding to the Chinese text, the code segment of the sound spectrum to be predicted is determined; in each round of iterative prediction, the current sound spectrum feature corresponding to the code segment predicted in the current round, and the The attention weight used when predicting the current spectral feature in rounds; when the attention weight reaches the preset weight threshold, the preset decoding times corresponding to the coding segment are adjusted based on the length of the current spectral feature, and the adjusted prediction is obtained. Set the number of decoding times; when the number of rounds of iterative prediction reaches the preset number of decoding times, stop the iteration, and determine the current spectral feature as the target spectral feature corresponding to the encoding segment; the target spectral feature is used to synthesize the corresponding Chinese text voice. By adjusting the preset decoding times according to the attention weight and the length of the current sound spectrum feature in each round of iterative prediction, when the number of rounds of iterative prediction reaches the preset decoding times, the iterative prediction is controlled to end, thereby Accurate target sound spectrum features can be obtained to improve the accuracy of speech synthesis.

在一个实施例中，通过对中文文本对应的音素特征信息进行编码，确定待预测声谱的编码段包括：确定中文文本对应的音素特征信息；对音素特征信息进行编码，得到中文文本对应的文本编码序列；按照预设时间步，从文本编码序列中确定待进行声谱预测的编码段。In one embodiment, by encoding the phoneme feature information corresponding to the Chinese text, determining the coded segment of the sound spectrum to be predicted includes: determining the phoneme feature information corresponding to the Chinese text; encoding the phoneme feature information to obtain the text corresponding to the Chinese text Coding sequence: according to the preset time step, determine the coding segment to be subjected to sound spectrum prediction from the text coding sequence.

其中，文本编码序列是在预测声谱特征的过程中符合相应格式的中间数据。可以理解，服务器无法直接从音素特征信息预测出声谱特征，通过多次对音素特征信息进行格式转换以提取特征，才能够预测出声谱特征。编码段是文本编码序列中的部分编码序列。Wherein, the text coding sequence is the intermediate data conforming to the corresponding format in the process of predicting the feature of the sound spectrum. It can be understood that the server cannot directly predict the sound spectrum feature from the phoneme feature information, and the sound spectrum feature can only be predicted by performing format conversion on the phoneme feature information multiple times to extract features. A coded segment is a partial coded sequence in a text coded sequence.

具体地，服务器可以确定中文文本中的音素特征信息。服务器可以通过编码器对音素特征信息进行提取并压缩成固定长度的上下文向量，得到中文文本对应的文本编码序列。服务器可以按照预设时间步，从文本编码序列中确定待进行声谱预测的编码段。可以理解，文本编码序列中包括多个编码段，服务器可以按照时间顺序，先后分别对多个编码段进行声谱预测。Specifically, the server can determine phoneme feature information in the Chinese text. The server can use the encoder to extract and compress the phoneme feature information into a fixed-length context vector to obtain a text encoding sequence corresponding to the Chinese text. The server can determine the code segment to be predicted from the text code sequence according to the preset time step. It can be understood that the text encoding sequence includes multiple encoding segments, and the server may perform sound spectrum prediction on the multiple encoding segments successively according to time sequence.

在一个实施例中，服务器可以识别237种的音素，包括正常的1至5声调，以及常用的标点符号。其中，声调5表示轻声。In one embodiment, the server can recognize 237 kinds of phonemes, including normal 1 to 5 tones, and common punctuation marks. Among them, tone 5 represents a soft voice.

在一个实施例中，服务器可以通过查询标准音素字典，得到中文文本对应的音素特征信息。In an embodiment, the server can obtain the phoneme feature information corresponding to the Chinese text by querying a standard phoneme dictionary.

在一个实施例中，编码器包括词编码层、三个卷积层和双向长短期记忆网络层。其中，词编码，是通过一些数据变换或映射将文本转化为计算机可识别处理的数值矩阵，在建模过程中，文本信息由该数值矩阵表示，参与训练和计算。服务器可以通过词编码层将音素特征信息转成文本向量，并通过三个卷积层和双向长短记忆网络层对文本向量进行编码特征提取，得到中文文本对应的文本编码序列。In one embodiment, the encoder includes a word encoding layer, three convolutional layers, and a bidirectional LSTM network layer. Among them, the word encoding is to transform the text into a numerical matrix that can be recognized and processed by the computer through some data transformation or mapping. In the modeling process, the text information is represented by the numerical matrix and participates in training and calculation. The server can convert the phoneme feature information into a text vector through the word coding layer, and extract the coding features of the text vector through three convolutional layers and a bidirectional long-short memory network layer to obtain the text coding sequence corresponding to the Chinese text.

本实施例中，通过确定中文文本对应的音素特征信息，并对音素特征信息进行编码，得到中文文本对应的文本编码序列，再按照预设时间步，从文本编码序列中确定待进行声谱预测的编码段，后续能够对编码段进行声谱预测，从而实现对中文文本的语音合成。In this embodiment, by determining the phoneme feature information corresponding to the Chinese text and encoding the phoneme feature information, the text coding sequence corresponding to the Chinese text is obtained, and then according to the preset time step, the sound spectrum prediction to be performed is determined from the text coding sequence The coded section can be used to predict the sound spectrum of the coded section, so as to realize the speech synthesis of Chinese text.

在一个实施例中，方法还包括：在迭代预测的轮数未达到预设解码次数的情况下，将下一轮作为本轮，返回确定本轮预测的编码段对应的当前声谱特征、以及本轮预测当前声谱特征时使用的注意力权重的步骤继续执行。In one embodiment, the method further includes: when the number of rounds of iterative prediction does not reach the preset number of decoding times, taking the next round as the current round, returning to determine the current sound spectrum feature corresponding to the coding segment of the current round of prediction, and The step of attention weights used when predicting the current spectral feature in this round continues.

具体地，在注意力权重达到预设权重阈值的情况下，需要再多预测一部分，以提高声谱预测的准确性。可以理解，多预测的这部分可以是静音特征。在迭代预测的轮数未达到预设解码次数的情况下，服务器可以将下一轮作为本轮，返回确定本轮预测的编码段对应的当前声谱特征、以及本轮预测当前声谱特征时使用的注意力权重的步骤继续执行。Specifically, when the attention weight reaches the preset weight threshold, more prediction is required to improve the accuracy of sound spectrum prediction. It will be appreciated that this part of the multi-prediction can be a silent feature. When the number of rounds of iterative prediction does not reach the preset number of decoding times, the server can use the next round as the current round, and return the current sound spectrum feature corresponding to the coding segment of the current round of prediction, and the time when the current round of prediction of the current sound spectrum feature The step of using the attention weights continues.

在一个实施例中，在所述注意力权重达到预设权重阈值、且所述当前声谱特征与所述编码段之间长度相匹配的情况下，服务器可以直接对当前声谱特征再预测一部分静音特征，得到目标声谱特征。In one embodiment, when the attention weight reaches the preset weight threshold and the current sound spectrum feature matches the length between the encoding segments, the server can directly predict another part of the current sound spectrum feature Mute feature to get the target sound spectrum feature.

在一个实施例中，当前声谱特征的长度与迭代预测的轮数相匹配。服务器可以通过比对当前声谱特征的长度和预设解码次数，判断迭代预测的轮数是否达到预设解码次数。可以理解，在经过每轮迭代预测后，当前声谱特征增加的长度是固定的，因此，服务器可以通过确定当前声谱特征的长度，得到迭代预测的轮数。In one embodiment, the length of the current spectral feature matches the number of rounds of iterative prediction. The server can judge whether the number of rounds of iterative prediction reaches the preset decoding times by comparing the length of the current sound spectrum feature with the preset decoding times. It can be understood that after each round of iterative prediction, the increased length of the current sound spectrum feature is fixed. Therefore, the server can obtain the number of iterative prediction rounds by determining the length of the current sound spectrum feature.

本实施例中，在迭代预测的轮数未达到预设解码次数的情况下，将下一轮作为本轮，返回确定本轮预测的编码段对应的当前声谱特征、以及本轮预测当前声谱特征时使用的注意力权重的步骤继续执行，通过多轮迭代预测，提高声谱预测的准确性。In this embodiment, when the number of rounds of iterative prediction does not reach the preset number of decoding times, the next round is used as the current round, and the current sound spectrum feature corresponding to the coding segment of the current round of prediction and the current sound spectrum feature corresponding to the current round of prediction are returned. The steps of attention weight used in spectral features are continued, and the accuracy of sound spectrum prediction is improved through multiple rounds of iterative prediction.

在一个实施例中，方法还包括计算注意力权重的步骤；计算注意力权重的步骤包括：在每轮迭代预测中，确定本轮预测当前声谱特征时使用的当前注意力向量；基于当前注意力向量与编码段之间的相似度，计算当前注意力向量对应的注意力权重；注意力权重用于指示对注意力向量进行特征提取以得到当前特征向量；当前特征向量用于预测当前声谱特征。In one embodiment, the method further includes the step of calculating the attention weight; the step of calculating the attention weight includes: in each round of iterative prediction, determining the current attention vector used when predicting the current sound spectrum feature in this round; based on the current attention The similarity between the force vector and the encoding segment is used to calculate the attention weight corresponding to the current attention vector; the attention weight is used to indicate the feature extraction of the attention vector to obtain the current feature vector; the current feature vector is used to predict the current sound spectrum feature.

具体地，在每轮迭代预测中，服务器可以确定本轮预测当前声谱特征时使用的当前注意力向量，通过注意力循环神经网络层计算当前注意力向量与编码段之间的相似度，得到相似度分数，并将相似度分数作为归一化指数函数(softmax函数)的输入，得到当前注意力向量对应的注意力权重。其中，循环神经网络(Recurrent Neural Network,RNN)是一类以序列数据为输入，在序列的演进方向进行递归且所有节点(循环单元)按链式连接的递归神经网络。Specifically, in each round of iterative prediction, the server can determine the current attention vector used when predicting the current sound spectrum features in this round, and calculate the similarity between the current attention vector and the coded segment through the attention loop neural network layer, and obtain The similarity score is used as the input of the normalized exponential function (softmax function) to obtain the attention weight corresponding to the current attention vector. Among them, the recurrent neural network (Recurrent Neural Network, RNN) is a kind of recurrent neural network that takes sequence data as input, recurses in the evolution direction of the sequence, and connects all nodes (recurrent units) in a chain.

在一个实施例中，服务器可以综合上下文信息和位置两种维度，计算当前注意力向量与编码段之间的相似度分数。可以理解，文本编码序列中包括完整的上下文信息，以及文本编码序列包含相应的序列位置信息。注意力向量中包含的文本内容信息和状态位置信息应与编码段中的上下文信息和序列位置信息一致，二者之间的一致性越高，相似度分数就越高。In one embodiment, the server can calculate the similarity score between the current attention vector and the coded segment by combining the two dimensions of context information and location. It can be understood that the text coding sequence includes complete context information, and the text coding sequence includes corresponding sequence position information. The text content information and state position information contained in the attention vector should be consistent with the context information and sequence position information in the encoding segment, the higher the consistency between the two, the higher the similarity score.

在一个实施例中，服务器可以通过解码器对编码段进行声谱预测。其中，解码器可以包括预处理层、注意力循环神经网络层和解码循环神经网络层。服务器可以确定上一轮预测出的上一声谱特征，以及上一轮预测时使用的上一注意力向量和上一特征向量。服务器可以将上一声谱特征输入至预处理层，在将预处理层的输出与上一注意力向量进行拼接后，输入至注意力循环神经网络层得到本轮预测需要使用的当前注意力向量。服务器可以在将当前注意力向量和上一特征向量输入至解码循环神经网络层，得到本轮预测需要使用的当前特征向量。In an embodiment, the server may perform sound spectrum prediction on the coded segment through the decoder. Wherein, the decoder may include a preprocessing layer, an attention recurrent neural network layer and a decoding recurrent neural network layer. The server may determine the last sound spectrum feature predicted in the last round, and the last attention vector and the last feature vector used in the last round of prediction. The server can input the previous sound spectrum feature to the preprocessing layer, and after splicing the output of the preprocessing layer with the previous attention vector, input it to the attention loop neural network layer to obtain the current attention vector needed for the current round of prediction. The server can input the current attention vector and the previous feature vector to the decoding cycle neural network layer to obtain the current feature vector to be used in the current round of prediction.

在一个实施例中，服务器可以将当前特征向量输入至帧预测网络，得到当前声谱特征。当前声谱特征可以包括至少一帧梅尔频谱。可以理解，服务器可以按照预设频谱帧数，在每轮迭代预测中，预测出预设频谱帧数的梅尔频谱。比如，预设频谱帧数可以是3帧、4帧或5帧。In an embodiment, the server may input the current feature vector into the frame prediction network to obtain the current sound spectrum feature. The current sound spectrum feature may include at least one frame of mel spectrum. It can be understood that the server may predict the mel spectrum of the preset number of spectrum frames in each round of iterative prediction according to the preset number of spectrum frames. For example, the preset frequency spectrum frame number may be 3 frames, 4 frames or 5 frames.

在一个实施例中，服务器可以确定待进行声谱预测的当前编码段。在对当前编码段进行第一轮声谱预测时，服务器可以确定上一轮预测出的上一目标声谱特征对应的上一注意力向量。服务器可以将当前编码段和上一注意力向量输入至注意力循环神经网络层得到本轮预测需要使用的当前注意力向量。In one embodiment, the server can determine the current coding segment to be predicted by the sound spectrum. When performing the first round of sound spectrum prediction on the current encoding segment, the server may determine the last attention vector corresponding to the last target sound spectrum feature predicted in the last round. The server can input the current encoding segment and the previous attention vector to the attention recurrent neural network layer to obtain the current attention vector needed for the current round of prediction.

本实施例中，在每轮迭代预测中，确定本轮预测当前声谱特征时使用的当前注意力向量；基于当前注意力向量与编码段之间的相似度，计算当前注意力向量对应的注意力权重；后续能够基于注意力权重调整预设解码次数，从而更准确地进行声谱预测，提高语音合成的准确性。In this embodiment, in each round of iterative prediction, the current attention vector used when predicting the current sound spectrum feature in this round is determined; based on the similarity between the current attention vector and the coding segment, the attention corresponding to the current attention vector is calculated. Power weight; the preset decoding times can be adjusted based on the attention weight in the future, so as to predict the sound spectrum more accurately and improve the accuracy of speech synthesis.

在一个实施例中，在注意力权重达到预设权重阈值的情况下，基于所述当前声谱特征的长度对编码段对应的预设解码次数进行调整，得到调整后的预设解码次数包括：在注意力权重达到预设权重阈值的情况下，确定编码段对应的解码轮次偏移值；根据解码轮次偏移值和当前声谱特征的长度，对编码段对应的预设解码次数进行调整，得到调整后的预设解码次数。In one embodiment, when the attention weight reaches the preset weight threshold, the preset decoding times corresponding to the coding segment are adjusted based on the length of the current sound spectrum feature, and the adjusted preset decoding times include: When the attention weight reaches the preset weight threshold, determine the decoding round offset value corresponding to the coding segment; according to the decoding round offset value and the length of the current sound spectrum feature, perform the preset decoding times corresponding to the coding segment Adjust to obtain the adjusted preset decoding times.

其中，解码轮次偏移值用于指示预设解码次数的偏移量。Wherein, the decoding round offset value is used to indicate the offset of the preset decoding times.

具体地，在注意力权重达到预设权重阈值的情况下，服务器可以确定编码段对应的解码轮次偏移值，并将解码轮次偏移值添加至预设解码次数，得到调整后的预设解码次数。Specifically, when the attention weight reaches the preset weight threshold, the server can determine the decoding round offset value corresponding to the encoded segment, and add the decoding round offset value to the preset decoding times to obtain the adjusted predicted Set the decoding times.

在一个实施例中，所有编码段对应的调整前的预设解码次数可以是预设的固定值。服务器可以确定当前声谱特征的长度，并计算解码轮次偏移值与当前声谱特征的长度之间的和，得到调整后的预设解码次数。In an embodiment, the preset decoding times before adjustment corresponding to all the coding segments may be a preset fixed value. The server may determine the length of the current sound spectrum feature, and calculate the sum of the decoding round offset value and the length of the current sound spectrum feature to obtain the adjusted preset decoding times.

本实施例中，在注意力权重达到预设权重阈值的情况下，确定编码段对应的解码轮次偏移值；基于解码轮次偏移值对编码段对应的预设解码次数进行调整，得到调整后的预设解码次数，从而基于预设解码次数控制迭代停止，以更准确地进行声谱预测。In this embodiment, when the attention weight reaches the preset weight threshold, the decoding round offset value corresponding to the coding segment is determined; based on the decoding round offset value, the preset decoding times corresponding to the coding segment are adjusted to obtain The adjusted preset decoding times, so as to control the iteration stop based on the preset decoding times, so as to perform sound spectrum prediction more accurately.

在一个实施例中，方法还包括：基于当前声谱特征预测编码段对应的结束标识信息；在结束标识信息无法满足预设结束条件的情况下，将迭代预测的轮数与预设解码次数进行比对，并执行在迭代预测的轮数达到预设解码次数的情况下，停止迭代的步骤。In one embodiment, the method further includes: predicting the end identification information corresponding to the encoded segment based on the current sound spectrum feature; in the case that the end identification information cannot meet the preset end condition, the number of rounds of iterative prediction is compared with the preset number of decoding times Compare, and execute the step of stopping the iteration when the number of rounds of iterative prediction reaches the preset number of decoding times.

其中，结束标识信息用于指示声谱预测结束的概率。Wherein, the end identification information is used to indicate the probability that the sound spectrum prediction ends.

具体地，服务器可以确定当前声谱特征对应的当前特征向量，并将当前特征向量输入至线性投影层预测编码段对应的结束标识信息。服务器可以在结束标识信息所指示的结束概率无法达到预设结束阈值的情况下，确定当前声谱特征的长度。可以理解，当前声谱特征的长度可以指示当前声谱特征所处的迭代预测轮数。服务器可以基于当前声谱特征的长度确定直至当前所迭代预测的轮数，将迭代预测的轮数与调整后的预设解码次数进行比对，并执行在迭代预测的轮数达到预设解码次数的情况下，停止迭代的步骤。Specifically, the server may determine the current feature vector corresponding to the current sound spectrum feature, and input the current feature vector into the end identification information corresponding to the linear projection layer predictive coding segment. The server may determine the length of the current sound spectrum feature when the end probability indicated by the end identification information cannot reach the preset end threshold. It can be understood that the length of the current sound spectrum feature may indicate the number of iterative prediction rounds the current sound spectrum feature is in. The server can determine the number of rounds up to the current iterative prediction based on the length of the current sound spectrum feature, compare the number of iterative prediction rounds with the adjusted preset decoding times, and perform the operation when the iterative prediction rounds reach the preset decoding times In the case of , stop the iteration step.

在一个实施例中，当前声谱特征的长度数值可以与迭代预测的轮数值一致。服务器可以直接比对当前声谱特征的长度和调整后的预设解码次数，在当前声谱特征的长度与调整后的预设解码次数相匹配的情况下，服务器可以执行在迭代预测的轮数达到所述预设解码次数的情况下，停止迭代的步骤。可以理解，相匹配的情况可以是当前声谱特征的长度值与预设解码次数值一致的情况。In an embodiment, the length value of the current sound spectrum feature may be consistent with the iteratively predicted round value. The server can directly compare the length of the current spectral feature with the adjusted preset decoding times. When the length of the current spectral feature matches the adjusted preset decoding times, the server can perform the rounds of iterative prediction. When the preset number of times of decoding is reached, the step of iterating is stopped. It can be understood that the matching situation may be the situation that the length value of the current sound spectrum feature is consistent with the preset decoding times value.

在一个实施例中，在结束标识信息无法满足预设结束条件的情况下，服务器可以返回在所述注意力权重达到预设权重阈值的情况下，基于所述当前声谱特征的长度对所述编码段对应的预设解码次数进行调整，得到调整后的预设解码次数的步骤以继续迭代。In an embodiment, in the case that the end identification information cannot satisfy the preset end condition, the server may return to perform an operation based on the length of the current sound spectrum feature when the attention weight reaches the preset weight threshold. The step of adjusting the preset decoding times corresponding to the coding segment to obtain the adjusted preset decoding times is to continue the iteration.

在一个实施例中，服务器可以将当前特征向量分别输入至帧预测网络和线性投影层，以得到当前声谱特征和预测编码段对应的结束标识信息。In an embodiment, the server may input the current feature vector to the frame prediction network and the linear projection layer respectively, so as to obtain the current sound spectrum feature and the end identification information corresponding to the predictive coding segment.

在一个实施例中，服务器可以在结束标识信息无法满足预设结束条件的情况下，判断注意力权重是否达到预设权重阈值、以及当前声谱特征与编码段之间长度是否相匹配。In one embodiment, the server may determine whether the attention weight reaches a preset weight threshold and whether the current sound spectrum feature matches the length between the encoded segments when the end identification information fails to meet the preset end condition.

本实施例中，在结束标识信息无法满足预设结束条件的情况下，将迭代预测的轮数与预设解码次数进行比对，并执行在迭代预测的轮数达到预设解码次数的情况下，停止迭代的步骤，通过结束标识信息和预设解码次数这两个维度的标准，控制迭代停止，提高了声谱预测的准确性。In this embodiment, when the end identification information cannot meet the preset end condition, compare the number of rounds of iterative prediction with the preset number of decoding times, and perform , the step of stopping the iteration, the stop of the iteration is controlled by the criteria of the two dimensions of the end identification information and the preset number of decoding times, and the accuracy of the sound spectrum prediction is improved.

在一个实施例中，方法还包括：在结束标识信息满足预设结束条件的情况下，停止迭代，并将当前声谱特征确定为编码段对应的目标声谱特征。In an embodiment, the method further includes: when the end identification information satisfies the preset end condition, stop the iteration, and determine the current sound spectrum feature as the target sound spectrum feature corresponding to the encoding segment.

具体地，在结束标识信息满足预设结束条件、且当前声谱特征与当前编码段之间长度相匹配的情况下，服务器可以控制迭代预测停止，并将当前声谱特征确定为编码段对应的目标声谱特征。Specifically, when the end identification information satisfies the preset end condition and the length between the current sound spectrum feature and the current encoding segment matches, the server can control the iterative prediction to stop, and determine the current sound spectrum feature as the length corresponding to the encoding segment. Spectral characteristics of the target.

在一个实施例中，在结束标识信息满足预设结束条件、且当前声谱特征与当前编码段之间长度相匹配的情况下，服务器可以停止迭代，并将当前声谱特征确定为编码段对应的目标声谱特征。In one embodiment, when the end identification information satisfies the preset end condition and the length between the current sound spectrum feature and the current encoding segment matches, the server can stop the iteration and determine the current sound spectrum feature as the encoding segment corresponding to target sound spectrum features.

本实施例中，在结束标识信息满足预设结束条件的情况下，停止迭代，并将当前声谱特征确定为编码段对应的目标声谱特征，能够避免在满足预设结束条件时声谱长度过短的问题。In this embodiment, when the end identification information satisfies the preset end condition, the iteration is stopped, and the current sound spectrum feature is determined as the target sound spectrum feature corresponding to the encoding segment, which can avoid the length of the sound spectrum when the preset end condition is met. too short question.

在一个实施例中，如图3所示提供了声谱预测模型的示意图。服务器可以通过声谱预测模型对中文文本进行声谱预测。声谱预测模型是一个带有局部敏感注意力机制的序列到序列的生成模型。可以理解，在中文文本到语音合成的任务中，局部敏感注意力机制通过在合理分配序列中每个元素所占的权重，从而使得模型对序列的各个部分的关注程度不同，更关注于与合成内容更密切相关的部分。声谱预测模型包括编码器、解码器，以及后处理网络。编码器包括词编码层、三个卷积层和双向长短期记忆网络层。解码器包括预处理层、注意力循环神经网络层和解码循环神经网络层。后处理网络用于改善频谱重构的结果。频谱重构的结果就是解码器输出的目标声谱特征。In one embodiment, a schematic diagram of an acoustic spectrum prediction model is provided as shown in FIG. 3 . The server can predict the sound spectrum of the Chinese text through the sound spectrum prediction model. The Spectral Prediction Model is a sequence-to-sequence generative model with a locality-sensitive attention mechanism. It can be understood that in the task of Chinese text-to-speech synthesis, the local sensitive attention mechanism reasonably distributes the weight of each element in the sequence, so that the model pays different attention to each part of the sequence, and pays more attention to the synthesis. The section with more closely related content. The spectral prediction model includes an encoder, a decoder, and a post-processing network. The encoder consists of a word encoding layer, three convolutional layers, and a bidirectional LSTM network layer. The decoder includes a preprocessing layer, an attention recurrent neural network layer, and a decoding recurrent neural network layer. A post-processing network is used to improve the results of spectral reconstruction. The result of spectral reconstruction is the target sound spectrum feature output by the decoder.

服务器可以将解码循环神经网络层输出的当前特征向量分别输入至帧预测网络和线性投影层，得到三帧梅尔频谱和相应的结束标识信息。服务器可以将三帧梅尔频谱中的至少一帧输入至预处理层以进行下一轮迭代，直至迭代结束。服务器可以最终预测出的梅尔频谱输入至后处理网络，预测出一个残差项，并将该残差项叠加至梅尔频谱中，得到改善后的梅尔频谱。The server can respectively input the current feature vector output by the decoded cyclic neural network layer to the frame prediction network and the linear projection layer to obtain three frames of mel spectrum and corresponding end identification information. The server may input at least one frame of the three frames of mel spectrum to the preprocessing layer for a next round of iteration until the end of the iteration. The server can finally input the predicted mel spectrum to the post-processing network, predict a residual term, and superimpose the residual term into the mel spectrum to obtain an improved mel spectrum.

应该理解的是，虽然如上所述的各实施例所涉及的流程图中的各个步骤按照箭头的指示依次显示，但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明，这些步骤的执行并没有严格的顺序限制，这些步骤可以以其它的顺序执行。而且，如上所述的各实施例所涉及的流程图中的至少一部分步骤可以包括多个步骤或者多个阶段，这些步骤或者阶段并不必然是在同一时刻执行完成，而是可以在不同的时刻执行，这些步骤或者阶段的执行顺序也不必然是依次进行，而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the steps in the flow charts involved in the above embodiments are shown sequentially according to the arrows, these steps are not necessarily executed sequentially in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order restriction on the execution of these steps, and these steps can be executed in other orders. Moreover, at least some of the steps in the flow charts involved in the above-mentioned embodiments may include multiple steps or stages, and these steps or stages are not necessarily executed at the same time, but may be performed at different times For execution, the execution order of these steps or stages is not necessarily performed sequentially, but may be executed in turn or alternately with other steps or at least a part of steps or stages in other steps.

基于同样的发明构思，本申请实施例还提供了一种用于实现上述所涉及的语音合成方法的语音合成装置。该装置所提供的解决问题的实现方案与上述方法中所记载的实现方案相似，故下面所提供的一个或多个语音合成装置实施例中的具体限定可以参见上文中对于语音合成方法的限定，在此不再赘述。Based on the same inventive concept, an embodiment of the present application further provides a speech synthesis device for implementing the speech synthesis method involved above. The solution to the problem provided by the device is similar to the implementation described in the above method, so for the specific limitations in one or more embodiments of the speech synthesis device provided below, please refer to the definition of the speech synthesis method above, I won't repeat them here.

在一个实施例中，如图4所示，提供了一种语音合成装置400，包括：编码模块402和解码模块404，其中：In one embodiment, as shown in FIG. 4 , a speech synthesis device 400 is provided, including: an encoding module 402 and a decoding module 404, wherein:

编码模块402，用于通过对中文文本对应的音素特征信息进行编码，确定待预测声谱的编码段；The coding module 402 is used to determine the coding segment of the sound spectrum to be predicted by coding the phoneme feature information corresponding to the Chinese text;

解码模块404，在每轮迭代预测中，确定本轮预测的编码段对应的当前声谱特征、以及本轮预测当前声谱特征时使用的注意力权重；在注意力权重达到预设权重阈值的情况下，基于当前声谱特征的长度对编码段对应的预设解码次数进行调整，得到调整后的预设解码次数；在迭代预测的轮数达到预设解码次数的情况下，停止迭代，并将当前声谱特征确定为编码段对应的目标声谱特征；目标声谱特征用于合成中文文本对应的语音。The decoding module 404, in each round of iterative prediction, determines the current sound spectrum feature corresponding to the encoded segment of the current round of prediction, and the attention weight used when predicting the current sound spectrum feature in the current round; when the attention weight reaches the preset weight threshold In this case, adjust the preset decoding times corresponding to the code segment based on the length of the current sound spectrum feature to obtain the adjusted preset decoding times; when the number of iterative prediction rounds reaches the preset decoding times, stop the iteration, and The current spectral feature is determined as the target spectral feature corresponding to the coding segment; the target spectral feature is used to synthesize the speech corresponding to the Chinese text.

在一个实施例中，编码模块402，还用于确定中文文本对应的音素特征信息；对音素特征信息进行编码，得到中文文本对应的文本编码序列；按照预设时间步，从文本编码序列中确定待进行声谱预测的编码段。In one embodiment, the encoding module 402 is also used to determine the phoneme feature information corresponding to the Chinese text; encode the phoneme feature information to obtain the text encoding sequence corresponding to the Chinese text; determine from the text encoding sequence according to the preset time step The coded segment to be subjected to spectrum prediction.

在一个实施例中，解码模块404，还用于在迭代预测的轮数未达到预设解码次数的情况下，将下一轮作为本轮，返回确定本轮预测的编码段对应的当前声谱特征、以及本轮预测当前声谱特征时使用的注意力权重的步骤继续执行。In one embodiment, the decoding module 404 is further configured to use the next round as the current round when the number of rounds of iterative prediction does not reach the preset number of decoding times, and return to determine the current sound spectrum corresponding to the encoded segment of the current round of prediction features, and the attention weights used to predict the current spectral feature in this round continue to execute.

在一个实施例中，解码模块404，还用于在每轮迭代预测中，确定本轮预测当前声谱特征时使用的当前注意力向量；基于当前注意力向量与编码段之间的相似度，计算当前注意力向量对应的注意力权重；注意力权重用于指示对注意力向量进行特征提取以得到当前特征向量；当前特征向量用于预测当前声谱特征。In one embodiment, the decoding module 404 is also used to determine the current attention vector used in the current round of iterative prediction in each round of iterative prediction; based on the similarity between the current attention vector and the coding segment, The attention weight corresponding to the current attention vector is calculated; the attention weight is used to instruct to perform feature extraction on the attention vector to obtain the current feature vector; the current feature vector is used to predict the current sound spectrum feature.

在一个实施例中，解码模块404，还用于在注意力权重达到预设权重阈值的情况下，确定编码段对应的解码轮次偏移值；根据解码轮次偏移值和当前声谱特征的长度，对编码段对应的预设解码次数进行调整，得到调整后的预设解码次数。In one embodiment, the decoding module 404 is further configured to determine the decoding round offset value corresponding to the encoded segment when the attention weight reaches the preset weight threshold; according to the decoding round offset value and the current sound spectrum feature length, adjust the preset decoding times corresponding to the coding segment, and obtain the adjusted preset decoding times.

在一个实施例中，解码模块404，还用于基于当前声谱特征预测编码段对应的结束标识信息；在结束标识信息无法满足预设结束条件的情况下，将迭代预测的轮数与预设解码次数进行比对，并执行在迭代预测的轮数与预设解码次数相匹配的情况下，停止迭代的步骤。In one embodiment, the decoding module 404 is also used to predict the end identification information corresponding to the coded segment based on the current sound spectrum feature; when the end identification information cannot meet the preset end condition, the number of rounds of iterative prediction and the preset The number of decoding times is compared, and the step of stopping the iteration is performed when the number of iteration prediction rounds matches the preset number of decoding times.

在一个实施例中，解码模块404，还用于在结束标识信息满足预设结束条件的情况下，停止迭代，并将当前声谱特征确定为编码段对应的目标声谱特征。上述语音合成装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中，也可以以软件形式存储于计算机设备中的存储器中，以便于处理器调用执行以上各个模块对应的操作。In one embodiment, the decoding module 404 is further configured to stop the iteration when the end identification information satisfies the preset end condition, and determine the current sound spectrum feature as the target sound spectrum feature corresponding to the encoding segment. Each module in the above-mentioned speech synthesis device can be fully or partially realized by software, hardware and a combination thereof. The above-mentioned modules can be embedded in or independent of the processor in the computer device in the form of hardware, and can also be stored in the memory of the computer device in the form of software, so that the processor can invoke and execute the corresponding operations of the above-mentioned modules.

在一个实施例中，提供了一种计算机设备，该计算机设备可以是服务器，其内部结构图可以如图5所示。该计算机设备包括处理器、存储器、输入/输出接口(Input/Output，简称I/O)和通信接口。其中，处理器、存储器和输入/输出接口通过系统总线连接，通信接口通过输入/输出接口连接到系统总线。其中，该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质和内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储每轮迭代预测出的声谱特征。该计算机设备的输入/输出接口用于处理器与外部设备之间交换信息。该计算机设备的通信接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种语音合成方法。In one embodiment, a computer device is provided, and the computer device may be a server, and its internal structure may be as shown in FIG. 5 . The computer device includes a processor, a memory, an input/output interface (Input/Output, I/O for short), and a communication interface. Wherein, the processor, the memory and the input/output interface are connected through the system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs and databases. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store the sound spectrum features predicted by each round of iterations. The input/output interface of the computer device is used for exchanging information between the processor and external devices. The communication interface of the computer device is used to communicate with an external terminal through a network connection. When the computer program is executed by the processor, a speech synthesis method is realized.

在一个实施例中，提供了一种计算机设备，该计算机设备可以是终端，其内部结构图可以如图6所示。该计算机设备包括处理器、存储器、输入/输出接口、通信接口、显示单元和输入装置。其中，处理器、存储器和输入/输出接口通过系统总线连接，通信接口、显示单元和输入装置通过输入/输出接口连接到系统总线。其中，该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的输入/输出接口用于处理器与外部设备之间交换信息。该计算机设备的通信接口用于与外部的终端进行有线或无线方式的通信，无线方式可通过WIFI、移动蜂窝网络、NFC(近场通信)或其他技术实现。该计算机程序被处理器执行时以实现一种语音合成方法。该计算机设备的显示单元用于形成视觉可见的画面，可以是显示屏、投影装置或虚拟现实成像装置，显示屏可以是液晶显示屏或电子墨水显示屏，该计算机设备的输入装置可以是显示屏上覆盖的触摸层，也可以是计算机设备外壳上设置的按键、轨迹球或触控板，还可以是外接的键盘、触控板或鼠标等。In one embodiment, a computer device is provided. The computer device may be a terminal, and its internal structure may be as shown in FIG. 6 . The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit and an input device. Wherein, the processor, the memory and the input/output interface are connected through the system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The input/output interface of the computer device is used for exchanging information between the processor and external devices. The communication interface of the computer device is used to communicate with an external terminal in a wired or wireless manner, and the wireless manner can be realized through WIFI, mobile cellular network, NFC (near field communication) or other technologies. When the computer program is executed by the processor, a speech synthesis method is realized. The display unit of the computer equipment is used to form a visually visible picture, and may be a display screen, a projection device or a virtual reality imaging device, the display screen may be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment may be a display screen The touch layer covered on the top can also be a button, a trackball or a touch pad arranged on the computer equipment casing, or an external keyboard, touch pad or mouse.

本领域技术人员可以理解，图5和图6中示出的结构，仅仅是与本申请方案相关的部分结构的框图，并不构成对本申请方案所应用于其上的计算机设备的限定，具体的计算机设备可以包括比图中所示更多或更少的部件，或者组合某些部件，或者具有不同的部件布置。Those skilled in the art can understand that the structures shown in Fig. 5 and Fig. 6 are only block diagrams of partial structures related to the solution of this application, and do not constitute a limitation on the computer equipment on which the solution of this application is applied, specifically The computer device may include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.

在一个实施例中，提供了一种计算机设备，包括存储器和处理器，存储器中存储有计算机程序，该处理器执行计算机程序时实现上述各方法实施例中的步骤。In one embodiment, a computer device is provided, including a memory and a processor, where a computer program is stored in the memory, and the processor implements the steps in the foregoing method embodiments when executing the computer program.

在一个实施例中，提供了一种计算机可读存储介质，其上存储有计算机程序，计算机程序被处理器执行时实现上述各方法实施例中的步骤。In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, and when the computer program is executed by a processor, the steps in the foregoing method embodiments are implemented.

在一个实施例中，提供了一种计算机程序产品，包括计算机程序，该计算机程序被处理器执行时实现上述各方法实施例中的步骤。In one embodiment, a computer program product is provided, including a computer program, and when the computer program is executed by a processor, the steps in the foregoing method embodiments are implemented.

需要说明的是，本申请所涉及的用户信息(包括但不限于用户设备信息、用户个人信息等)和数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)，均为经用户授权或者经过各方充分授权的信息和数据，且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。It should be noted that the user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in this application are all It is information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data need to comply with relevant laws, regulations and standards of relevant countries and regions.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，所述的计算机程序可存储于一非易失性计算机可读取存储介质中，该计算机程序在执行时，可包括如上述各方法的实施例的流程。其中，本申请所提供的各实施例中所使用的对存储器、数据库或其它介质的任何引用，均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-OnlyMemory，ROM)、磁带、软盘、闪存、光存储器、高密度嵌入式非易失性存储器、阻变存储器(ReRAM)、磁变存储器(Magnetoresistive Random Access Memory，MRAM)、铁电存储器(Ferroelectric Random Access Memory，FRAM)、相变存储器(Phase Change Memory，PCM)、石墨烯存储器等。易失性存储器可包括随机存取存储器(Random Access Memory，RAM)或外部高速缓冲存储器等。作为说明而非局限，RAM可以是多种形式，比如静态随机存取存储器(Static Random Access Memory，SRAM)或动态随机存取存储器(Dynamic RandomAccess Memory，DRAM)等。本申请所提供的各实施例中所涉及的数据库可包括关系型数据库和非关系型数据库中至少一种。非关系型数据库可包括基于区块链的分布式数据库等，不限于此。本申请所提供的各实施例中所涉及的处理器可为通用处理器、中央处理器、图形处理器、数字信号处理器、可编程逻辑器、基于量子计算的数据处理逻辑器等，不限于此。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented through computer programs to instruct related hardware, and the computer programs can be stored in a non-volatile computer-readable memory In the medium, when the computer program is executed, it may include the processes of the embodiments of the above-mentioned methods. Wherein, any reference to storage, database or other media used in the various embodiments provided in the present application may include at least one of non-volatile and volatile storage. Non-volatile memory can include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive variable memory (ReRAM), magnetic variable memory (Magnetoresistive Random Access Memory, MRAM), Ferroelectric Random Access Memory (FRAM), Phase Change Memory (Phase Change Memory, PCM), graphene memory, etc. The volatile memory may include random access memory (Random Access Memory, RAM) or external cache memory. As an illustration and not a limitation, the RAM can be in various forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM). The databases involved in the various embodiments provided in this application may include at least one of a relational database and a non-relational database. The non-relational database may include a blockchain-based distributed database, etc., but is not limited thereto. The processors involved in the various embodiments provided by this application can be general-purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, data processing logic devices based on quantum computing, etc., and are not limited to this.

以上实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined arbitrarily. To make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, they should be It is considered to be within the range described in this specification.

以上所述实施例仅表达了本申请的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对本申请专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本申请构思的前提下，还可以做出若干变形和改进，这些都属于本申请的保护范围。因此，本申请的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation modes of the present application, and the description thereof is relatively specific and detailed, but should not be construed as limiting the patent scope of the present application. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of the present application, and these all belong to the protection scope of the present application. Therefore, the protection scope of the present application should be determined by the appended claims.

Claims

1. a speech synthesis method, is characterized in that, described method comprises:

Determine the coding segment of the sound spectrum to be predicted by encoding the phoneme feature information corresponding to the Chinese text;

In each round of iterative prediction, determine the current sound spectrum feature corresponding to the coding segment of the current round of prediction, and the attention weight used when predicting the current sound spectrum feature in the current round;

When the attention weight reaches the preset weight threshold, adjust the preset decoding times corresponding to the encoding segment based on the length of the current sound spectrum feature, to obtain the adjusted preset decoding times;

When the number of rounds of iterative prediction reaches the preset number of times of decoding, the iteration is stopped, and the current spectral feature is determined as the target spectral feature corresponding to the encoding segment; the target spectral feature is used for synthesis The voice corresponding to the Chinese text.

2. The method according to claim 1, characterized in that, by encoding the phoneme feature information corresponding to the Chinese text, determining the coded segment of the sound spectrum to be predicted comprises:

Determine the phoneme feature information corresponding to the Chinese text;

Encoding the phoneme feature information to obtain a text encoding sequence corresponding to the Chinese text;

According to the preset time step, the coding segment to be subjected to sound spectrum prediction is determined from the text coding sequence.

3. The method according to claim 1, characterized in that the method further comprises:

In the case that the number of rounds of iterative prediction does not reach the preset number of decoding times, the next round is used as the current round, and the current sound spectrum feature corresponding to the coding segment of the current round of prediction and the current sound spectrum feature corresponding to the current round of prediction are returned. The step of using attention weights for spectral features continues.

4. method according to claim 1, is characterized in that, described method also comprises the step of calculating attention weight; The step of described calculation attention weight comprises:

In each round of iterative prediction, determine the current attention vector used when predicting the current sound spectrum feature in this round;

Based on the similarity between the current attention vector and the coding segment, calculate the attention weight corresponding to the current attention vector; the attention weight is used to indicate that the feature extraction is performed on the attention vector to obtain A current feature vector; the current feature vector is used to predict the current sound spectrum features.

5. The method according to claim 1, wherein when the attention weight reaches a preset weight threshold, the length corresponding to the coding segment is preset based on the length of the current sound spectrum feature. The decoding times are adjusted, and the adjusted preset decoding times include:

When the attention weight reaches a preset weight threshold, determine a decoding round offset value corresponding to the coding segment;

According to the decoding round offset value and the length of the current sound spectrum feature, the preset decoding times corresponding to the coded segment is adjusted to obtain the adjusted preset decoding times.

6. The method according to any one of claims 1 to 5, characterized in that the method further comprises:

predicting end identification information corresponding to the coded segment based on the current sound spectrum feature;

When the end identification information cannot satisfy the preset end condition, compare the number of rounds of iterative prediction with the preset number of decoding times, and perform In the case of decoding times, stop the iterative step.

7. The method according to claim 6, further comprising:

When the end identification information satisfies the preset end condition, the iteration is stopped, and the current sound spectrum feature is determined as the target sound spectrum feature corresponding to the encoding segment.

8. A speech synthesis device, characterized in that the device comprises:

The encoding module is used to determine the encoding segment of the sound spectrum to be predicted by encoding the phoneme feature information corresponding to the Chinese text;

The decoding module, in each round of iterative prediction, determines the current sound spectrum feature corresponding to the encoded segment of the current round of prediction, and the attention weight used when predicting the current sound spectrum feature in the current round; when the attention weight reaches In the case of a preset weight threshold, adjust the preset decoding times corresponding to the coded segment based on the length of the current sound spectrum feature to obtain the adjusted preset decoding times; when the number of rounds of iterative prediction reaches the preset When the number of decoding times is set, the iteration is stopped, and the current spectral feature is determined as the target spectral feature corresponding to the coded segment; the target spectral feature is used to synthesize the speech corresponding to the Chinese text.

9. A computer device, comprising a memory and a processor, the memory stores a computer program, wherein the processor implements the steps of the method according to any one of claims 1 to 7 when executing the computer program .

10. A computer-readable storage medium, on which a computer program is stored, wherein when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 7 are realized.