CN111816168A

CN111816168A - A model training method, voice playback method, device and storage medium

Info

Publication number: CN111816168A
Application number: CN202010714263.4A
Authority: CN
Inventors: 杨治银
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-07-21
Filing date: 2020-07-21
Publication date: 2020-10-23
Anticipated expiration: 2040-07-21
Also published as: CN111816168B

Abstract

The present application discloses a model training method, a voice playback method, a device and a storage medium, which are applied to the field of artificial intelligence cloud services. The method of the present application includes: obtaining a voice data set to be trained; when N meets a threshold of the number of voice training, obtaining a voice model training instruction; in response to the voice model training instruction, based on the voice data set to be trained, obtain a predicted voice data set through the voice model to be trained ; Based on the predicted speech data set and the to-be-trained speech data set, the to-be-trained speech model is trained to obtain the target speech model; the target speech model is sent, so that the terminal device stores the target speech model. The present application enhances the flexibility of voice model generation, and satisfies the user's voice customization requirements. Secondly, the possibility of selecting a voice model is improved when the user performs voice playback, thereby improving the flexibility of voice playback, thereby improving the user's voice playback experience and flexibility. sex.

Description

A model training method, voice playback method, device and storage medium

技术领域technical field

本申请涉及人工智能云服务领域，尤其涉及一种模型训练的方法、语音播放的方法、装置及存储介质。The present application relates to the field of artificial intelligence cloud services, and in particular, to a method for model training, a method, device and storage medium for voice playback.

背景技术Background technique

随着互联网信息技术的快速发展，以及生活质量的不断提高，智能化的终端设备广泛应用于人们的生活中，而用户对智能化的终端设备的需求也日渐增加。为了满足用户在不同终端设备的个性化体验，产生了基于人工智能完成语音播放的方式。With the rapid development of Internet information technology and the continuous improvement of the quality of life, intelligent terminal equipment is widely used in people's lives, and users' demand for intelligent terminal equipment is also increasing day by day. In order to meet the personalized experience of users on different terminal devices, a way to complete voice playback based on artificial intelligence has been developed.

目前，语音播放的方式通常是通过将发声人(比如某公众明星)语音模型数据集成于终端设备，通过用户选择想要的发声人，终端设备将用户所选择的发声人的声音与所需语音播放的文本信息进行合成，从而得到目标语音数据，使得终端设备播放该目标语音数据，从而完成进行语音播放。At present, the way of voice playback is usually by integrating the voice model data of the speaker (such as a public star) into the terminal device, and the user selects the desired speaker, and the terminal device combines the voice of the speaker selected by the user with the desired voice. The played text information is synthesized to obtain target voice data, so that the terminal device plays the target voice data, thereby completing voice playback.

然而，由于语音模型集成于终端设备，用于仅能选择集成于终端设备的语音模型，因此降低语音模型的可选择性以及灵活性，从而降低了语音播放的灵活性。However, since the voice model is integrated in the terminal device, only the voice model integrated in the terminal device can be selected, thereby reducing the selectivity and flexibility of the voice model, thereby reducing the flexibility of voice playback.

发明内容SUMMARY OF THE INVENTION

本申请实施例提供了一种模型训练的方法、语音播放的方法、装置及存储介质，用于提升语音模型生成的灵活性，且用户进行语音播放时语音模型选择可能性提升，从而提升语音播放的灵活性。Embodiments of the present application provide a model training method, a voice playback method, a device, and a storage medium, which are used to improve the flexibility of voice model generation, and increase the possibility of voice model selection when a user performs voice playback, thereby improving voice playback. flexibility.

有鉴于此，本申请一方面提供一种模型训练的方法，包括：In view of this, on the one hand, the present application provides a method for model training, including:

获取待训练语音数据集合，其中，待训练语音数据集合包括N个待训练语音数据，N为大于，或者等于1的正整数；Acquiring a voice data set to be trained, wherein the voice data set to be trained includes N voice data to be trained, and N is a positive integer greater than or equal to 1;

当N满足语音训练数量阈值，获取语音模型训练指令；When N meets the threshold of the number of voice training, obtain the voice model training instruction;

响应于语音模型训练指令，基于待训练语音数据集合，通过待训练语音模型获取预测语音数据集合，其中，预测语音数据集合包括N个预测语音数据，且预测语音数据与待训练语音数据具有对应关系；In response to the voice model training instruction, based on the set of voice data to be trained, a set of predicted voice data is obtained through the voice model to be trained, wherein the set of predicted voice data includes N pieces of predicted voice data, and the predicted voice data and the voice data to be trained have a corresponding relationship ;

基于预测语音数据集合以及待训练语音数据集合，对待训练语音模型进行训练，得到目标语音模型；Based on the predicted speech data set and the to-be-trained speech data set, the to-be-trained speech model is trained to obtain the target speech model;

发送目标语音模型，以使得终端设备存储目标语音模型。The target speech model is sent, so that the terminal device stores the target speech model.

本申请另一方面提供一种语音播放的方法，包括：Another aspect of the present application provides a method for voice playback, comprising:

检测针对于语音播放的操作，发送语音播放请求，以使得服务器基于待解析语音数据获取服务内容文本数据，根据服务内容文本数据获取无音色语音数据，其中，语音播放请求携带待解析语音数据，服务内容文本数据是根据待解析语音数据获取的，无音色语音数据是对服务内容文本数据进行语音合成后得到的；Detecting an operation for voice playback, and sending a voice playback request, so that the server obtains service content text data based on the voice data to be parsed, and obtains voiceless voice data according to the service content text data, wherein the voice playback request carries the voice data to be parsed, and the service The content text data is obtained according to the voice data to be parsed, and the timbreless voice data is obtained after voice synthesis is performed on the service content text data;

接收无音色语音数据；Receive voiceless voice data;

根据无音色语音数据以及目标语音模型生成目标语音数据，其中，目标语音模型是基于预测语音数据集合以及待训练语音数据集合，对待训练语音模型进行训练得到的，预测语音数据集合是服务器响应于语音模型训练指令，基于待训练语音数据集合，通过待训练语音模型获取的；The target speech data is generated according to the voiceless speech data and the target speech model, wherein the target speech model is obtained by training the speech model to be trained based on the predicted speech data set and the to-be-trained speech data set, and the predicted speech data set is the response of the server to the speech The model training instruction, based on the voice data set to be trained, is obtained through the voice model to be trained;

播放目标语音数据。Play the target voice data.

本申请另一方面提供一种模型训练装置，包括：Another aspect of the present application provides a model training device, comprising:

获取模块，用于获取待训练语音数据集合，其中，待训练语音数据集合包括N个待训练语音数据，N为大于，或者等于1的正整数；an acquisition module, configured to acquire a voice data set to be trained, wherein the voice data set to be trained includes N voice data to be trained, and N is a positive integer greater than or equal to 1;

获取模块，还用于当N满足语音训练数量阈值，获取语音模型训练指令；The obtaining module is also used to obtain the training instruction of the voice model when N meets the threshold of the number of voice training;

获取模块，还用于响应于语音模型训练指令，基于待训练语音数据集合，通过待训练语音模型获取预测语音数据集合，其中，预测语音数据集合包括N个预测语音数据，且预测语音数据与待训练语音数据具有对应关系；The acquisition module is further configured to, in response to the voice model training instruction, obtain a predicted voice data set based on the to-be-trained voice data set through the to-be-trained voice model, wherein the predicted voice data set includes N predicted voice data, and the predicted voice data is the same as the to-be-trained voice data set. The training speech data has a corresponding relationship;

训练模块，用于基于预测语音数据集合以及待训练语音数据集合，对待训练语音模型进行训练，得到目标语音模型；The training module is used for training the to-be-trained speech model based on the predicted speech data set and the to-be-trained speech data set to obtain the target speech model;

发送模块，用于发送目标语音模型，以使得终端设备存储目标语音模型。The sending module is used for sending the target voice model, so that the terminal device stores the target voice model.

在一种可能的设计中，在本申请实施例的另一方面的一种实现方式中，In a possible design, in an implementation manner of another aspect of the embodiments of the present application,

获取模块，具体用于获取语音服务请求；An acquisition module, which is specifically used to acquire a voice service request;

响应于语音服务请求，向终端设备发送语音数据收集指令，以使得终端设备响应于语音数据收集指令，获取待检测语音数据，其中，待检测语音数据是基于语音训练文本得到的，语音训练文本是根据语音数据收集指令获取的；In response to the voice service request, a voice data collection instruction is sent to the terminal device, so that the terminal device, in response to the voice data collection instruction, obtains the voice data to be detected, wherein the voice data to be detected is obtained based on the voice training text, and the voice training text is Obtained according to voice data collection instructions;

当待检测语音数据满足语音训练合格条件，确定待训练语音数据；When the voice data to be detected satisfies the qualified conditions for voice training, determine the voice data to be trained;

基于待训练语音数据，生成待训练语音数据集合。Based on the to-be-trained speech data, a to-be-trained speech data set is generated.

在一种可能的设计中，在本申请实施例的另一方面的另一种实现方式中，In a possible design, in another implementation manner of another aspect of the embodiments of the present application,

训练模块，具体用于基于预测语音数据集合以及待训练语音数据集合，根据目标损失函数更新待训练语音模型的模型参数；A training module, which is specifically used to update the model parameters of the speech model to be trained according to the target loss function based on the predicted speech data set and the to-be-trained speech data set;

若目标损失函数达到收敛，则根据模型参数生成目标语音模型。If the target loss function reaches convergence, the target speech model is generated according to the model parameters.

获取模块，还用于获取语音播放请求，其中，语音播放请求携带待解析语音数据；an obtaining module, further configured to obtain a voice playback request, wherein the voice playback request carries the voice data to be parsed;

获取模块，还用于根据待解析语音数据获取服务内容文本数据，其中，服务内容文本数据与待解析语音数据对应；The obtaining module is further configured to obtain service content text data according to the to-be-parsed voice data, wherein the service content text data corresponds to the to-be-parsed voice data;

获取模块，还用于根据服务内容文本数据获取无音色语音数据，其中，无音色语音数据是对服务内容文本数据进行语音合成后得到的；The acquiring module is further configured to acquire timbreless voice data according to the service content text data, wherein the timbreless voice data is obtained by performing speech synthesis on the service content text data;

发送模块，还用于向终端设备发送无音色语音数据，以使得终端设备根据无音色语音数据以及目标语音模型生成目标语音数据，并播放目标语音数据。The sending module is further configured to send the voiceless voice data to the terminal device, so that the terminal device generates the target voice data according to the voiceless voice data and the target voice model, and plays the target voice data.

本申请另一方面提供一种语音播放装置，包括：Another aspect of the present application provides a voice playback device, comprising:

发送模块，用于检测针对于语音播放的操作，发送语音播放请求，以使得服务器基于待解析语音数据获取服务内容文本数据，根据服务内容文本数据获取无音色语音数据，其中，语音播放请求携带待解析语音数据，服务内容文本数据是根据待解析语音数据获取的，无音色语音数据是对服务内容文本数据进行语音合成后得到的；The sending module is used to detect an operation for voice playback, and send a voice playback request, so that the server obtains service content text data based on the voice data to be parsed, and obtains voiceless voice data according to the service content text data, wherein the voice playback request carries the text data to be parsed. Parse the voice data, the service content text data is obtained according to the voice data to be parsed, and the timbreless voice data is obtained after the service content text data is synthesized by speech;

获取模块，用于获取无音色语音数据；The acquisition module is used to acquire the voiceless voice data;

生成模块，用于根据无音色语音数据以及目标语音模型生成目标语音数据，其中，目标语音模型是基于预测语音数据集合以及待训练语音数据集合，对待训练语音模型进行训练得到的，预测语音数据集合是服务器响应于语音模型训练指令，基于待训练语音数据集合，通过待训练语音模型获取的；The generation module is used to generate target voice data according to the voiceless voice data and the target voice model, wherein the target voice model is obtained by training the voice model to be trained based on the predicted voice data set and the to-be-trained voice data set, and the predicted voice data set It is obtained by the server in response to the voice model training instruction, based on the voice data set to be trained, and obtained through the voice model to be trained;

播放模块，用于播放目标语音数据。The playback module is used to play the target voice data.

发送模块，还用于检测针对于语音服务的操作，发送语音服务请求；The sending module is also used to detect the operation for the voice service and send the voice service request;

获取模块，还用于当服务器响应于语音服务请求，获取语音数据收集指令；The obtaining module is also used to obtain the voice data collection instruction when the server responds to the voice service request;

获取模块，还用于响应于语音数据收集指令，且待检测语音数据满足噪声合格条件时，获取待检测语音数据，其中，待检测语音数据是基于语音训练文本得到的，语音训练文本是根据语音数据收集指令获取的；The acquisition module is also used to acquire the voice data to be detected in response to the voice data collection instruction and when the voice data to be detected meets the noise qualified condition, wherein the voice data to be detected is obtained based on the voice training text, and the voice training text is obtained according to the voice Obtained by data collection instructions;

发送模块，还用于发送待检测语音数据，以使得服务器基于待检测语音数据，生成待训练语音数据集合，待训练语音数据集合包括N个待训练语音数据，N为大于，或者等于1的正整数。The sending module is also used to send the voice data to be detected, so that the server generates a voice data set to be trained based on the voice data to be detected, the voice data set to be trained includes N voice data to be trained, and N is a positive value greater than or equal to 1. Integer.

在一种可能的设计中，在本申请实施例的另一方面的另一种实现方式中，语音播放装置还包括存储模块，In a possible design, in another implementation manner of another aspect of the embodiments of the present application, the voice playback device further includes a storage module,

发送模块，还用于当N满足语音训练数量阈值，发送语音模型训练指令，以使得服务器得到目标语音模型；The sending module is also used to send a voice model training instruction when N meets the voice training quantity threshold, so that the server obtains the target voice model;

获取模块，还用于获取目标语音模型；The acquisition module is also used to acquire the target speech model;

存储模块，用于存储目标语音模型。The storage module is used to store the target speech model.

获取模块，还用于获取语音数据收集状态信息，其中，语音数据收集状态信息用于指示正在获取待检测语音数据；an acquisition module, further configured to acquire voice data collection status information, wherein the voice data collection status information is used to indicate that voice data to be detected is being acquired;

获取模块，还用于获取语音数据收集完成状态信息，其中，语音数据收集完成状态信息用于指示语音数据已完成获取，且可以开始训练目标语音模型；The acquisition module is further configured to acquire voice data collection completion status information, wherein the voice data collection completion status information is used to indicate that the voice data has been acquired, and the target voice model can be trained;

获取模块，还用于获取目标语音模型训练状态信息，其中，目标语音模型训练状态信息用于指示正在对待训练语音模型进行训练；The acquisition module is further configured to acquire target voice model training status information, wherein the target voice model training status information is used to indicate that the voice model to be trained is being trained;

获取模块，还用于获取目标语音模型完成状态信息，其中，目标语音模型完成状态信息用于指示目标语音模型进行已完成训练。The obtaining module is further configured to obtain the completion status information of the target voice model, wherein the target voice model completion status information is used to indicate that the target voice model has completed training.

本申请的另一方面提供了一种计算机可读存储介质，所述计算机可读存储介质中存储有指令，当其在计算机上运行时，使得计算机执行上述各方面所述的方法。Another aspect of the present application provides a computer-readable storage medium having instructions stored in the computer-readable storage medium, which, when executed on a computer, cause the computer to perform the methods described in the above aspects.

本申请的另一方面提供了一种计算机程序产品或计算机程序，该计算机程序产品或计算机程序包括计算机指令，该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令，处理器执行该计算机指令，使得该计算机设备执行上述各个方面的各种可选实现方式中提供的方法。Another aspect of the present application provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the methods provided in various alternative implementations of the above aspects.

从以上技术方案可以看出，本申请实施例具有以下优点：As can be seen from the above technical solutions, the embodiments of the present application have the following advantages:

本申请实施例中，提供了一种模型训练的方法，服务器先获取包括N个待训练语音数据的待训练语音数据集合，当N满足语音训练数量阈值，即获取语音模型训练指令，再响应于语音模型训练指令，基于待训练语音数据集合，通过待训练语音模型获取预测语音数据集合，进一步地，基于预测语音数据集合以及待训练语音数据集合，对待训练语音模型进行训练，得到目标语音模型，向终端设备发送目标语音模型，以使得终端设备存储目标语音模型。采用上述方式，由于目标语音模型是基于预测语音数据集合以及待训练语音数据集合对待训练语音模型进行训练得到的，因此终端设备可以由用户所提供的待训练语音数据得到用户所需求的目标语音模型，提升语音模型生成的灵活性，且满足用户对声音的定制需求，其次，用户进行语音播放时语音模型选择可能性提升，从而提升语音播放的灵活性，进而提升用户的语音播放体验以及灵活性。In the embodiment of the present application, a method for model training is provided. The server first obtains a set of to-be-trained speech data including N pieces of speech data to be trained. When N meets the threshold of the number of speech trainings, it obtains a speech model training instruction, and then responds to The voice model training instruction, based on the set of voice data to be trained, obtains the set of predicted voice data through the voice model to be trained, and further, based on the set of predicted voice data and the set of voice data to be trained, trains the voice model to be trained to obtain the target voice model, The target speech model is sent to the terminal device, so that the terminal device stores the target speech model. In the above manner, since the target speech model is obtained by training the to-be-trained speech model based on the predicted speech data set and the to-be-trained speech data set, the terminal device can obtain the target speech model required by the user from the to-be-trained speech data provided by the user , improve the flexibility of voice model generation, and meet the user's voice customization needs. Secondly, the possibility of voice model selection is improved when users play voice playback, thereby improving the flexibility of voice playback, thereby improving the user's voice playback experience and flexibility. .

附图说明Description of drawings

图1为本申请实施例中模型训练的一个架构示意图；Fig. 1 is a schematic diagram of the architecture of model training in the embodiment of the present application;

图2为本申请实施例中模型训练的另一架构示意图；FIG. 2 is another schematic diagram of the architecture of model training in the embodiment of the present application;

图3为本申请实施例中模型训练的方法一个流程示意图；3 is a schematic flowchart of a method for model training in the embodiment of the present application;

图4为本申请实施例中模型训练的方法一个实施例示意图；4 is a schematic diagram of an embodiment of a method for model training in an embodiment of the present application;

图5为本申请实施例中模型训练的方法另一流程示意图；5 is another schematic flowchart of a method for model training in an embodiment of the present application;

图6为本申请实施例中语音播放的方法一个流程示意图；6 is a schematic flowchart of a method for voice playback in an embodiment of the present application;

图7为本申请实施例中语音播放的方法一个实施例示意图；7 is a schematic diagram of an embodiment of a method for voice playback in an embodiment of the present application;

图8为本申请实施例中针对于语音播放的操作一个实施例示意图；8 is a schematic diagram of an embodiment of an operation for voice playback in an embodiment of the present application;

图9为本申请实施例中语音播放的方法另一实施例示意图；FIG. 9 is a schematic diagram of another embodiment of a method for voice playback in an embodiment of the present application;

图10为本申请实施例中语音播放的方法另一流程示意图；10 is another schematic flowchart of the method for voice playback in the embodiment of the application;

图11为本申请实施例中语音播放的方法另一实施例示意图；FIG. 11 is a schematic diagram of another embodiment of a method for voice playback in an embodiment of the present application;

图12为本申请实施例中模型训练装置一个实施例示意图；FIG. 12 is a schematic diagram of an embodiment of a model training apparatus in an embodiment of the present application;

图13为本申请实施例中语音播放装置一个实施例示意图；13 is a schematic diagram of an embodiment of a voice playback device in an embodiment of the application;

图14为本申请实施例中服务器一个实施例示意图；FIG. 14 is a schematic diagram of an embodiment of a server in an embodiment of the present application;

图15为本申请实施例中终端设备一个实施例示意图。FIG. 15 is a schematic diagram of an embodiment of a terminal device in an embodiment of the present application.

具体实施方式Detailed ways

本申请实施例提供了一种模型训练的方法、语音播放的方法、装置及存储介质，用于提升语音模型生成的灵活性，且满足用户对声音的定制需求，其次，用户进行语音播放时语音模型选择可能性提升，从而提升语音播放的灵活性，进而提升用户的语音播放体验以及灵活性。Embodiments of the present application provide a model training method, a voice playback method, a device, and a storage medium, which are used to improve the flexibility of voice model generation and meet the user's voice customization requirements. The possibility of model selection is improved, thereby improving the flexibility of voice playback, thereby improving the user's voice playback experience and flexibility.

本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本申请的实施例例如能够以除了在这里图示或描述的那些以外的顺序实施。此外，术语“包括”和“对应于”以及他们的任何变形，意图在于覆盖不排他的包含，例如，包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third", "fourth", etc. (if any) in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is to be understood that the data so used may be interchanged under appropriate circumstances such that the embodiments of the application described herein can, for example, be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "corresponding to", and any variations thereof, are intended to cover non-exclusive inclusion, eg, a process, method, system, product or device comprising a series of steps or units not necessarily limited to those expressly listed but may include other steps or units not expressly listed or inherent to these processes, methods, products or devices.

随着互联网信息技术的快速发展以及生活质量的不断提高，智能化的终端设备广泛应用于人们的生活中，而用户对智能化的终端设备的需求也日渐增加。为了满足用户在不同终端设备的个性化体验，产生了基于人工智能完成语音播放的方式。本申请所提供的方法可以应用于人工智能云服务领域的各个场景，例如，在开车行驶时驾驶员向终端设备请求语音导航，或者在婴幼儿终端设备以语音朗读睡前故事，或者在双手进行其他日程操作时，终端设备语音播放新闻或者小说等，应理解，在实际应用中，人工智能云服务还可以应用于其他更多场景中，因此此处不对应用场景进行穷举。With the rapid development of Internet information technology and the continuous improvement of the quality of life, intelligent terminal equipment is widely used in people's lives, and users' demands for intelligent terminal equipment are also increasing day by day. In order to meet the personalized experience of users on different terminal devices, a way to complete voice playback based on artificial intelligence has been developed. The method provided by this application can be applied to various scenarios in the field of artificial intelligence cloud services, for example, the driver requests voice navigation from the terminal device while driving, or reads a bedtime story by voice on the infant terminal device, or performs During other schedule operations, the terminal device will play news or novels by voice. It should be understood that in practical applications, the artificial intelligence cloud service can also be applied to more other scenarios, so the application scenarios are not exhaustive here.

为了在前述场景中提升语音播放的灵活性，本申请提出了一种模型训练的方法以及语音播放的方法，该方法应用于图1所示的模型训练系统，请参阅图1，图1为本申请实施例中模型训练系统的一个架构示意图，如图所示，模型训练系统中包括服务器以及终端设备，且客户端部署于终端设备上。本申请涉及的服务器可以是独立的物理服务器，也可以是多个物理服务器构成的服务器集群或者分布式系统，还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network，CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。终端可以是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表等，但并不局限于此。终端以及服务器可以通过有线或无线通信方式进行直接或间接地连接，其次，虽然图1中仅示出了六个终端设备，一个网络以及一个服务器，但应当理解，图1中的示例仅用于理解本方案，具体终端设备，网络和服务器的数量均应当结合实际情况灵活确定。In order to improve the flexibility of voice playback in the aforementioned scenarios, the present application proposes a model training method and a voice playback method. The method is applied to the model training system shown in FIG. 1 , please refer to FIG. A schematic diagram of the architecture of the model training system in the application embodiment, as shown in the figure, the model training system includes a server and a terminal device, and the client is deployed on the terminal device. The server involved in this application may be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or may provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, Cloud servers for basic cloud computing services such as cloud communications, middleware services, domain name services, security services, Content Delivery Network (CDN), and big data and artificial intelligence platforms. The terminal may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc., but is not limited thereto. The terminal and the server can be directly or indirectly connected through wired or wireless communication. Secondly, although only six terminal devices, one network and one server are shown in FIG. 1, it should be understood that the example in FIG. 1 is only used for To understand this solution, the number of specific terminal devices, networks and servers should be flexibly determined according to the actual situation.

具体地，在用户播放目标语音数据之前，需要先对目标语音模型进行获取，终端设备检测针对于语音服务的操作，向服务器发送语音服务请求服务器响应于语音服务请求，向终端设备发送语音数据收集指令，终端设备响应于语音数据收集指令，获取待检测语音数据，并且在待检测语音数据满足噪声合格条件时，向服务器发送待检测语音数据，服务器基于待训练语音数据，生成包括N个待训练语音数据的待训练语音数据集合，当N满足语音训练数量阈值，获取终端设备发送的语音模型训练指令，服务器响应于语音模型训练指令，基于待训练语音数据集合，通过待训练语音模型获取预测语音数据集合，再基于预测语音数据集合以及待训练语音数据集合，对待训练语音模型进行训练，得到目标语音模型，最后服务器向终端设备发送目标语音模型，以使得终端设备存储目标语音模型，并且在需要播放目标语音数据时，检测针对于语音播放的操作，向服务器发送语音播放请求，以使得服务器获取无音色语音数据，并且向终端设备发送，终端设备即可根据无音色语音数据以及目标语音模型生成目标语音数据，并且播放该目标语音数据。由于目标语音模型是基于预测语音数据集合以及待训练语音数据集合对待训练语音模型进行训练得到的，因此终端设备可以由用户所提供的待训练语音数据得到对应的目标语音模型，提升语音模型生成的灵活性，且用户进行语音播放时语音模型选择可能性提升，从而提升语音播放的灵活性。Specifically, before the user plays the target voice data, the target voice model needs to be acquired first, the terminal device detects the operation for the voice service, and sends the voice service request to the server. The server responds to the voice service request and sends the voice data collection to the terminal device. instruction, the terminal device acquires the voice data to be detected in response to the voice data collection instruction, and sends the voice data to be detected to the server when the voice data to be detected satisfies the noise qualified condition, and the server generates an N voice data to be trained based on the voice data to be trained The voice data set to be trained of the voice data, when N meets the threshold of the number of voice training, the voice model training instruction sent by the terminal device is obtained, and the server responds to the voice model training instruction, based on the voice data set to be trained, and obtains the predicted voice through the voice model to be trained The data set, and then based on the predicted voice data set and the voice data set to be trained, the voice model to be trained is trained to obtain the target voice model, and finally the server sends the target voice model to the terminal device, so that the terminal device stores the target voice model, and when needed When playing the target voice data, the operation for voice playback is detected, and a voice playback request is sent to the server, so that the server obtains the voiceless voice data and sends it to the terminal device, and the terminal device can generate the voiceless voice data and the target voice model according to the voiceless voice data. target voice data, and play the target voice data. Since the target speech model is obtained by training the to-be-trained speech model based on the predicted speech data set and the to-be-trained speech data set, the terminal device can obtain the corresponding target speech model from the to-be-trained speech data provided by the user, improving the performance of the speech model generated. Flexibility, and the possibility of voice model selection is improved when the user performs voice playback, thereby improving the flexibility of voice playback.

进一步地，请参阅图2，图2为本申请实施例中模型训练系统的另一架构示意图，如图所示，终端设备集成有腾讯语音服务(tencent voice service，TVS)-软件开发工具包(software development kit，SDK)以及语音合成技术(text to speech，TTS)-SDK，且TVS-SDK中集成有模型管理模块以及消息模块，而TTS-SDK中集成有模型管理模块以及声音合成。具体地，TVS-SDK以及TTS-SDK可以与服务器中的TVS服务交互，集成于TVS-SDK中的模型管理模块与服务器中的TVS-网页(web)服务交互，具体用于语音数据、模型数据等数据的传输，集成于TVS-SDK中的消息模块则与服务器中的消息推送服务交互，实现各类消息的推送。Further, please refer to FIG. 2, which is another schematic diagram of the architecture of the model training system in the embodiment of the application. As shown in the figure, the terminal device is integrated with Tencent voice service (tencent voice service, TVS)-software development kit ( software development kit, SDK) and speech synthesis technology (text to speech, TTS)-SDK, and TVS-SDK integrates model management module and message module, and TTS-SDK integrates model management module and voice synthesis. Specifically, the TVS-SDK and TTS-SDK can interact with the TVS service in the server, and the model management module integrated in the TVS-SDK interacts with the TVS-web service in the server, which is specifically used for voice data and model data. For data transmission, the message module integrated in TVS-SDK interacts with the message push service in the server to realize the push of various messages.

其次，服务器包括TVS服务，TVS-web服务，TTS定制服务，离线训练服务，消息推送服务以及数据库，其中TVS服务可以提供语音服务，如自动语音识别技术(automaticspeech recognition，ASR)，语音合成技术，语义解析以及技能服务等，终端设备通过语义查询，获取TVS特定的服务，如查询新闻、小说、播放音乐、导航等服务，TVS-web服务是一个网关服务，对外提供非语音交互的接口，声音定制的接口通过该服务往外提供，TTS定制服务负责接收终端设备发起的训练请求，收集终端设备的语音数据，并创建和触发离线训练任务，并把任务记录写入数据库，返回任务提交成功状态，同时会不断查询任务状态，若任务完成，则通过消息推送服务给终端设备发消息，通知终端设备目标模型训练已经完成，离线训练服务通过AI深度学习方法，对待训练语音数据进行特征抽取和训练，最终得到目标语音模型，并更新数据库，保存目标训练模型数据，并且任务状态变更为成功状态，消息推送服务具有向终端设备推送消息的通道，本申请实施例中使用该通道能力，可以使得终端设备得知目标语音模型的训练结果。因此，终端设备接收到服务器中消息推送服务器所推送的消息后，通过消息跳转到模型页面，选择目标语音模型选择后可以触发TTS-SDK把训练好的目标语音模型存储至本地，当终端设备触发TVS服务，如小说播报时，TTS-SDK结合目标语音模型以及服务器中TTS数据合成的无音色语音数据，生成目标语音数据，并对目标语音数据进行播放，实现语音播报。Secondly, the server includes TVS service, TVS-web service, TTS customization service, offline training service, message push service and database, wherein TVS service can provide voice services, such as automatic speech recognition technology (automatic speech recognition, ASR), speech synthesis technology, Semantic parsing and skill services, etc. Terminal devices obtain TVS-specific services through semantic query, such as querying news, novels, playing music, navigation and other services. TVS-web service is a gateway service that provides non-voice interactive interfaces, voice The customized interface is provided through this service. The TTS customized service is responsible for receiving the training request initiated by the terminal device, collecting the voice data of the terminal device, creating and triggering the offline training task, writing the task record into the database, and returning the task submission success status. At the same time, the task status will be continuously inquired. If the task is completed, a message will be sent to the terminal device through the message push service to notify the terminal device that the target model training has been completed. The offline training service uses the AI deep learning method to perform feature extraction and training on the training voice data. Finally, the target voice model is obtained, the database is updated, the target training model data is saved, and the task status is changed to a successful status. The message push service has a channel for pushing messages to the terminal device. The channel capability is used in the embodiment of the present application to make the terminal device Know the training result of the target speech model. Therefore, after the terminal device receives the message pushed by the message push server in the server, it jumps to the model page through the message, selects the target voice model, and then triggers the TTS-SDK to store the trained target voice model locally. When the terminal device When TVS services are triggered, such as when a novel is broadcast, TTS-SDK combines the target voice model and the voiceless voice data synthesized by TTS data in the server to generate target voice data, and play the target voice data to realize voice broadcast.

由于本申请实施例是应用于云技术领域的，在对本申请实施例提供的语音播放的方法开始具体介绍之前，先对云技术领域的一些基础概念进行介绍。云技术(Cloudtechnology)是指在广域网或局域网内将硬件、软件、网络等系列资源统一起来，实现数据的计算、储存、处理和共享的一种托管技术。云技术(Cloud technology)基于云计算商业模式应用的网络技术、信息技术、整合技术、管理平台技术、应用技术等的总称，可以组成资源池，按需所用，灵活便利。云计算技术将变成重要支撑。技术网络系统的后台服务需要大量的计算、存储资源，如视频网站、图片类网站和更多的门户网站。伴随着互联网行业的高度发展和应用，将来每个物品都有可能存在自己的识别标志，都需要传输到后台系统进行逻辑处理，不同程度级别的数据将会分开处理，各类行业数据皆需要强大的系统后盾支撑，只能通过云计算来实现。Since the embodiments of the present application are applied in the field of cloud technology, some basic concepts in the field of cloud technology are first introduced before the specific introduction of the voice playback method provided by the embodiments of the present application. Cloud technology refers to a kind of hosting technology that unifies a series of resources such as hardware, software, and network in a wide area network or a local area network to realize the calculation, storage, processing and sharing of data. Cloud technology is a general term for network technology, information technology, integration technology, management platform technology, application technology, etc. based on the application of cloud computing business models. It can form a resource pool, which can be used on demand, flexible and convenient. Cloud computing technology will become an important support. Background services of technical network systems require a lot of computing and storage resources, such as video websites, picture websites and more portal websites. With the high development and application of the Internet industry, in the future, each item may have its own identification mark, which needs to be transmitted to the back-end system for logical processing. Data of different levels will be processed separately, and all kinds of industry data need to be strong. The system backing support can only be achieved through cloud computing.

随着云技术研究和进步，云技术在多种方向展开研究，而人工智能(ArtificialIntelligence，AI)云服务，一般也被称作是人工智能即服务(AI as a Service，AIaaS)。这是目前主流的一种人工智能平台的服务方式，具体来说AIaaS平台会把几类常见的AI服务进行拆分，并在云端提供独立或者打包的服务。这种服务模式类似于开了一个AI主题商城，所有的开发者都可以通过应用程序编程接口(Application Programming Interface，API)的方式来接入使用平台提供的一种或者是多种人工智能服务，部分资深的开发者还可以使用平台提供的AI框架和AI基础设施来部署和运维自已专属的云人工智能服务。With the research and progress of cloud technology, cloud technology has been researched in various directions, and artificial intelligence (Artificial Intelligence, AI) cloud services are generally also called artificial intelligence as a service (AI as a Service, AIaaS). This is the current mainstream service method of artificial intelligence platforms. Specifically, the AIaaS platform will split several types of common AI services and provide independent or packaged services in the cloud. This service model is similar to opening an AI-themed mall. All developers can access one or more artificial intelligence services provided by the platform through Application Programming Interface (API). Some experienced developers can also use the AI framework and AI infrastructure provided by the platform to deploy and operate their own cloud AI services.

其次，人工智能云服务是需要人工智能应用领域进行，因此对人工智能的一些基础概念进行介绍。人工智能是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能，感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说，人工智能是计算机科学的一个综合技术，它企图了解智能的实质，并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法，使机器具有感知、推理与决策的功能。人工智能技术是一门综合学科，涉及领域广泛，既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。机器学习(Machine Learning，ML)是一门多领域交叉学科，涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为，以获取新的知识或技能，重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心，是使计算机具有智能的根本途径，其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、式教学习等技术。Secondly, artificial intelligence cloud services are required in the field of artificial intelligence applications, so some basic concepts of artificial intelligence are introduced. Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making. Artificial intelligence technology is a comprehensive discipline, involving a wide range of fields, including both hardware-level technology and software-level technology. The basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning. Machine Learning (ML) is a multi-field interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. It specializes in how computers simulate or realize human learning behaviors to acquire new knowledge or skills, and to reorganize existing knowledge structures to continuously improve their performance. Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent, and its applications are in all fields of artificial intelligence. Machine learning and deep learning usually include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, teaching learning and other technologies.

随着人工智能技术研究和进步，人工智能技术在多种方向展开研究，语音技术(SpeechTechnology)的关键技术有自动语音识别技术(ASR)和语音合成技术(TTS)以及声纹识别技术。让计算机能听、能看、能说、能感觉，是未来人机交互的发展方向，其中语音成为未来最被看好的人机交互方式之一。而自然语言处理(Nature Language processing，NLP)是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。自然语言处理是一门融语言学、计算机科学、数学于一体的科学。因此，这一领域的研究将涉及自然语言，即人们日常使用的语言，所以它与语言学的研究有着密切的联系。自然语言处理技术通常包括文本处理、语义理解、机器翻译、机器人问答、知识图谱等技术。With the research and progress of artificial intelligence technology, artificial intelligence technology has been researched in various directions. The key technologies of speech technology (SpeechTechnology) include automatic speech recognition technology (ASR) and speech synthesis technology (TTS) and voiceprint recognition technology. Making computers able to hear, see, speak, and feel is the development direction of human-computer interaction in the future, and voice will become one of the most promising human-computer interaction methods in the future. Natural language processing (NLP) is an important direction in the field of computer science and artificial intelligence. It studies various theories and methods that can realize effective communication between humans and computers using natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, the language that people use on a daily basis, so it is closely related to the study of linguistics. Natural language processing technology usually includes text processing, semantic understanding, machine translation, robot question answering, knowledge graph and other technologies.

基于此，下面将介绍语音播放的方法，请参阅图3，图3为本申请实施例中模型训练的方法一个流程示意图，如图所示，具体地：Based on this, a method for voice playback will be introduced below, please refer to FIG. 3 , which is a schematic flowchart of the method for model training in the embodiment of the application, as shown in the figure, specifically:

在步骤S1中，终端设备向服务器发送待训练语音数据，该待训练语音数据为用户通过语音输入获取的语音数据。In step S1, the terminal device sends the voice data to be trained to the server, where the voice data to be trained is the voice data obtained by the user through voice input.

在步骤S2中，当终端设备向服务器发送的待训练语音数据数量满足语音训练数量阈值时，终端设备还可以向服务器发送语音模型训练指令。In step S2, when the quantity of the voice data to be trained sent by the terminal device to the server meets the voice training quantity threshold, the terminal device may also send a voice model training instruction to the server.

在步骤S3中，服务器通过前述步骤可以获取到包括N个待训练语音数据的待训练语音数据集合，然后服务器响应于语音模型训练指令，将待训练语音数据集合作为待训练语音模型的输入，通过待训练语音模型获取包括N个预测语音数据的预测语音数据集合。In step S3, the server can obtain the to-be-trained voice data set including N to-be-trained voice data through the aforementioned steps, and then the server responds to the voice model training instruction, takes the to-be-trained voice data set as the input of the to-be-trained voice model, and passes The voice model to be trained acquires a predicted voice data set including N predicted voice data.

在步骤S4中，服务器基于预测语音数据集合以及待训练语音数据集合，对待训练语音模型进行训练，得到目标语音模型。In step S4, the server trains the speech model to be trained based on the predicted speech data set and the to-be-trained speech data set to obtain the target speech model.

在步骤S5中，服务器向终端设备发送目标语音模型。In step S5, the server sends the target speech model to the terminal device.

在步骤S6中，终端设备存储目标语音模型。In step S6, the terminal device stores the target speech model.

在步骤S7中，当用户需要进行语音播放时，终端设备检测针对于语音播放的操作，并通过语音输入待解析语音数据，生成携带待解析语音数据的语音播放请求，并向服务器发送该语音播放请求。In step S7, when the user needs to perform voice playback, the terminal device detects the operation for voice playback, inputs the voice data to be parsed through voice, generates a voice playback request carrying the voice data to be parsed, and sends the voice playback to the server ask.

在步骤S8中，服务器基于语音播放请求中携带的待解析语音数据获取服务内容文本数据，并且根据服务内容文本数据获取无音色语音数据。In step S8, the server obtains service content text data based on the to-be-parsed voice data carried in the voice playback request, and obtains timbreless voice data according to the service content text data.

在步骤S9中，服务器向终端设备发送无音色语音数据。In step S9, the server sends the voiceless voice data to the terminal device.

在步骤S10中，终端设备获取无音色语音数据后，根据所获取的无音色语音数据以及存储的目标语音模型生成目标语音数据。In step S10, after acquiring the voiceless voice data, the terminal device generates target voice data according to the acquired voiceless voice data and the stored target voice model.

在步骤S11中，播放目标语音数据。In step S11, the target voice data is played.

本申请实施例提供的方案涉及云技术中的人工智能云服务技术，结合上述介绍，下面将对本申请中语音播放的方法进行介绍，请参阅图4，图4为本申请实施例中模型训练的方法一个实施例示意图，如图所示，本申请实施例中模型训练的方法一个实施例包括：The solutions provided by the embodiments of the present application relate to the artificial intelligence cloud service technology in the cloud technology. In combination with the above introduction, the following will introduce the voice playback method in the present application. Please refer to FIG. 4 , which is a schematic diagram of model training in the embodiments of the present application. A schematic diagram of an embodiment of the method, as shown in the figure, an embodiment of the method for model training in the embodiment of the present application includes:

101、获取待训练语音数据集合，其中，待训练语音数据集合包括N个待训练语音数据，N为大于，或者等于1的正整数；101. Obtain a voice data set to be trained, wherein the voice data set to be trained includes N voice data to be trained, and N is a positive integer greater than or equal to 1;

本实施例中，服务器获取包括N个待训练语音数据的待训练语音数据集合，即可以接收终端设备实时发送的待训练语音数据集合，也可以从终端设备的数据库中获取待训练语音数据集合，或者从服务器的数据库中获取待训练语音数据集合，具体此处不对获取待训练语音数据集合的方式进行限定。In this embodiment, the server acquires the to-be-trained voice data set including N to-be-trained voice data, that is, it can receive the to-be-trained voice data set sent by the terminal device in real time, or it can obtain the to-be-trained voice data set from the database of the terminal device, Or acquire the voice data set to be trained from the database of the server, and specifically, the manner of acquiring the voice data set to be trained is not limited here.

应理解，服务器通过步骤101获取待训练语音数据集合后，若用户需要进行其他语音模型的训练，还可以继续使得服务器获取其他语音模型所需的语音数据集合，即不需要目标语音模型完成训练再开始获取。It should be understood that after the server obtains the voice data set to be trained through step 101, if the user needs to train other voice models, the server can also continue to make the server obtain the voice data sets required by the other voice models, that is, the target voice model does not need to complete the training and then Get started.

102、当N满足语音训练数量阈值，获取语音模型训练指令；102. When N meets the voice training quantity threshold, obtain a voice model training instruction;

本实施例中，当待训练语音数据集合中的待训练语音数据数量达到语音训练数量阈值时，终端设备可以获取语音数据收集完成状态信息，根据语音数据收集完成状态信息可以得知终端设备语音数据已完成获取，可以开始训练目标语音模型，因此向服务器发送语音模型训练指令，由此服务器可以获取到终端设备所发送的语音模型训练指令。In this embodiment, when the quantity of the to-be-trained voice data in the to-be-trained voice data set reaches the threshold of the voice training quantity, the terminal device can obtain the voice data collection completion status information, and can learn the terminal device voice data according to the voice data collection completion status information After the acquisition is completed, the training of the target voice model can be started. Therefore, the voice model training instruction is sent to the server, and the server can obtain the voice model training instruction sent by the terminal device.

具体地，本实施例中语音训练数量阈值为20，即待训练语音数据集合包括20个待训练语音数据后，终端设备可以向服务器发送语音模型训练指令，由此服务器获取语音模型训练指令，应理解，在实际应用中语音训练数量阈值还可以为15，25以及30等其他正整数，具体语音训练数量阈值的数值应该根据待训练语音模型的实际情况以及需求灵活确定。Specifically, in this embodiment, the threshold for the number of voice training is 20, that is, after the set of voice data to be trained includes 20 voice data to be trained, the terminal device can send a voice model training instruction to the server, and the server obtains the voice model training instruction, and should It is understood that in practical applications, the threshold of the number of voice training can also be other positive integers such as 15, 25, and 30. The specific threshold of the number of voice training should be flexibly determined according to the actual situation and needs of the voice model to be trained.

103、响应于语音模型训练指令，基于待训练语音数据集合，通过待训练语音模型获取预测语音数据集合，其中，预测语音数据集合包括N个预测语音数据，且预测语音数据与待训练语音数据具有对应关系；103. In response to the voice model training instruction, based on the voice data set to be trained, a predicted voice data set is obtained through the voice model to be trained, wherein the predicted voice data set includes N predicted voice data, and the predicted voice data and the to-be-trained voice data have Correspondence;

本实施例中，服务器接收到步骤102所获取的语音模型训练指令后，可以得知终端设备需要对待训练语音模型进行训练，因此可以响应于语音模型训练指令，将步骤101中获取的待训练语音数据集合作为待训练语音模型的输入，待训练语音模型可以输出包括N个预测语音数据的预测语音数据集合，且预测语音数据与待训练语音数据具有对应关系。In this embodiment, after the server receives the voice model training instruction acquired in step 102, it can know that the terminal device needs to train the voice model to be trained, so it can respond to the voice model training instruction to use the voice model to be trained acquired in step 101. The data set is used as the input of the to-be-trained speech model, and the to-be-trained speech model can output a predicted speech data set including N pieces of predicted speech data, and the predicted speech data has a corresponding relationship with the to-be-trained speech data.

示例性地，待训练语音数据集合中包括待训练语音数据A至待训练语音数据T，共20个待训练语音数据，那么将该待训练语音数据集合作为待训练语音模型的输入后，待训练语音模型的输出的预测语音数据集合中包括预测语音数据A至预测语音数据T，共20个预测语音数据，且待训练语音数据A与预测语音数据A对应，即单独将待训练语音数据A作为待训练语音模型的输入，待训练语音模型可以输出预测语音数据A，而待训练语音数据T也与预测语音数据T对应，其他对应关系也与前述类似，具体不再赘述。Exemplarily, the to-be-trained speech data set includes the to-be-trained speech data A to the to-be-trained speech data T, a total of 20 to-be-trained speech data, then the to-be-trained speech data set is used as the input of the to-be-trained speech model, to be trained. The predicted voice data set output by the voice model includes predicted voice data A to predicted voice data T, a total of 20 predicted voice data, and the to-be-trained voice data A corresponds to the predicted voice data A, that is, the to-be-trained voice data A is used separately. The input of the to-be-trained speech model, the to-be-trained speech model can output predicted speech data A, and the to-be-trained speech data T also corresponds to the predicted speech data T;

104、基于预测语音数据集合以及待训练语音数据集合，对待训练语音模型进行训练，得到目标语音模型；104. Based on the predicted speech data set and the to-be-trained speech data set, train the to-be-trained speech model to obtain a target speech model;

本实施例中，服务器通过离线训练模块，通过深度学习算法结合步骤103获取的预测语音数据集合，以及步骤101获取的待训练语音数据集合进行特征抽取，对待训练语音模型进行训练，当满足训练条件时，即可以得到目标语音模型，并且在训练完成后将得到的目标语音模型存储至服务器中的数据库。In this embodiment, the server performs feature extraction through the offline training module and the deep learning algorithm in combination with the predicted voice data set obtained in step 103 and the to-be-trained voice data set obtained in step 101, and the to-be-trained voice model is trained. , the target speech model can be obtained, and after the training is completed, the obtained target speech model is stored in the database in the server.

示例性地，若用户将“爸爸”的声音作为待训练语音数据，那么训练得到的目标语音模型可以为“爸爸”对应声音的语音模型，若用户将“妈妈”的声音作为待训练语音数据，那么训练得到的目标语音模型可以为“妈妈”对应声音的语音模型，具体目标语音模型需要通过待训练语音数据的实际输入情况灵活确定。Exemplarily, if the user takes the voice of "Dad" as the voice data to be trained, then the target voice model obtained by training can be the voice model of the corresponding voice of "Dad". If the user takes the voice of "Mom" as the voice data to be trained, Then the target speech model obtained by training can be the speech model corresponding to the voice of "mother", and the specific target speech model needs to be flexibly determined by the actual input of the speech data to be trained.

105、发送目标语音模型，以使得终端设备存储目标语音模型。105. Send the target voice model, so that the terminal device stores the target voice model.

本实施例中，服务器通过步骤104得到目标语音模型后，还可以向终端设备发送目标语音模型，终端设备接收到目标语音模型后，可以存储目标语音模型，即终端设备通过所集成的TTS-SDK将目标语音模型加载到终端设备的数据结构中，当终端设备需要根据目标语音播放时，可以根据存储的目标语音模型生成目标语音数据，从而完成语音播放。In this embodiment, after the server obtains the target voice model through step 104, it can also send the target voice model to the terminal device. After the terminal device receives the target voice model, it can store the target voice model, that is, the terminal device can store the target voice model through the integrated TTS-SDK. The target voice model is loaded into the data structure of the terminal device, and when the terminal device needs to play the target voice, the target voice data can be generated according to the stored target voice model, thereby completing the voice play.

本申请实施例中，提供了一种模型训练的方法，采用上述方式，由于目标语音模型是基于预测语音数据集合以及待训练语音数据集合对待训练语音模型进行训练得到的，因此终端设备可以由用户所提供的待训练语音数据得到对应的目标语音模型，提升语音模型生成的灵活性，且满足用户对声音的定制需求，其次，用户进行语音播放时语音模型选择可能性提升，从而提升语音播放的灵活性，进而提升用户的语音播放体验以及灵活性。In the embodiment of the present application, a method for model training is provided. Using the above method, since the target speech model is obtained by training the speech model to be trained based on the predicted speech data set and the to-be-trained speech data set, the terminal device can be trained by the user. The provided voice data to be trained gets the corresponding target voice model, which improves the flexibility of voice model generation and meets the user's voice customization requirements. Secondly, the possibility of voice model selection is improved when the user performs voice playback, thereby improving the voice playback performance. flexibility, thereby enhancing the user's voice playback experience and flexibility.

可选地，在上述图4对应的实施例的基础上，本申请实施例提供的模型训练的方法一个可选实施例中，获取待训练语音数据集合，具体包括如下步骤：Optionally, on the basis of the above-mentioned embodiment corresponding to FIG. 4 , in an optional embodiment of the model training method provided by the embodiment of the present application, acquiring a set of speech data to be trained specifically includes the following steps:

获取语音服务请求；Get voice service request;

本实施例中，提供了一种获取待训练语音数据集合的方法，当用户需要进行语音模型训练时，可以对终端设备进行针对于语音服务的操作，该操作包括但不限于单击语音训练接口，双击语音训练接口，滑动语音训练接口，应理解，在实际应用中，该操作还可以为通过语音进行控制以及手势进行控制。例如，对于语音控制而言，该操作可以为用户通过语音输入装置触发语音服务(例如，用户说出“定制我的声音”)，对于手势控制而言，该操作可以为通过用户相应手势触发手势指令，具体此处不对该操作进行限定。In this embodiment, a method for acquiring a voice data set to be trained is provided. When a user needs to perform voice model training, he can perform operations on the terminal device for voice services, including but not limited to clicking on the voice training interface. , double-click the voice training interface, and slide the voice training interface. It should be understood that, in practical applications, this operation can also be controlled by voice and gesture. For example, for voice control, the operation may be the user triggering the voice service through the voice input device (eg, the user says "customize my voice"), and for gesture control, the operation may be triggering the gesture through the user's corresponding gesture instruction, the operation is not limited here.

因此，服务器可以获取到语音服务请求，得知终端设备需要进行语音模型的需求，触发服务器中的TVS服务，并响应于语音服务请求，生成语音数据收集指令，由此向终端设备发送语音数据收集指令，终端设备根据语音数据收集指令可以得知服务器需要获取用于训练的语音数据，此时终端设备需要判断待检测语音数据是否满足噪声合格条件，若满足噪声合格条件，才根据语音数据收集指令获取对应的语音训练文本，然后用户可以针对于语音训练文本进行语音输入，终端设备得到用户语音输入的待检测语音数据，然后向服务器发送待检测语音数据，服务器获取到待检测语音数据后，通过TVS-web接口向TTS定制模块传输待检测语音数据，然后判断待检测语音数据是否满足语音训练合格条件，若满足即确定该待检测语音数据为待训练语音数据，若不满足终端设备继续基于用户语音输入获取对应的待检测语音数据，并向服务器发送待检测语音数据，使得服务器根据前述类似步骤对该待检测语音数据进行判断，由此通过上述类似步骤获取N个待训练语音数据后，即可生成待训练语音数据集合。Therefore, the server can obtain the voice service request, know that the terminal device needs to perform a voice model, trigger the TVS service in the server, and in response to the voice service request, generate a voice data collection instruction, thereby sending voice data collection to the terminal device. Command, the terminal device can know that the server needs to obtain the voice data for training according to the voice data collection command. At this time, the terminal device needs to determine whether the voice data to be detected meets the noise qualification conditions. If the noise qualification conditions are met, the voice data collection command is performed. Obtain the corresponding voice training text, and then the user can perform voice input for the voice training text, and the terminal device obtains the voice data to be detected input by the user's voice, and then sends the voice data to be detected to the server. The TVS-web interface transmits the voice data to be detected to the TTS customization module, and then judges whether the voice data to be detected satisfies the eligibility conditions for voice training. If so, it is determined that the voice data to be detected is the voice data to be trained. The voice input obtains the corresponding voice data to be detected, and sends the voice data to be detected to the server, so that the server judges the voice data to be detected according to the above-mentioned similar steps, and thus obtains N voice data to be trained through the above-mentioned similar steps, that is, A set of speech data to be trained can be generated.

具体地，语音训练合格条件为待检测语音数据与语音训练文本之间的相似度大于预设阈值，本实施例中待检测语音数据与语音训练文本之间的相似度应大于95％，在实际应用中，语音训练合格条件还可以为待检测语音数据与语音训练文本之间的相似度应大于98％，或者待检测语音数据与语音训练文本之间的相似度应大于90％等，具体语音训练合格条件需要通过实际情况以及需求灵活确定。为了便于理解，以语音训练合格条件为待检测语音数据与语音训练文本之间的相似度大于95％作为一个示例进行说明，语音训练文本为“跟着我一起跳”，若用户通过语音输入的待检测语音数据通过语义识别后得到的文本数据为“跟着我一起跳”，那么可以说待检测语音数据与语音训练文本之间的相似度为100％，即满足语音训练合格条件。其次，语音训练文本为“八百标兵奔北坡，北坡炮兵并排跑”，而用户通过语音输入的待检测语音数据通过语义识别后得到的文本数据为“八百标兵奔北坡，北坡标兵不用跑”，那么待检测语音数据与语音训练文本之间的相似度即小于95％，不满足语音训练合格条件。Specifically, the qualified condition for voice training is that the similarity between the voice data to be detected and the voice training text is greater than a preset threshold. In this embodiment, the similarity between the voice data to be detected and the voice training text should be greater than 95%. In the application, the qualified condition for voice training can also be that the similarity between the voice data to be detected and the voice training text should be greater than 98%, or the similarity between the voice data to be detected and the voice training text should be greater than 90%, etc. The training eligibility conditions need to be determined flexibly according to the actual situation and demand. In order to facilitate understanding, the eligibility condition for voice training is that the similarity between the voice data to be detected and the voice training text is greater than 95% as an example, and the voice training text is "jump with me". The text data obtained by semantic recognition of the detected speech data is "jump with me", then it can be said that the similarity between the speech data to be detected and the speech training text is 100%, that is, the qualified condition of speech training is satisfied. Secondly, the voice training text is "Eight hundred pacesetters run to the North Slope, North Slope artillerymen run side by side", and the text data obtained after semantic recognition of the speech data to be detected through the user's voice input is "Eight hundred pacesetters run to the North Slope, North Slope The pacesetter does not need to run”, then the similarity between the speech data to be detected and the speech training text is less than 95%, which does not meet the qualification requirements for speech training.

为了便于理解，请参阅图5，图5为本申请实施例中模型训练的方法另一流程示意图，如图所示，在步骤A1中，终端设备向服务器发送语音服务请求，在步骤A2中，服务器接收到语音服务请求后，响应于语音服务请求，向终端设备发送语音数据收集指令，在步骤A3中，终端设备接收到语音数据收集指令后，响应于语音数据收集指令，先判断待检测语音数据是否满足噪声合格条件，在满足的情况下根据语音数据收集指令获取语音训练文本，然后用户基于语音训练文本进行语音输入得到待检测语音数据，使得终端设备获取到待检测语音数据，在步骤A4中，向服务器发送获取到的待检测语音数据，在步骤A5中，服务器判断待检测语音数据满足语音训练合格条件时，确定该待检测语音数据为待训练语音数据，在步骤A6中，然后重复步骤A1至步骤A5，基于所确定的待训练语音数据，生成待训练语音数据集合。For ease of understanding, please refer to FIG. 5 , which is another schematic flowchart of the method for model training in the embodiment of the present application. As shown in the figure, in step A1, the terminal device sends a voice service request to the server, and in step A2, After the server receives the voice service request, in response to the voice service request, it sends a voice data collection instruction to the terminal device. In step A3, after the terminal device receives the voice data collection instruction, in response to the voice data collection instruction, it first determines the voice data to be detected. Whether the data meets the noise qualification conditions, if so, obtain the voice training text according to the voice data collection instruction, and then the user performs voice input based on the voice training text to obtain the voice data to be detected, so that the terminal device obtains the voice data to be detected. In step A4 , send the acquired voice data to be detected to the server, in step A5, the server determines that the voice data to be detected is the voice data to be trained when the server determines that the voice data to be detected satisfies the voice training eligibility conditions, in step A6, then repeat From steps A1 to A5, based on the determined speech data to be trained, a set of speech data to be trained is generated.

应理解，前述示例仅用于理解本方案，在实际应用中，语音训练文本需要通过语音模型的实际需求与设计确定，而待检测语音数据则基于用户语音输入的情况确定，因此是否满足语音训练合格条件需要通过语音训练文本以及待检测语音数据的实际情况灵活确定。It should be understood that the foregoing examples are only used to understand this solution. In practical applications, the voice training text needs to be determined by the actual needs and design of the voice model, and the voice data to be detected is determined based on the user's voice input, so whether it satisfies the voice training. Eligibility conditions need to be flexibly determined through the speech training text and the actual situation of the speech data to be detected.

本申请实施例中，提供了一种获取待训练语音数据集合的方法，采用上述方式，通过语音数据收集指令获取待检测语音数据，并且将满足语音训练合格条件的待检测语音数据确定为待训练语音数据，由此可以生成包括至少一个待训练语音数据的待训练语音数据集合，从而提供待训练语音模型的输入数据，从而提升本方案的可行性。In the embodiment of the present application, a method for obtaining a voice data set to be trained is provided. In the above manner, the voice data to be detected is obtained through a voice data collection instruction, and the voice data to be detected that meets the eligibility conditions for voice training is determined as the voice data to be trained. voice data, thereby generating a set of voice data to be trained including at least one voice data to be trained, thereby providing input data for the voice model to be trained, thereby improving the feasibility of this solution.

可选地，在上述图4对应的实施例的基础上，本申请实施例提供的模型训练的方法另一可选实施例中，基于预测语音数据集合以及待训练语音数据集合，对待训练语音模型进行训练，得到目标语音模型，具体包括如下步骤：Optionally, on the basis of the embodiment corresponding to FIG. 4 above, in another optional embodiment of the model training method provided by the embodiment of the present application, based on the predicted speech data set and the to-be-trained speech data set, the to-be-trained speech model is Carry out training to obtain the target speech model, which specifically includes the following steps:

基于预测语音数据集合以及待训练语音数据集合，根据目标损失函数更新待训练语音模型的模型参数；Based on the predicted speech data set and the to-be-trained speech data set, the model parameters of the to-be-trained speech model are updated according to the target loss function;

本实施例中，提供了一种训练目标语音模型的方法，由于服务器需要以待训练语音数据集合为目标，对待训练语音模型的模型参数进行训练，因此服务器在获取到预测语音数据集合中的一个预测语音数据之后，可以从待训练语音数据集合中获取与前述一个预测语音数据所对应的一个待训练语音数据，进而可以根据预测语音数据以及与之对应的待训练语音数据，生成目标损失函数的值，根据目标损失函数的值判断目标损失函数是否达到收敛条件，若未达到收敛条件，则利用目标损失函数的值更新待训练语音模型的模型参数。由此可知，待训练语音模型每生成一个预测语音数据，服务器均可以执行一次前述操作，直至目标损失函数达到收敛条件，则根据最后一次对模型参数进行更新后获得的模型参数生成目标语音模型。In this embodiment, a method for training a target speech model is provided. Since the server needs to target the speech data set to be trained, and train the model parameters of the speech model to be trained, the server obtains one of the predicted speech data sets when After the voice data is predicted, a voice data to be trained corresponding to the aforementioned one predicted voice data can be obtained from the set of voice data to be trained, and then a target loss function can be generated according to the predicted voice data and the corresponding voice data to be trained. value, according to the value of the target loss function to determine whether the target loss function reaches the convergence condition, if the convergence condition is not reached, use the value of the target loss function to update the model parameters of the speech model to be trained. It can be seen from this that the server can perform the aforementioned operation once every time the speech model to be trained generates a predicted speech data, until the target loss function reaches the convergence condition, and then generates the target speech model according to the model parameters obtained after updating the model parameters for the last time.

其中，目标损失函数可以为预测语音数据以及与之对应的待训练语音数据之间差值的总和，也可以为预测语音数据以及与之对应的待训练语音数据之间差值的绝对值的总和，也可以为预测语音数据以及与之对应的待训练语音数据之间差值的总和的平方，还可以为采用其他形式的目标损失函数等，具体目标损失函数的选择可以根据实际情况确定，此处不做限定。目标损失函数的收敛条件可以为目标损失函数的值小于或等于第一预设阈值，作为示例，例如第一预设阈值的取值可以为0.005、0.01、0.02或其它趋近于0的数值；也可以为目标损失函数的相邻两次的值的差值小于或等于第二预设阈值，第二阈值的取值可以与第一阈值的取值相同或不同，作为示例，例如第二预设阈值的取值可以为0.005、0.01、0.02或其它趋近于0的数值等，服务器还可以采用其它收敛条件等，此处不做限定。The objective loss function may be the sum of the differences between the predicted speech data and the corresponding to-be-trained speech data, or may be the sum of the absolute values of the differences between the predicted speech data and the corresponding to-be-trained speech data , it can also be the square of the sum of the difference between the predicted speech data and the corresponding to-be-trained speech data, or it can be an objective loss function of other forms, etc. The choice of the specific objective loss function can be determined according to the actual situation. There are no restrictions. The convergence condition of the objective loss function may be that the value of the objective loss function is less than or equal to the first preset threshold, as an example, for example, the value of the first preset threshold may be 0.005, 0.01, 0.02 or other values approaching 0; It can also be that the difference between two adjacent values of the objective loss function is less than or equal to a second preset threshold, and the value of the second threshold can be the same as or different from the value of the first threshold. The value of the threshold may be 0.005, 0.01, 0.02, or other values close to 0, etc. The server may also adopt other convergence conditions, etc., which are not limited here.

本申请实施例中，提供了一种训练目标语音模型的方法，基于预测语音数据集合以及待训练语音数据集合，根据目标损失函数更新待训练语音模型的模型参数，若目标损失函数达到收敛即根据模型参数生成目标语音模型，采用上述方式，提供了服务器对待训练语音模型的参数进行更新的一种具体实现方式，并由此生成目标语音模型，因此提高了本方案的可实现性。In the embodiment of the present application, a method for training a target speech model is provided. Based on the predicted speech data set and the to-be-trained speech data set, the model parameters of the to-be-trained speech model are updated according to the target loss function. The model parameters generate the target speech model. The above method provides a specific implementation method for the server to update the parameters of the speech model to be trained, and thereby generates the target speech model, thus improving the practicability of the solution.

可选地，在上述图4对应的实施例的基础上，本申请实施例提供的模型训练的方法另一可选实施例中，模型训练的方法还包括如下步骤：Optionally, on the basis of the embodiment corresponding to FIG. 4 above, in another optional embodiment of the model training method provided by the embodiment of the present application, the model training method further includes the following steps:

获取语音播放请求，其中，语音播放请求携带待解析语音数据；obtaining a voice playback request, wherein the voice playback request carries the voice data to be parsed;

根据待解析语音数据获取服务内容文本数据，其中，服务内容文本数据与待解析语音数据对应；Obtaining service content text data according to the voice data to be parsed, wherein the service content text data corresponds to the voice data to be parsed;

根据服务内容文本数据获取无音色语音数据，其中，无音色语音数据是对服务内容文本数据进行语音合成后得到的；Acquiring timbreless voice data according to the service content text data, wherein the timbreless voice data is obtained by performing speech synthesis on the service content text data;

向终端设备发送无音色语音数据，以使得终端设备根据无音色语音数据以及目标语音模型生成目标语音数据，并播放目标语音数据。The voiceless voice data is sent to the terminal device, so that the terminal device generates the target voice data according to the voiceless voice data and the target voice model, and plays the target voice data.

本实施例中，当用户需要进行语音播放时，终端设备可以检测针对于语音播放的操作，并通过语音输入待解析语音数据，由此生成携带待解析语音数据的语音播放请求，并向服务器发送语音播放请求，服务器由此获取到语音播放请求，并将语音播放请求中携带的待解析语音数据进行语音识别，具体语音识别可以为语义转文本以及语义理解等，然后通过语音识别确定用户的意图以及需求，从而确定服务内容文本数据，再对服务内容文本数据进行语音合成，完成文本数据到语音数据的转换，由此得到无音色语音数据，服务器向终端发送TTS指令，TTS指令中携带有无音色语音数据，因此终端在获取到携带有无音色语音数据的TTS指令后，可以根据无音色语音数据以及存储的目标语音模型生成目标语音数据，并且播放目标语音数据，从而完成语音播放。In this embodiment, when the user needs to perform voice playback, the terminal device can detect the operation for voice playback, and input the voice data to be parsed through voice, thereby generating a voice play request carrying the voice data to be parsed, and sending it to the server A voice playback request, from which the server obtains the voice playback request, and performs voice recognition on the to-be-parsed voice data carried in the voice playback request. The specific voice recognition can be semantic-to-text and semantic understanding, etc., and then determines the user's intention through voice recognition. and demand, so as to determine the service content text data, and then perform speech synthesis on the service content text data to complete the conversion of the text data to the voice data, thereby obtaining the voiceless voice data, the server sends the TTS command to the terminal, and the TTS command carries whether or not timbre voice data, so after the terminal obtains the TTS instruction carrying the timbre-free voice data, it can generate target voice data according to the timbre-free voice data and the stored target voice model, and play the target voice data, thereby completing voice playback.

示例性，以语音播放请求中所携带的待解析语音数据为“我要听丑小鸭”作为一个示例进行介绍，请参阅图6，图6为本申请实施例中语音播放的方法一个流程示意图，如图所示，在步骤F1中，用户根据需求对终端设备发起语音播放请求，语音播放请求中所携带的待解析语音数据为“我要听丑小鸭”，在步骤F2中终端设备向服务器放语音播放请求，在步骤F3中，服务器中的TVS服务通过语义转文本、语义理解等语音识别的过程后，可以确定用户的意图是想听故事，并且想听的内容为“丑小鸭故事”，因此以向TVS服务中技能服务发起请求，具体为儿童故事内容服务发起请求，然后技能服务可以反馈儿童故事内容中“丑小鸭故事”的服务内容文本数据，在步骤F4中，服务器将服务内容文本数据“丑小鸭故事”进行文本到声音的转换，从而得到无音色语音数据，该无音色语音数据为语音内容为童话故事“丑小鸭”无音调音色的无音色语音数据，然后在步骤F5中，服务器向终端设备发送无音色语音数据，在步骤F6中，终端设备根据无音色语音数据以及目标语音模型生成目标语音数据，并在步骤F7中播放目标语音数据，即通过目标声音播放“丑小鸭”的故事内容，例如，“从前，有一只丑小鸭……”。应理解，前述示例仅用于理解本方案，在实际应用中，具体无音色语音数据需要根据待解析语音数据的实际情况灵活确定。Exemplarily, take the voice data to be parsed carried in the voice playback request as "I want to listen to the ugly duckling" as an example to introduce, please refer to Figure 6, Figure 6 is a schematic flowchart of the method for voice playback in the embodiment of the application, as shown in Figure 6. As shown in the figure, in step F1, the user initiates a voice playback request to the terminal device according to the requirements, the voice data to be parsed carried in the voice playback request is "I want to listen to the ugly duckling", and in step F2, the terminal device sends the voice playback to the server Request, in step F3, the TVS service in the server can determine that the user's intention is to listen to the story after the process of speech recognition such as semantic-to-text and semantic understanding, and the content that he wants to listen to is "the ugly duckling story". The skill service in the TVS service initiates a request, specifically for the children's story content service, and then the skill service can feed back the service content text data of "The Ugly Duckling Story" in the children's story content. In step F4, the server sends the service content text data "The Ugly Duckling Story" "Conduct text to voice conversion, thereby obtaining voiceless voice data, the voiceless voice data is voiceless voice data whose voice content is the fairy tale "The Ugly Duckling" voiceless voice data, and then in step F5, the server sends no voice to the terminal device. Tone voice data, in step F6, the terminal device generates target voice data according to the voiceless voice data and the target voice model, and plays the target voice data in step F7, that is, the story content of "The Ugly Duckling" is played through the target voice, for example, " Once upon a time, there was an ugly duckling...". It should be understood that the foregoing examples are only used to understand this solution, and in practical applications, the specific voiceless voice data needs to be flexibly determined according to the actual situation of the voice data to be parsed.

本申请实施例中，提供了一种基于目标语音模型播放语音数据的方法，服务器获根据语音播放请求中携带待解析语音数据获取服务内容文本数据，并向终端设备发送根据服务内容文本数据获取的无音色语音数据，采用上述方式，终端设备可以根据无音色语音数据与目标语音模型生成目标语音数据，并对目标语音数据进行播放，因此提高了本方案的可行性。In the embodiment of the present application, a method for playing voice data based on a target voice model is provided. The server obtains service content text data according to the voice data to be parsed carried in the voice play request, and sends the service content text data obtained according to the service content text data to the terminal device. For voiceless voice data, in the above manner, the terminal device can generate target voice data according to the voiceless voice data and the target voice model, and play the target voice data, thus improving the feasibility of this solution.

下面将对本申请中语音播放的方法进行介绍，请参阅图7，图7为本申请实施例中语音播放的方法一个实施例示意图，如图所示，本申请实施例中语音播放的方法一个实施例包括：The method for voice playback in the present application will be introduced below. Please refer to FIG. 7 . FIG. 7 is a schematic diagram of an embodiment of the method for voice playback in the embodiment of the present application. As shown in the figure, the method for voice playback in the embodiment of the present application is implemented. Examples include:

201、检测针对于语音播放的操作，发送语音播放请求，以使得服务器基于待解析语音数据获取服务内容文本数据，根据服务内容文本数据获取无音色语音数据，其中，语音播放请求携带待解析语音数据，服务内容文本数据是根据待解析语音数据获取的，无音色语音数据是对服务内容文本数据进行语音合成后得到的；201. Detect an operation for voice playback, and send a voice playback request, so that the server obtains service content text data based on the voice data to be parsed, and obtains voiceless voice data according to the service content text data, wherein the voice playback request carries the voice data to be parsed. , the service content text data is obtained according to the voice data to be parsed, and the timbreless voice data is obtained after voice synthesis is performed on the service content text data;

本实施例中，当用户需要进行语音播放时，终端设备可以检测针对于语音播放的操作，并通过语音输入接口进行语音输入从而获取待解析语音数据，生成携带待解析语音数据的语音播放请求，然后终端设备可以通过TVS-SDK向服务器发送语音播放请求，以使得服务器总的TVS语音服务对语音播放请求中携带的待解析语音数据进行语音识别得到服务内容文本数据，再对服务内容文本数据进行语音合成得到无音色语音数据。In this embodiment, when the user needs to perform voice playback, the terminal device can detect the operation for voice playback, and perform voice input through the voice input interface to obtain the voice data to be parsed, and generate a voice playback request carrying the voice data to be parsed, Then the terminal device can send a voice playback request to the server through the TVS-SDK, so that the general TVS voice service of the server performs voice recognition on the voice data to be parsed carried in the voice playback request to obtain the service content text data, and then performs the service content text data on the service content text data. Speech synthesis results in voiceless speech data.

具体地，针对于语音播放的操作包括但不限于单击语音训练接口，双击语音训练接口，滑动语音训练接口，为了便于理解，以终端设备为手机，且针对于语音播放的操作为单击语音训练接口作为一个示例进行介绍，请参阅图8，图8为本申请实施例中针对于语音播放的操作一个实施例示意图，如图所示，B1用于指示语音播放接口，B2用于指示语音输入接口，图8中(A)图用于指示用户对语音播放接口B1进行单击操作的界面，然后即可跳转至图8中(B)图所示界面，图8中(B)图用于指示用户通过语音输入接口B2进行语音输入的界面，例如用户在图8中(B)图所示界面点击语音输入接口B2后，对终端设备说“我要听丑小鸭”，那么终端设备获取到待解析语音数据“我要听丑小鸭”。应理解，前述示例仅用于理解本方案，在实际应用中，该操作还可以为通过语音进行控制以及手势进行控制。例如，对于语音控制而言，该操作可以为用户通过语音输入装置触发语音服务(例如，用户说出“语音播放”)，对于手势控制而言，该操作可以为通过用户相应手势触发手势指令，具体此处不对该操作进行限定。其次，在实际应用中，语音播放接口以及语音输入接口的形状，大小，位置和对应的文本信息需要根据实际情况灵活确定。Specifically, the operations for voice playback include, but are not limited to, single-clicking the voice training interface, double-clicking the voice training interface, and sliding the voice training interface. For ease of understanding, the terminal device is a mobile phone, and the operation for voice playback is to click the voice The training interface is introduced as an example, please refer to FIG. 8 . FIG. 8 is a schematic diagram of an embodiment of the operation for voice playback in the embodiment of the application. As shown in the figure, B1 is used to indicate the voice playback interface, and B2 is used to indicate the voice playback. Input interface, Figure (A) in Figure 8 is used to instruct the user to click on the voice playback interface B1, and then jump to the interface shown in Figure (B) in Figure 8, Figure (B) in Figure 8 An interface used to instruct the user to perform voice input through the voice input interface B2. For example, after the user clicks on the voice input interface B2 on the interface shown in (B) in Figure 8, and says "I want to listen to the ugly duckling" to the terminal device, the terminal device obtains To the speech data to be parsed "I want to listen to the ugly duckling". It should be understood that the foregoing examples are only used to understand this solution, and in practical applications, the operation may also be control by voice and control by gesture. For example, for voice control, the operation may be that the user triggers a voice service through a voice input device (for example, the user speaks "voice play"), and for gesture control, the operation may be to trigger a gesture instruction through the user's corresponding gesture, Specifically, this operation is not limited here. Secondly, in practical applications, the shape, size, position and corresponding text information of the voice playback interface and the voice input interface need to be flexibly determined according to the actual situation.

202、获取无音色语音数据；202. Obtain voiceless voice data;

本实施例中，在步骤201中服务器得到无音色语音数据后，向终端设备发送该无音色语音数据，终端设备中集成的TVS-SDK由此获取无音色语音数据。例如，待解析语音数据为“我要听丑小鸭”，通过前述实施例可知，无音色语音数据为语音内容为童话故事“丑小鸭”无音调音色的语音数据，若待解析语音数据为“我要听丑八怪”，与前述实施例类似，无音色语音数据可以为语音内容为歌曲“丑八怪”无音色的语音数据。应理解，前述示例仅用于理解本方案，在实际应用中，具体待解析语音数据需要根据用户需求的实际情况灵活确定，而具体无音色语音数据则需要根据待解析语音数据的实际情况灵活确定。In this embodiment, after obtaining the voiceless voice data in step 201, the server sends the voiceless voice data to the terminal device, and the TVS-SDK integrated in the terminal device obtains the voiceless voice data thereby. For example, if the voice data to be parsed is "I want to listen to the ugly duckling", it can be seen from the foregoing embodiment that the voiceless voice data is the voice data of the fairy tale "The Ugly Duckling" with no tonal timbre. If the voice data to be parsed is "I want to listen to it" "Ugly", similar to the aforementioned embodiment, the voiceless voice data may be voice data with the voice content of the song "Ugly" and no timbre. It should be understood that the foregoing examples are only used to understand this solution. In practical applications, the specific voice data to be parsed needs to be flexibly determined according to the actual situation of the user's needs, and the specific voiceless voice data needs to be flexibly determined according to the actual situation of the voice data to be parsed. .

203、根据无音色语音数据以及目标语音模型生成目标语音数据，其中，目标语音模型是基于预测语音数据集合以及待训练语音数据集合，对待训练语音模型进行训练得到的，预测语音数据集合是服务器响应于语音模型训练指令，基于待训练语音数据集合，通过待训练语音模型获取的；203. Generate target voice data according to the voiceless voice data and the target voice model, wherein the target voice model is obtained by training the voice model to be trained based on the predicted voice data set and the voice data set to be trained, and the predicted voice data set is the server response. Based on the voice model training instruction, based on the voice data set to be trained, obtained through the voice model to be trained;

本实施例中，终端设备中集成的TVS-SDK获取无音色语音数据后，把无音色语音数据传送给TTS-SDK，用户此时可以选择目标语音模型，通过前述实施例可知，目标语音模型是基于预测语音数据集合以及待训练语音数据集合，对待训练语音模型进行训练得到的，预测语音数据集合是服务器响应于语音模型训练指令，基于待训练语音数据集合，通过待训练语音模型获取的，因此在此不再赘述。In this embodiment, after the TVS-SDK integrated in the terminal device obtains the voiceless voice data, it transmits the voiceless voice data to the TTS-SDK, and the user can select the target voice model at this time. It can be seen from the foregoing embodiments that the target voice model is Based on the predicted voice data set and the to-be-trained voice data set, the voice model to be trained is obtained by training the to-be-trained voice model. The predicted voice data set is obtained by the server in response to the voice model training instruction and based on the to-be-trained voice data set through the to-be-trained voice model. Therefore, It is not repeated here.

示例性地，用户将“爸爸”的声音作为待训练语音数据，那么目标语音模型可以为“爸爸”对应声音的语音模型，若用户将“妈妈”的声音作为待训练语音数据，那么目标语音模型可以为“妈妈”对应声音的语音模型。若用户选择的目标语音模型为“妈妈”对应声音的语音模型，那么然后TTS-SDK将无音色语音数据作为目标语音模型的输入，目标语音模型输出目标语音数据，即输出“妈妈”的音色音调对应的语音数据。应理解，前述示例仅用于理解本方案，在实际应用中，具体目标语音数据需要根据无音色语音数据以及用户所选择的目标语音模型的实际情况灵活确定。Exemplarily, if the user takes the voice of "Dad" as the voice data to be trained, then the target voice model can be the voice model corresponding to the voice of "Dad". If the user takes the voice of "Mom" as the voice data to be trained, then the target voice model A speech model that can correspond to a voice for "mom". If the target voice model selected by the user is the voice model corresponding to the voice of "Mom", then TTS-SDK will use the voiceless voice data as the input of the target voice model, and the target voice model will output the target voice data, that is, output the voice tone of "Mom" corresponding voice data. It should be understood that the foregoing examples are only used to understand this solution. In practical applications, the specific target speech data needs to be flexibly determined according to the timbreless speech data and the actual situation of the target speech model selected by the user.

204、播放目标语音数据。204. Play the target voice data.

本实施例中，终端设备获取目标语音数据后，即可播放目标语音数据。为了便于理解，以目标语音模型为“妈妈”对应声音的语音模型作为一个示例进行说明，请参阅图9，图9为本申请实施例中语音播放的方法另一实施例示意图，如图所示，C1用于指示语音输入接口，用户在点击语音输入接口C1后，对终端设备说“我要听丑小鸭”，那么终端设备获取到待解析语音数据“我要听丑小鸭”，通过前述实施例的步骤可以得到文本内容为童话故事“丑小鸭”的目标语音数据，因此终端设备设备以“妈妈”的音色音调播放目标语音数据“从前，有一只丑小鸭……”。应理解，前述示例仅用于理解本方案，在实际应用中，具体目标语音数据需要根据无音色语音数据以及用户所选择的目标语音模型的实际情况灵活确定。In this embodiment, after acquiring the target voice data, the terminal device can play the target voice data. For ease of understanding, the target voice model is the voice model corresponding to the voice of "mother" as an example for description, please refer to FIG. 9 . FIG. 9 is a schematic diagram of another embodiment of the voice playback method in the embodiment of the application, as shown in the figure , C1 is used to indicate the voice input interface. After the user clicks on the voice input interface C1, he says "I want to listen to the ugly duckling" to the terminal device, then the terminal device obtains the voice data to be parsed "I want to listen to the ugly duckling". The step can obtain the target voice data whose text content is the fairy tale "The Ugly Duckling", so the terminal device plays the target voice data "Once upon a time, there was an ugly duckling..." in the timbre of "Mama". It should be understood that the foregoing examples are only used to understand this solution. In practical applications, the specific target speech data needs to be flexibly determined according to the timbreless speech data and the actual situation of the target speech model selected by the user.

本申请实施例中，提供了一种语音播放的方法，采用上述方式，终端设备可以根据无音色语音数据与目标语音模型生成目标语音数据，由于目标语音模型是由用户所提供的待训练语音数据得到的，因此提升语音模型生成的灵活性，从而提升生成目标语音数据的灵活性，其次，对生成的目标语音数据进行播放，由于生成目标语音数据的灵活性的提高，从而可以提高语音播放的灵活性。In the embodiment of the present application, a method for playing voice is provided. By adopting the above method, the terminal device can generate target voice data according to the voiceless voice data and the target voice model. Since the target voice model is the voice data to be trained provided by the user Therefore, the flexibility of generating the voice model is improved, thereby improving the flexibility of generating the target voice data. Secondly, the generated target voice data is played. Due to the improvement of the flexibility of generating the target voice data, the voice playback can be improved. flexibility.

可选地，在上述图7对应的实施例的基础上，本申请实施例提供的语音播放的方法一个可选实施例中，语音播放的方法还包括如下步骤：Optionally, on the basis of the above-mentioned embodiment corresponding to FIG. 7 , in an optional embodiment of the voice playback method provided by the embodiment of the present application, the voice playback method further includes the following steps:

检测针对于语音服务的操作，发送语音服务请求；Detect operations for voice services, and send voice service requests;

当服务器响应于语音服务请求，获取语音数据收集指令；When the server responds to the voice service request, obtain the voice data collection instruction;

响应于语音数据收集指令，且待检测语音数据满足噪声合格条件时，获取待检测语音数据，其中，待检测语音数据是基于语音训练文本得到的，语音训练文本是根据语音数据收集指令获取的；In response to the voice data collection instruction, and when the to-be-detected voice data meets the noise qualified condition, acquire the to-be-detected voice data, wherein the to-be-detected voice data is obtained based on the voice training text, and the voice training text is obtained according to the voice data collection instruction;

发送待检测语音数据，以使得服务器基于待检测语音数据，生成待训练语音数据集合，待训练语音数据集合包括N个待训练语音数据，N为大于，或者等于1的正整数。The voice data to be detected is sent, so that the server generates a voice data set to be trained based on the voice data to be detected. The voice data set to be trained includes N voice data to be trained, where N is a positive integer greater than or equal to 1.

本实施例中，提供了一种发送待检测语音数据的方法，当用户需要进行语音模型训练时，可以对终端设备进行针对于语音服务的操作，终端设备检测针对于语音服务的操作，向服务器发送语音服务请求，服务器可以获取到语音服务请求，得知终端设备需要进行语音模型的需求，触发服务器中的TVS服务，并响应于语音服务请求，生成语音数据收集指令，由此向终端设备发送语音数据收集指令，终端设备根据语音数据收集指令可以得知服务器需要获取用于训练的语音数据，然后先判断待检测语音数据是否满足噪声合格条件，在满足的情况下根据语音数据收集指令获取对应的语音训练文本，然后用户可以针对于语音训练文本进行语音输入，终端设备得到用户语音输入的待检测语音数据。In this embodiment, a method for sending voice data to be detected is provided. When the user needs to perform voice model training, the terminal equipment can be operated for the voice service. The terminal equipment detects the operation for the voice service. After sending a voice service request, the server can obtain the voice service request, learn the needs of the terminal device to perform a voice model, trigger the TVS service in the server, and in response to the voice service request, generate a voice data collection instruction, thereby sending it to the terminal device. Voice data collection instruction, the terminal device can know that the server needs to obtain the voice data for training according to the voice data collection instruction, and then first judges whether the voice data to be detected satisfies the noise qualified condition, and if so, obtains the corresponding voice data according to the voice data collection instruction Then the user can perform voice input for the voice training text, and the terminal device obtains the to-be-detected voice data input by the user's voice.

进一步地，终端设备向服务器发送待检测语音数据，服务器获取到待检测语音数据后，通过TVS-web接口向TTS定制模块传输待检测语音数据，并判断待检测语音数据是否满足语音训练合格条件，若满足即确定该待检测语音数据为待训练语音数据，若不满足则终端设备继续向服务器发送满足噪声合格条件的待检测语音数据，服务器根据前述类似步骤对待检测语音数据进行判断，由此通过上述类似步骤获取N个待训练语音数据后，即可生成待训练语音数据集合。Further, the terminal device sends the voice data to be detected to the server, and after the server obtains the voice data to be detected, it transmits the voice data to be detected to the TTS customization module through the TVS-web interface, and judges whether the voice data to be detected satisfies the voice training eligibility conditions, If it is satisfied, it is determined that the to-be-detected voice data is the to-be-trained voice data. If not, the terminal device continues to send the to-be-detected voice data that meets the noise qualified conditions to the server, and the server judges the to-be-detected voice data according to the above similar steps. After obtaining N pieces of speech data to be trained in the above similar steps, a set of speech data to be trained can be generated.

具体地，噪声合格条件为环境噪声低于分贝的预设阈值，本实施例中噪声合格条件为环境噪声低于45分贝，应理解，在实际应用中，若需要更准确的待检测语音数据，噪声合格条件还可以为环境噪声低于40分贝，或者需要更快速的获取待检测语音数据，噪声合格条件还可以为环境噪声低于50分贝，具体噪声合格条件根据实际情况以及需求灵活确定。Specifically, the qualified condition for noise is that the ambient noise is lower than the preset threshold of decibels. In this embodiment, the qualified condition for noise is that the ambient noise is lower than 45 decibels. It should be understood that in practical applications, if more accurate speech data to be detected is required, The qualified noise condition can also be that the ambient noise is lower than 40 decibels, or the voice data to be detected needs to be obtained more quickly. The qualified noise condition can also be that the ambient noise is lower than 50 decibels. The specific noise qualified condition can be determined flexibly according to the actual situation and needs.

为了便于理解，请参阅图10，图10为本申请实施例中语音播放的方法另一流程示意图，如图所示，步骤D1中，用户进行针对于语音服务的操作，从而生成语音服务请求，在步骤D2中，终端设备检测针对于语音服务的操作，向服务器发送语音服务请求，在步骤D3中，服务器响应于语音服务请求，向终端设备发送语音数据收集指令，在步骤D4中，终端设备响应于语音数据收集指令，判断待检测语音数据是否满足噪声合格条件，若满足执行步骤D5，在步骤D5中，终端设备根据语音数据收集指令获取语音训练文本，用户基于语音训练文本语音输入待检测语音数据，在步骤D6中，向服务器发送待检测语音数据，在步骤D7中，服务器判断待检测语音数据满足语音训练合格条件时，确定该待检测语音数据为待训练语音数据，并基于所确定的待训练语音数据，生成待训练语音数据集合，在步骤D8中，当待训练语音数据集合中的待训练语音数据数量达到语音训练数量阈值时，终端设备可以获取语音数据收集完成状态信息，根据语音数据收集完成状态信息可以得知终端设备语音数据已完成获取，可以开始训练目标语音模型，因此服务器向服务器发送语音模型训练指令。在步骤D9中，服务器接收到语音模型训练指令后，可以得知终端设备需要对待训练语音模型进行训练，因此可以响应于语音模型训练指令，进行模型训练从而得到目标语音模型，在步骤D10中服务器向终端设备发送目标语音模型，在步骤D11中终端设备存储目标语音模型。在步骤D12中，当用户需要定制语音播放时，终端设备可以向服务器发送语音播放请求，在步骤D13中，服务器根据待解析语音数据获取服务内容文本数据，在步骤D14中，服务器根据服务内容文本数据获取无音色语音数据，然后在步骤D15中服务器向终端设备发送无音色语音数据，在步骤D16中，终端设备根据无音色语音数据以及目标语音模型生成目标语音数据，在步骤D17中，终端设备播放目标语音数据。前述步骤中的具体方式已在前述实施例进行介绍，因此在此不再赘述。应理解，前述示例仅用于理解本方案，在实际应用中，具体待检测语音数据是否想服务器发送需要根据用户语音输入的情况灵活确定。For ease of understanding, please refer to FIG. 10. FIG. 10 is another schematic flowchart of the method for voice playback in the embodiment of the application. As shown in the figure, in step D1, the user performs an operation for voice service, thereby generating a voice service request, In step D2, the terminal device detects the operation for the voice service, and sends a voice service request to the server. In step D3, the server responds to the voice service request and sends a voice data collection instruction to the terminal device. In step D4, the terminal device In response to the voice data collection instruction, it is judged whether the voice data to be detected satisfies the noise qualified condition, and if so, step D5 is performed. In step D5, the terminal device obtains the voice training text according to the voice data collection instruction, and the user enters the voice input to be detected based on the voice training text. Voice data, in step D6, send the voice data to be detected to the server, in step D7, the server determines that the voice data to be detected is the voice data to be trained when the server determines that the voice data to be detected satisfies the voice training eligibility conditions, and based on the determined voice data The voice data to be trained is generated, and a voice data set to be trained is generated. In step D8, when the number of voice data to be trained in the voice data set to be trained reaches the threshold of the voice training quantity, the terminal device can obtain the voice data collection completion status information, according to The voice data collection completion status information shows that the terminal device voice data has been acquired, and the target voice model can be trained, so the server sends a voice model training instruction to the server. In step D9, after the server receives the voice model training instruction, it can know that the terminal device needs to train the voice model to be trained, so it can perform model training in response to the voice model training instruction to obtain the target voice model. In step D10, the server The target voice model is sent to the terminal device, and the terminal device stores the target voice model in step D11. In step D12, when the user needs to customize voice playback, the terminal device can send a voice playback request to the server. In step D13, the server obtains service content text data according to the voice data to be parsed. In step D14, the server obtains service content text data according to the service content text. The data obtains voiceless voice data, then in step D15, the server sends voiceless voice data to the terminal device, in step D16, the terminal device generates target voice data according to the voiceless voice data and the target voice model, and in step D17, the terminal device Play the target voice data. The specific manner in the foregoing steps has been introduced in the foregoing embodiments, and thus will not be repeated here. It should be understood that the foregoing examples are only used to understand this solution. In practical applications, whether the specific voice data to be detected is to be sent by the server needs to be flexibly determined according to the user's voice input.

本申请实施例中，提供了一种发送待检测语音数据的方法，采用上述方式，在满足噪声合格条件时，通过用户需求获取待检测语音数据，并向服务器发送待检测语音数据，由此提升待检测语音数据的准确度，以方便待训练语音模型对待检测语音数据的特征提取，由此提升生成目标语音模型的准确度以及效率，从而提升语音播放的准确度以及效率。In the embodiment of the present application, a method for sending voice data to be detected is provided. By adopting the above method, when the qualified noise condition is satisfied, the voice data to be detected is obtained according to user requirements, and the voice data to be detected is sent to the server, thereby improving the The accuracy of the voice data to be detected is used to facilitate the feature extraction of the voice data to be detected by the voice model to be trained, thereby improving the accuracy and efficiency of generating the target voice model, thereby improving the accuracy and efficiency of voice playback.

可选地，在上述图7对应的实施例的基础上，本申请实施例提供的语音播放的方法另一可选实施例中，语音播放的方法还包括如下步骤：Optionally, on the basis of the embodiment corresponding to FIG. 7 above, in another optional embodiment of the voice playback method provided by the embodiment of the present application, the voice playback method further includes the following steps:

当N满足语音训练数量阈值，发送语音模型训练指令，以使得服务器得到目标语音模型；When N meets the voice training quantity threshold, send a voice model training instruction so that the server obtains the target voice model;

获取目标语音模型；Get the target speech model;

存储目标语音模型。Store the target speech model.

本实施例中，提供了一种存储目标语音模型的方法，终端设备通过前述步骤向服务器发送待检测语音数据后，当服务器确定待训练语音数据集合中的待训练语音数据数量达到语音训练数量阈值时，终端设备可以获取语音数据收集完成状态信息，根据语音数据收集完成状态信息可以得知终端设备语音数据已完成获取，可以开始进行模型训练，因此向服务器发送语音模型训练指令，服务器响应于语音模型训练指令，基于待训练语音数据集合，通过待训练语音模型获取预测语音数据集合，并基于预测语音数据集合以及待训练语音数据集合，对待训练语音模型进行训练，得到目标语音模型，再向终端设备发送目标语音模型，终端设备存储所获取的目标语音模型。当终端设备需要根据目标语音播放时，可以根据存储的目标语音模型生成目标语音数据，从而完成语音播放。In this embodiment, a method for storing a target voice model is provided. After the terminal device sends the voice data to be detected to the server through the foregoing steps, when the server determines that the quantity of the voice data to be trained in the set of voice data to be trained reaches the threshold of the voice training quantity The terminal device can obtain the voice data collection completion status information. According to the voice data collection completion status information, it can be known that the terminal device voice data has been acquired, and the model training can be started. Therefore, the voice model training instruction is sent to the server, and the server responds to the voice The model training instruction, based on the set of voice data to be trained, obtains the set of predicted voice data through the voice model to be trained, and trains the voice model to be trained based on the set of predicted voice data and the set of voice data to be trained, obtains the target voice model, and then sends it to the terminal The device sends the target voice model, and the terminal device stores the acquired target voice model. When the terminal device needs to play the target voice, it can generate target voice data according to the stored target voice model, so as to complete the voice play.

本申请实施例中，提供了一种存储目标语音模型的方法，采用上述方式，可以通过用户需求开始目标语音模型的训练，且由此存储所得到的目标语音模型，在需要语音播放时通过目标语音模型生成目标语音数据，从而完成语音播放，提升本方案的可行性。In the embodiment of the present application, a method for storing a target voice model is provided. Using the above method, the training of the target voice model can be started according to user requirements, and the obtained target voice model can be stored in this way. The voice model generates target voice data to complete voice playback and improve the feasibility of this solution.

获取语音数据收集状态信息，其中，语音数据收集状态信息用于指示正在获取待检测语音数据；acquiring voice data collection status information, wherein the voice data collection status information is used to indicate that voice data to be detected is being acquired;

获取语音数据收集完成状态信息，其中，语音数据收集完成状态信息用于指示语音数据已完成获取，且可以开始训练目标语音模型；Obtaining voice data collection completion status information, wherein the voice data collection completion status information is used to indicate that the voice data has been acquired, and the target voice model can be trained;

获取目标语音模型训练状态信息，其中，目标语音模型训练状态信息用于指示正在对待训练语音模型进行训练；Acquiring training status information of the target voice model, wherein the training status information of the target voice model is used to indicate that the voice model to be trained is being trained;

获取目标语音模型完成状态信息，其中，目标语音模型完成状态信息用于指示目标语音模型进行已完成训练。Acquiring the completion status information of the target voice model, where the target voice model completion status information is used to indicate that the target voice model has completed training.

本实施例中，终端设备还可以获取到不同的状态信息，基于不同的状态信息得知模型训练的状态，具体地，终端设备可以获取语音数据收集状态信息，语音数据收集完成状态信息，目标语音模型训练状态信息以及目标语音模型完成状态信息。其中，终端设备通过语音数据收集状态信息可以得知正在获取待检测语音数据，而终端设备通过语音数据收集完成状态信息可以得知语音数据已完成获取，即N满足语音训练数量阈值，可以开始训练目标语音模型，因此终端设备基于语音数据收集完成状态信息向服务器发送语音模型训练指令。进一步地，终端设备通过目标语音模型训练状态信可以得知服务器正在对待训练语音模型进行训练，而终端设备通过目标语音模型完成状态信息可以得知服务器目标语音模型进行已完成训练，即服务器已得到目标语音模型，且目标语音模型保存在服务器的数据中，然后服务器向终端设备发送目标语音模型，终端设备存储已完成训练的目标语音模型，当用户通过针对于语音播放的操作请求TTV服务的时，终端设备可以使用存储的目标语音模型与无音色语音数据合成目标语音数据。In this embodiment, the terminal device can also acquire different status information, and learn the status of the model training based on the different status information. Specifically, the terminal device can acquire voice data collection status information, voice data collection completion status information, and target voice data. Model training status information and target speech model completion status information. Among them, the terminal device can know that the voice data to be detected is being acquired through the voice data collection status information, and the terminal device can know that the voice data has been obtained through the voice data collection completion status information, that is, N meets the voice training quantity threshold, and can start training. The target voice model, so the terminal device sends voice model training instructions to the server based on the voice data collection completion status information. Further, the terminal device can know that the server is training the voice model to be trained through the target voice model training status information, and the terminal device can know that the server target voice model has completed training through the target voice model completion status information, that is, the server has obtained. The target voice model, and the target voice model is stored in the data of the server, and then the server sends the target voice model to the terminal device, and the terminal device stores the target voice model that has been trained. When the user requests the TTV service through the operation for voice playback , the terminal device can use the stored target voice model and the voiceless voice data to synthesize target voice data.

为了便于理解，请参阅图11，图11为本申请实施例中语音播放的方法另一实施例示意图，如图所示，在步骤E1中，终端设备持续向服务器请求检测模型训练状态信息，若正在获取待检测语音数据，在步骤E2中，终端设备可以获取到语音数据收集状态信息。若服务器生成包括N个待训练语音数据的待训练语音数据集合，且N满足语音训练数量阈值时，在步骤E2中，终端设备可以获取到语音数据收集完成状态信息。此时，终端设备向服务器发送语音模型训练指令，服务器开始进行模型训练，在步骤E3中，TTS定制服务向数据库请求模型状态信息，若模型训练完成得到目标语音模型，执行步骤E4，若模型训练未完成，执行步骤E2，在步骤E2中，向终端设备发送目标语音模型训练状态信息。在步骤E4中，离线训练服务向数据库发送目标语音模型，此时TTS定制服务可以得知模型已完成训练，因此在步骤E5中，TTS定制服务向消息推送服务发送目标语音模型完成状态信息，此时，在步骤E2中，服务器中的消息推送服务向终端设备发送目标语音模型完成状态信息。其次终端设备的TVS-SDK收到目标语音模型完成状态信息后，还可以触发调用TTS-SDK的接口检查模型状态，通过目标语音模型完成状态信息确定已生成目标语音模型后，模型管理模块获取目标语音模型，并保存至本地模型库，当用户选择使用目标语音模型时，TTS-SDK把标语音模型加载到数据结构中，以用于与无音色语音数据生成目标语音数据。For ease of understanding, please refer to FIG. 11. FIG. 11 is a schematic diagram of another embodiment of the voice playback method in this embodiment of the application. As shown in the figure, in step E1, the terminal device continuously requests the server for detection model training status information. The voice data to be detected is being acquired. In step E2, the terminal device may acquire voice data collection status information. If the server generates a to-be-trained voice data set including N to-be-trained voice data, and N satisfies the voice training quantity threshold, in step E2, the terminal device may obtain voice data collection completion status information. At this point, the terminal device sends a voice model training instruction to the server, and the server starts model training. In step E3, the TTS customization service requests model status information from the database. If the model training is completed to obtain the target voice model, step E4 is performed. If not completed, step E2 is performed, and in step E2, the training state information of the target speech model is sent to the terminal device. In step E4, the offline training service sends the target voice model to the database. At this time, the TTS customization service can know that the model has been trained. Therefore, in step E5, the TTS customization service sends the target voice model completion status information to the message push service. , in step E2, the message push service in the server sends the target voice model completion status information to the terminal device. Secondly, after the TVS-SDK of the terminal device receives the completion status information of the target voice model, it can also trigger the interface to call the TTS-SDK to check the model status. After determining that the target voice model has been generated through the completion status information of the target voice model, the model management module obtains the target The voice model is saved to the local model library. When the user chooses to use the target voice model, TTS-SDK loads the marked voice model into the data structure for generating the target voice data with the voiceless voice data.

本申请实施例中，提供了另一种语音播放的方法，采用上述方式，终端设备通过不同的状态信息能够确定模型训练的进程，因此可以及时提供目标语音模型所需数据，以及在完成训练后及时获取目标语音模型并进行存储，由此提升后续生成目标语音数据的效率。In the embodiment of the present application, another method for voice playback is provided. By using the above method, the terminal device can determine the model training process through different status information, so the data required by the target voice model can be provided in time, and the data required by the target voice model can be provided in time after the training is completed. The target speech model is acquired and stored in time, thereby improving the efficiency of subsequent generation of target speech data.

下面对本申请中模型训练装置进行详细描述，请参阅图12，图12为本申请实施例中模型训练装置一个实施例示意图，如图所示，模型训练装置30包括：The model training device in the present application will be described in detail below. Please refer to FIG. 12. FIG. 12 is a schematic diagram of an embodiment of the model training device in the embodiment of the present application. As shown in the figure, the model training device 30 includes:

获取模块301，用于获取待训练语音数据集合，其中，待训练语音数据集合包括N个待训练语音数据，N为大于，或者等于1的正整数；The acquisition module 301 is configured to acquire a voice data set to be trained, wherein the voice data set to be trained includes N voice data to be trained, and N is a positive integer greater than or equal to 1;

获取模块301，还用于当N满足语音训练数量阈值，获取语音模型训练指令；The obtaining module 301 is further configured to obtain a voice model training instruction when N meets the voice training quantity threshold;

获取模块301，还用于响应于语音模型训练指令，基于待训练语音数据集合，通过待训练语音模型获取预测语音数据集合，其中，预测语音数据集合包括N个预测语音数据，且预测语音数据与待训练语音数据具有对应关系；The obtaining module 301 is further configured to, in response to the voice model training instruction, obtain a predicted voice data set based on the to-be-trained voice data set through the to-be-trained voice model, wherein the predicted voice data set includes N predicted voice data, and the predicted voice data is the same as the predicted voice data set. The speech data to be trained has a corresponding relationship;

训练模块302，用于基于预测语音数据集合以及待训练语音数据集合，对待训练语音模型进行训练，得到目标语音模型；The training module 302 is used for training the to-be-trained speech model based on the predicted speech data set and the to-be-trained speech data set to obtain the target speech model;

发送模块303，用于发送目标语音模型，以使得终端设备存储目标语音模型。The sending module 303 is configured to send the target voice model, so that the terminal device stores the target voice model.

可选地，在上述图12所对应的实施例的基础上，本申请实施例提供的模型训练装置30的另一实施例中，Optionally, on the basis of the embodiment corresponding to FIG. 12 above, in another embodiment of the model training apparatus 30 provided by the embodiment of the present application,

获取模块301，具体用于获取语音服务请求；an obtaining module 301, specifically configured to obtain a voice service request;

训练模块302，具体用于基于预测语音数据集合以及待训练语音数据集合，根据目标损失函数更新待训练语音模型的模型参数；The training module 302 is specifically configured to update the model parameters of the speech model to be trained according to the target loss function based on the predicted speech data set and the to-be-trained speech data set;

获取模块301，还用于获取语音播放请求，其中，语音播放请求携带待解析语音数据；The obtaining module 301 is further configured to obtain a voice play request, wherein the voice play request carries voice data to be parsed;

获取模块301，还用于根据待解析语音数据获取服务内容文本数据，其中，服务内容文本数据与待解析语音数据对应；The obtaining module 301 is further configured to obtain service content text data according to the to-be-parsed voice data, wherein the service content text data corresponds to the to-be-parsed voice data;

获取模块301，还用于根据服务内容文本数据获取无音色语音数据，其中，无音色语音数据是对服务内容文本数据进行语音合成后得到的；The obtaining module 301 is further configured to obtain voiceless voice data according to the service content text data, wherein the voiceless voice data is obtained by performing speech synthesis on the service content text data;

发送模块303，还用于向终端设备发送无音色语音数据，以使得终端设备根据无音色语音数据以及目标语音模型生成目标语音数据，并播放目标语音数据。The sending module 303 is further configured to send the voiceless voice data to the terminal device, so that the terminal device generates the target voice data according to the voiceless voice data and the target voice model, and plays the target voice data.

下面对本申请中语音播放装置进行详细描述，请参阅图13，图13为本申请实施例中语音播放装置一个实施例示意图，如图所示，语音播放装置40包括：The following is a detailed description of the voice playback device in the present application. Please refer to FIG. 13. FIG. 13 is a schematic diagram of an embodiment of the voice playback device in the embodiment of the present application. As shown in the figure, the voice playback device 40 includes:

发送模块401，用于检测针对于语音播放的操作，发送语音播放请求，以使得服务器基于待解析语音数据获取服务内容文本数据，根据服务内容文本数据获取无音色语音数据，其中，语音播放请求携带待解析语音数据，服务内容文本数据是根据待解析语音数据获取的，无音色语音数据是对服务内容文本数据进行语音合成后得到的；The sending module 401 is used to detect an operation for voice playback, and send a voice playback request, so that the server obtains service content text data based on the voice data to be parsed, and obtains voiceless voice data according to the service content text data, wherein the voice playback request carries The voice data to be parsed, the service content text data is obtained according to the voice data to be parsed, and the timbreless voice data is obtained after voice synthesis is performed on the service content text data;

获取模块402，用于获取无音色语音数据；Obtaining module 402, for obtaining voiceless voice data;

生成模块403，用于根据无音色语音数据以及目标语音模型生成目标语音数据，其中，目标语音模型是基于预测语音数据集合以及待训练语音数据集合，对待训练语音模型进行训练得到的，预测语音数据集合是服务器响应于语音模型训练指令，基于待训练语音数据集合，通过待训练语音模型获取的；The generation module 403 is used to generate target voice data according to the voiceless voice data and the target voice model, wherein the target voice model is based on the predicted voice data set and the voice data set to be trained, the voice model to be trained is trained and obtained, the predicted voice data is obtained. The set is obtained by the server through the voice model to be trained based on the voice data set to be trained in response to the voice model training instruction;

播放模块404，用于播放目标语音数据。The playing module 404 is used for playing the target voice data.

可选地，在上述图13所对应的实施例的基础上，本申请实施例提供的语音播放装置40的另一实施例中，Optionally, on the basis of the embodiment corresponding to FIG. 13 above, in another embodiment of the voice playback device 40 provided in the embodiment of the present application,

发送模块401，还用于检测针对于语音服务的操作，发送语音服务请求；The sending module 401 is further configured to detect the operation for the voice service and send the voice service request;

获取模块402，还用于当服务器响应于语音服务请求，获取语音数据收集指令；The obtaining module 402 is further configured to obtain the voice data collection instruction when the server responds to the voice service request;

获取模块402，还用于响应于语音数据收集指令，且所述待检测语音数据满足噪声合格条件时，获取待检测语音数据，其中，待检测语音数据是基于语音训练文本得到的，语音训练文本是根据语音数据收集指令获取的；The acquisition module 402 is further configured to acquire the voice data to be detected in response to the voice data collection instruction and when the voice data to be detected meets the noise qualification conditions, wherein the voice data to be detected is obtained based on the voice training text, and the voice training text is obtained according to the voice data collection instruction;

发送模块401，还用于发送待检测语音数据，以使得服务器基于待检测语音数据，生成待训练语音数据集合，待训练语音数据集合包括N个待训练语音数据，N为大于，或者等于1的正整数。The sending module 401 is further configured to send the voice data to be detected, so that the server generates a voice data set to be trained based on the voice data to be detected, the voice data set to be trained includes N voice data to be trained, and N is greater than or equal to 1 positive integer.

可选地，在上述图13所对应的实施例的基础上，本申请实施例提供的语音播放装置40的另一实施例中，语音播放装置40还包括存储模块405，Optionally, on the basis of the embodiment corresponding to FIG. 13 above, in another embodiment of the voice playback device 40 provided by the embodiment of the present application, the voice playback device 40 further includes a storage module 405,

发送模块401，还用于当N满足语音训练数量阈值，发送语音模型训练指令，以使得服务器得到目标语音模型；The sending module 401 is also used to send a voice model training instruction when N meets the voice training quantity threshold, so that the server obtains the target voice model;

获取模块402，还用于获取目标语音模型；an obtaining module 402, further configured to obtain a target speech model;

存储模块405，用于存储目标语音模型。The storage module 405 is used to store the target speech model.

获取模块402，还用于获取语音数据收集状态信息，其中，语音数据收集状态信息用于指示正在获取待检测语音数据；The acquiring module 402 is further configured to acquire voice data collection status information, wherein the voice data collection status information is used to indicate that voice data to be detected is being acquired;

获取模块402，还用于获取语音数据收集完成状态信息，其中，语音数据收集完成状态信息用于指示语音数据已完成获取，且可以开始训练目标语音模型；The acquisition module 402 is further configured to acquire voice data collection completion status information, wherein the voice data collection completion status information is used to indicate that the voice data has been acquired, and the target voice model can be trained;

获取模块402，还用于获取目标语音模型训练状态信息，其中，目标语音模型训练状态信息用于指示正在对待训练语音模型进行训练；The obtaining module 402 is further configured to obtain target voice model training status information, wherein the target voice model training status information is used to indicate that the voice model to be trained is being trained;

获取模块402，还用于获取目标语音模型完成状态信息，其中，目标语音模型完成状态信息用于指示目标语音模型进行已完成训练。The obtaining module 402 is further configured to obtain the completion status information of the target voice model, wherein the target voice model completion status information is used to indicate that the target voice model has completed training.

本申请实施例还提供了一种服务器，请参阅图14，图14为本申请实施例中服务器的一个结构示意图，如图所示，该服务器500可因配置或性能不同而产生比较大的差异，可以包括一个或一个以上中央处理器(central processing units，CPU)522(例如，一个或一个以上处理器)和存储器532，一个或一个以上存储应用程序542或数据544的存储介质530(例如一个或一个以上海量存储设备)。其中，存储器532和存储介质530可以是短暂存储或持久存储。存储在存储介质530的程序可以包括一个或一个以上模块(图示没标出)，每个模块可以包括对服务器中的一系列指令操作。更进一步地，中央处理器522可以设置为与存储介质530通信，在服务器500上执行存储介质530中的一系列指令操作。An embodiment of the present application also provides a server. Please refer to FIG. 14 . FIG. 14 is a schematic structural diagram of a server in an embodiment of the present application. As shown in the figure, the server 500 may vary greatly due to different configurations or performances. , may include one or more central processing units (CPU) 522 (eg, one or more processors) and memory 532, one or more storage media 530 (eg, a or more than one mass storage device). Among them, the memory 532 and the storage medium 530 may be short-term storage or persistent storage. The program stored in the storage medium 530 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the server. Furthermore, the central processing unit 522 may be configured to communicate with the storage medium 530 to execute a series of instruction operations in the storage medium 530 on the server 500 .

服务器500还可以包括一个或一个以上电源526，一个或一个以上有线或无线网络接口550，一个或一个以上输入输出接口558，和/或，一个或一个以上操作系统541，例如Windows Server^TM，Mac OS X^TM，Unix^TM，Linux^TM，FreeBSD^TM等等。Server 500 may also include one or more power supplies 526, one or more wired or wireless network interfaces 550, one or more input and output interfaces 558, and/or, one or more operating systems 541, such as Windows Server ^™ , Mac OS X ^TM , Unix ^TM , Linux ^TM , FreeBSD ^TM and many more.

上述实施例中由服务器所执行的步骤可以基于该图14所示的服务器结构。The steps performed by the server in the above embodiment may be based on the server structure shown in FIG. 14 .

本申请实施例还提供了一种终端设备，终端设备上部署有客户端，如图15所示，为了便于说明，仅示出了与本申请实施例相关的部分，具体技术细节未揭示的，请参照本申请实施例方法部分。终端设备可以包括但不限于智能手机、平板电脑、笔记本电脑、掌上电脑、个人电脑、智能手表、电视机等任意终端设备，以终端设备为手机为例：An embodiment of the present application further provides a terminal device, where a client is deployed on the terminal device, as shown in FIG. 15 , for the convenience of description, only the part related to the embodiment of the present application is shown, and the specific technical details are not disclosed. Please refer to the method section of the embodiment of the present application. Terminal devices can include but are not limited to any terminal device such as smart phones, tablet computers, notebook computers, PDAs, personal computers, smart watches, TV sets, etc. Taking the terminal device as a mobile phone as an example:

图15示出的是与本申请实施例提供的终端设备相关的手机的部分结构的框图。参考图15，如图所示，手机包括：射频(Radio Frequency，RF)电路610、存储器620、输入单元630、显示单元640、传感器630、语音电路660、无线保真(wireless fidelity，WiFi)模块670、处理器680、以及电源690等部件。本领域技术人员可以理解，图15中示出的手机结构并不构成对手机的限定，可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。FIG. 15 is a block diagram showing a partial structure of a mobile phone related to a terminal device provided by an embodiment of the present application. Referring to FIG. 15 , as shown in the figure, the mobile phone includes: a radio frequency (RF) circuit 610, a memory 620, an input unit 630, a display unit 640, a sensor 630, a voice circuit 660, and a wireless fidelity (WiFi) module 670, the processor 680, and the power supply 690 and other components. Those skilled in the art can understand that the structure of the mobile phone shown in FIG. 15 does not constitute a limitation on the mobile phone, and may include more or less components than shown, or combine some components, or arrange different components.

下面结合图15对手机的各个构成部件进行具体的介绍：The following is a detailed introduction to each component of the mobile phone with reference to Figure 15:

RF电路610可用于收发信息或通话过程中，信号的接收和发送，特别地，将基站的下行信息接收后，给处理器680处理；另外，将设计上行的数据发送给基站。通常，RF电路610包括但不限于天线、至少一个放大器、收发信机、耦合器、低噪声放大器(Low NoiseAmplifier，LNA)、双工器等。此外，RF电路610还可以通过无线通信与网络和其他设备通信。上述无线通信可以使用任一通信标准或协议，包括但不限于全球移动通讯系统(GlobalSystem of Mobile communication，GSM)、通用分组无线服务(General Packet RadioService，GPRS)、码分多址(Code Division Multiple Access，CDMA)、宽带码分多址(Wideband Code Division Multiple Access，WCDMA)、长期演进(Long Term Evolution，LTE)、电子邮件、短消息服务(Short Messaging Service，SMS)等。The RF circuit 610 can be used for receiving and sending signals during sending and receiving information or during a call. In particular, after receiving the downlink information of the base station, it is processed by the processor 680; in addition, the designed uplink data is sent to the base station. Typically, the RF circuit 610 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, RF circuitry 610 may communicate with networks and other devices via wireless communications. The above-mentioned wireless communication can use any communication standard or protocol, including but not limited to Global System of Mobile communication (GSM), General Packet Radio Service (General Packet Radio Service, GPRS), Code Division Multiple Access (Code Division Multiple Access) , CDMA), Wideband Code Division Multiple Access (Wideband Code Division Multiple Access, WCDMA), Long Term Evolution (Long Term Evolution, LTE), email, Short Messaging Service (Short Messaging Service, SMS) and the like.

存储器620可用于存储软件程序以及模块，处理器680通过运行存储在存储器620的软件程序以及模块，从而执行手机的各种功能应用以及数据处理。存储器620可主要包括存储程序区和存储数据区，其中，存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等；存储数据区可存储根据手机的使用所创建的数据(比如语音数据、电话本等)等。此外，存储器620可以包括高速随机存取存储器，还可以包括非易失性存储器，例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。The memory 620 can be used to store software programs and modules, and the processor 680 executes various functional applications and data processing of the mobile phone by running the software programs and modules stored in the memory 620 . The memory 620 may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program (such as a sound playback function, an image playback function, etc.) required for at least one function, and the like; The data created by the use of the mobile phone (such as voice data, phone book, etc.), etc. Additionally, memory 620 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

输入单元630可用于接收输入的数字或字符信息，以及产生与手机的用户设置以及功能控制有关的键信号输入。具体地，输入单元630可包括触控面板631以及其他输入设备632。触控面板631，也称为触摸屏，可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触控面板631上或在触控面板631附近的操作)，并根据预先设定的程式驱动相应的连接装置。可选的，触控面板631可包括触摸检测装置和触摸控制器两个部分。其中，触摸检测装置检测用户的触摸方位，并检测触摸操作带来的信号，将信号传送给触摸控制器；触摸控制器从触摸检测装置上接收触摸信息，并将它转换成触点坐标，再送给处理器680，并能接收处理器680发来的命令并加以执行。此外，可以采用电阻式、电容式、红外线以及表面声波等多种类型实现触控面板631。除了触控面板631，输入单元630还可以包括其他输入设备632。具体地，其他输入设备632可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆等中的一种或多种。The input unit 630 may be used for receiving inputted numerical or character information, and generating key signal input related to user setting and function control of the mobile phone. Specifically, the input unit 630 may include a touch panel 631 and other input devices 632 . The touch panel 631, also referred to as a touch screen, can collect the user's touch operations on or near it (such as the user's finger, stylus, etc., any suitable object or attachment on or near the touch panel 631). operation), and drive the corresponding connection device according to the preset program. Optionally, the touch panel 631 may include two parts, a touch detection device and a touch controller. Among them, the touch detection device detects the user's touch orientation, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact coordinates, and then sends it to the touch controller. To the processor 680, and can receive the commands sent by the processor 680 and execute them. In addition, the touch panel 631 can be implemented in various types such as resistive, capacitive, infrared, and surface acoustic waves. Besides the touch panel 631 , the input unit 630 may further include other input devices 632 . Specifically, other input devices 632 may include, but are not limited to, one or more of physical keyboards, function keys (such as volume control keys, switch keys, etc.), trackballs, mice, joysticks, and the like.

显示单元640可用于显示由用户输入的信息或提供给用户的信息以及手机的各种菜单。显示单元640可包括显示面板641，可选的，可以采用液晶显示器(Liquid CrystalDisplay，LCD)、有机发光二极管(Organic Light-Emitting Diode，OLED)等形式来配置显示面板641。进一步的，触控面板631可覆盖显示面板641，当触控面板631检测到在其上或附近的触摸操作后，传送给处理器680以确定触摸事件的类型，随后处理器680根据触摸事件的类型在显示面板641上提供相应的视觉输出。虽然在图15中，触控面板631与显示面板641是作为两个独立的部件来实现手机的输入和输入功能，但是在某些实施例中，可以将触控面板631与显示面板641集成而实现手机的输入和输出功能。The display unit 640 may be used to display information input by the user or information provided to the user and various menus of the mobile phone. The display unit 640 may include a display panel 641, and optionally, the display panel 641 may be configured in the form of a liquid crystal display (Liquid Crystal Display, LCD), an organic light-emitting diode (Organic Light-Emitting Diode, OLED), or the like. Further, the touch panel 631 can cover the display panel 641, and when the touch panel 631 detects a touch operation on or near it, it transmits it to the processor 680 to determine the type of the touch event, and then the processor 680 determines the type of the touch event according to the touch event. Type provides corresponding visual output on display panel 641 . Although in FIG. 15, the touch panel 631 and the display panel 641 are used as two independent components to realize the input and input functions of the mobile phone, in some embodiments, the touch panel 631 and the display panel 641 can be integrated to form Realize the input and output functions of the mobile phone.

手机还可包括至少一种传感器630，比如光传感器、运动传感器以及其他传感器。具体地，光传感器可包括环境光传感器及接近传感器，其中，环境光传感器可根据环境光线的明暗来调节显示面板641的亮度，接近传感器可在手机移动到耳边时，关闭显示面板641和/或背光。作为运动传感器的一种，加速计传感器可检测各个方向上(一般为三轴)加速度的大小，静止时可检测出重力的大小及方向，可用于识别手机姿态的应用(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等；至于手机还可配置的陀螺仪、气压计、湿度计、温度计、红外线传感器等其他传感器，在此不再赘述。The cell phone may also include at least one sensor 630, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 641 according to the brightness of the ambient light, and the proximity sensor may turn off the display panel 641 and/or when the mobile phone is moved to the ear. or backlight. As a kind of motion sensor, the accelerometer sensor can detect the magnitude of acceleration in all directions (usually three axes), and can detect the magnitude and direction of gravity when it is stationary. games, magnetometer attitude calibration), vibration recognition related functions (such as pedometer, tapping), etc.; as for other sensors such as gyroscope, barometer, hygrometer, thermometer, infrared sensor, etc. Repeat.

语音电路660、扬声器661，传声器662可提供用户与手机之间的语音接口。语音电路660可将接收到的语音数据转换后的电信号，传输到扬声器661，由扬声器661转换为声音信号输出；另一方面，传声器662将收集的声音信号转换为电信号，由语音电路660接收后转换为语音数据，再将语音数据输出处理器680处理后，经RF电路610以发送给比如另一手机，或者将语音数据输出至存储器620以便进一步处理。The voice circuit 660, the speaker 661, and the microphone 662 can provide a voice interface between the user and the mobile phone. The voice circuit 660 can transmit the received voice data converted electrical signal to the speaker 661, and the speaker 661 converts it into a sound signal for output; on the other hand, the microphone 662 converts the collected voice signal into an electrical signal, which is converted by the voice circuit 660 After receiving, it is converted into voice data, and then the voice data is output to the processor 680 for processing, and then sent to, for example, another mobile phone through the RF circuit 610, or the voice data is output to the memory 620 for further processing.

WiFi属于短距离无线传输技术，手机通过WiFi模块670可以帮助用户收发电子邮件、浏览网页和访问流式媒体等，它为用户提供了无线的宽带互联网访问。虽然图15示出了WiFi模块670，但是可以理解的是，其并不属于手机的必须构成。WiFi is a short-distance wireless transmission technology. The mobile phone can help users to send and receive emails, browse web pages, and access streaming media through the WiFi module 670. It provides users with wireless broadband Internet access. Although FIG. 15 shows the WiFi module 670, it can be understood that it is not a necessary component of the mobile phone.

处理器680是手机的控制中心，利用各种接口和线路连接整个手机的各个部分，通过运行或执行存储在存储器620内的软件程序和/或模块，以及调用存储在存储器620内的数据，执行手机的各种功能和处理数据，从而对手机进行整体监控。可选的，处理器680可包括一个或多个处理单元；优选的，处理器680可集成应用处理器和调制解调处理器，其中，应用处理器主要处理操作系统、用户界面和应用程序等，调制解调处理器主要处理无线通信。可以理解的是，上述调制解调处理器也可以不集成到处理器680中。The processor 680 is the control center of the mobile phone, using various interfaces and lines to connect various parts of the entire mobile phone, by running or executing the software programs and/or modules stored in the memory 620, and calling the data stored in the memory 620. Various functions of the mobile phone and processing data, so as to monitor the mobile phone as a whole. Optionally, the processor 680 may include one or more processing units; preferably, the processor 680 may integrate an application processor and a modem processor, wherein the application processor mainly processes the operating system, user interface, and application programs, etc. , the modem processor mainly deals with wireless communication. It can be understood that, the above-mentioned modulation and demodulation processor may not be integrated into the processor 680 .

手机还包括给各个部件供电的电源690(比如电池)，优选的，电源可以通过电源管理系统与处理器680逻辑相连，从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。The mobile phone also includes a power supply 690 (such as a battery) for supplying power to various components. Preferably, the power supply can be logically connected to the processor 680 through a power management system, so as to manage charging, discharging, and power consumption management functions through the power management system.

尽管未示出，手机还可以包括摄像头、蓝牙模块等，在此不再赘述。Although not shown, the mobile phone may also include a camera, a Bluetooth module, and the like, which will not be repeated here.

在本申请实施例中，该终端设备所包括的处理器680可以执行前述实施例中的功能，此处不再赘述。In this embodiment of the present application, the processor 680 included in the terminal device may perform the functions in the foregoing embodiments, and details are not described herein again.

本申请实施例中还提供一种计算机可读存储介质，该计算机可读存储介质中存储有计算机程序，当其在计算机上运行时，使得计算机执行如前述实施例描述的方法中终端设备所执行的步骤，或者，使得计算机执行如前述实施例描述的方法中服务器所执行的步骤。Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when it runs on a computer, the computer executes the method as described in the foregoing embodiments that is executed by the terminal device. or, causing the computer to execute the steps executed by the server in the methods described in the foregoing embodiments.

本申请实施例中还提供一种包括程序的计算机程序产品，当其在计算机上运行时，使得计算机执行如前述实施例描述的方法中终端设备所执行的步骤，或者，使得计算机执行如前述实施例描述的方法中服务器所执行的步骤。Embodiments of the present application also provide a computer program product including a program, which, when running on a computer, causes the computer to perform the steps performed by the terminal device in the methods described in the foregoing embodiments, or causes the computer to execute the steps performed by the terminal device as described in the foregoing embodiments. The steps performed by the server in the method described in the example.

所属领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的系统，装置和单元的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the system, device and unit described above may refer to the corresponding process in the foregoing method embodiments, which will not be repeated here.

在本申请所提供的几个实施例中，应该理解到，所揭露的系统，装置和方法，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，所述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外，在本申请各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.

所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(read-only memory，ROM)、随机存取存储器(random access memory，RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, removable hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

以上所述，以上实施例仅用以说明本申请的技术方案，而非对其限制；尽管参照前述实施例对本申请进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand: The technical solutions described in the embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the present application.

Claims

1. A method of model training, comprising:

acquiring a voice data set to be trained, wherein the voice data set to be trained comprises N voice data to be trained, and N is a positive integer greater than or equal to 1;

when N meets the threshold of the number of voice training, acquiring a voice model training instruction;

responding to the voice model training instruction, and based on the voice data set to be trained, obtaining a prediction voice data set through a voice model to be trained, wherein the prediction voice data set comprises N prediction voice data, and the prediction voice data and the voice data to be trained have a corresponding relation;

training the voice model to be trained based on the predicted voice data set and the voice data set to be trained to obtain a target voice model;

and sending the target voice model so that the terminal equipment stores the target voice model.

2. The method of claim 1, wherein the obtaining the set of speech data to be trained comprises:

acquiring a voice service request;

responding to the voice service request, sending a voice data collection instruction to the terminal device, so that the terminal device responds to the voice data collection instruction to obtain voice data to be detected, wherein the voice data to be detected is obtained based on a voice training text, and the voice training text is obtained according to the voice data collection instruction;

when the voice data to be tested meets the qualified voice training conditions, determining the voice data to be trained;

and generating the voice data set to be trained based on the voice data to be trained.

3. The method of claim 1, wherein the training the speech model to be trained based on the predicted speech data set and the speech data set to be trained to obtain the target speech model comprises:

updating model parameters of the voice model to be trained according to a target loss function based on the predicted voice data set and the voice data set to be trained;

and if the target loss function reaches convergence, generating the target voice model according to the model parameters.

4. The method according to any one of claims 1 to 3, further comprising:

acquiring a voice playing request, wherein the voice playing request carries voice data to be analyzed;

acquiring service content text data according to the voice data to be analyzed, wherein the service content text data corresponds to the voice data to be analyzed;

obtaining unvoiced sound data according to the service content text data, wherein the unvoiced sound data is obtained by performing voice synthesis on the service content text data;

and sending the unvoiced sound data to a terminal device, so that the terminal device generates target sound data according to the unvoiced sound data and the target sound model, and playing the target sound data.

5. A method for voice playback, comprising:

detecting operation aiming at voice playing, and sending a voice playing request to enable a server to obtain service content text data based on voice data to be analyzed and obtain unvoiced voice data according to the service content text data, wherein the voice playing request carries the voice data to be analyzed, the service content text data is obtained according to the voice data to be analyzed, and the unvoiced voice data is obtained after voice synthesis is carried out on the service content text data;

acquiring the unvoiced sound data;

generating target voice data according to the unvoiced voice data and a target voice model, wherein the target voice model is obtained by training a voice model to be trained based on a prediction voice data set and a voice data set to be trained, and the prediction voice data set is obtained by a server responding to a voice model training instruction, based on the voice data set to be trained, through the voice model to be trained;

and playing the target voice data.

6. The method of claim 5, further comprising:

detecting an operation for a voice service, and sending a voice service request;

when the server responds to the voice service request, acquiring a voice data collection instruction;

responding to the voice data collection instruction, and acquiring the voice data to be detected when the voice data to be detected meets a noise qualified condition, wherein the voice data to be detected is acquired based on a voice training text which is acquired according to the voice data collection instruction;

and sending the voice data to be detected so that the server generates a voice data set to be trained based on the voice data to be detected, wherein the voice data set to be trained comprises N voice data to be trained, and N is a positive integer greater than or equal to 1.

7. The method of claim 6, further comprising:

when N meets the threshold of the number of voice training, sending a voice model training instruction to enable the server to obtain a target voice model;

acquiring a target voice model;

and storing the target voice model.

8. The method according to any one of claims 5 to 7, further comprising:

acquiring voice data collection state information, wherein the voice data collection state information is used for indicating that voice data to be detected is being acquired;

acquiring voice data collection completion state information, wherein the voice data collection completion state information is used for indicating that voice data is completely acquired and training of the target voice model can be started;

acquiring target voice model training state information, wherein the target voice model training state information is used for indicating that the voice model to be trained is being trained;

and acquiring target voice model completion state information, wherein the target voice model completion state information is used for indicating that the target voice model is trained completely.

9. A server, comprising: a memory, a transceiver, a processor, and a bus system;

wherein the memory is used for storing programs;

the processor for executing a program in the memory, the processor for performing the method of any of claims 1 to 4 according to instructions in the program code;

the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate.

10. A terminal device, comprising: a memory, a transceiver, a processor, and a bus system;

wherein the memory is used for storing programs;

the processor for executing a program in the memory, the processor for performing the method of any of claims 5 to 8 according to instructions in the program code;