CN111210809A

CN111210809A - Voice training data adaptation method and device, voice data conversion method and electronic equipment

Info

Publication number: CN111210809A
Application number: CN201811400134.7A
Authority: CN
Inventors: 张平
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-11-22
Filing date: 2018-11-22
Publication date: 2020-05-29
Anticipated expiration: 2038-11-22
Also published as: CN111210809B

Abstract

The embodiment of the invention provides a voice training data adaptation method and device, a voice data conversion method and electronic equipment. The voice training data adaptation method comprises the following steps: acquiring original voice data for data conversion, wherein the original voice data has audio data information in all directions; and converting the original voice data through a channel conversion algorithm to obtain training data suitable for different channels. The embodiment of the invention carries out conversion processing on the existing original voice data through the channel conversion algorithm to obtain the training data adaptive to different channels, avoids training by carrying out a large amount of voice data acquisition on a new voice recognition product every time, and can obtain the training data adaptive to the voice recognition product only by updating and maintaining the channel conversion algorithm, thereby improving the modeling efficiency of a new voice matching model and saving the labor cost.

Description

Voice training data adaptation method and device, voice data conversion method and electronic equipment

Technical Field

The invention relates to the technical field of smart home, in particular to a voice training data adaptation method and device, a voice data conversion method and electronic equipment.

Background

The intelligent sound box is an upgrading product of the sound box, is a tool for family consumers to acquire songs, weather forecasts, news and the like from a cloud end through voice input, and can also control other intelligent household equipment, such as opening a curtain through voice input, setting the temperature of a refrigerator, warming a water heater in advance and the like.

Different intelligent audio amplifier products all have the difference in the aspect of microphone setting and speech signal processing technique. The service provider (used for providing services such as songs, weather and news) needs to set a voice database matched with the intelligent sound boxes of different models, uses voice data in the voice database as training data to train matching models suitable for the intelligent sound boxes of various models, and performs matching operations in aspects such as voiceprint and voice through corresponding matching models after a user inputs voice by using the intelligent sound box of a certain model, so that voiceprint recognition or voice recognition is realized.

In the process of implementing the invention, the inventor finds that the prior art has at least the following problems: with the upgrading and development of technology, new voice recognition products are continuously released in the market. After a new product is released, because the existing voice data in the existing voice database is not matched with the new product, a service provider needs to acquire a large amount of voice data for the new product and acquire voice training data suitable for the model of voice recognition product to perform modeling, and the acquisition efficiency is very low.

Disclosure of Invention

The embodiment of the invention provides a voice training data adaptation method and device, a voice data conversion method and electronic equipment, and aims to overcome the defect that the training data acquisition efficiency is low in the prior art.

To achieve the above object, an embodiment of the present invention provides a method for adapting speech training data, including:

acquiring original voice data for data conversion, wherein the original voice data has audio data information in all directions;

and converting the original voice data through a channel conversion algorithm to obtain training data suitable for different channels.

The embodiment of the invention also provides a voice data conversion method, which comprises the following steps:

converting original voice data through a channel conversion algorithm matched with a playing device to obtain training data suitable for the playing device, wherein the original voice data have audio data information in all directions;

performing model training according to the training data to obtain a data conversion model;

and converting the data to be output of the playing equipment according to the data conversion model so as to obtain the playing data suitable for the playing equipment.

The embodiment of the present invention further provides a device for adapting voice training data, including:

the system comprises an original voice data acquisition module, a voice conversion module and a voice conversion module, wherein the original voice data acquisition module is used for acquiring original voice data for data conversion, and the original voice data has audio data information in all directions;

and the data conversion module is used for converting the original voice data through a channel conversion algorithm so as to obtain training data suitable for different channels.

An embodiment of the present invention further provides an electronic device, including:

a memory for storing a program;

a processor for executing the program stored in the memory for:

According to the voice training data adapting method and device, the voice data conversion method and the electronic equipment, the existing original voice data are converted through the channel conversion algorithm to obtain the training data adapted to different channels, the training data adapted to the voice recognition product can be obtained only by updating and maintaining the channel conversion algorithm and training a large amount of voice data acquisition on the new voice recognition product every time, the modeling efficiency of a new voice matching model is improved, and meanwhile the labor cost is saved.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a system block diagram of a service system according to an embodiment of the present invention;

FIG. 2 is a flowchart of an embodiment of a method for adapting speech training data provided by the present invention;

FIG. 3 is a flowchart of another embodiment of a method for adapting speech training data according to the present invention;

FIG. 4 is a schematic structural diagram of an embodiment of an apparatus for adapting speech training data according to the present invention;

FIG. 5 is a schematic structural diagram of another embodiment of an apparatus for adapting speech training data according to the present invention;

FIG. 6 is a flow chart of an embodiment of a method for converting voice data provided by the present invention;

fig. 7 is a schematic structural diagram of an embodiment of an electronic device provided in the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

In the prior art, different voice recognition products (e.g., smart speaker products) have differences in both microphone settings and voice signal processing techniques. The service provider needs to provide a voice database matched with the intelligent sound boxes of different models, and takes the voice data in the voice database as training data to train a matching model suitable for various types of voice recognition products. After a user inputs voice by using a certain type of voice recognition product, matching operations in aspects of voiceprint, voice and the like can be carried out through the corresponding matching model, and therefore voiceprint recognition or voice recognition is achieved. When a new speech recognition product is released, because stock speech data in the existing speech database is not matched with the new product, a service provider needs to acquire a large amount of speech data for the new product and acquire training data suitable for the speech recognition product of the model for modeling, and the acquisition efficiency is very low. Therefore, the present application provides a scheme for adapting speech training data, which has the following main principles: the method comprises the steps of converting original voice data (namely voice data with audio data information in all directions, such as complete channel information, rich high-frequency information, voice data with noise removal and the like) which are obtained or obtained in advance through a channel conversion algorithm to obtain training data suitable for different channels (such as two-microphone, four-microphone, six-microphone and the like), so that the condition that a large amount of voice data are collected to train a new voice recognition product every time is avoided, and only the channel conversion algorithm needs to be updated and maintained, the training data suitable for the voice recognition product can be obtained, therefore, the modeling efficiency of a matching model of the new voice recognition product can be improved, and meanwhile, the labor cost is saved.

The method provided by the embodiment of the invention can be applied to any business system with voice data processing capability. Fig. 1 is a system block diagram of a service system provided in an embodiment of the present invention, and the structure shown in fig. 1 is only one example of a service system to which the technical solution of the present invention can be applied. As shown in fig. 1, the service system includes a training data adapting device. The device includes: the raw speech data acquisition module and the data conversion module may be configured to perform the processing flows shown in fig. 2 and 3 below.

In the service system, first, original voice data for data conversion is acquired, the original voice data having audio data information in each direction; then, the obtained original voice data is converted through a channel conversion algorithm to obtain training data suitable for different channels. Specifically, the existing original voice data (i.e., the channel information is complete, the high-frequency information is rich, and the noisy high-quality voice data is removed) can be directly obtained; the existing stock data can also be recorded in high fidelity, so that the original voice data can be obtained; in addition, for data which is not contained in the existing data, the voice of a recording person can be recorded through a high-fidelity recording device to supplement. After conversion processing is performed through a channel conversion algorithm, training data (e.g., two-wheat data, four-wheat data, six-wheat data, etc.) suitable for different channels are obtained to be respectively used for training different matching models (e.g., two-wheat model, four-wheat model, six-wheat model, etc.).

The above embodiments are illustrations of technical principles and exemplary application frameworks of the embodiments of the present invention, and specific technical solutions of the embodiments of the present invention are further described in detail below through a plurality of embodiments.

Example one

Fig. 2 is a flowchart of an embodiment of a method for adapting speech training data provided by the present invention, where an execution subject of the method may be the service system, various server devices with speech data processing capability, or devices or chips integrated on the server devices. As shown in fig. 2, the method for adapting speech training data includes the following steps:

s201, original voice data for data conversion is acquired.

In the embodiment of the present invention, the original voice data has audio data information in various directions. The existing original voice data can be obtained from the first database, the original voice data obtained by recording existing stock data through high-fidelity recording equipment can also be obtained from the second database, and the original voice data obtained by recording personnel through the high-fidelity recording equipment can also be obtained from the third database.

S202, original voice data are converted through a channel conversion algorithm to obtain training data suitable for different channels.

In the embodiment of the present invention, step S201, i.e., the process of acquiring the original voice data, is independent of the data conversion process. The raw speech data is used as input to the channel switching algorithm, and the acquisition step is a pre-processing data preparation process. And step S202, i.e., the data conversion process, may be performed whenever corresponding training data is required.

According to the voice training data adaptation method provided by the embodiment of the invention, the existing original voice data is converted through the channel conversion algorithm to obtain the training data adapted to different channels, so that the training data adapted to the voice recognition product can be obtained only by updating and maintaining the channel conversion algorithm while the training data acquired by acquiring a large amount of voice data of a new voice recognition product is avoided, the modeling efficiency of a new voice matching model is improved, and the labor cost is saved.

Example two

Fig. 3 is a flowchart of another embodiment of a method for adapting speech training data according to the present invention. As shown in fig. 3, on the basis of the embodiment shown in fig. 2, the method for adapting speech training data provided in this embodiment may further include the following steps:

s301, existing original voice data are obtained from a first database.

S302, original voice data obtained by recording existing stock data through high-fidelity recording equipment are obtained in a second database.

And S303, acquiring original voice data obtained by recording the sound of a recording person through high-fidelity recording equipment in a third database.

In the embodiment of the present invention, the execution sequence of the steps S301 to S303 is not sequential, may be performed simultaneously, or may be performed sequentially according to any sequence, and of course, any one or two of the three steps may also be performed.

In addition, the method for adapting speech training data provided in the embodiment of the present invention may further include an obtaining step of a channel switching algorithm, as shown in steps S304 to S305 described below.

S304, acquiring the recording data aiming at the fixed text under different channels.

In the embodiment of the present invention, a section of fixed text may be set first, and when a channel conversion algorithm is obtained, different recording data may be obtained by recording the section of fixed text in different channels, for example, in a channel environment of two microphones, four microphones, six microphones, and the like and original speech.

Further, for the same channel environment, data acquisition at different distances can be performed to obtain recording data for the fixed text at different distances.

S305, acquiring a channel conversion algorithm according to different parameter distribution functions of different recording data.

In the embodiment of the invention, aiming at the recording data under different channels, a channel conversion algorithm can be obtained according to the Gaussian distribution function of the recording data; aiming at the recording data at different distances, a channel conversion algorithm can be obtained according to the energy distribution function of the recording data, and finally, the channel conversion algorithm for data conversion is obtained.

S306, the original voice data is converted through a channel conversion algorithm to obtain training data suitable for different channels.

In the embodiment of the present invention, steps S301 to S303 (i.e., the acquisition process of the raw speech data) are independent of steps S304 to S305 (i.e., the acquisition process of the channel conversion algorithm), the raw speech data is used as the input of the channel conversion algorithm, and the acquisition process can be regarded as a pre-processed data preparation process; the channel conversion algorithm acquisition process needs to be executed each time a new smart speaker is generated, so as to update and maintain the old channel conversion algorithm.

EXAMPLE III

Fig. 4 is a schematic structural diagram of an embodiment of a speech training data adaptation apparatus provided by the present invention, which can be used to execute the method steps shown in fig. 2. As shown in fig. 4, the speech training data adapting apparatus may include: a raw voice data acquisition module 41 and a data conversion module 42.

The original voice data obtaining module 41 may be configured to obtain original voice data for data conversion; the data conversion module 42 may be configured to perform conversion processing on the raw voice data acquired by the raw voice data acquisition module 41 through a channel conversion algorithm to obtain training data suitable for different channels.

In the embodiment of the present invention, the original voice data has audio data information in various directions. After the original voice data obtaining module 41 obtains the original voice data, the data converting module 42 may perform conversion processing on the original voice data obtained by the original voice data obtaining module 41 through a channel conversion algorithm to obtain training data suitable for different channels. The original voice data acquisition process by the original voice data acquisition module 41 is independent of the data conversion process by the data conversion module 42. The raw speech data is used as input to the channel switching algorithm, and the acquisition step is a pre-processing data preparation process. The data conversion process can be implemented whenever the corresponding training data is needed.

The voice training data adapting device provided by the embodiment of the invention carries out conversion processing operation on the existing original voice data through the channel conversion algorithm to obtain the training data adapted to different channels, avoids training by carrying out a large amount of voice data acquisition on a new voice recognition product every time, and can obtain the training data adapted to the voice recognition product only by updating and maintaining the channel conversion algorithm, thereby improving the modeling efficiency of a new voice matching model and saving the labor cost.

Example four

Fig. 5 is a schematic structural diagram of another embodiment of the speech training data adaptation apparatus provided by the present invention, which can be used to execute the method steps shown in fig. 3. As shown in fig. 5, on the basis of the embodiment shown in fig. 4, the speech training data adaptation apparatus provided in the embodiment of the present invention may further include: an algorithm acquisition module 51. The algorithm obtaining module 51 may be configured to obtain recording data for a fixed text under different channels, and obtain a channel conversion algorithm according to a difference parameter distribution function of different recording data.

In the embodiment of the present invention, a section of fixed text may be set first, and when acquiring the channel conversion algorithm, the algorithm acquisition module 51 may record the section of fixed text in different channels, for example, in the environment of two-microphone, four-microphone, six-microphone, and the like and in a high-fidelity channel environment, to acquire different recording data.

Further, the algorithm obtaining module 51 may be further configured to obtain the recording data for the fixed text at different distances for the same channel environment.

In the embodiment of the present invention, the algorithm obtaining module 51 may obtain a channel conversion algorithm according to a gaussian distribution function of the recording data under different channels; aiming at the recording data at different distances, a channel conversion algorithm can be obtained according to the energy distribution function of the recording data, and finally, the channel conversion algorithm for data conversion is obtained.

In the embodiment of the present invention, the process algorithm obtaining module 51 for obtaining the original voice data by the original voice data obtaining module 41 obtains a process of obtaining a channel conversion algorithm, the original voice data is used as an input of the channel conversion algorithm, and the obtaining process can be regarded as a data preparation process of preprocessing; the channel conversion algorithm acquisition process needs to be executed each time a new smart speaker is generated, so as to update and maintain the old channel conversion algorithm.

Still further, the raw speech data acquisition module 41 may include: a first obtaining unit 411, where the first obtaining unit 411 may be configured to obtain existing original voice data in a first database.

The raw speech data acquisition module 41 may further include: a second obtaining unit 412, where the second obtaining unit 412 may be configured to obtain, in a second database, original voice data obtained by recording existing inventory data with a high-fidelity recording apparatus.

The raw speech data acquisition module 41 may further include: a third obtaining unit 413, where the third obtaining unit 413 may be configured to obtain, in a third database, original voice data obtained by recording a human voice by a high-fidelity recording apparatus.

In this embodiment of the present invention, the acquiring orders of the first acquiring unit 411, the second acquiring unit 412, and the third acquiring unit 413 are not sequential, and may be executed simultaneously, or may be executed sequentially according to an arbitrary order, or of course, any one or two of the three units may be optionally executed.

EXAMPLE five

Fig. 6 is a flowchart of a voice data conversion method according to an embodiment of the present invention. The execution subject of the method can be various server devices with voice data processing capability, and can also be devices or chips integrated on the server devices. As shown in fig. 6, the voice data conversion method includes the steps of:

s601, converting the original voice data through a channel conversion algorithm matched with the playing device to obtain training data suitable for the playing device.

In the embodiment of the present invention, the original voice data refers to voice data having audio data information in various directions.

Regarding the acquisition of the original voice data, the existing original voice data can be acquired in the first database, the original voice data obtained by recording the existing stock data through the high-fidelity recording equipment can be acquired in the second database, and the original voice data obtained by recording the recording personnel through the high-fidelity recording equipment can be acquired in the third database.

When TTS (Text To Speech, i.e., from Text To Speech) is played, the Speech playing device needs To play Speech according To the configured Speech database. And for playing devices of different models, voice databases of different channels need to be configured. According to the voice data conversion method provided by the embodiment of the invention, when a new playing device is generated, a server providing support for the playing device can acquire the channel conversion matched with the playing device according to the credit type of the playing device, so as to acquire the training data suitable for the playing device.

Specifically, when a channel conversion algorithm matched with the playing device is obtained, the following steps may be taken: acquiring recording data aiming at the fixed text under different channels, wherein the recording data comprises the recording data aiming at the fixed text by a playing device; then, according to the different parameter distribution functions of different recording data, a channel conversion algorithm is obtained.

And aiming at the recording data under different channels, a channel conversion algorithm can be obtained according to the Gaussian distribution function of the recording data.

And S602, performing model training according to the training data to obtain a data conversion model.

S603, according to the data conversion model, converting the data to be output of the playing device to obtain the playing data suitable for the playing device.

In the embodiment of the invention, the server performs model training after acquiring the training data suitable for the playing equipment, thereby obtaining the data conversion model.

When the playing device plays the voice, the data to be output can be sent to the server, the server inputs the data to be output into the data conversion model, and the model automatically outputs the playing data suitable for the playing device. When the playing device receives the playing data from the server, the playing can be performed.

According to the voice data conversion method provided by the embodiment of the invention, the existing original voice data is converted through the channel conversion algorithm matched with the playing equipment to obtain the training data adaptive to the playing equipment, so that the situation that a large amount of voice data acquisition is carried out on a new voice playing product every time, and the training data adaptive to the voice playing product can be obtained only by updating and maintaining the channel conversion algorithm can be avoided, and thus a data conversion model is trained, the conversion of the data to be played of a new product is realized, the voice playing quality is improved, and the labor cost during data acquisition can be saved.

EXAMPLE six

The internal functions and structure of the speech training data adaptation apparatus, which can be implemented as an electronic device, are described above. Fig. 7 is a schematic structural diagram of an embodiment of an electronic device provided in the present invention. As shown in fig. 7, the electronic device includes a memory 71 and a processor 72.

The memory 71 stores programs. In addition to the above-described programs, the memory 71 may also be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, and so forth.

The memory 71 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

A processor 72, coupled to the memory 71, that executes programs stored by the memory 71 to:

acquiring original voice data for data conversion, the original voice data having audio data information in various directions;

and converting the acquired original voice data through a channel conversion algorithm to acquire training data suitable for different channels.

Further, as shown in fig. 7, the electronic device may further include: communication components 73, power components 74, audio components 75, a display 76, and the like. Only some of the components are schematically shown in fig. 7, and the electronic device is not meant to include only the components shown in fig. 7.

The communication component 73 is configured to facilitate wired or wireless communication between the electronic device and other devices. The electronic device may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 73 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 73 further includes a Near Field Communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

A power supply component 74 provides power to the various components of the electronic device. The power components 74 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for an electronic device.

The audio component 75 is configured to output and/or input audio signals. For example, the audio component 75 includes a Microphone (MIC) configured to receive external audio signals when the electronic device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in a memory 71 or transmitted via a communication component 73. In some embodiments, audio assembly 75 also includes a speaker for outputting audio signals.

The display 76 includes a screen, which may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for adapting speech training data, comprising:

2. The method of adapting speech training data according to claim 1, further comprising, before said converting said raw speech data by a channel conversion algorithm:

acquiring recording data aiming at the fixed text under different channels;

and acquiring the channel conversion algorithm according to different difference parameter distribution functions of the recording data.

3. The method of adapting speech training data according to claim 2, further comprising:

and acquiring the recording data aiming at the fixed text under different distances.

4. The method of claim 2, wherein the distribution function of the difference parameters of the recorded data under different channels is a Gaussian distribution function.

5. The method of claim 3, wherein the difference parameter distribution function of the recorded data at different distances is an energy distribution function.

6. The method of any of claims 1 to 5, wherein the obtaining raw speech data for data conversion comprises:

existing raw speech data is retrieved from a first database.

7. The method of any of claims 1 to 5, wherein the obtaining raw speech data for data conversion comprises:

and acquiring original voice data obtained by recording the existing stock data through high-fidelity recording equipment in a second database.

8. The method of any of claims 1 to 5, wherein the obtaining raw speech data for data conversion comprises:

and acquiring original voice data obtained by recording the voice of a recording person through high-fidelity recording equipment in a third database.

9. A method for converting voice data, comprising:

10. The voice data conversion method according to claim 9, wherein before the conversion processing of the original voice data by the channel conversion algorithm matched with the playback device, the method comprises:

acquiring recording data for a fixed text under different channels, wherein the recording data comprises the recording data for the fixed text of the playing equipment;

11. An apparatus for adapting speech training data, comprising:

12. An electronic device, comprising:

a memory for storing a program;

a processor for executing the program stored in the memory for: