WO2014176489A2 - A system and method for supervised creation of personalized speech samples libraries in real-time for text-to-speech synthesis - Google Patents
A system and method for supervised creation of personalized speech samples libraries in real-time for text-to-speech synthesis Download PDFInfo
- Publication number
- WO2014176489A2 WO2014176489A2 PCT/US2014/035436 US2014035436W WO2014176489A2 WO 2014176489 A2 WO2014176489 A2 WO 2014176489A2 US 2014035436 W US2014035436 W US 2014035436W WO 2014176489 A2 WO2014176489 A2 WO 2014176489A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speech
- unit
- sample
- samples library
- speech unit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/06—Elementary speech units used in speech synthesisers; Concatenation rules
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/60—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
Definitions
- the present invention relates generally to the generation of a speech samples library utilized for text-to-speech (TTS) synthesis and, more specifically, to supervised creation of speech samples libraries for TTS synthesis customized based on a user's expressive speech.
- TTS text-to-speech
- Several systems known in the art provide speech samples libraries for text-to- speech (TTS) synthesis. These speech samples libraries are typically equipped with speech units with diverse utterances based on a variety of musical parameters.
- the musical parameters can include, for example, different pronunciations of a given word that may result from characteristics of the speaker such as gender, accent, dialect, etc.
- the quality of these collections of speech samples is typically measured by how natural or human-like the synthesized speech sounds. Thus, such measures of quality are typically evaluated respective of a phonetic completeness, a phonemic completeness, and an optimal variety of musical attributes.
- Such synthesis may be produced respective of TTS techniques.
- a customized speech samples library is required to be created respective of a user's voice.
- the creation of such customized speech sample libraries depends in part on unsupervised and unstructured speech.
- the difficulty arising from such speech sample libraries is that a desired threshold of quality suitable for generating expressive TTS voice cannot be supervised in real time.
- Certain exemplary embodiments include a system and method for supervised creation of a new speech samples library for text-to-speech (TTS) synthesis.
- the method comprises tracking at least one speech unit in an existing speech samples library to determine if the existing speech samples library achieves a desired quality; receiving at least one speech sample; analyzing the at least one received speech sample to identify at least one speech unit necessitated to obtain the desired quality of the speech samples library; and storing the at least one necessary speech unit in the speech samples library.
- Figure 1 is a schematic block diagram of a system for creating personalized speech samples libraries utilized to describe various embodiments.
- Figure 2 is a flowchart illustrating the supervised creation of a personalized speech samples library according to an embodiment.
- Figure 3 is a flowchart illustrating determination of whether a desired quality has been achieved according to an embodiment.
- Figure 4 is a flowchart illustrating determination of priority according to an embodiment.
- Fig. 1 is an exemplary and non-limiting schematic diagram of a system 100 for creating personalized speech samples libraries utilized to describe the various embodiments.
- a server 1 10 is optionally connected to one or more user nodes 120-1 through 120-n (for the sake of simplicity and without limitation, user nodes 120-1 through 120-n may be referred to individually as a user node 120 or collectively as user nodes 120) via an interface 130.
- a user node 120 may be, but is not limited to, a computer node, a personal computer (PC), a notebook computer, a cellular phone, a smartphone, a tablet device, a wearableand device, and so on.
- the server 1 10 typically contains several components such as, a processor or processing unit 140, and a memory 150.
- the interface 130 may be a network interface providing wired and/or wireless connectivity for a local area network (LAN), the Internet, and the like.
- the network interface 130 may be a serial bus, for example, universal serial bus (USB) for connecting peripheral devices.
- USB universal serial bus
- the memory 150 further contains instructions 160 executed by the processor 140.
- the system 100 optionally includes a speech recognition (SR) system 170 that may be an integral part of the memory 150, or a separate entity coupled to the server 1 10.
- the server 1 10 is configured to identify, using the the SR system 170, speech samples pronounced by a user.
- the server 1 10 may be configured to perform digital signal processing (DSP) when the pronunciation is inconsistent. This inconsistency may be expressed, for example, in a sound volume, a speed of the pronunciation, and a tone of speech.
- DSP digital signal processing
- the server 1 10 is further configured to receive speech samples containing speech units through, for example, the interface 130.
- a speech sample may be a word, a phrase, a sentence, etc.
- a speech unit is a distinct unit of sound in a specified language or dialect used to distinguish one word from another. As an example p, b, d, and t, may distinguish between the English words pad, pat, bad, and bat.
- Each speech unit can be classified to a phoneme, a bi-phone, or a tri- phone.
- a phoneme is the basic unit of a language's phonology, which is combined with one or more phonemes to form meaningful units of bi-phones or tri-phones.
- the system 100 includes one or more speech samples libraries 180-1 through 180- m (for the sake of simplicity and without limitation, speech samples libraries 180-1 through 180-m may be referred to individually as a speech samples library 180 or collectively as speech samples libraries 180) that may be an integral part of the memory 150, or a separate entity coupled to the server 1 10.
- Each speech samples library 180 contains personalized speech samples of speech units.
- each speech samples library 180 may maintain information to be used for TTS synthesis.
- the server 1 10 in an embodiment, is configured to analyze each speech samples library 180.
- the server 1 10 typically identifies one or more speech units stored in the speech samples library 180 as well as the speech units that are missing and, thus, must be added. This identification is usually performed respective of a threshold of speech units required to reach a desired quality.
- the server 1 10 is then configured to analyze the speech sample and the speech units comprised within.
- the analysis may include, for example, identification of neighbors of each speech unit in the speech sample, determination of a location of each speech unit in the speech sample, analysis of musical parameters of each speech unit, etc. The analysis process is discussed further herein below with respect to Fig. 2.
- the server 1 10 may also determine a quality of the speech samples.
- the speech units are stored in the speech samples library 180 under the supervision of the server 1 10 in real-time respective of a priority determined for each speech unit.
- the server 1 10 takes into consideration the analysis results to determine the priority of each speech unit.
- the supervised creation of the speech samples library 180 assists the server 1 10 in determining whether the speech samples library 180 has reached the desired quality.
- the process of analyzing a speech samples library and the speech samples contained therein to perform supervised creation of a speech samples library 180 of a desired quality is discussed further herein below with respect to Fig. 2.
- Fig. 2 shows an exemplary and non-limiting flowchart 200 describing the supervised creation of a personalized speech samples library according to an embodiment.
- a personalized speech samples library is a speech samples library that achieves a desired quality.
- the desired quality may be, e.g., a level of quality that is predefined, a level of quality decided by a user in real time, and the like.
- the personalized speech samples library is constructed based on speech samples contained in an existing speech samples library.
- a personalized speech samples library may be constructed without utilizing an existing speech samples library.
- S210 upon receiving a request to create a personalized speech samples library, one or more speech units stored in an existing or preconfigured speech samples library (e.g., speech samples library 180) are tracked for analysis.
- a server e.g., server 1 10
- S220 it is checked whether the speech samples library has reached a desired quality and, if so, execution terminates; otherwise, execution continues with S230. Determination of whether a speech samples library has achieved a desired quality is discussed further herein below with respect to Fig. 3.
- one or more speech samples are received.
- the one or more speech samples are received from the speech samples library.
- the speech samples library may lack one or more speech units that are necessary to achieving a desired quality.
- a speech sample may be, but is not limited to, a word, a sequence of words, a sentence, and the like.
- the identities of speech units that are missing respective of the speech units existing in the speech samples library are determined. This determination may further include determining a threshold of speech units that are required for the desired quality.
- a speech unit may be, but is not limited to, a phoneme, a bi-phone, a tri-phone, and the like.
- the speech samples may be received through a speech recognition (SR) system (e.g., SR system 170).
- SR speech recognition
- a request is sent to a user to pronounce one or more speech samples.
- the request may be sent through an appropriate interface (e.g., interface 130).
- the received speech samples may be sent back in real-time to verify that the identification of such speech samples was correct.
- one or more speech units of the received speech samples are analyzed.
- the speech unit's neighbors within each speech sample are identified.
- a neighbor is a speech unit that precedes or follows the analyzed speech unit when the speech units of a speech sample are arranged in a sequential order (e.g., according to location within the speech sample).
- neighbors only include speech units that immediately precede or follow the analyzed speech unit.
- a parameters analysis is performed for each speech unit. Such an analysis may include, but is not limited to, identification of musical parameters, such as, pitch characteristics, duration, volume, and so on.
- a priority of the speech samples and the respective speech units is determined.
- a priority may be, but is not limited to, one of several categories (e.g., low, medium, or high), a numerical value associated with levels of priority (e.g., zero through ten, wherein zero represents the lowest priority and ten represents the highest priority), and so on. Determination of priority is discussed further herein below with respect to Fig. 4.
- the analysis and its respective speech units are stored in the speech samples library.
- the speech units are typically stored respective of the priority of those speech units or of speech samples containing the speech units.
- speech units with higher priority are stored earlier in a sequential order than speech units with lower priority.
- the quality of the speech samples library is determined respective of the analysis made for each speech unit and the respective determined priority.
- a value representing the quality may be displayed to the user through the interface.
- it is checked whether there are additional speech units that are required to be added and if so execution continues with S230; otherwise, execution terminates.
- Fig. 3 illustrates an exemplary and non-limiting flowchart S220 for determination of whether a desired quality has been achieved by a speech samples library according to an embodiment.
- speech samples and requirements for desired quality are retrieved.
- a speech sample may be, but is not limited to, a word, a sequence of words, a sentence, and the like.
- the speech samples may be retrieved from an existing speech samples library.
- the speech samples may be retrieved from an input received by a user, an electronic sound database, and the like.
- the requirements for desired quality may be, e.g., a minimum number of speech units, a set of required speech units, a set of speech samples containing all required speech units, and the like.
- a speech unit may be, but is not limited to, a phoneme, a bi-phone, a tri- phone, etc.
- the retrieved speech samples are analyzed to determine existing speech units within each speech sample.
- the existing speech units are analyzed to determine the suitability of each speech unit.
- Suitability may be determined based on musical parameters of the speech unit including, but not limited to, volume, clarity, pitch, tone, duration, etc.
- Suitability of a speech unit may be, e.g., predefined, or may be determined by the user in real-time. In embodiments where suitability is determined by a user in real-time, the speech units may be displayed and played on a user node, thereby enabling the user to decide whether each speech unit is suitable according to his or her preferences.
- suitable speech units are compiled in a list or number of speech units.
- the results of the suitability determination may be returned as the list or number of speech units.
- unsuitable speech units are excluded from the results of the suitability determination.
- a speech sample of the word "incredulous" is analyzed (a word that includes four phonemes that, for purposes of the example, each considered to be a speech unit), and three of the speech units are determined to be suitable, the results of the suitability analysis would only include those three suitable speech units.
- Fig. 4 is an exemplary and non-limiting flowchart S250 illustrating determination of priority according to an embodiment.
- a priority in the context of the disclosed embodiments may be, but is not limited to, a classification within one of several categories (e.g., low, medium, high, and the like), a numerical value associated with levels of priority (e.g., zero through ten, wherein zero represents the lowest priority and ten represents the highest priority), and so on.
- the priority represents the degree of importance that a speech unit has within a given speech samples library (e.g., speech samples library 180). Speech units demonstrating particularly unique features respective of other speech units may be determined as higher priority, since such unique speech units typically contribute more to the quality of a speech samples library than less unique speech units.
- the speech sample is retrieved.
- the speech sample is retrieved from, e.g., a speech samples library 180.
- the speech sample is analyzed to identify existing speech units within the speech sample.
- the priority of each speech unit is determined.
- the priority may be determined, for example, respective of the analysis and the desired quality of the speech samples library.
- the priority may be determined respective of a quality level of the received speech sample. Specifically, speech units associated with speech samples having poor quality are considered low priority, while speech units associated with speech samples having high quality are considered high priority. As a non-limiting example, speech units of clear sounding words will get higher priority than speech units of words whose respective sounds are muddled or otherwise distorted.
- the priority of speech units may be determined respective of a variety of musical parameters of such speech units.
- existence of a variety of musical parameters of such speech units may also lead to determination of a high priority speech unit.
- the quality of the speech samples library may depend on, among other things, existence of a wide range of pitch characteristics of the speech samples.
- the priority may be determined respective of a significance of the speech units.
- significant speech units may be considered high priority.
- the significance of a speech unit may be determined respective of a frequency of occurrence of the speech unit in a natural language speech corpus.
- the significance may also be reflected upon existence of one or more alternative speech units found in the speech samples library. Absence of such alternative speech units may lead to determination of a high significance speech unit and, thus, would result in a high priority speech unit.
- results of the priority determination are stored in a speech samples library. In an embodiment, these results are associated with the respective speech samples in the speech samples library.
- a user node 120 may be configured to execute these processes.
- the preceding embodiments for determining priority do not limit the methods available for determining the priority. Specifically, any of the above methods of embodiments may be combined with each other and/or with other methods for determining the priority without departing from the scope of the disclosed embodiments. As an example, the priority may be determined based on both the significance of speech units and the variety of musical parameters. In such combinations, the priority may be determined to be high if, e.g., any of the methods used for determining priority yields a high priority result, or if an average of numerical values for priority yields a high priority result.
- the various embodiments disclosed herein can be implemented as hardware, firmware, software, or any combination thereof.
- the software is preferably implemented as an application program tangibly embodied on a program storage unit or computer readable medium consisting of parts, or of certain devices and/or a combination of devices.
- the application program may be uploaded to, and executed by, a machine comprising any suitable architecture.
- the machine is implemented on a computer platform having hardware such as one or more central processing units (“CPUs"), a memory, and input/output interfaces.
- CPUs central processing units
- the computer platform may also include an operating system and microinstruction code.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
- Telephonic Communication Services (AREA)
Abstract
Description
Claims
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201361816176P | 2013-04-26 | 2013-04-26 | |
| US61/816,176 | 2013-04-26 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| WO2014176489A2 true WO2014176489A2 (en) | 2014-10-30 |
| WO2014176489A3 WO2014176489A3 (en) | 2014-12-18 |
Family
ID=51792516
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2014/035436 Ceased WO2014176489A2 (en) | 2013-04-26 | 2014-04-25 | A system and method for supervised creation of personalized speech samples libraries in real-time for text-to-speech synthesis |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2014176489A2 (en) |
Cited By (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10388272B1 (en) | 2018-12-04 | 2019-08-20 | Sorenson Ip Holdings, Llc | Training speech recognition systems using word sequences |
| US10573312B1 (en) | 2018-12-04 | 2020-02-25 | Sorenson Ip Holdings, Llc | Transcription generation from multiple speech recognition systems |
| US11017778B1 (en) | 2018-12-04 | 2021-05-25 | Sorenson Ip Holdings, Llc | Switching between speech recognition systems |
| US11170761B2 (en) | 2018-12-04 | 2021-11-09 | Sorenson Ip Holdings, Llc | Training of speech recognition systems |
| US11488604B2 (en) | 2020-08-19 | 2022-11-01 | Sorenson Ip Holdings, Llc | Transcription of audio |
| US12499874B2 (en) | 2023-12-13 | 2025-12-16 | Sorenson Ip Holdings, Llc | Training speech recognition systems using word sequences |
Family Cites Families (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5850629A (en) * | 1996-09-09 | 1998-12-15 | Matsushita Electric Industrial Co., Ltd. | User interface controller for text-to-speech synthesizer |
| US6505158B1 (en) * | 2000-07-05 | 2003-01-07 | At&T Corp. | Synthesis-based pre-selection of suitable units for concatenative speech |
| EP2140448A1 (en) * | 2007-03-21 | 2010-01-06 | Vivotext Ltd. | Speech samples library for text-to-speech and methods and apparatus for generating and using same |
-
2014
- 2014-04-25 WO PCT/US2014/035436 patent/WO2014176489A2/en not_active Ceased
Cited By (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10388272B1 (en) | 2018-12-04 | 2019-08-20 | Sorenson Ip Holdings, Llc | Training speech recognition systems using word sequences |
| US10573312B1 (en) | 2018-12-04 | 2020-02-25 | Sorenson Ip Holdings, Llc | Transcription generation from multiple speech recognition systems |
| US10672383B1 (en) | 2018-12-04 | 2020-06-02 | Sorenson Ip Holdings, Llc | Training speech recognition systems using word sequences |
| US10971153B2 (en) | 2018-12-04 | 2021-04-06 | Sorenson Ip Holdings, Llc | Transcription generation from multiple speech recognition systems |
| US11017778B1 (en) | 2018-12-04 | 2021-05-25 | Sorenson Ip Holdings, Llc | Switching between speech recognition systems |
| US11145312B2 (en) | 2018-12-04 | 2021-10-12 | Sorenson Ip Holdings, Llc | Switching between speech recognition systems |
| US11170761B2 (en) | 2018-12-04 | 2021-11-09 | Sorenson Ip Holdings, Llc | Training of speech recognition systems |
| US11594221B2 (en) | 2018-12-04 | 2023-02-28 | Sorenson Ip Holdings, Llc | Transcription generation from multiple speech recognition systems |
| US11935540B2 (en) | 2018-12-04 | 2024-03-19 | Sorenson Ip Holdings, Llc | Switching between speech recognition systems |
| US12380877B2 (en) | 2018-12-04 | 2025-08-05 | Sorenson Ip Holdings, Llc | Training of speech recognition systems |
| US11488604B2 (en) | 2020-08-19 | 2022-11-01 | Sorenson Ip Holdings, Llc | Transcription of audio |
| US12499874B2 (en) | 2023-12-13 | 2025-12-16 | Sorenson Ip Holdings, Llc | Training speech recognition systems using word sequences |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2014176489A3 (en) | 2014-12-18 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| KR102582291B1 (en) | Emotion information-based voice synthesis method and device | |
| CN108573693B (en) | Text-to-speech system and method, and storage medium therefor | |
| CN107590135B (en) | Automatic translation methods, devices and systems | |
| US20140236597A1 (en) | System and method for supervised creation of personalized speech samples libraries in real-time for text-to-speech synthesis | |
| EP3832644A1 (en) | Neural speech-to-meaning translation | |
| JP6812843B2 (en) | Computer program for voice recognition, voice recognition device and voice recognition method | |
| US11574637B1 (en) | Spoken language understanding models | |
| WO2017067206A1 (en) | Training method for multiple personalized acoustic models, and voice synthesis method and device | |
| US11158308B1 (en) | Configuring natural language system | |
| US8447603B2 (en) | Rating speech naturalness of speech utterances based on a plurality of human testers | |
| JP2007249212A (en) | Method, computer program and processor for text speech synthesis | |
| US11615787B2 (en) | Dialogue system and method of controlling the same | |
| CN110600002B (en) | Voice synthesis method and device and electronic equipment | |
| JP2017032839A (en) | Acoustic model learning device, voice synthesis device, acoustic model learning method, voice synthesis method, and program | |
| KR20230079503A (en) | Sample generation method and device | |
| CN105609097A (en) | Speech synthesis apparatus and control method thereof | |
| JP2019179257A (en) | Acoustic model learning device, voice synthesizer, acoustic model learning method, voice synthesis method, and program | |
| CN112908308B (en) | Audio processing method, device, equipment and medium | |
| CN110852075B (en) | Voice transcription method and device capable of automatically adding punctuation marks and readable storage medium | |
| CN114566140B (en) | Speech synthesis model training method, speech synthesis method, equipment and product | |
| WO2014176489A2 (en) | A system and method for supervised creation of personalized speech samples libraries in real-time for text-to-speech synthesis | |
| WO2014183411A1 (en) | Method, apparatus and speech synthesis system for classifying unvoiced and voiced sound | |
| CN115050351A (en) | Method and device for generating timestamp and computer equipment | |
| CN116917984A (en) | Interactive content output | |
| KR102277205B1 (en) | Apparatus for converting audio and method thereof |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 14787390 Country of ref document: EP Kind code of ref document: A2 |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 14787390 Country of ref document: EP Kind code of ref document: A2 |
|
| 32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 02/05/2016) |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 14787390 Country of ref document: EP Kind code of ref document: A2 |