[go: up one dir, main page]

WO2014176489A2 - A system and method for supervised creation of personalized speech samples libraries in real-time for text-to-speech synthesis - Google Patents

A system and method for supervised creation of personalized speech samples libraries in real-time for text-to-speech synthesis Download PDF

Info

Publication number
WO2014176489A2
WO2014176489A2 PCT/US2014/035436 US2014035436W WO2014176489A2 WO 2014176489 A2 WO2014176489 A2 WO 2014176489A2 US 2014035436 W US2014035436 W US 2014035436W WO 2014176489 A2 WO2014176489 A2 WO 2014176489A2
Authority
WO
WIPO (PCT)
Prior art keywords
speech
unit
sample
samples library
speech unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2014/035436
Other languages
French (fr)
Other versions
WO2014176489A3 (en
Inventor
Yossef BEN EZRA
Shai Nissim
Gershon Silbert
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Vivo Text Ltd
M&B IP Analysts LLC
Original Assignee
Vivo Text Ltd
M&B IP Analysts LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Vivo Text Ltd, M&B IP Analysts LLC filed Critical Vivo Text Ltd
Publication of WO2014176489A2 publication Critical patent/WO2014176489A2/en
Publication of WO2014176489A3 publication Critical patent/WO2014176489A3/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals

Definitions

  • the present invention relates generally to the generation of a speech samples library utilized for text-to-speech (TTS) synthesis and, more specifically, to supervised creation of speech samples libraries for TTS synthesis customized based on a user's expressive speech.
  • TTS text-to-speech
  • Several systems known in the art provide speech samples libraries for text-to- speech (TTS) synthesis. These speech samples libraries are typically equipped with speech units with diverse utterances based on a variety of musical parameters.
  • the musical parameters can include, for example, different pronunciations of a given word that may result from characteristics of the speaker such as gender, accent, dialect, etc.
  • the quality of these collections of speech samples is typically measured by how natural or human-like the synthesized speech sounds. Thus, such measures of quality are typically evaluated respective of a phonetic completeness, a phonemic completeness, and an optimal variety of musical attributes.
  • Such synthesis may be produced respective of TTS techniques.
  • a customized speech samples library is required to be created respective of a user's voice.
  • the creation of such customized speech sample libraries depends in part on unsupervised and unstructured speech.
  • the difficulty arising from such speech sample libraries is that a desired threshold of quality suitable for generating expressive TTS voice cannot be supervised in real time.
  • Certain exemplary embodiments include a system and method for supervised creation of a new speech samples library for text-to-speech (TTS) synthesis.
  • the method comprises tracking at least one speech unit in an existing speech samples library to determine if the existing speech samples library achieves a desired quality; receiving at least one speech sample; analyzing the at least one received speech sample to identify at least one speech unit necessitated to obtain the desired quality of the speech samples library; and storing the at least one necessary speech unit in the speech samples library.
  • Figure 1 is a schematic block diagram of a system for creating personalized speech samples libraries utilized to describe various embodiments.
  • Figure 2 is a flowchart illustrating the supervised creation of a personalized speech samples library according to an embodiment.
  • Figure 3 is a flowchart illustrating determination of whether a desired quality has been achieved according to an embodiment.
  • Figure 4 is a flowchart illustrating determination of priority according to an embodiment.
  • Fig. 1 is an exemplary and non-limiting schematic diagram of a system 100 for creating personalized speech samples libraries utilized to describe the various embodiments.
  • a server 1 10 is optionally connected to one or more user nodes 120-1 through 120-n (for the sake of simplicity and without limitation, user nodes 120-1 through 120-n may be referred to individually as a user node 120 or collectively as user nodes 120) via an interface 130.
  • a user node 120 may be, but is not limited to, a computer node, a personal computer (PC), a notebook computer, a cellular phone, a smartphone, a tablet device, a wearableand device, and so on.
  • the server 1 10 typically contains several components such as, a processor or processing unit 140, and a memory 150.
  • the interface 130 may be a network interface providing wired and/or wireless connectivity for a local area network (LAN), the Internet, and the like.
  • the network interface 130 may be a serial bus, for example, universal serial bus (USB) for connecting peripheral devices.
  • USB universal serial bus
  • the memory 150 further contains instructions 160 executed by the processor 140.
  • the system 100 optionally includes a speech recognition (SR) system 170 that may be an integral part of the memory 150, or a separate entity coupled to the server 1 10.
  • the server 1 10 is configured to identify, using the the SR system 170, speech samples pronounced by a user.
  • the server 1 10 may be configured to perform digital signal processing (DSP) when the pronunciation is inconsistent. This inconsistency may be expressed, for example, in a sound volume, a speed of the pronunciation, and a tone of speech.
  • DSP digital signal processing
  • the server 1 10 is further configured to receive speech samples containing speech units through, for example, the interface 130.
  • a speech sample may be a word, a phrase, a sentence, etc.
  • a speech unit is a distinct unit of sound in a specified language or dialect used to distinguish one word from another. As an example p, b, d, and t, may distinguish between the English words pad, pat, bad, and bat.
  • Each speech unit can be classified to a phoneme, a bi-phone, or a tri- phone.
  • a phoneme is the basic unit of a language's phonology, which is combined with one or more phonemes to form meaningful units of bi-phones or tri-phones.
  • the system 100 includes one or more speech samples libraries 180-1 through 180- m (for the sake of simplicity and without limitation, speech samples libraries 180-1 through 180-m may be referred to individually as a speech samples library 180 or collectively as speech samples libraries 180) that may be an integral part of the memory 150, or a separate entity coupled to the server 1 10.
  • Each speech samples library 180 contains personalized speech samples of speech units.
  • each speech samples library 180 may maintain information to be used for TTS synthesis.
  • the server 1 10 in an embodiment, is configured to analyze each speech samples library 180.
  • the server 1 10 typically identifies one or more speech units stored in the speech samples library 180 as well as the speech units that are missing and, thus, must be added. This identification is usually performed respective of a threshold of speech units required to reach a desired quality.
  • the server 1 10 is then configured to analyze the speech sample and the speech units comprised within.
  • the analysis may include, for example, identification of neighbors of each speech unit in the speech sample, determination of a location of each speech unit in the speech sample, analysis of musical parameters of each speech unit, etc. The analysis process is discussed further herein below with respect to Fig. 2.
  • the server 1 10 may also determine a quality of the speech samples.
  • the speech units are stored in the speech samples library 180 under the supervision of the server 1 10 in real-time respective of a priority determined for each speech unit.
  • the server 1 10 takes into consideration the analysis results to determine the priority of each speech unit.
  • the supervised creation of the speech samples library 180 assists the server 1 10 in determining whether the speech samples library 180 has reached the desired quality.
  • the process of analyzing a speech samples library and the speech samples contained therein to perform supervised creation of a speech samples library 180 of a desired quality is discussed further herein below with respect to Fig. 2.
  • Fig. 2 shows an exemplary and non-limiting flowchart 200 describing the supervised creation of a personalized speech samples library according to an embodiment.
  • a personalized speech samples library is a speech samples library that achieves a desired quality.
  • the desired quality may be, e.g., a level of quality that is predefined, a level of quality decided by a user in real time, and the like.
  • the personalized speech samples library is constructed based on speech samples contained in an existing speech samples library.
  • a personalized speech samples library may be constructed without utilizing an existing speech samples library.
  • S210 upon receiving a request to create a personalized speech samples library, one or more speech units stored in an existing or preconfigured speech samples library (e.g., speech samples library 180) are tracked for analysis.
  • a server e.g., server 1 10
  • S220 it is checked whether the speech samples library has reached a desired quality and, if so, execution terminates; otherwise, execution continues with S230. Determination of whether a speech samples library has achieved a desired quality is discussed further herein below with respect to Fig. 3.
  • one or more speech samples are received.
  • the one or more speech samples are received from the speech samples library.
  • the speech samples library may lack one or more speech units that are necessary to achieving a desired quality.
  • a speech sample may be, but is not limited to, a word, a sequence of words, a sentence, and the like.
  • the identities of speech units that are missing respective of the speech units existing in the speech samples library are determined. This determination may further include determining a threshold of speech units that are required for the desired quality.
  • a speech unit may be, but is not limited to, a phoneme, a bi-phone, a tri-phone, and the like.
  • the speech samples may be received through a speech recognition (SR) system (e.g., SR system 170).
  • SR speech recognition
  • a request is sent to a user to pronounce one or more speech samples.
  • the request may be sent through an appropriate interface (e.g., interface 130).
  • the received speech samples may be sent back in real-time to verify that the identification of such speech samples was correct.
  • one or more speech units of the received speech samples are analyzed.
  • the speech unit's neighbors within each speech sample are identified.
  • a neighbor is a speech unit that precedes or follows the analyzed speech unit when the speech units of a speech sample are arranged in a sequential order (e.g., according to location within the speech sample).
  • neighbors only include speech units that immediately precede or follow the analyzed speech unit.
  • a parameters analysis is performed for each speech unit. Such an analysis may include, but is not limited to, identification of musical parameters, such as, pitch characteristics, duration, volume, and so on.
  • a priority of the speech samples and the respective speech units is determined.
  • a priority may be, but is not limited to, one of several categories (e.g., low, medium, or high), a numerical value associated with levels of priority (e.g., zero through ten, wherein zero represents the lowest priority and ten represents the highest priority), and so on. Determination of priority is discussed further herein below with respect to Fig. 4.
  • the analysis and its respective speech units are stored in the speech samples library.
  • the speech units are typically stored respective of the priority of those speech units or of speech samples containing the speech units.
  • speech units with higher priority are stored earlier in a sequential order than speech units with lower priority.
  • the quality of the speech samples library is determined respective of the analysis made for each speech unit and the respective determined priority.
  • a value representing the quality may be displayed to the user through the interface.
  • it is checked whether there are additional speech units that are required to be added and if so execution continues with S230; otherwise, execution terminates.
  • Fig. 3 illustrates an exemplary and non-limiting flowchart S220 for determination of whether a desired quality has been achieved by a speech samples library according to an embodiment.
  • speech samples and requirements for desired quality are retrieved.
  • a speech sample may be, but is not limited to, a word, a sequence of words, a sentence, and the like.
  • the speech samples may be retrieved from an existing speech samples library.
  • the speech samples may be retrieved from an input received by a user, an electronic sound database, and the like.
  • the requirements for desired quality may be, e.g., a minimum number of speech units, a set of required speech units, a set of speech samples containing all required speech units, and the like.
  • a speech unit may be, but is not limited to, a phoneme, a bi-phone, a tri- phone, etc.
  • the retrieved speech samples are analyzed to determine existing speech units within each speech sample.
  • the existing speech units are analyzed to determine the suitability of each speech unit.
  • Suitability may be determined based on musical parameters of the speech unit including, but not limited to, volume, clarity, pitch, tone, duration, etc.
  • Suitability of a speech unit may be, e.g., predefined, or may be determined by the user in real-time. In embodiments where suitability is determined by a user in real-time, the speech units may be displayed and played on a user node, thereby enabling the user to decide whether each speech unit is suitable according to his or her preferences.
  • suitable speech units are compiled in a list or number of speech units.
  • the results of the suitability determination may be returned as the list or number of speech units.
  • unsuitable speech units are excluded from the results of the suitability determination.
  • a speech sample of the word "incredulous" is analyzed (a word that includes four phonemes that, for purposes of the example, each considered to be a speech unit), and three of the speech units are determined to be suitable, the results of the suitability analysis would only include those three suitable speech units.
  • Fig. 4 is an exemplary and non-limiting flowchart S250 illustrating determination of priority according to an embodiment.
  • a priority in the context of the disclosed embodiments may be, but is not limited to, a classification within one of several categories (e.g., low, medium, high, and the like), a numerical value associated with levels of priority (e.g., zero through ten, wherein zero represents the lowest priority and ten represents the highest priority), and so on.
  • the priority represents the degree of importance that a speech unit has within a given speech samples library (e.g., speech samples library 180). Speech units demonstrating particularly unique features respective of other speech units may be determined as higher priority, since such unique speech units typically contribute more to the quality of a speech samples library than less unique speech units.
  • the speech sample is retrieved.
  • the speech sample is retrieved from, e.g., a speech samples library 180.
  • the speech sample is analyzed to identify existing speech units within the speech sample.
  • the priority of each speech unit is determined.
  • the priority may be determined, for example, respective of the analysis and the desired quality of the speech samples library.
  • the priority may be determined respective of a quality level of the received speech sample. Specifically, speech units associated with speech samples having poor quality are considered low priority, while speech units associated with speech samples having high quality are considered high priority. As a non-limiting example, speech units of clear sounding words will get higher priority than speech units of words whose respective sounds are muddled or otherwise distorted.
  • the priority of speech units may be determined respective of a variety of musical parameters of such speech units.
  • existence of a variety of musical parameters of such speech units may also lead to determination of a high priority speech unit.
  • the quality of the speech samples library may depend on, among other things, existence of a wide range of pitch characteristics of the speech samples.
  • the priority may be determined respective of a significance of the speech units.
  • significant speech units may be considered high priority.
  • the significance of a speech unit may be determined respective of a frequency of occurrence of the speech unit in a natural language speech corpus.
  • the significance may also be reflected upon existence of one or more alternative speech units found in the speech samples library. Absence of such alternative speech units may lead to determination of a high significance speech unit and, thus, would result in a high priority speech unit.
  • results of the priority determination are stored in a speech samples library. In an embodiment, these results are associated with the respective speech samples in the speech samples library.
  • a user node 120 may be configured to execute these processes.
  • the preceding embodiments for determining priority do not limit the methods available for determining the priority. Specifically, any of the above methods of embodiments may be combined with each other and/or with other methods for determining the priority without departing from the scope of the disclosed embodiments. As an example, the priority may be determined based on both the significance of speech units and the variety of musical parameters. In such combinations, the priority may be determined to be high if, e.g., any of the methods used for determining priority yields a high priority result, or if an average of numerical values for priority yields a high priority result.
  • the various embodiments disclosed herein can be implemented as hardware, firmware, software, or any combination thereof.
  • the software is preferably implemented as an application program tangibly embodied on a program storage unit or computer readable medium consisting of parts, or of certain devices and/or a combination of devices.
  • the application program may be uploaded to, and executed by, a machine comprising any suitable architecture.
  • the machine is implemented on a computer platform having hardware such as one or more central processing units (“CPUs"), a memory, and input/output interfaces.
  • CPUs central processing units
  • the computer platform may also include an operating system and microinstruction code.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A system and method for supervised creation of a speech samples library for text-to speech synthesis are provided. The method includes tracking at least one speech unit in an existing speech samples library to determine if the existing speech samples library achieves a desired quality; receiving at least one speech sample; analyzing the at least one received speech sample to identify at least one speech unit necessitated to obtain the desired quality of the speech samples library; and storing the at least one necessary speech unit in the speech samples library.

Description

A SYSTEM AND METHOD FOR SUPERVISED CREATION OF PERSONALIZED SPEECH SAMPLES LIBRARIES IN REAL-TIME FOR TEXT-TO-SPEECH
SYNTHESIS
CROSS-REFERENCE TO RELATED APPLICATIONS
[001] This application claims the benefit of U.S. Provisional Application No. 61/816,176 filed on April 26, 2013, the contents of which are hereby incorporated by reference for all that they contain.
TECHNICAL FIELD
[002] The present invention relates generally to the generation of a speech samples library utilized for text-to-speech (TTS) synthesis and, more specifically, to supervised creation of speech samples libraries for TTS synthesis customized based on a user's expressive speech.
BACKGROUND
[003] Several systems known in the art provide speech samples libraries for text-to- speech (TTS) synthesis. These speech samples libraries are typically equipped with speech units with diverse utterances based on a variety of musical parameters. The musical parameters can include, for example, different pronunciations of a given word that may result from characteristics of the speaker such as gender, accent, dialect, etc. The quality of these collections of speech samples is typically measured by how natural or human-like the synthesized speech sounds. Thus, such measures of quality are typically evaluated respective of a phonetic completeness, a phonemic completeness, and an optimal variety of musical attributes.
[004] The existing art features techniques for synthesis of customized expressive speech.
Such synthesis may be produced respective of TTS techniques. In order to achieve such synthesis, a customized speech samples library is required to be created respective of a user's voice. The creation of such customized speech sample libraries depends in part on unsupervised and unstructured speech. The difficulty arising from such speech sample libraries is that a desired threshold of quality suitable for generating expressive TTS voice cannot be supervised in real time.
[005] It would therefore be advantageous to overcome the limitations of the prior art by providing an effective way for handling the supervision of the creation of a speech samples library reaching a desired quality threshold suitable for generating customized expressive speech.
SUMMARY
[006] Certain exemplary embodiments include a system and method for supervised creation of a new speech samples library for text-to-speech (TTS) synthesis. The method comprises tracking at least one speech unit in an existing speech samples library to determine if the existing speech samples library achieves a desired quality; receiving at least one speech sample; analyzing the at least one received speech sample to identify at least one speech unit necessitated to obtain the desired quality of the speech samples library; and storing the at least one necessary speech unit in the speech samples library.
BRIEF DESCRIPTION OF THE DRAWINGS
[007] The subject matter disclosed herein is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the invention will be apparent from the following detailed description taken in conjunction with the accompanying drawings.
[008] Figure 1 is a schematic block diagram of a system for creating personalized speech samples libraries utilized to describe various embodiments.
[009] Figure 2 is a flowchart illustrating the supervised creation of a personalized speech samples library according to an embodiment.
[0010] Figure 3 is a flowchart illustrating determination of whether a desired quality has been achieved according to an embodiment.
[0011] Figure 4 is a flowchart illustrating determination of priority according to an embodiment. DETAILED DESCRIPTION
[0012] It is important to note that the embodiments disclosed herein are only examples of the many advantageous uses of the innovative teachings herein. In general, statements made in the specification of the present application do not necessarily limit any of the various claimed inventions. Moreover, some statements may apply to some inventive features but not to others. In general, unless otherwise indicated, singular elements may be in plural and vice versa with no loss of generality. In the drawings, like numerals refer to like parts through several views.
[0013] Fig. 1 is an exemplary and non-limiting schematic diagram of a system 100 for creating personalized speech samples libraries utilized to describe the various embodiments. A server 1 10 is optionally connected to one or more user nodes 120-1 through 120-n (for the sake of simplicity and without limitation, user nodes 120-1 through 120-n may be referred to individually as a user node 120 or collectively as user nodes 120) via an interface 130. Such a user node 120 may be, but is not limited to, a computer node, a personal computer (PC), a notebook computer, a cellular phone, a smartphone, a tablet device, a wearableand device, and so on. The server 1 10 typically contains several components such as, a processor or processing unit 140, and a memory 150. The interface 130 may be a network interface providing wired and/or wireless connectivity for a local area network (LAN), the Internet, and the like. Alternatively or collectively, the network interface 130 may be a serial bus, for example, universal serial bus (USB) for connecting peripheral devices.
[0014] The memory 150 further contains instructions 160 executed by the processor 140.
The system 100 optionally includes a speech recognition (SR) system 170 that may be an integral part of the memory 150, or a separate entity coupled to the server 1 10. According to one embodiment, the server 1 10 is configured to identify, using the the SR system 170, speech samples pronounced by a user. According to another embodiment, the server 1 10 may be configured to perform digital signal processing (DSP) when the pronunciation is inconsistent. This inconsistency may be expressed, for example, in a sound volume, a speed of the pronunciation, and a tone of speech.
[0015] According to yet another embodiment, the server 1 10 is further configured to receive speech samples containing speech units through, for example, the interface 130. A speech sample may be a word, a phrase, a sentence, etc. A speech unit is a distinct unit of sound in a specified language or dialect used to distinguish one word from another. As an example p, b, d, and t, may distinguish between the English words pad, pat, bad, and bat. Each speech unit can be classified to a phoneme, a bi-phone, or a tri- phone. A phoneme is the basic unit of a language's phonology, which is combined with one or more phonemes to form meaningful units of bi-phones or tri-phones.
[0016] The system 100 includes one or more speech samples libraries 180-1 through 180- m (for the sake of simplicity and without limitation, speech samples libraries 180-1 through 180-m may be referred to individually as a speech samples library 180 or collectively as speech samples libraries 180) that may be an integral part of the memory 150, or a separate entity coupled to the server 1 10. Each speech samples library 180 contains personalized speech samples of speech units. Moreover, each speech samples library 180 may maintain information to be used for TTS synthesis.
[0017] The server 1 10, in an embodiment, is configured to analyze each speech samples library 180. The server 1 10 typically identifies one or more speech units stored in the speech samples library 180 as well as the speech units that are missing and, thus, must be added. This identification is usually performed respective of a threshold of speech units required to reach a desired quality. When a speech sample is received, the server 1 10 is then configured to analyze the speech sample and the speech units comprised within. The analysis may include, for example, identification of neighbors of each speech unit in the speech sample, determination of a location of each speech unit in the speech sample, analysis of musical parameters of each speech unit, etc. The analysis process is discussed further herein below with respect to Fig. 2.
[0018] Moreover, the server 1 10 may also determine a quality of the speech samples. The speech units are stored in the speech samples library 180 under the supervision of the server 1 10 in real-time respective of a priority determined for each speech unit. The server 1 10 takes into consideration the analysis results to determine the priority of each speech unit. The supervised creation of the speech samples library 180 assists the server 1 10 in determining whether the speech samples library 180 has reached the desired quality. The process of analyzing a speech samples library and the speech samples contained therein to perform supervised creation of a speech samples library 180 of a desired quality is discussed further herein below with respect to Fig. 2.
[0019] Fig. 2 shows an exemplary and non-limiting flowchart 200 describing the supervised creation of a personalized speech samples library according to an embodiment. A personalized speech samples library is a speech samples library that achieves a desired quality. The desired quality may be, e.g., a level of quality that is predefined, a level of quality decided by a user in real time, and the like. In this embodiment, the personalized speech samples library is constructed based on speech samples contained in an existing speech samples library. In various other embodiments, a personalized speech samples library may be constructed without utilizing an existing speech samples library.
[0020] In S210, upon receiving a request to create a personalized speech samples library, one or more speech units stored in an existing or preconfigured speech samples library (e.g., speech samples library 180) are tracked for analysis. In an embodiment, a server (e.g., server 1 10), tracks the one or more speech units. In S220, it is checked whether the speech samples library has reached a desired quality and, if so, execution terminates; otherwise, execution continues with S230. Determination of whether a speech samples library has achieved a desired quality is discussed further herein below with respect to Fig. 3.
[0021] In S230, one or more speech samples are received. In an embodiment, the one or more speech samples are received from the speech samples library. The speech samples library may lack one or more speech units that are necessary to achieving a desired quality. A speech sample may be, but is not limited to, a word, a sequence of words, a sentence, and the like. In an embodiment, the identities of speech units that are missing respective of the speech units existing in the speech samples library are determined. This determination may further include determining a threshold of speech units that are required for the desired quality. A speech unit may be, but is not limited to, a phoneme, a bi-phone, a tri-phone, and the like.
[0022] According to one embodiment, the speech samples may be received through a speech recognition (SR) system (e.g., SR system 170). According to another embodiment, a request is sent to a user to pronounce one or more speech samples. The request may be sent through an appropriate interface (e.g., interface 130). Moreover, the received speech samples may be sent back in real-time to verify that the identification of such speech samples was correct.
[0023] In S240, one or more speech units of the received speech samples are analyzed. In an embodiment, the speech unit's neighbors within each speech sample are identified. A neighbor is a speech unit that precedes or follows the analyzed speech unit when the speech units of a speech sample are arranged in a sequential order (e.g., according to location within the speech sample). In an embodiment, neighbors only include speech units that immediately precede or follow the analyzed speech unit. In another embodiment, a parameters analysis is performed for each speech unit. Such an analysis may include, but is not limited to, identification of musical parameters, such as, pitch characteristics, duration, volume, and so on.
[0024] In S250, a priority of the speech samples and the respective speech units is determined. A priority may be, but is not limited to, one of several categories (e.g., low, medium, or high), a numerical value associated with levels of priority (e.g., zero through ten, wherein zero represents the lowest priority and ten represents the highest priority), and so on. Determination of priority is discussed further herein below with respect to Fig. 4.
[0025] In S260, the analysis and its respective speech units are stored in the speech samples library. The speech units are typically stored respective of the priority of those speech units or of speech samples containing the speech units. In an embodiment, speech units with higher priority are stored earlier in a sequential order than speech units with lower priority.
[0026] In S270, the quality of the speech samples library is determined respective of the analysis made for each speech unit and the respective determined priority. A value representing the quality may be displayed to the user through the interface. In S280, it is checked whether there are additional speech units that are required to be added and if so execution continues with S230; otherwise, execution terminates.
[0027] Fig. 3 illustrates an exemplary and non-limiting flowchart S220 for determination of whether a desired quality has been achieved by a speech samples library according to an embodiment. [0028] In S310, speech samples and requirements for desired quality are retrieved. A speech sample may be, but is not limited to, a word, a sequence of words, a sentence, and the like. In an embodiment, the speech samples may be retrieved from an existing speech samples library. In another embodiment, the speech samples may be retrieved from an input received by a user, an electronic sound database, and the like. The requirements for desired quality may be, e.g., a minimum number of speech units, a set of required speech units, a set of speech samples containing all required speech units, and the like. A speech unit may be, but is not limited to, a phoneme, a bi-phone, a tri- phone, etc.
[0029] In S320, the retrieved speech samples are analyzed to determine existing speech units within each speech sample. In S330, the existing speech units are analyzed to determine the suitability of each speech unit. Suitability may be determined based on musical parameters of the speech unit including, but not limited to, volume, clarity, pitch, tone, duration, etc. Suitability of a speech unit may be, e.g., predefined, or may be determined by the user in real-time. In embodiments where suitability is determined by a user in real-time, the speech units may be displayed and played on a user node, thereby enabling the user to decide whether each speech unit is suitable according to his or her preferences.
[0030] In an embodiment, suitable speech units are compiled in a list or number of speech units. In that embodiment, the results of the suitability determination may be returned as the list or number of speech units. Thus, in that embodiment, unsuitable speech units are excluded from the results of the suitability determination. As a non-limiting example, if a speech sample of the word "incredulous" is analyzed (a word that includes four phonemes that, for purposes of the example, each considered to be a speech unit), and three of the speech units are determined to be suitable, the results of the suitability analysis would only include those three suitable speech units.
[0031] In S340, the results of the suitability determination in S330 are compared to the requirements for desired quality. In S350, the results of the comparison are returned. It should be noted that the comparison results utilized in the determination if the suitability has been achieved as required by the process discussed above. [0032] Fig. 4 is an exemplary and non-limiting flowchart S250 illustrating determination of priority according to an embodiment. A priority in the context of the disclosed embodiments may be, but is not limited to, a classification within one of several categories (e.g., low, medium, high, and the like), a numerical value associated with levels of priority (e.g., zero through ten, wherein zero represents the lowest priority and ten represents the highest priority), and so on.
[0033] In an embodiment, the priority represents the degree of importance that a speech unit has within a given speech samples library (e.g., speech samples library 180). Speech units demonstrating particularly unique features respective of other speech units may be determined as higher priority, since such unique speech units typically contribute more to the quality of a speech samples library than less unique speech units.
[0034] In S410, the speech sample is retrieved. In an embodiment, the speech sample is retrieved from, e.g., a speech samples library 180. In S420, the speech sample is analyzed to identify existing speech units within the speech sample.
[0035] In S430, the priority of each speech unit is determined. The priority may be determined, for example, respective of the analysis and the desired quality of the speech samples library. In an embodiment, the priority may be determined respective of a quality level of the received speech sample. Specifically, speech units associated with speech samples having poor quality are considered low priority, while speech units associated with speech samples having high quality are considered high priority. As a non-limiting example, speech units of clear sounding words will get higher priority than speech units of words whose respective sounds are muddled or otherwise distorted.
[0036] In a further embodiment, the priority of speech units may be determined respective of a variety of musical parameters of such speech units. In that embodiment, existence of a variety of musical parameters of such speech units may also lead to determination of a high priority speech unit. As a non-limiting example, the quality of the speech samples library may depend on, among other things, existence of a wide range of pitch characteristics of the speech samples.
[0037] In another embodiment, the priority may be determined respective of a significance of the speech units. In that embodiment, significant speech units may be considered high priority. The significance of a speech unit may be determined respective of a frequency of occurrence of the speech unit in a natural language speech corpus. The significance may also be reflected upon existence of one or more alternative speech units found in the speech samples library. Absence of such alternative speech units may lead to determination of a high significance speech unit and, thus, would result in a high priority speech unit.
[0038] In S440, the results of the priority determination are stored in a speech samples library. In an embodiment, these results are associated with the respective speech samples in the speech samples library.
[0039] The processes described herein with references to Figs. 2-4 may be performed by the server 1 10. In another embodiment, a user node 120 may be configured to execute these processes.
[0040] It should be appreciated that the preceding embodiments for determining priority do not limit the methods available for determining the priority. Specifically, any of the above methods of embodiments may be combined with each other and/or with other methods for determining the priority without departing from the scope of the disclosed embodiments. As an example, the priority may be determined based on both the significance of speech units and the variety of musical parameters. In such combinations, the priority may be determined to be high if, e.g., any of the methods used for determining priority yields a high priority result, or if an average of numerical values for priority yields a high priority result.
[0041]The various embodiments disclosed herein can be implemented as hardware, firmware, software, or any combination thereof. Moreover, the software is preferably implemented as an application program tangibly embodied on a program storage unit or computer readable medium consisting of parts, or of certain devices and/or a combination of devices. The application program may be uploaded to, and executed by, a machine comprising any suitable architecture. Preferably, the machine is implemented on a computer platform having hardware such as one or more central processing units ("CPUs"), a memory, and input/output interfaces. The computer platform may also include an operating system and microinstruction code. The various processes and functions described herein may be either part of the microinstruction code or part of the application program, or any combination thereof, which may be executed by a CPU, whether or not such a computer or processor is explicitly shown. In addition, various other peripheral units may be connected to the computer platform such as an additional data storage unit and a printing unit. Furthermore, a non-transitory computer readable medium is any computer readable medium except for a transitory propagating signal.42] All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the principles of the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions. Moreover, all statements herein reciting principles, aspects, and embodiments of the invention, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future, i.e., any elements developed that perform the same function, regardless of structure.

Claims

CLAIMS What is claimed is:
1 . A computerized method for supervised creation of a new speech samples library for text-to-speech (TTS) synthesis, comprising:
tracking at least one speech unit in an existing speech samples library to determine if the existing speech samples library achieves a desired quality;
receiving at least one speech sample;
analyzing the at least one received speech sample to identify at least one speech unit necessitated to obtain the desired quality of the speech samples library; and
storing the at least one necessary speech unit in the speech samples library.
2. The computerized method of claim 1 , wherein the desired quality is determined respective of a predefined threshold.
3. The computerized method of claim 1 , wherein the at least one speech unit is any of: a phoneme, a bi-phone, and a tri-phone.
4. The computerized method of claim 1 , further comprising:
analyzing at least one musical parameter related to the at least one speech unit.
5. The computerized method of claim 4, wherein the at least one musical parameter is any of: pitch characteristics, duration features, and a sound volume.
6. The computerized method of claim 1 , wherein the analysis of the at least one speech sample further comprises:
determining a quality level of the at least one received speech sample.
7. The computerized method of claim 6, wherein the identification of the at least one speech unit further comprises at least one of: analyzing a location of the at least speech unit within the at least one speech sample, analyzing neighbors of the at least one speech unit in the at least one speech sample, and analyzing a significance of the at least one speech unit.
8. The computerized method of claim 7, wherein the significance of the at least one speech unit is determined based on at least one of: a frequency of occurrence of the at least one speech unit in the speech samples library, a frequency of occurrence of the at least one speech unit in a natural language speech corpus, an existence of alternative speech units, a lack of any alternative speech units.
9. The computerized method of claim 1 , further comprising:
sending a request to at least one user to pronounce the at least one speech sample.
10. The computerized method of claim 9, further comprising:
performing digital signal processing (DSP) when the at least one pronounced speech sample is inconsistent.
1 1 . The computerized method of claim 1 , wherein the at least one speech sample is at least one of: a word, a phrase, and a sentence.
12. A non-transitory computer readable medium having stored thereon instructions for causing one or more processing units to execute the method according to claim 1 .
13. A system for supervised creation of a speech samples library for text-to-speech (TTS) synthesis, comprising:
a processor; and
a memory, wherein the memory contains instructions that, when executed by the processor, configure the system to:
track at least one speech unit in an existing speech samples library to determine if the existing speech samples library achieves a desired quality;
receive at least one speech sample; analyze the at least one received speech sample to identify at least one speech unit necessitated to obtain the desired quality of the speech samples library; and
store at least the one necessary speech unit in the speech samples library.
14. The system of claim 13, wherein the system further comprises:
a speech recognition (SR) system, wherein the SR system is configured to identify at least one speech sample pronounced by a user.
15. The system of claim 13, wherein the system is further configured to:
display a value representing a quality of the speech samples library via an interface.
16. The system of claim 15, further configured to:
return the at least one received speech samples to verify a correct identification of the at least one identified speech sample.
17. The system of claim 13, wherein the desired quality is determined respective of a predefined threshold.
18. The system of claim 13, wherein the at least one speech unit is any of: a phoneme, a bi-phone, and a tri-phone.
19. The system of claim 13, wherein the system is further configured to:
analyze at least one musical parameter related to the at least one speech unit.
20. The system of claim 19, wherein the at least one musical parameter is at least one of: pitch characteristics, duration features, and a sound volume.
21 . The system of claim 13, wherein the identification of the at least one speech unit further comprises at least one of: analyzing a location of the at least speech unit within the at least one speech sample, analyzing neighbors of the at least one speech unit in the at least one speech sample, and analyzing a significance of the at least one speech unit.
22. The system of claim 21 , wherein the significance of the at least one speech unit is determined based on at least one of: a frequency of occurrence of the at least one speech unit in the speech samples library, a frequency of occurrence of the at least one speech unit in a natural language speech corpus, an existence of alternative speech units, a lack of any alternative speech units.
23. The system of claim 13, wherein the system is further configured to:
send a request to at least one user to pronounce the at least one speech sample.
24. The system of claim 23, wherein the system is further configured to:
perform digital signal processing (DSP) when the at least one pronounced speech sample is inconsistent.
25. The system of claim 13, wherein the at least one received speech sample is any of: a word, a phrase, and a sentence.
PCT/US2014/035436 2013-04-26 2014-04-25 A system and method for supervised creation of personalized speech samples libraries in real-time for text-to-speech synthesis Ceased WO2014176489A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201361816176P 2013-04-26 2013-04-26
US61/816,176 2013-04-26

Publications (2)

Publication Number Publication Date
WO2014176489A2 true WO2014176489A2 (en) 2014-10-30
WO2014176489A3 WO2014176489A3 (en) 2014-12-18

Family

ID=51792516

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2014/035436 Ceased WO2014176489A2 (en) 2013-04-26 2014-04-25 A system and method for supervised creation of personalized speech samples libraries in real-time for text-to-speech synthesis

Country Status (1)

Country Link
WO (1) WO2014176489A2 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10388272B1 (en) 2018-12-04 2019-08-20 Sorenson Ip Holdings, Llc Training speech recognition systems using word sequences
US10573312B1 (en) 2018-12-04 2020-02-25 Sorenson Ip Holdings, Llc Transcription generation from multiple speech recognition systems
US11017778B1 (en) 2018-12-04 2021-05-25 Sorenson Ip Holdings, Llc Switching between speech recognition systems
US11170761B2 (en) 2018-12-04 2021-11-09 Sorenson Ip Holdings, Llc Training of speech recognition systems
US11488604B2 (en) 2020-08-19 2022-11-01 Sorenson Ip Holdings, Llc Transcription of audio
US12499874B2 (en) 2023-12-13 2025-12-16 Sorenson Ip Holdings, Llc Training speech recognition systems using word sequences

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5850629A (en) * 1996-09-09 1998-12-15 Matsushita Electric Industrial Co., Ltd. User interface controller for text-to-speech synthesizer
US6505158B1 (en) * 2000-07-05 2003-01-07 At&T Corp. Synthesis-based pre-selection of suitable units for concatenative speech
EP2140448A1 (en) * 2007-03-21 2010-01-06 Vivotext Ltd. Speech samples library for text-to-speech and methods and apparatus for generating and using same

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10388272B1 (en) 2018-12-04 2019-08-20 Sorenson Ip Holdings, Llc Training speech recognition systems using word sequences
US10573312B1 (en) 2018-12-04 2020-02-25 Sorenson Ip Holdings, Llc Transcription generation from multiple speech recognition systems
US10672383B1 (en) 2018-12-04 2020-06-02 Sorenson Ip Holdings, Llc Training speech recognition systems using word sequences
US10971153B2 (en) 2018-12-04 2021-04-06 Sorenson Ip Holdings, Llc Transcription generation from multiple speech recognition systems
US11017778B1 (en) 2018-12-04 2021-05-25 Sorenson Ip Holdings, Llc Switching between speech recognition systems
US11145312B2 (en) 2018-12-04 2021-10-12 Sorenson Ip Holdings, Llc Switching between speech recognition systems
US11170761B2 (en) 2018-12-04 2021-11-09 Sorenson Ip Holdings, Llc Training of speech recognition systems
US11594221B2 (en) 2018-12-04 2023-02-28 Sorenson Ip Holdings, Llc Transcription generation from multiple speech recognition systems
US11935540B2 (en) 2018-12-04 2024-03-19 Sorenson Ip Holdings, Llc Switching between speech recognition systems
US12380877B2 (en) 2018-12-04 2025-08-05 Sorenson Ip Holdings, Llc Training of speech recognition systems
US11488604B2 (en) 2020-08-19 2022-11-01 Sorenson Ip Holdings, Llc Transcription of audio
US12499874B2 (en) 2023-12-13 2025-12-16 Sorenson Ip Holdings, Llc Training speech recognition systems using word sequences

Also Published As

Publication number Publication date
WO2014176489A3 (en) 2014-12-18

Similar Documents

Publication Publication Date Title
KR102582291B1 (en) Emotion information-based voice synthesis method and device
CN108573693B (en) Text-to-speech system and method, and storage medium therefor
CN107590135B (en) Automatic translation methods, devices and systems
US20140236597A1 (en) System and method for supervised creation of personalized speech samples libraries in real-time for text-to-speech synthesis
EP3832644A1 (en) Neural speech-to-meaning translation
JP6812843B2 (en) Computer program for voice recognition, voice recognition device and voice recognition method
US11574637B1 (en) Spoken language understanding models
WO2017067206A1 (en) Training method for multiple personalized acoustic models, and voice synthesis method and device
US11158308B1 (en) Configuring natural language system
US8447603B2 (en) Rating speech naturalness of speech utterances based on a plurality of human testers
JP2007249212A (en) Method, computer program and processor for text speech synthesis
US11615787B2 (en) Dialogue system and method of controlling the same
CN110600002B (en) Voice synthesis method and device and electronic equipment
JP2017032839A (en) Acoustic model learning device, voice synthesis device, acoustic model learning method, voice synthesis method, and program
KR20230079503A (en) Sample generation method and device
CN105609097A (en) Speech synthesis apparatus and control method thereof
JP2019179257A (en) Acoustic model learning device, voice synthesizer, acoustic model learning method, voice synthesis method, and program
CN112908308B (en) Audio processing method, device, equipment and medium
CN110852075B (en) Voice transcription method and device capable of automatically adding punctuation marks and readable storage medium
CN114566140B (en) Speech synthesis model training method, speech synthesis method, equipment and product
WO2014176489A2 (en) A system and method for supervised creation of personalized speech samples libraries in real-time for text-to-speech synthesis
WO2014183411A1 (en) Method, apparatus and speech synthesis system for classifying unvoiced and voiced sound
CN115050351A (en) Method and device for generating timestamp and computer equipment
CN116917984A (en) Interactive content output
KR102277205B1 (en) Apparatus for converting audio and method thereof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14787390

Country of ref document: EP

Kind code of ref document: A2

122 Ep: pct application non-entry in european phase

Ref document number: 14787390

Country of ref document: EP

Kind code of ref document: A2

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 02/05/2016)

122 Ep: pct application non-entry in european phase

Ref document number: 14787390

Country of ref document: EP

Kind code of ref document: A2