[go: up one dir, main page]

WO1994015330A1 - Method and apparatus for automatic evaluation of pronunciation - Google Patents

Method and apparatus for automatic evaluation of pronunciation Download PDF

Info

Publication number
WO1994015330A1
WO1994015330A1 PCT/US1993/012399 US9312399W WO9415330A1 WO 1994015330 A1 WO1994015330 A1 WO 1994015330A1 US 9312399 W US9312399 W US 9312399W WO 9415330 A1 WO9415330 A1 WO 9415330A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
hidden markov
markov model
acoustic feature
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US1993/012399
Other languages
French (fr)
Inventor
Jared C. Bernstein
Michael H. Cohen
Hy Murveit
Mitchel Weintraub
Dimitry Rtischev
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SRI International Inc
Original Assignee
SRI International Inc
Stanford Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SRI International Inc, Stanford Research Institute filed Critical SRI International Inc
Publication of WO1994015330A1 publication Critical patent/WO1994015330A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]

Definitions

  • This invention relates to automatic pronunciation evaluation.
  • An application is in computer-aided foreign language instruction and assessment.
  • HMM hidden Markov model
  • a probability distribution over one or more acoustic features is associated with each state. Probabilities associated with the
  • transitions leading from each state specify the probability of taking each transition upon exiting the selected state.
  • the acoustic feature distributions are typically used to model speech characteristics such as spectra.
  • the transition probabilities implicitly model duration.
  • the probability distributions for transitions and acoustic features are estimated using examples of speech collected from a diverse population. This model of speech is referred to as the hidden Markov model, since the states are not directly observed.
  • Recognition consists of determining the path through the hidden Markov model that has the highest probability of generating the observed acoustic feature sequence.
  • That speech pronunciation evaluation method depends upon the use of dynamic time warping (DTW) and metric scoring based on distance in acoustic parameter space.
  • DTW dynamic time warping
  • metric scoring based on distance in acoustic parameter space.
  • a student is prompted to speak a preselected script into an input device in order to allow a machine according to the invention to evaluate the quality of the pronunciation.
  • the speech of the student is divided into time frames and each frame of speech is characterized by one or more acoustic features.
  • the student's speech, represented as a sequence of acoustic features, is compared to a hidden Markov model of speech patterns corresponding to the
  • the probability score is determined by processes herein referred to as alignment and backtracing.
  • Alignment is the process of calculating a maximum likelihood path, the maximum likelihood path being the path through the hidden Markov model with the maximum likelihood of generating the acoustic feature sequence extracted from the speech of the user.
  • Backtracing is the process of finding the path
  • the probability score is a probability score indicative of the probability that the observed speech acoustic features would have been generated by the hidden Markov model.
  • a conventional HMM speech recognizer would use the maximum likelihood path identified during alignment and backtracing to recognize a user's speech.
  • FIG. 1 is a block diagram of a device operative in accordance with the invention.
  • FIG. 2 is a diagram illustrating a portion of a hidden Markov model finite state machine adapted for
  • FIG. 3 is a diagram illustrating a word-level grammar network adapted for pronunciation evaluation according to the invention.
  • FIG. 4 is a flowchart describing the steps of aligning an observed acoustic feature sequence to a speech pattern model according to the Viterbi search technique.
  • FIG. 5 is a flowchart describing the step of
  • FIG. 6 is a flowchart describing the step of
  • FIG. 7 is a flowchart describing the step of
  • FIG. 8 is a diagram illustrating a portion of the backtrace structure used in computing the probability score for a sample speech pattern in accordance with the invention.
  • FIG. 9 is a flowchart describing the use of the forward procedure to determine the probability score for an observed acoustic feature sequence given a hidden Markov model .
  • the invention allows pronunciation of preselected scripts to be automatically evaluated by calculating the probability that an appropriate hidden Markov model would generate the acoustic features derived from the actual
  • the rendition of the script may be a word, a phrase, a sentence, a paragraph or other unit of speech.
  • Figure 1 is a block diagram of an exemplary realtime pronunciation evaluation apparatus 10 comprising a lesson controller apparatus 2, an output device 4, a speech input device 6, speech information storage 7, a feature extractor 8, an HMM processor 11, speech pattern template storage 12 and a sealer 14.
  • the output device 4 and speech input device 6 are positioned to permit interaction between the pronunciation evaluation apparatus 10 and a student 3.
  • the lesson controller 2 is used to select an evaluation speech pattern and present it to a student 3 through output device 4.
  • the student 3 then recites the script into speech input device 6.
  • Speech input device 6 converts the student's speech to machine-readable speech information.
  • Speech information storage 7 stores the machine- readable speech information.
  • Feature extractor 8 registers the speech information by dividing it into frames and
  • the characterizing step comprises determining one or more acoustic features for each frame.
  • the speech is thus available to the HMM processor 11 as a sequence of observed acoustic features.
  • the HMM processor 11 retrieves from speech pattern template storage 12 the template corresponding to the script selected by the lesson controller 2.
  • Each template stored in speech pattern template storage 12 comprises the previously computed state transition probabilities, acoustic feature probability densities, and grammar network for the hidden Markov model of speech patterns corresponding to the
  • the templates are derived from speech generated by a diverse population of speakers.
  • a probability score for the observed feature sequence is generated by processes herein referred to as alignment and backtracing.
  • Alignment is the process of calculating a maximum likelihood path, the maximum likelihood path being the path through the hidden Markov model with the maximum likelihood of generating the acoustic feature sequence extracted from the speech of the user.
  • Backtracing is the process of recalling the probabilities evaluated for the maximum likelihood path to obtain an overall probability score.
  • the overall probability score is converted to a pronunciation evaluation score through the action of the sealer 14.
  • the sealer applies the
  • a is the scaling factor
  • x is the probability score
  • b is the scaling offset.
  • the hidden Markov model stored in the template for the selected speech pattern has a hierarchical structure.
  • the speech pattern is divided into words which are in turn divided into phones which are themselves in turn divided into the individual states of the hidden Markov model.
  • the input speech is modeled as a Markov source which traces a path through the hidden Markov model, traversing many states and remaining in each state for one or more frames.
  • Allowable transitions from a state are constrained by a grammar network associated with the hidden Markov model constructed from speech patterns, preferably obtained by training a speech recognizer employing the speech of native speakers of the target language reciting the same preselected script.
  • a grammar network is a catalog of states and their allowable links forming phones, words and sentences.
  • Figure 2 illustrates a portion of the hidden Markov model for a speech pattern and shows the transitions between states allowed by the grammar network associated with the hidden Markov model.
  • Phones 16 and 18 are divided into states 20, 22, 24, 26, 28 and 30. Transitions are allowed from a state to itself or to the next state.
  • the last state 24 in phone 16 may transition to the first state 26 of a successor phone 18.
  • the grammar network may allow multiple successor phones to a selected phone (not shown). Transitions to another phone from states other than the final state in a selected phone may also be allowed (not shown). At any given frame, the list of phones which may be reached through transitions permitted by the grammar network is herein referred to as the active list of phones.
  • Figure 3 illustrates a portion of the word- level grammar network for a sample speech pattern.
  • the network inserts an optional pause phone between words which may be traversed to allow for speakers who paused between words. Otherwise, the pause is skipped. (The skip is
  • a pause phone 34 is inserted between the word “she” 36 and the word “had” 38.
  • FIG. 4 is a flowchart illustrating the steps of alignment.
  • alignment is performed using the Viterbi search method.
  • the Viterbi search method is a dynamic programming technique named for its originator and disclosed in Viterbi 67 which is incorporated by reference.
  • the HMM processor determines the maximum likelihood path through the hidden Markov model by evaluating one frame at a time, the probability that each path through the model would have generated the acoustic feature sequence extracted from the input speech pattern.
  • Path probabilities are updated at each time frame. For each active phone, the path probabilities for paths going through that phone are updated using the Viterbi search technique (Step B). The active phone list is then updated in accordance with the grammar network (Step C). After this has been done for the entire speech pattern, the maximum
  • FIG. 5 is a flowchart illustrating in greater detail the calculation of path probabilities in accordance with the Viterbi search technique.
  • the active phones are identified (Step E).
  • the states belonging to the identified active phones are identified.
  • Step F For each identified state, each path is extended to that state and a new
  • the probability for the path is calculated.
  • the probability is calculated by multiplying the previous probability of that path by the probability of taking the transition to that state and then further multiplying by the probability of each acoustic feature observed at that frame as derived from the acoustic feature distributions (Step G). For each state, the maximum likelihood path arriving at that state is preserved and the other paths arriving at the state are discarded because none of them can be the maximum likelihood path through the hidden Markov model. In a further pruning step, the probability of the paths arriving at each phone are evaluated and if the probability of arriving at the phone for all paths leading to the phone is below a preselected
  • all paths with probability below a preselected threshold may be marked as not requiring further evaluation. However, if this threshold is set too high, there is a chance that the maximum likelihood path itself may be pruned.
  • the observed acoustic features are weighted (step F1). Weighting is raising an observed acoustic feature value to an exponent which is a preselected weight value before incorporating the acoustic feature value into the path probability calculation. Each acoustic feature has a preselected weight for each state. Weights are adjusted to improve the correlation between evaluation scores generated by the pronunciation evaluation apparatus 10 and ones
  • Figure 6 is a flowchart illustrating the step of activating phones in accordance with the grammar network.
  • Each active phone is identified (Step I). For each active phone, possible successor phones are identified from the grammar network (Step J). The successor phones are made active (Step K). Thus, at the next probability evaluation step, the first states of these phones will be potential destinations.
  • Figure 7 is a flowchart describing the step of backtracing in accordance with the invention.
  • the HMM processor 11 backtraces through the maximum likelihood path to determine a pronunciation evaluation score. During backtracing, the HMM processor 11 creates a record herein referred to as the backtrace structure.
  • FIG 8 is a diagram illustrating a portion of the backtrace structure record 100 for a sample speech pattern.
  • the speech pattern 40 is composed of its constituent words 42, which are in turn broken down into constituent phones 44, which are themselves broken down into constituent states 46.
  • Each state has a score, stored in logarithmic form which is the probability evaluated for the portion of the maximum likelihood path going through that state.
  • the speech pattern probability is determined by summing its constituent state probabilities (Step L). Each word probability is determined by summing its constituent phone probabilities (Step M), and the speech pattern probability is determined by summing the word probabilities (Step N). The speech pattern probability is the probability score for the acoustic feature sequence.
  • the overall score may comprise the weighted combination of individual phone probabilities.
  • Each of the individual phone probabilities may be weighted before summing according to the importance of the phone to judgment of pronunciation quality. For example, a pause phone ( Figure 3) may be weighted at zero, since it contains minimal information about the quality of pronunciation.
  • the weights of some phones may be adjusted for selected target user populations to take into account predisposition to certain pronunciation patterns.
  • a segment of the maximum likelihood path is selected for backtracing rather than the entire maximum likelihood path (step K1).
  • the selected section may be a particular state, phone, or word for which an evaluation score is desired to diagnose a particular pronunciation difficulty.
  • the probability score is then the probability evaluated for the selected section.
  • the probability score is determined using a "forward" procedure rather than by alignment and backtracing.
  • the forward procedure is described in Rabiner 89 which is incorporated herein by reference and is a procedure for determining the cumulative "forward"
  • FIG. 9 is a flowchart illustrating the forward procedure.
  • First states allowed by the grammar network are determined (Step 0).
  • a state probability is calculated which is the probability that the state would have generated the acoustic features observed during the first frame (Step P).
  • the successor states allowed by the grammar network are determined (Step Q) and paths are extended from all the first states to all the allowed successor states. The probability of these paths is calculated by multiplying the first state probability for the first state which begins the path by the transition
  • the successor state probability is then calculated to be the total probability of all the paths arriving at the successor state (Step R).
  • Steps Q and R are repeated until the last frame of the observed acoustic feature sequence is reached.
  • Step S The probabilities determined for all the allowed last states are then summed to determine the probability score.
  • Step T The probability score
  • the probability score is the output of the HMM processor 11 and is transformed by the sealer 14 into a pronunciation evaluation score which is output to the user via output device 4. Scores obtained according to the invention as described above correlate very well with human evaluations of pronunciation.
  • the attached appendix contains a source-code listing containing one operational embodiment of selected elements of the claimed invention.
  • the machine-readable form of the listing can be compiled using a C language compiler and executed on a central processing unit of a system equipped with the balance of the elements of the claimed invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

Pronunciation is automatically evaluated using techniques adapted from automatic speech recognition as shown in the figure 1. Speech, represented as a sequence of acoustic features (8), is compared to a hidden Markov model (11) of speech patterns (12) corresponding to the preselected script to obtain a probability score which is a measure of the likelihood that the hidden Markov model would have generated the sequence of observed acoustic features. In a specific embodiment, the probability score is determined by processes referred to as alignment and backtracing. Alignment is the process of calculating a maximum likelihood path established through the use of the hidden Markov model. Backtracing is the process of finding the path backwards through the network of states of the hidden Markov model and obtaining a probability score.

Description

METHOD AND APPARATUS FOR AUTOMATIC
EVALUATION OF PRONUNCIATION BACKGROUND OF THE INVENTION
This invention relates to automatic pronunciation evaluation. An application is in computer-aided foreign language instruction and assessment.
Techniques for computer analysis of speech have heretofore been developed for recognition rather than
pronunciation evaluation. One approach to automated speech recognition has been to model speech as a Markov source useful in a hidden Markov model (HMM) speech recognition processor. The speech units being modeled are represented by hidden
Markov models.
In an HMM speech recognition system, a probability distribution over one or more acoustic features is associated with each state. Probabilities associated with the
transitions leading from each state specify the probability of taking each transition upon exiting the selected state. The acoustic feature distributions are typically used to model speech characteristics such as spectra. The transition probabilities implicitly model duration. The probability distributions for transitions and acoustic features are estimated using examples of speech collected from a diverse population. This model of speech is referred to as the hidden Markov model, since the states are not directly observed.
Recognition consists of determining the path through the hidden Markov model that has the highest probability of generating the observed acoustic feature sequence.
Recently, Ryohei Nakatsu of Japan, aware of early unpublished development work of the present inventors, proposed a speech pronunciation evaluation method (Abstract of paper by Hamada and Nakatsu entitled "Evaluation of English Pronunciation Based on the Static and Dynamic Spectral
Characteristics of Words Spoken By Japanese," Journal of the Acoustical Society of America, Vol. 84, Sup. 1, Fall 1988). That speech pronunciation evaluation method depends upon the use of dynamic time warping (DTW) and metric scoring based on distance in acoustic parameter space. These techniques are generally known in the speech recognition art and have been found to be inferior for aligning speech and recognizing speech. What is needed is a technique which is relatively invariant across a wide range of voices of speakers.
SUMMARY OF THE INVENTION
According to the invention, pronunciation is
automatically evaluated using techniques adapted from
automatic speech recognition. A student is prompted to speak a preselected script into an input device in order to allow a machine according to the invention to evaluate the quality of the pronunciation. The speech of the student is divided into time frames and each frame of speech is characterized by one or more acoustic features. The student's speech, represented as a sequence of acoustic features, is compared to a hidden Markov model of speech patterns corresponding to the
preselected script to obtain a probability score which is a measure of the likelihood that the hidden Markov model would have generated the sequence of observed acoustic features. In a specific embodiment, the probability score is determined by processes herein referred to as alignment and backtracing.
Alignment is the process of calculating a maximum likelihood path, the maximum likelihood path being the path through the hidden Markov model with the maximum likelihood of generating the acoustic feature sequence extracted from the speech of the user. Backtracing is the process of finding the path
backwards through the network of states of the hidden Markov model and obtaining a probability score indicative of the probability that the observed speech acoustic features would have been generated by the hidden Markov model. (In contrast, a conventional HMM speech recognizer would use the maximum likelihood path identified during alignment and backtracing to recognize a user's speech.) The probability score is
transformed into an evaluation score by use of scaling, and it is then displayed. The present invention, based on heretofore undisclosed recognition of the applicability of certain underlying principles of HMM speech recognition to the
solution of the pronunciation evaluation problem, is believed to surpass in accuracy all known and proposed pronunciation evaluation systems.
The invention will be better understood by reference to the following detailed description in connection with the accompanying drawings. BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram of a device operative in accordance with the invention.
FIG. 2 is a diagram illustrating a portion of a hidden Markov model finite state machine adapted for
pronunciation evaluation according to the invention.
FIG. 3 is a diagram illustrating a word-level grammar network adapted for pronunciation evaluation according to the invention.
FIG. 4 is a flowchart describing the steps of aligning an observed acoustic feature sequence to a speech pattern model according to the Viterbi search technique. FIG. 5 is a flowchart describing the step of
calculating path probabilities according to the Viterbi search technique.
FIG. 6 is a flowchart describing the step of
activating phones during alignment.
FIG. 7 is a flowchart describing the step of
backtracing through the maximum likelihood path.
FIG. 8 is a diagram illustrating a portion of the backtrace structure used in computing the probability score for a sample speech pattern in accordance with the invention.
FIG. 9 is a flowchart describing the use of the forward procedure to determine the probability score for an observed acoustic feature sequence given a hidden Markov model .
DESCRIPTION OF SPECIFIC EMBODIMENTS
The invention allows pronunciation of preselected scripts to be automatically evaluated by calculating the probability that an appropriate hidden Markov model would generate the acoustic features derived from the actual
rendition of the script as spoken by the speaker being
evaluated. The probability is then transformed into an evaluation score. The rendition of the script may be a word, a phrase, a sentence, a paragraph or other unit of speech.
Figure 1 is a block diagram of an exemplary realtime pronunciation evaluation apparatus 10 comprising a lesson controller apparatus 2, an output device 4, a speech input device 6, speech information storage 7, a feature extractor 8, an HMM processor 11, speech pattern template storage 12 and a sealer 14. The output device 4 and speech input device 6 are positioned to permit interaction between the pronunciation evaluation apparatus 10 and a student 3.
In operation, the lesson controller 2 is used to select an evaluation speech pattern and present it to a student 3 through output device 4. The student 3 then recites the script into speech input device 6. Speech input device 6 converts the student's speech to machine-readable speech information. Speech information storage 7 stores the machine- readable speech information. Feature extractor 8 registers the speech information by dividing it into frames and
characterizing the frames. The characterizing step comprises determining one or more acoustic features for each frame. The speech is thus available to the HMM processor 11 as a sequence of observed acoustic features.
The HMM processor 11 retrieves from speech pattern template storage 12 the template corresponding to the script selected by the lesson controller 2. Each template stored in speech pattern template storage 12 comprises the previously computed state transition probabilities, acoustic feature probability densities, and grammar network for the hidden Markov model of speech patterns corresponding to the
preselected script. The templates are derived from speech generated by a diverse population of speakers.
In one embodiment, a probability score for the observed feature sequence is generated by processes herein referred to as alignment and backtracing. Alignment is the process of calculating a maximum likelihood path, the maximum likelihood path being the path through the hidden Markov model with the maximum likelihood of generating the acoustic feature sequence extracted from the speech of the user. Backtracing is the process of recalling the probabilities evaluated for the maximum likelihood path to obtain an overall probability score.
The overall probability score is converted to a pronunciation evaluation score through the action of the sealer 14. In one embodiment, the sealer applies the
transformation: y=ax+b, to obtain y, where y is the
pronunciation evaluation score, a is the scaling factor, x is the probability score, and b is the scaling offset. The pronunciation evaluation score is then presented to the student 3 through output device 4.
The processes of alignment and backtracing will now be explained in greater detail. In one embodiment of the invention, the hidden Markov model stored in the template for the selected speech pattern has a hierarchical structure. The speech pattern is divided into words which are in turn divided into phones which are themselves in turn divided into the individual states of the hidden Markov model. The input speech is modeled as a Markov source which traces a path through the hidden Markov model, traversing many states and remaining in each state for one or more frames.
Allowable transitions from a state are constrained by a grammar network associated with the hidden Markov model constructed from speech patterns, preferably obtained by training a speech recognizer employing the speech of native speakers of the target language reciting the same preselected script. (A standard HMM-based speech recognizer on which such training can be performed is the DECIPHER system from SRI International of Menlo Park, California.) A grammar network is a catalog of states and their allowable links forming phones, words and sentences.
Figure 2 illustrates a portion of the hidden Markov model for a speech pattern and shows the transitions between states allowed by the grammar network associated with the hidden Markov model. Phones 16 and 18 are divided into states 20, 22, 24, 26, 28 and 30. Transitions are allowed from a state to itself or to the next state. The last state 24 in phone 16 may transition to the first state 26 of a successor phone 18. The identities of the successor phones to a
selected phone are defined by the grammar network. The grammar network may allow multiple successor phones to a selected phone (not shown). Transitions to another phone from states other than the final state in a selected phone may also be allowed (not shown). At any given frame, the list of phones which may be reached through transitions permitted by the grammar network is herein referred to as the active list of phones.
The grammar network also constrains transitions between words. Figure 3 illustrates a portion of the word- level grammar network for a sample speech pattern. The network inserts an optional pause phone between words which may be traversed to allow for speakers who paused between words. Otherwise, the pause is skipped. (The skip is
implemented as a transition arc which does not correspond to an acoustic feature.) For example, in the word-level grammar network 32, a pause phone 34 is inserted between the word "she" 36 and the word "had" 38.
Figure 4 is a flowchart illustrating the steps of alignment. In one embodiment of the present invention, alignment is performed using the Viterbi search method. The Viterbi search method is a dynamic programming technique named for its originator and disclosed in Viterbi 67 which is incorporated by reference. The HMM processor determines the maximum likelihood path through the hidden Markov model by evaluating one frame at a time, the probability that each path through the model would have generated the acoustic feature sequence extracted from the input speech pattern.
The method of evaluating paths will now be described. Path probabilities are updated at each time frame. For each active phone, the path probabilities for paths going through that phone are updated using the Viterbi search technique (Step B). The active phone list is then updated in accordance with the grammar network (Step C). After this has been done for the entire speech pattern, the maximum
likelihood path is determined (Step D).
Figure 5 is a flowchart illustrating in greater detail the calculation of path probabilities in accordance with the Viterbi search technique. The active phones are identified (Step E). The states belonging to the identified active phones are identified. (Step F) For each identified state, each path is extended to that state and a new
probability for the path is calculated. The probability is calculated by multiplying the previous probability of that path by the probability of taking the transition to that state and then further multiplying by the probability of each acoustic feature observed at that frame as derived from the acoustic feature distributions (Step G). For each state, the maximum likelihood path arriving at that state is preserved and the other paths arriving at the state are discarded because none of them can be the maximum likelihood path through the hidden Markov model. In a further pruning step, the probability of the paths arriving at each phone are evaluated and if the probability of arriving at the phone for all paths leading to the phone is below a preselected
threshold, the phone becomes inactive (Step H).
In one embodiment, all paths with probability below a preselected threshold may be marked as not requiring further evaluation. However, if this threshold is set too high, there is a chance that the maximum likelihood path itself may be pruned.
In an alternative embodiment, when updating the probability of a path in step D the observed acoustic features are weighted (step F1). Weighting is raising an observed acoustic feature value to an exponent which is a preselected weight value before incorporating the acoustic feature value into the path probability calculation. Each acoustic feature has a preselected weight for each state. Weights are adjusted to improve the correlation between evaluation scores generated by the pronunciation evaluation apparatus 10 and ones
generated by human scorers.
Figure 6 is a flowchart illustrating the step of activating phones in accordance with the grammar network.
Each active phone is identified (Step I). For each active phone, possible successor phones are identified from the grammar network (Step J). The successor phones are made active (Step K). Thus, at the next probability evaluation step, the first states of these phones will be potential destinations.
The above steps are repeated until the end of the input speech pattern is reached. The maximum likelihood path is then identified (Step D).
Figure 7 is a flowchart describing the step of backtracing in accordance with the invention. After
alignment, the HMM processor 11 backtraces through the maximum likelihood path to determine a pronunciation evaluation score. During backtracing, the HMM processor 11 creates a record herein referred to as the backtrace structure.
Figure 8 is a diagram illustrating a portion of the backtrace structure record 100 for a sample speech pattern. The speech pattern 40 is composed of its constituent words 42, which are in turn broken down into constituent phones 44, which are themselves broken down into constituent states 46.
Each state has a score, stored in logarithmic form which is the probability evaluated for the portion of the maximum likelihood path going through that state. The
probabilities are in logarithmic form to facilitate
mathematical manipulation since addition may substitute for multiplication. Using logarithmic form, each phone
probability is determined by summing its constituent state probabilities (Step L). Each word probability is determined by summing its constituent phone probabilities (Step M), and the speech pattern probability is determined by summing the word probabilities (Step N). The speech pattern probability is the probability score for the acoustic feature sequence.
The overall score may comprise the weighted combination of individual phone probabilities. Each of the individual phone probabilities may be weighted before summing according to the importance of the phone to judgment of pronunciation quality. For example, a pause phone (Figure 3) may be weighted at zero, since it contains minimal information about the quality of pronunciation. In addition, the weights of some phones may be adjusted for selected target user populations to take into account predisposition to certain pronunciation patterns.
In an alternative embodiment, a segment of the maximum likelihood path is selected for backtracing rather than the entire maximum likelihood path (step K1). The selected section may be a particular state, phone, or word for which an evaluation score is desired to diagnose a particular pronunciation difficulty. The probability score is then the probability evaluated for the selected section.
In an alternative embodiment, the probability score is determined using a "forward" procedure rather than by alignment and backtracing. The forward procedure is described in Rabiner 89 which is incorporated herein by reference and is a procedure for determining the cumulative "forward"
probability (Pf) that any path through a hidden Markov model would generate a given observed acoustic feature sequence. Figure 9 is a flowchart illustrating the forward procedure. First states allowed by the grammar network are determined (Step 0). Then for each allowed first state, a state probability is calculated which is the probability that the state would have generated the acoustic features observed during the first frame (Step P). Then the successor states allowed by the grammar network are determined (Step Q) and paths are extended from all the first states to all the allowed successor states. The probability of these paths is calculated by multiplying the first state probability for the first state which begins the path by the transition
probability for the given path to the successor state and by the probability that the successor state would have generated the acoustic features observed during the second frame. The successor state probability is then calculated to be the total probability of all the paths arriving at the successor state (Step R).
Steps Q and R are repeated until the last frame of the observed acoustic feature sequence is reached. (Step S). The probabilities determined for all the allowed last states are then summed to determine the probability score. (Step T).
The probability score is the output of the HMM processor 11 and is transformed by the sealer 14 into a pronunciation evaluation score which is output to the user via output device 4. Scores obtained according to the invention as described above correlate very well with human evaluations of pronunciation.
The attached appendix contains a source-code listing containing one operational embodiment of selected elements of the claimed invention. The machine-readable form of the listing can be compiled using a C language compiler and executed on a central processing unit of a system equipped with the balance of the elements of the claimed invention.
The invention has now been explained with reference to specific embodiments. Other embodiments will be apparent to those of ordinary skill in the art in view of the foregoing description. For example, preselected scripts may be
delivered to a user via off-line means such as a written guidebook, as a newspaper advertisement promoting the service or other visual or auditory form. It is therefore not intended that this invention be limited, except as indicated by the appended claims.

Claims

WHAT IS CLAIMED IS:
1. A method for evaluating pronunciation of a preselected script comprising the steps of:
a) establishing a hidden Markov model of speech patterns corresponding to said preselected script and storing the hidden Markov model of the speech pattern in machine- readable form in a first memory device;
b) capturing an acoustic speech signal from a user through a speech input device and storing said speech signal as machine-readable speech information in a second memory device;
c) extracting an observed acoustic feature sequence from said machine-readable speech information by dividing the speech information into frames and characterizing each one of said frames by at least one acoustic feature; thereafter
d) aligning the observed acoustic feature sequence with the hidden Markov model to obtain a structure registering a traceable path through the hidden Markov model, said
traceable path being the path through the hidden Markov model with the maximum likelihood of generating the observed
acoustic feature sequence; thereafter
e) backtracing through the structure to obtain a probability (Po) that the hidden Markov model of speech patterns corresponding to the preselected script would have traversed the traceable path and generated the observed acoustic feature sequence; thereafter
f) scaling the probability (Po) into an evaluation score; and thereafter
g) displaying said evaluation score.
2. The method according to claim 1 wherein the step of aligning is performed using the Viterbi search
technique.
3. The method according to claim 1 wherein during the step of aligning, assigning weights to acoustic features, said weights selected to correlate said evaluation score to a human generated evaluation score.
4. The method according to claim 1 wherein during the step of backtracing, weighting phone probabilities
according to an expectation of significance of the phone to a judging of pronunciation quality.
5. A method for diagnosing pronunciation deficiencies of a user, comprising the steps of:
a) establishing a hidden Markov model of speech patterns corresponding to a preselected script and storing the hidden Markov model in machine-readable form in a first memory device;
b) capturing an acoustic speech signal from a user through a speech input device and storing said speech signal as machine-readable speech information in a second memory device; thereafter
c) extracting an observed acoustic feature sequence from said machine-readable speech information by dividing the speech information into frames and characterizing each one of said frames by at least one acoustic feature; thereafter
d) aligning the observed acoustic feature sequence with the hidden Markov model to obtain a structure registering a traceable path through the hidden Markov model, said
traceable path being the path through the hidden Markov model with the maximum likelihood of generating the observed
acoustic feature sequence; thereafter
e) selecting a segment of the structure
corresponding to a section of the preselected speech pattern preselected for evaluation; thereafter
f) backtracing through the selected segment to obtain a probability (Po) that the hidden Markov model of speech patterns corresponding to the preselected script would have traversed the traceable path and generated the observed acoustic feature sequence; thereafter
g) spaling. the probability (Po) into an evaluation score; and thereafter h) displaying said evaluation score.
6. The method according to claim 5 wherein the step of aligning is performed using the Viterbi search
technique.
7. The method according to claim 5 wherein during the step of aligning, assigning weights to acoustic features, said weights selected to correlate said evaluation score to a human generated evaluation score.
8. The method according to claim 5 wherein during the step of backtracing, weighting phone probabilities
according to an expectation of significance of the phone to a judging of pronunciation quality.
9. A method for evaluating pronunciation of a preselected script comprising the steps of:
a) establishing a hidden Markov model of preselected speech patterns and storing the hidden Markov model in
machine-readable form in a first memory device;
b) capturing an acoustic speech signal from a person through a speech input device and storing said speech signal as machine-readable speech information in a second memory device; thereafter
c) extracting an observed acoustic feature sequence from said machine-readable speech information by dividing the speech information into frames and characterizing each one of said frames by at least one acoustic feature; thereafter
d) using a forward procedure to determine a forward probability (Pf) that the observed acoustic feature sequence would have been generated by the hidden Markov model of speech patterns corresponding to the preselected script traversing any traceable path; thereafter
e) scaling the probability (Pf) into an evaluation score; and thereafter
f)) displaying said evaluation score.
10. The method according to claim 9 wherein during the forward procedure, assigning weights to acoustic features, said weights selected to correlate said evaluation score to a human generated evaluation score.
11. The method according to claim 9 wherein during the step of backtracing, weighting phone probabilities
according to an expectation of significance of the phone to a judging of pronunciation quality.
12. A method for evaluating pronunciation of a preselected script comprising the steps of:
a) establishing a hidden Markov model of speech patterns corresponding to the preselected script and storing the hidden Markov model in machine-readable form in a first memory device;
b) capturing an acoustic speech signal from a person through a speech input device and storing said speech signal as machine-readable speech information in a second memory device; thereafter
c) comparing the observed acoustic feature sequence to the hidden Markov model of the speech patterns to obtain a probability (Po), said probability (Po) being an approximate measure of the likelihood that the hidden Markov model of the speech patterns corresponding to the preselected script would generate the observed acoustic feature sequence; thereafter d) scaling the probability (Po) into an evaluation score; and thereafter
e) displaying said evaluation score.
13. The method according to claim 12 wherein during the step of comparing, assigning weights to acoustic features, said weights selected to correlate said evaluation score to a human generated evaluation score.
14. The method according to claim 12 wherein during the step of backtracing, weighting phone probabilities according to an expectation of significance of the phone to a judging of pronunciation quality.
15. An apparatus for evaluating the pronunciation quality of speech, said apparatus comprising:
means for storing representations of pronunciation of speech patterns by a population of speakers, each said representation being a hidden Markov model defining a single script, said hidden Markov model forming a network of states;
means coupled to said storing means for selecting one of said stored representations for pronunciation by a user;
means for accepting speech input pronounced by the user, said speech input intended by the user to correspond to said single script;
means coupled to said accepting means for representing said speech input as an observed acoustic feature sequence;
a hidden Markov model processing means coupled to said storing means and to said extracting means for aligning said observed acoustic feature sequence to the hidden Markov model of speech patterns corresponding to said single script stored in said storing means, said processing means operative to determine a traceable path through said hidden Markov model, said traceable path being the path through said hidden Markov model with the maximum likelihood of generating said observed acoustic feature sequence, said processing means being further operative to backtrace through the network of states of the hidden Markov model to obtain a probability (Po) that the hidden Markov model would have generated the observed acoustic feature sequence;
means coupled to said processing means for transforming the probability (Po) obtained by said hidden Markov model processor into an evaluation score by use of scaling; and
means coupled to said transforming means for
displaying said evaluation score.
16. An apparatus for evaluating the pronunciation quality of speech, said apparatus comprising:
means for storing representations of pronunciation of speech patterns by a population of speakers, each said representation being a hidden Markov model defining a single script;
means coupled to said storing means for selecting one of said stored representations for pronunciation by a user;
means for accepting speech input pronounced by the user, said speech input intended by the user to correspond to said single speech pattern defined by said selected stored representation;
means coupled to said accepting means for representing said speech input as an observed acoustic feature sequence;
a hidden Markov model processing means coupled to said storing means and to said extracting means, said
processing means for applying a forward procedure to determine a forward probability (Pf) that the hidden Markov model of speech patterns corresponding to a single preselected script would have generated the observed acoustic feature sequence by traversing any path through the hidden Markov model;
means coupled to said processing means for transforming the probability (Po) obtained by said hidden Markov model processor into an evaluation score by use of scaling; and
means coupled to said transforming means for
displaying said evaluation score.
17. An apparatus for evaluating the pronunciation quality of speech, said apparatus comprising:
means for storing representations of pronunciation of speech patterns by a population of speakers, each said representation being a hidden Markov model defining a single script; means coupled to said storing means for selecting one of said stored representations for pronunciation by a user;
means for accepting speech input pronounced by the user, said speech input intended by the user to correspond to said single speech pattern defined by said selected stored representation;
means coupled to said accepting means for representing said speech input as an observed acoustic feature sequence;
a hidden Markov model processing means coupled to said storing means and to said extracting means said
processing means operative to compare the observed acoustic feature sequence to the hidden Markov model of the single speech pattern to estimate the probability (Po) that the hidden Marlov model of speech patterns corresponding to a single preselected script would have generated the observed acoustic feature sequence;
means coupled to said processing means for transforming the probability (Po) obtained by said hidden Markov model processor into an evaluation score by use of scaling; and
means coupled to said transforming means for
displaying said evaluation score.
PCT/US1993/012399 1992-12-18 1993-12-17 Method and apparatus for automatic evaluation of pronunciation Ceased WO1994015330A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US99311192A 1992-12-18 1992-12-18
US07/993,111 1992-12-18

Publications (1)

Publication Number Publication Date
WO1994015330A1 true WO1994015330A1 (en) 1994-07-07

Family

ID=25539103

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1993/012399 Ceased WO1994015330A1 (en) 1992-12-18 1993-12-17 Method and apparatus for automatic evaluation of pronunciation

Country Status (1)

Country Link
WO (1) WO1994015330A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6775019B1 (en) 1994-09-20 2004-08-10 Fuji Xerox Co., Ltd. Printer having a plurality of logical printers therein and having a control unit which controls the logical printers so as to print a virtual printing process of one page at a time, thus actually printing data for a single page
US8447603B2 (en) 2009-12-16 2013-05-21 International Business Machines Corporation Rating speech naturalness of speech utterances based on a plurality of human testers
US9837070B2 (en) 2013-12-09 2017-12-05 Google Inc. Verification of mappings between phoneme sequences and words
CN111739518A (en) * 2020-08-10 2020-10-02 腾讯科技(深圳)有限公司 Audio identification method and device, storage medium and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4741036A (en) * 1985-01-31 1988-04-26 International Business Machines Corporation Determination of phone weights for markov models in a speech recognition system
US4829577A (en) * 1986-03-25 1989-05-09 International Business Machines Corporation Speech recognition method
US5268990A (en) * 1991-01-31 1993-12-07 Sri International Method for recognizing speech using linguistically-motivated hidden Markov models

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4741036A (en) * 1985-01-31 1988-04-26 International Business Machines Corporation Determination of phone weights for markov models in a speech recognition system
US4829577A (en) * 1986-03-25 1989-05-09 International Business Machines Corporation Speech recognition method
US5268990A (en) * 1991-01-31 1993-12-07 Sri International Method for recognizing speech using linguistically-motivated hidden Markov models

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
COHEN et al., "The Decipher Speech Recognition System", IEEE/ICASSP, 3-6 April 1990, pp. 77-80. *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6775019B1 (en) 1994-09-20 2004-08-10 Fuji Xerox Co., Ltd. Printer having a plurality of logical printers therein and having a control unit which controls the logical printers so as to print a virtual printing process of one page at a time, thus actually printing data for a single page
US8447603B2 (en) 2009-12-16 2013-05-21 International Business Machines Corporation Rating speech naturalness of speech utterances based on a plurality of human testers
US9837070B2 (en) 2013-12-09 2017-12-05 Google Inc. Verification of mappings between phoneme sequences and words
CN111739518A (en) * 2020-08-10 2020-10-02 腾讯科技(深圳)有限公司 Audio identification method and device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
US6226611B1 (en) Method and system for automatic text-independent grading of pronunciation for language instruction
EP0619911B1 (en) Children's speech training aid
KR100309207B1 (en) Speech-interactive language command method and apparatus
KR101183344B1 (en) Automatic speech recognition learning using user corrections
JP3049259B2 (en) Voice recognition method
JP3434838B2 (en) Word spotting method
US6269335B1 (en) Apparatus and methods for identifying homophones among words in a speech recognition system
US5995928A (en) Method and apparatus for continuous spelling speech recognition with early identification
EP0907949B1 (en) Method and system for dynamically adjusted training for speech recognition
US20040243412A1 (en) Adaptation of speech models in speech recognition
CN106297800B (en) A method and device for adaptive speech recognition
JPH0423799B2 (en)
WO2006034200A2 (en) Method and system for the automatic generation of speech features for scoring high entropy speech
JPH09127972A (en) Vocalization discrimination and verification for recognitionof linked numeral
JPH0372998B2 (en)
JPH11143346A (en) Language practice utterance evaluation method and apparatus, and storage medium storing utterance evaluation processing program
US20230252971A1 (en) System and method for speech processing
EP1576580B1 (en) Method of optimising the execution of a neural network in a speech recognition system through conditionally skipping a variable number of frames
JPH0962291A (en) Pattern adaptive method using describing length minimum reference
EP1851756B1 (en) Method and system for automatically providing linguistic formulations that are outside a recognition domain of an automatic speech recognition system
EP1010170B1 (en) Method and system for automatic text-independent grading of pronunciation for language instruction
WO1994015330A1 (en) Method and apparatus for automatic evaluation of pronunciation
CN113990288B (en) A method for automatically generating and deploying a speech synthesis model for voice customer service
KR102274764B1 (en) User-defined pronunciation evaluation system for providing statistics information
KR100404852B1 (en) Speech recognition apparatus having language model adaptive function and method for controlling the same

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): CA JP

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FR GB GR IE IT LU MC NL PT SE

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: CA