WO1994015330A1

WO1994015330A1 - Method and apparatus for automatic evaluation of pronunciation

Info

Publication number: WO1994015330A1
Application number: PCT/US1993/012399
Authority: WO
Inventors: Jared C. Bernstein; Michael H. Cohen; Hy Murveit; Mitchel Weintraub; Dimitry Rtischev
Original assignee: SRI International Inc; Stanford Research Institute
Current assignee: SRI International Inc
Priority date: 1992-12-18
Filing date: 1993-12-17
Publication date: 1994-07-07
Anticipated expiration: 1995-06-18

Abstract

Pronunciation is automatically evaluated using techniques adapted from automatic speech recognition as shown in the figure 1. Speech, represented as a sequence of acoustic features (8), is compared to a hidden Markov model (11) of speech patterns (12) corresponding to the preselected script to obtain a probability score which is a measure of the likelihood that the hidden Markov model would have generated the sequence of observed acoustic features. In a specific embodiment, the probability score is determined by processes referred to as alignment and backtracing. Alignment is the process of calculating a maximum likelihood path established through the use of the hidden Markov model. Backtracing is the process of finding the path backwards through the network of states of the hidden Markov model and obtaining a probability score.

Description

METHOD AND APPARATUS FOR AUTOMATIC

EVALUATION OF PRONUNCIATION BACKGROUND OF THE INVENTION

This invention relates to automatic pronunciation evaluation. An application is in computer-aided foreign language instruction and assessment.

Techniques for computer analysis of speech have heretofore been developed for recognition rather than

pronunciation evaluation. One approach to automated speech recognition has been to model speech as a Markov source useful in a hidden Markov model (HMM) speech recognition processor. The speech units being modeled are represented by hidden

Markov models.

In an HMM speech recognition system, a probability distribution over one or more acoustic features is associated with each state. Probabilities associated with the

transitions leading from each state specify the probability of taking each transition upon exiting the selected state. The acoustic feature distributions are typically used to model speech characteristics such as spectra. The transition probabilities implicitly model duration. The probability distributions for transitions and acoustic features are estimated using examples of speech collected from a diverse population. This model of speech is referred to as the hidden Markov model, since the states are not directly observed.

Recognition consists of determining the path through the hidden Markov model that has the highest probability of generating the observed acoustic feature sequence.

Recently, Ryohei Nakatsu of Japan, aware of early unpublished development work of the present inventors, proposed a speech pronunciation evaluation method (Abstract of paper by Hamada and Nakatsu entitled "Evaluation of English Pronunciation Based on the Static and Dynamic Spectral

Characteristics of Words Spoken By Japanese," Journal of the Acoustical Society of America, Vol. 84, Sup. 1, Fall 1988). That speech pronunciation evaluation method depends upon the use of dynamic time warping (DTW) and metric scoring based on distance in acoustic parameter space. These techniques are generally known in the speech recognition art and have been found to be inferior for aligning speech and recognizing speech. What is needed is a technique which is relatively invariant across a wide range of voices of speakers.

SUMMARY OF THE INVENTION

According to the invention, pronunciation is

automatically evaluated using techniques adapted from

automatic speech recognition. A student is prompted to speak a preselected script into an input device in order to allow a machine according to the invention to evaluate the quality of the pronunciation. The speech of the student is divided into time frames and each frame of speech is characterized by one or more acoustic features. The student's speech, represented as a sequence of acoustic features, is compared to a hidden Markov model of speech patterns corresponding to the

preselected script to obtain a probability score which is a measure of the likelihood that the hidden Markov model would have generated the sequence of observed acoustic features. In a specific embodiment, the probability score is determined by processes herein referred to as alignment and backtracing.

Alignment is the process of calculating a maximum likelihood path, the maximum likelihood path being the path through the hidden Markov model with the maximum likelihood of generating the acoustic feature sequence extracted from the speech of the user. Backtracing is the process of finding the path

backwards through the network of states of the hidden Markov model and obtaining a probability score indicative of the probability that the observed speech acoustic features would have been generated by the hidden Markov model. (In contrast, a conventional HMM speech recognizer would use the maximum likelihood path identified during alignment and backtracing to recognize a user's speech.) The probability score is

transformed into an evaluation score by use of scaling, and it is then displayed. The present invention, based on heretofore undisclosed recognition of the applicability of certain underlying principles of HMM speech recognition to the

solution of the pronunciation evaluation problem, is believed to surpass in accuracy all known and proposed pronunciation evaluation systems.

The invention will be better understood by reference to the following detailed description in connection with the accompanying drawings. BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a device operative in accordance with the invention.

FIG. 2 is a diagram illustrating a portion of a hidden Markov model finite state machine adapted for

pronunciation evaluation according to the invention.

FIG. 3 is a diagram illustrating a word-level grammar network adapted for pronunciation evaluation according to the invention.

FIG. 4 is a flowchart describing the steps of aligning an observed acoustic feature sequence to a speech pattern model according to the Viterbi search technique. FIG. 5 is a flowchart describing the step of

calculating path probabilities according to the Viterbi search technique.

FIG. 6 is a flowchart describing the step of

activating phones during alignment.

FIG. 7 is a flowchart describing the step of

backtracing through the maximum likelihood path.

FIG. 8 is a diagram illustrating a portion of the backtrace structure used in computing the probability score for a sample speech pattern in accordance with the invention.

FIG. 9 is a flowchart describing the use of the forward procedure to determine the probability score for an observed acoustic feature sequence given a hidden Markov model .

DESCRIPTION OF SPECIFIC EMBODIMENTS

The invention allows pronunciation of preselected scripts to be automatically evaluated by calculating the probability that an appropriate hidden Markov model would generate the acoustic features derived from the actual

rendition of the script as spoken by the speaker being

evaluated. The probability is then transformed into an evaluation score. The rendition of the script may be a word, a phrase, a sentence, a paragraph or other unit of speech.

Figure 1 is a block diagram of an exemplary realtime pronunciation evaluation apparatus 10 comprising a lesson controller apparatus 2, an output device 4, a speech input device 6, speech information storage 7, a feature extractor 8, an HMM processor 11, speech pattern template storage 12 and a sealer 14. The output device 4 and speech input device 6 are positioned to permit interaction between the pronunciation evaluation apparatus 10 and a student 3.

In operation, the lesson controller 2 is used to select an evaluation speech pattern and present it to a student 3 through output device 4. The student 3 then recites the script into speech input device 6. Speech input device 6 converts the student's speech to machine-readable speech information. Speech information storage 7 stores the machine- readable speech information. Feature extractor 8 registers the speech information by dividing it into frames and

characterizing the frames. The characterizing step comprises determining one or more acoustic features for each frame. The speech is thus available to the HMM processor 11 as a sequence of observed acoustic features.

The HMM processor 11 retrieves from speech pattern template storage 12 the template corresponding to the script selected by the lesson controller 2. Each template stored in speech pattern template storage 12 comprises the previously computed state transition probabilities, acoustic feature probability densities, and grammar network for the hidden Markov model of speech patterns corresponding to the

preselected script. The templates are derived from speech generated by a diverse population of speakers.

In one embodiment, a probability score for the observed feature sequence is generated by processes herein referred to as alignment and backtracing. Alignment is the process of calculating a maximum likelihood path, the maximum likelihood path being the path through the hidden Markov model with the maximum likelihood of generating the acoustic feature sequence extracted from the speech of the user. Backtracing is the process of recalling the probabilities evaluated for the maximum likelihood path to obtain an overall probability score.

The overall probability score is converted to a pronunciation evaluation score through the action of the sealer 14. In one embodiment, the sealer applies the

transformation: y=ax+b, to obtain y, where y is the

pronunciation evaluation score, a is the scaling factor, x is the probability score, and b is the scaling offset. The pronunciation evaluation score is then presented to the student 3 through output device 4.

The processes of alignment and backtracing will now be explained in greater detail. In one embodiment of the invention, the hidden Markov model stored in the template for the selected speech pattern has a hierarchical structure. The speech pattern is divided into words which are in turn divided into phones which are themselves in turn divided into the individual states of the hidden Markov model. The input speech is modeled as a Markov source which traces a path through the hidden Markov model, traversing many states and remaining in each state for one or more frames.

Allowable transitions from a state are constrained by a grammar network associated with the hidden Markov model constructed from speech patterns, preferably obtained by training a speech recognizer employing the speech of native speakers of the target language reciting the same preselected script. (A standard HMM-based speech recognizer on which such training can be performed is the DECIPHER system from SRI International of Menlo Park, California.) A grammar network is a catalog of states and their allowable links forming phones, words and sentences.

Figure 2 illustrates a portion of the hidden Markov model for a speech pattern and shows the transitions between states allowed by the grammar network associated with the hidden Markov model. Phones 16 and 18 are divided into states 20, 22, 24, 26, 28 and 30. Transitions are allowed from a state to itself or to the next state. The last state 24 in phone 16 may transition to the first state 26 of a successor phone 18. The identities of the successor phones to a

selected phone are defined by the grammar network. The grammar network may allow multiple successor phones to a selected phone (not shown). Transitions to another phone from states other than the final state in a selected phone may also be allowed (not shown). At any given frame, the list of phones which may be reached through transitions permitted by the grammar network is herein referred to as the active list of phones.

The grammar network also constrains transitions between words. Figure 3 illustrates a portion of the word- level grammar network for a sample speech pattern. The network inserts an optional pause phone between words which may be traversed to allow for speakers who paused between words. Otherwise, the pause is skipped. (The skip is

implemented as a transition arc which does not correspond to an acoustic feature.) For example, in the word-level grammar network 32, a pause phone 34 is inserted between the word "she" 36 and the word "had" 38.

Figure 4 is a flowchart illustrating the steps of alignment. In one embodiment of the present invention, alignment is performed using the Viterbi search method. The Viterbi search method is a dynamic programming technique named for its originator and disclosed in Viterbi 67 which is incorporated by reference. The HMM processor determines the maximum likelihood path through the hidden Markov model by evaluating one frame at a time, the probability that each path through the model would have generated the acoustic feature sequence extracted from the input speech pattern.

The method of evaluating paths will now be described. Path probabilities are updated at each time frame. For each active phone, the path probabilities for paths going through that phone are updated using the Viterbi search technique (Step B). The active phone list is then updated in accordance with the grammar network (Step C). After this has been done for the entire speech pattern, the maximum

likelihood path is determined (Step D).

Figure 5 is a flowchart illustrating in greater detail the calculation of path probabilities in accordance with the Viterbi search technique. The active phones are identified (Step E). The states belonging to the identified active phones are identified. (Step F) For each identified state, each path is extended to that state and a new

probability for the path is calculated. The probability is calculated by multiplying the previous probability of that path by the probability of taking the transition to that state and then further multiplying by the probability of each acoustic feature observed at that frame as derived from the acoustic feature distributions (Step G). For each state, the maximum likelihood path arriving at that state is preserved and the other paths arriving at the state are discarded because none of them can be the maximum likelihood path through the hidden Markov model. In a further pruning step, the probability of the paths arriving at each phone are evaluated and if the probability of arriving at the phone for all paths leading to the phone is below a preselected

threshold, the phone becomes inactive (Step H).

In one embodiment, all paths with probability below a preselected threshold may be marked as not requiring further evaluation. However, if this threshold is set too high, there is a chance that the maximum likelihood path itself may be pruned.

In an alternative embodiment, when updating the probability of a path in step D the observed acoustic features are weighted (step F1). Weighting is raising an observed acoustic feature value to an exponent which is a preselected weight value before incorporating the acoustic feature value into the path probability calculation. Each acoustic feature has a preselected weight for each state. Weights are adjusted to improve the correlation between evaluation scores generated by the pronunciation evaluation apparatus 10 and ones

generated by human scorers.

Figure 6 is a flowchart illustrating the step of activating phones in accordance with the grammar network.

Each active phone is identified (Step I). For each active phone, possible successor phones are identified from the grammar network (Step J). The successor phones are made active (Step K). Thus, at the next probability evaluation step, the first states of these phones will be potential destinations.

The above steps are repeated until the end of the input speech pattern is reached. The maximum likelihood path is then identified (Step D).

Figure 7 is a flowchart describing the step of backtracing in accordance with the invention. After

alignment, the HMM processor 11 backtraces through the maximum likelihood path to determine a pronunciation evaluation score. During backtracing, the HMM processor 11 creates a record herein referred to as the backtrace structure.

Figure 8 is a diagram illustrating a portion of the backtrace structure record 100 for a sample speech pattern. The speech pattern 40 is composed of its constituent words 42, which are in turn broken down into constituent phones 44, which are themselves broken down into constituent states 46.

Each state has a score, stored in logarithmic form which is the probability evaluated for the portion of the maximum likelihood path going through that state. The

probabilities are in logarithmic form to facilitate

mathematical manipulation since addition may substitute for multiplication. Using logarithmic form, each phone

probability is determined by summing its constituent state probabilities (Step L). Each word probability is determined by summing its constituent phone probabilities (Step M), and the speech pattern probability is determined by summing the word probabilities (Step N). The speech pattern probability is the probability score for the acoustic feature sequence.

The overall score may comprise the weighted combination of individual phone probabilities. Each of the individual phone probabilities may be weighted before summing according to the importance of the phone to judgment of pronunciation quality. For example, a pause phone (Figure 3) may be weighted at zero, since it contains minimal information about the quality of pronunciation. In addition, the weights of some phones may be adjusted for selected target user populations to take into account predisposition to certain pronunciation patterns.

In an alternative embodiment, a segment of the maximum likelihood path is selected for backtracing rather than the entire maximum likelihood path (step K1). The selected section may be a particular state, phone, or word for which an evaluation score is desired to diagnose a particular pronunciation difficulty. The probability score is then the probability evaluated for the selected section.

In an alternative embodiment, the probability score is determined using a "forward" procedure rather than by alignment and backtracing. The forward procedure is described in Rabiner 89 which is incorporated herein by reference and is a procedure for determining the cumulative "forward"

probability (P_f) that any path through a hidden Markov model would generate a given observed acoustic feature sequence. Figure 9 is a flowchart illustrating the forward procedure. First states allowed by the grammar network are determined (Step 0). Then for each allowed first state, a state probability is calculated which is the probability that the state would have generated the acoustic features observed during the first frame (Step P). Then the successor states allowed by the grammar network are determined (Step Q) and paths are extended from all the first states to all the allowed successor states. The probability of these paths is calculated by multiplying the first state probability for the first state which begins the path by the transition

probability for the given path to the successor state and by the probability that the successor state would have generated the acoustic features observed during the second frame. The successor state probability is then calculated to be the total probability of all the paths arriving at the successor state (Step R).

Steps Q and R are repeated until the last frame of the observed acoustic feature sequence is reached. (Step S). The probabilities determined for all the allowed last states are then summed to determine the probability score. (Step T).

The probability score is the output of the HMM processor 11 and is transformed by the sealer 14 into a pronunciation evaluation score which is output to the user via output device 4. Scores obtained according to the invention as described above correlate very well with human evaluations of pronunciation.

The attached appendix contains a source-code listing containing one operational embodiment of selected elements of the claimed invention. The machine-readable form of the listing can be compiled using a C language compiler and executed on a central processing unit of a system equipped with the balance of the elements of the claimed invention.

The invention has now been explained with reference to specific embodiments. Other embodiments will be apparent to those of ordinary skill in the art in view of the foregoing description. For example, preselected scripts may be

delivered to a user via off-line means such as a written guidebook, as a newspaper advertisement promoting the service or other visual or auditory form. It is therefore not intended that this invention be limited, except as indicated by the appended claims.

Claims

WHAT IS CLAIMED IS:

1. A method for evaluating pronunciation of a preselected script comprising the steps of:

a) establishing a hidden Markov model of speech patterns corresponding to said preselected script and storing the hidden Markov model of the speech pattern in machine- readable form in a first memory device;

b) capturing an acoustic speech signal from a user through a speech input device and storing said speech signal as machine-readable speech information in a second memory device;

c) extracting an observed acoustic feature sequence from said machine-readable speech information by dividing the speech information into frames and characterizing each one of said frames by at least one acoustic feature; thereafter

d) aligning the observed acoustic feature sequence with the hidden Markov model to obtain a structure registering a traceable path through the hidden Markov model, said

traceable path being the path through the hidden Markov model with the maximum likelihood of generating the observed

acoustic feature sequence; thereafter

e) backtracing through the structure to obtain a probability (P_o) that the hidden Markov model of speech patterns corresponding to the preselected script would have traversed the traceable path and generated the observed acoustic feature sequence; thereafter

f) scaling the probability (P_o) into an evaluation score; and thereafter

g) displaying said evaluation score.

2. The method according to claim 1 wherein the step of aligning is performed using the Viterbi search

technique.

3. The method according to claim 1 wherein during the step of aligning, assigning weights to acoustic features, said weights selected to correlate said evaluation score to a human generated evaluation score.

4. The method according to claim 1 wherein during the step of backtracing, weighting phone probabilities

according to an expectation of significance of the phone to a judging of pronunciation quality.

5. A method for diagnosing pronunciation deficiencies of a user, comprising the steps of:

a) establishing a hidden Markov model of speech patterns corresponding to a preselected script and storing the hidden Markov model in machine-readable form in a first memory device;

b) capturing an acoustic speech signal from a user through a speech input device and storing said speech signal as machine-readable speech information in a second memory device; thereafter

acoustic feature sequence; thereafter

e) selecting a segment of the structure

corresponding to a section of the preselected speech pattern preselected for evaluation; thereafter

f) backtracing through the selected segment to obtain a probability (P_o) that the hidden Markov model of speech patterns corresponding to the preselected script would have traversed the traceable path and generated the observed acoustic feature sequence; thereafter

g) spaling. the probability (P_o) into an evaluation score; and thereafter h) displaying said evaluation score.

6. The method according to claim 5 wherein the step of aligning is performed using the Viterbi search

technique.

7. The method according to claim 5 wherein during the step of aligning, assigning weights to acoustic features, said weights selected to correlate said evaluation score to a human generated evaluation score.

8. The method according to claim 5 wherein during the step of backtracing, weighting phone probabilities

9. A method for evaluating pronunciation of a preselected script comprising the steps of:

a) establishing a hidden Markov model of preselected speech patterns and storing the hidden Markov model in

machine-readable form in a first memory device;

b) capturing an acoustic speech signal from a person through a speech input device and storing said speech signal as machine-readable speech information in a second memory device; thereafter

d) using a forward procedure to determine a forward probability (P_f) that the observed acoustic feature sequence would have been generated by the hidden Markov model of speech patterns corresponding to the preselected script traversing any traceable path; thereafter

e) scaling the probability (P_f) into an evaluation score; and thereafter

f)) displaying said evaluation score.

10. The method according to claim 9 wherein during the forward procedure, assigning weights to acoustic features, said weights selected to correlate said evaluation score to a human generated evaluation score.

11. The method according to claim 9 wherein during the step of backtracing, weighting phone probabilities

12. A method for evaluating pronunciation of a preselected script comprising the steps of:

a) establishing a hidden Markov model of speech patterns corresponding to the preselected script and storing the hidden Markov model in machine-readable form in a first memory device;

c) comparing the observed acoustic feature sequence to the hidden Markov model of the speech patterns to obtain a probability (P_o), said probability (P_o) being an approximate measure of the likelihood that the hidden Markov model of the speech patterns corresponding to the preselected script would generate the observed acoustic feature sequence; thereafter d) scaling the probability (P_o) into an evaluation score; and thereafter

e) displaying said evaluation score.

13. The method according to claim 12 wherein during the step of comparing, assigning weights to acoustic features, said weights selected to correlate said evaluation score to a human generated evaluation score.

14. The method according to claim 12 wherein during the step of backtracing, weighting phone probabilities according to an expectation of significance of the phone to a judging of pronunciation quality.

15. An apparatus for evaluating the pronunciation quality of speech, said apparatus comprising:

means for storing representations of pronunciation of speech patterns by a population of speakers, each said representation being a hidden Markov model defining a single script, said hidden Markov model forming a network of states;

means coupled to said storing means for selecting one of said stored representations for pronunciation by a user;

means for accepting speech input pronounced by the user, said speech input intended by the user to correspond to said single script;

means coupled to said accepting means for representing said speech input as an observed acoustic feature sequence;

a hidden Markov model processing means coupled to said storing means and to said extracting means for aligning said observed acoustic feature sequence to the hidden Markov model of speech patterns corresponding to said single script stored in said storing means, said processing means operative to determine a traceable path through said hidden Markov model, said traceable path being the path through said hidden Markov model with the maximum likelihood of generating said observed acoustic feature sequence, said processing means being further operative to backtrace through the network of states of the hidden Markov model to obtain a probability (P_o) that the hidden Markov model would have generated the observed acoustic feature sequence;

means coupled to said processing means for transforming the probability (P_o) obtained by said hidden Markov model processor into an evaluation score by use of scaling; and

means coupled to said transforming means for

displaying said evaluation score.

16. An apparatus for evaluating the pronunciation quality of speech, said apparatus comprising:

means for storing representations of pronunciation of speech patterns by a population of speakers, each said representation being a hidden Markov model defining a single script;

means for accepting speech input pronounced by the user, said speech input intended by the user to correspond to said single speech pattern defined by said selected stored representation;

a hidden Markov model processing means coupled to said storing means and to said extracting means, said

processing means for applying a forward procedure to determine a forward probability (P_f) that the hidden Markov model of speech patterns corresponding to a single preselected script would have generated the observed acoustic feature sequence by traversing any path through the hidden Markov model;

means coupled to said transforming means for

displaying said evaluation score.

17. An apparatus for evaluating the pronunciation quality of speech, said apparatus comprising:

means for storing representations of pronunciation of speech patterns by a population of speakers, each said representation being a hidden Markov model defining a single script; means coupled to said storing means for selecting one of said stored representations for pronunciation by a user;

a hidden Markov model processing means coupled to said storing means and to said extracting means said

processing means operative to compare the observed acoustic feature sequence to the hidden Markov model of the single speech pattern to estimate the probability (P_o) that the hidden Marlov model of speech patterns corresponding to a single preselected script would have generated the observed acoustic feature sequence;

means coupled to said transforming means for

displaying said evaluation score.