HK1117264A

HK1117264A - Voice recognition system using implicit speaker adaptation

Info

Publication number: HK1117264A
Application number: HK08111776.9A
Authority: HK
Inventors: N．马拉亚; A．P．德雅柯; C.张; S．贾里尔; 毕宁; H．加鲁达德里
Original assignee: 高通股份有限公司
Priority date: 2001-03-28
Filing date: 2004-12-02
Publication date: 2009-01-09

Description

Speech recognition system using implicit speaker adaptation

This application is a divisional application of the invention application entitled "speech recognition system using built-in speaker" with application number 02810586.9 and international application date of 3/22/2002.

Background

Technical Field

The present invention relates to the processing of speech signals. More particularly, the present invention relates to novel speech recognition methods and apparatus for achieving improved performance through unsupervised training.

Technical Field

Speech recognition is one of the most important technologies that can give machines analog intelligence for recognizing user's speech commands and facilitating human-to-machine interface. Systems that employ techniques for recovering linguistic information from acoustic speech signals are referred to as Voice Recognition (VR) systems. Fig. 1 shows a basic VR system, which includes: a pre-emphasis filter 102, an Acoustic Feature Extraction (AFE) unit 104, and a pattern matching engine 110. The AFE unit 104 converts a series of digital speech samples into a set of measurement values (e.g., extracted frequency components), which may be referred to as an acoustic feature vector. The pattern matching engine 110 matches a series of acoustic feature vectors to templates contained in the VR acoustic model 112. The VR pattern matching engine generally uses Dynamic Time Warping (DTW) or hidden Markov (Markov) model (HMM) techniques. Both DTW and HMM are well known in the art and are discussed in detail in Rabiner, L.R and Juang, basis for speech recognition (Prentic Hall publication, 1993) edited by Juang, B.H. When a series of acoustic features match a template contained in the acoustic model 112, the recognized template may be used to generate a desired output format, e.g., a recognized sequence of linguistic words corresponding to the input speech.

As noted above, the acoustic model 112 is typically an HMM model or a DTW model. The DTW acoustic model may be thought of as a database of templates associated with various vocabularies that need to be recognized. In general, a DTW template includes a sequence of feature vectors that are averaged over a number of related vocabulary instances. DTW pattern matching typically involves placing a stored template with a minimum distance into a sequence of input feature vectors representing the input speech. The templates used in HMM-based acoustic models contain detailed statistical descriptions of the associated phonetic pronunciation. Generally, an HMM template stores a series of mean vectors, variance vectors, and a set of transition probabilities. These parameters can be used to describe the statistics of the phonetic unit and are estimated from many examples of phonetic units. HMM pattern matching generally involves generating probabilities for each template in a model based on a series of input feature vectors associated with the input speech. The template with the highest probability may be selected as the most similar input utterance.

"training" refers to the process of collecting speech samples from particular speech segments and syllables of one or more speakers to facilitate the generation of templates in the acoustic model 112. The various templates in the acoustic model are associated with special words or speech segments called pronunciation categories. There may be many templates in the acoustic model that relate to the same utterance class. "testing" refers to the process of matching templates in an acoustic model to a sequence of feature vectors extracted from the input speech. The performance of a given system depends largely on the degree of match between the end user's input speech and the content in the database, and therefore also on the match between the reference template generated by training and the speech sample used for the VR test.

Two common types of training are supervised and unsupervised. In supervised training, the pronunciation categories associated with each set of training feature vectors are known a priori. A speaker who provides input voice usually has an original of vocabulary and voice segments corresponding to a predetermined kind of pronunciation. Subsequently, the feature vectors generated from reading the original may be incorporated into the acoustic model template associated with the correct pronunciation category.

In unsupervised training, the class of pronunciation associated with a set of training feature vectors is not known a priori. The utterance class must be correctly identified before a set of training feature vectors can be incorporated into the correct acoustic model template. In unsupervised training, errors in identifying utterance classes for a set of training feature vectors can cause changes in the wrong acoustic model template. Such errors generally reduce, rather than improve, speech recognition performance. To avoid such errors, any changes to the unsupervised trained acoustic model must generally be made with great care. The set of training feature quantities may be incorporated into the acoustic model only if it is considered with a relatively high degree of confidence that the utterance class has been correctly identified. This necessary conservation makes constructing the SD acoustic model by unsupervised training a very slow process. Until the SD acoustic model was constructed using this method, VR performance was perhaps unacceptable to most users.

Optimally, the final user provides speech acoustic feature vectors during training and testing so that the acoustic model 112 can be strongly matched to the final user's speech. A personalized acoustic model that is applicable to a single speaker may also be referred to as a speaker-Specific (SD) acoustic model. Generating the SD acoustic model generally requires that the end user be able to provide a large number of supervised training samples. First, the user must provide training samples for many different pronunciation categories. Also, for best performance, the end user must provide multiple templates representing various possible acoustic environments for various pronunciation categories. Because most users cannot or do not want to provide the input speech needed to generate the SD acoustic model, many existing VR systems use a generalized acoustic model instead that is trained using the speech of many "representative" speakers. This type of acoustic model may be referred to as a Speaker Independent (SI) acoustic model and may be designed to have the best performance for a wide range of users. However, the SI acoustic model is not optimal for any one user. VR systems using SI acoustic models do not work as well for a particular user as VR systems using SD acoustic models that are appropriate for that user. For some users, e.g., users with strong foreign accents, the performance of VR systems using SI acoustic models is so poor that the services of VR cannot be used effectively at all.

Optimally, SD acoustic models are generated for individual personalities of the user. As discussed above, it is not practical to use supervised training to construct the SD acoustic model. However, it takes a long time to generate the SD acoustic model using unsupervised training, during which VR performance based on the partial SD acoustic model will be very poor. Therefore, there is a need in the art for a VR system that works well before or during the use of unsupervised training to generate SD acoustic models.

Disclosure of Invention

The methods and apparatus disclosed herein provide a novel and improved speech recognition (VR) system that employs a combination of Speaker Independent (SI) and speaker Specific (SD) acoustic models. At least one SI acoustic model is used in combination with at least one SD acoustic model such that the level of speech recognition performance provided is at least equal to the level of a pure SI acoustic model. The disclosed hybrid SI/SD VR system may continue to use unsupervised training to update acoustic templates in one or more SD acoustic models. The hybrid VR system then uses the updated SD acoustic model, alone or in combination with at least one SI acoustic model, in order to provide improved VR performance during VR testing.

The term "exemplary" as used herein means "serving as an example, instance, or illustration". Any embodiment discussed as "exemplary embodiment" is not necessarily to be construed as preferred or advantageous over other embodiments.

Brief description of the drawings

The nature, objects, and advantages of the disclosed method and apparatus will become apparent from the detailed discussion set forth in connection with the drawings in which like reference characters designate corresponding parts throughout the several views, and wherein:

FIG. 1 illustrates a basic speech recognition system;

FIG. 2 illustrates a speech recognition system according to an exemplary embodiment;

FIG. 3 illustrates a method for unsupervised training;

FIG. 4 illustrates an exemplary method for generating combined match scores for use in unsupervised training;

FIG. 5 is a flow diagram illustrating a method for speech recognition (testing) using both Speaker Independent (SI) and speaker Specific (SD) match evaluation;

FIG. 6 illustrates a method for generating a combined match score from Speaker Independent (SI) and speaker Specific (SD) match scores.

Detailed description of the invention

Fig. 2 shows an exemplary embodiment of a hybrid Voice Recognition (VR) system that may be implemented in a wireless remote station 202. In the exemplary embodiment, remote station 202 communicates with a wireless communication network (not shown) over a wireless channel (not shown). For example, the remote station 202 may be a wireless telephone that communicates with a wireless telephone system. Those skilled in the art will appreciate that the techniques discussed herein may be equally applicable to fixed (not portable) VR systems or to systems that do not include a wireless channel.

In the illustrated embodiment, a voice signal from a user is converted to an electrical signal in a Microphone (MIC)210 and converted to digital voice samples in an analog-to-digital converter (ADC) 212. The digital sample stream is then filtered using a pre-emphasis (PE) filter 214, for example, a Finite Impulse Response (FIR) filter that attenuates low frequency signal components may be employed.

The filtered samples are then analyzed in an Acoustic Feature Extraction (AFE) unit 216. The AFE unit 216 converts the digital speech samples into acoustic feature vectors. In an exemplary embodiment, the AFE unit 216 fourier transforms the segments with consecutive digital samples to produce vectors of signal strengths corresponding to different frequency bins. In an exemplary embodiment, the frequency bins may vary in bandwidth according to the bark scale (bark scale). In the bark scale, the bandwidth of each frequency bin has a relationship with the center frequency of the bin, such that higher frequency bins can have a wider frequency bandwidth than lower frequency bins. The bark scale is discussed in Rabiner, L.R and Juang, B.H, the "basis for speech recognition" (Prentic Hall published, 1993).

In an exemplary embodiment, the individual acoustic feature vectors are extracted from a series of speech samples collected over a fixed time interval. In an exemplary embodiment, the time intervals are overlapping. For example, the acoustic features may be obtained from 20 millisecond intervals of speech data starting every 10 milliseconds, such that every two consecutive intervals may share a 10 millisecond segment. Those skilled in the art will recognize that the time intervals may be replaced with non-overlapping or with non-fixed periods without departing from the scope of the embodiments disclosed herein.

The acoustic feature vectors generated by the AFE unit 216 may be provided to the VR engine 220, which performs pattern matching to characterize the acoustic feature vectors according to the contents of the one or more acoustic models 230, 232, and 234.

In the exemplary embodiment shown in FIG. 2, three acoustic models are shown: a Speaker Independent (SI) Hidden Markov Model (HMM) model 230, a speaker independent Dynamic Time Warping (DTW) model 232, and a speaker Specific (SD) acoustic model 234. Those skilled in the art will appreciate that different combinations of SI acoustic models may be used in other embodiments. For example, the remote station 202 may include only the SIHMM acoustic model 230 and the SD acoustic model 234, leaving out the SIDTW acoustic model 232. In addition, the remote station 202 may include a single SIHMM acoustic model 230, one SD acoustic model 234 and two different SIDTW acoustic models 232. In addition, those skilled in the art will appreciate that the SD acoustic model 234 may be HMM type or DTW type or a combination of both. In the exemplary embodiment, SD acoustic model 234 is a DTW acoustic model.

As discussed above, the VR engine 220 performs pattern matching to determine a degree of match between the acoustic feature vectors and the content of one or more acoustic models 230, 232, and 234. In an exemplary embodiment, the VR engine 220 generates matching ratings based on matching vectors of acoustic features to different acoustic templates in the respective acoustic models 230, 232, and 234. For example, the VR engine 220 generates an evaluation of HMM matching based on matching a set of acoustic feature vectors to a plurality of HMM templates in the SIHMM acoustic model 230. Likewise, VR engine 220 generates an evaluation of DTW matching based on matching the vector of acoustic features to a plurality of DTW templates in SIDTW acoustic model 232. The VR engine 220 generates an evaluation of the match based on the matching of the acoustic feature vectors to the templates in the SD acoustic model 234.

As discussed above, each template in the acoustic model is associated with a pronunciation category. In one exemplary embodiment, the VR engine 220 combines the ratings of templates related to the same pronunciation category to produce a combined match rating to be used in unsupervised training. For example, the VR engine 220 combines the evaluations of SIHMM and SIDTW obtained by correlating a set of input acoustic feature vectors, thereby producing a combined SI evaluation. Based on the combined match rating, the VR engine 220 determines whether to store the set of input acoustic feature vectors as SD templates in the SD acoustic model 234. In one exemplary embodiment, the unique SI matching evaluations are used for unsupervised training for updating the SD acoustic model 234. This prevents other errors that may arise from using the extended SD acoustic model 234 in unsupervised training of itself. Exemplary methods of performing unsupervised training are discussed in more detail below.

In addition to unsupervised training, the VR engine 220 uses various acoustic models (230, 232, and 234) during testing. In an exemplary embodiment, the VR engine 220 retrieves matching ratings from the acoustic models (230, 232, and 234) and generates a combined matching rating that is applicable to each pronunciation category. The combined match score may be used to select the utterance class that best matches the input speech. The VR engine 220 assembles the successive pronunciation categories together as needed to recognize the entire word or phrase. The VR engine 220 then provides information about the recognized words or phrases to a control processor 222, which processor 222 uses the information to determine an appropriate response to the voice information or command. For example, in response to recognized words or phrases, the control processor 222 may provide feedback to the user via a display or other user interface. In another embodiment, control processor 222 may send a message to a wireless network (not shown) via wireless modem 218 and antenna 224 to initiate a mobile telephone call to a target telephone number associated with the name of the person being pronounced and recognized.

The wireless modem 218 may transmit signals over any of a number of wireless channel types, including CDMA, TDMA, or FDMA. Moreover, the wireless modem 218 may be replaced with other types of communication interfaces that communicate over non-wireless channels without departing from the scope of the disclosed embodiments. For example, remote station 202 may transmit signaling information over any type of communication channel, where: types of communication channels may include land-line modems, T1/E1, ISDN, DSL, Ethernet, or even wire paths on a Printed Circuit Board (PCB).

FIG. 3 is a flow chart illustrating an exemplary method of performing unsupervised training. At step 302, an analog-to-digital converter (ADC) (212 of FIG. 2) samples analog voice data. Subsequently, at step 304, the stream of digital samples is filtered using a pre-emphasis (PE) filter (214 in fig. 2). At step 306, an Acoustic Feature Extraction (AFE) unit (216 in fig. 2) extracts an input acoustic feature vector from the filtered samples. The VR engine (220 in fig. 2) receives the input acoustic feature vectors from the AFE unit 216 and performs pattern matching of the input acoustic feature vectors to the content in the SI acoustic model (230 and 232 in fig. 2). At step 308, the VR engine 220 generates an evaluation of the match from the results of the pattern match. The VR engine 220 generates a SIHMM matching score by matching the input acoustic feature vectors to the SIHMM acoustic models 230 and a SIDTW matching score by matching the input acoustic feature vectors to the SIDTW acoustic models 232. Each acoustic template in both the SIHMM and the SIDTW acoustic models (230 and 232) is associated with a particular utterance class. At step 310, the SIHMM and SIDTW evaluations are combined to form a combined match evaluation.

Fig. 4 shows the generation of a combined matching score for use in unsupervised training. In an exemplary embodiment, the speaker independent combination match rating S for a particular pronunciation category_{COMB_SI}Is a weighted sum according to the illustrated equation 1, where:

SIHMM_Tis the SIHMM matching evaluation of the target utterance class;

SIHMM_NTis suitable for non-target pronunciation types (non-target pronunciation) in SIHMM acoustic modelPronunciation category of the sound category) of the template;

SIHMM_Gis a matching evaluation of the SIHMM suitable for the "garbage (garpage)" utterance class.

SIDTW_TSIDTW match rating for the target utterance class;

SIDTW_NTis the next best match for the template associated with the non-target utterance class in the SIDTW acoustic model; and the number of the first and second groups,

SIDTW_Gis the matching evaluation of the SIDTW suitable for the garbage pronunciation category.

Match evaluation SIHMM for various individuals_nAnd SIDTW_nCan be viewed as representing a distance value between a series of input acoustic feature vectors and a template in an acoustic model. The longer the distance between the input acoustic feature vector and the template, the greater the rating of the match. A close match between the template and the input acoustic feature vector will yield a very low match score. If a series of input acoustic feature vectors are compared to two templates associated with different utterance classes and two matching scores are generated that are approximately equal, the VR system cannot identify which is the "correct" utterance class.

SIHMM_GAnd SIDTW_GIs suitable for matching evaluation of the garbage pronunciation types. The template or templates associated with the category of garbage pronunciation are all referred to as garbage templates and do not correspond to a particular word or phrase. It is for this reason that they also tend to be incorrect for all input speech, and the garbage match evaluation is very useful as a measure of noise level in VR systems. Generally, a series of input acoustic feature vectors should match a template associated with a target utterance class to a much better degree than a garbage template before the utterance class can be confidently identified.

Before the VR system can be confident that a pronunciation category is identified as "correct", the input acoustic feature vectors should match the templates associated with that pronunciation category to a greater degree than the matching of spam templates and other templates associated with other pronunciation categories. The combined match score generated from the various acoustic models may result in a more definitive distinction between utterance classes than a match score based on a single acoustic model. In an exemplary embodiment, the VR system uses such combined matching scores to determine whether to replace the template in the SD acoustic model (234 in fig. 2) with a template obtained from a new set of input acoustic feature vectors.

The weighting factor (W) can be selected₁...W₆) To provide optimal training performance throughout the acoustic environment. In an exemplary embodiment, the weighting factor (W)₁...W₆) Is constant in all pronunciation categories. In other words, W used to generate the combined match score for the first target utterance class_nAnd W for use in generating a combined match score for another target utterance class_nAre the same. In another embodiment, the weighting factor varies according to the target utterance class. Other ways of making the combination shown in fig. 4 will be apparent to those skilled in the art and can be considered within the scope of the embodiments discussed herein. For example, weighted inputs greater than 6 or less than 6 may also be used. Another obvious variation is to generate a combined match score based on a class of acoustic models. For example, according to SIHMM_T，SIHMM_NTAnd SIHMM_GTo generate a combined match rating, or according to SIDTW_T，SIDTW_NTAnd SIDTW_GTo generate a combined match rating.

In an exemplary embodiment, W₁And W₄Is a negative number, and S_COMBA larger (i.e., less negative) value of (a) indicates a greater degree of matching (a smaller distance) between the target utterance class and a series of input acoustic feature vectors. Those skilled in the art will appreciate that the signs of the weighting factors may be readily rearranged so as to not depart from the scope of the disclosed embodimentsA greater degree of matching may correspond to a smaller value.

Returning again to FIG. 3, at step 310, a combined match score is generated for the utterance class associated with the templates in the HMM and DTW acoustic models (230 and 232). In an exemplary embodiment, a combined match score is generated for only the utterance class associated with the best n SIHMM match scores and the utterance class associated with the best m SIDTW match scores. This limitation is required to conserve computing resources, even though a significant amount of computing power is consumed in generating the respective matching evaluations. For example, if n-m-3, a combined match score may be generated for the utterance class associated with the best three SIHMM match scores and the utterance class associated with the best three SIDTW match scores. This approach will yield three to six different combined matching scores depending on whether the class of utterance associated with the best three SIHMM matching scores is the same as the class of utterance associated with the best three SIDTW matching scores.

At step 312, the remote station 202 compares the combined match rating stored in the SD acoustic model with the corresponding template. If the new series of input acoustic feature vectors has a greater degree of match than the old templates stored in the SD templates for the same utterance class, a new SD template is generated from the new series of input acoustic feature vectors. In embodiments where the SD acoustic model is a DTW acoustic model, the series of input acoustic feature vectors themselves constitute a new SD template. The old template is then replaced with the new template and the combined match rating associated with the new template is stored in the SD acoustic model for later comparison.

In an alternative embodiment, unsupervised training is used to update one or more templates in acoustic models of speaker-specific hidden markov models (SDHMMs). The SDHMM acoustic model may be used in place of or in addition to the SDDTW model in the SD acoustic model 234.

In an exemplary embodiment, the comparison in step 312 further includes comparing the combined match rating of the expected new SD template to a constant training threshold. Even if no template for a particular utterance class has been stored in the SD acoustic model, the new template cannot be stored in the SD acoustic model unless it has a better combined match score (indicating a greater degree of match) than the value of the training threshold.

In an alternative embodiment, the SD acoustic model is generally defined by the templates of the SI acoustic model before replacing any of the templates in the SD acoustic model. Such initialization provides an alternative method for ensuring that VR performance using SD acoustic models is at least as good at the beginning as VR performance using SI acoustic models alone. As more and more templates in the SD acoustic model are updated, VR performance using the SD acoustic model may exceed VR performance using only the SI acoustic model.

In an alternative embodiment, the VR system allows a user to perform supervised training. The user must place the VR system in a supervised training mode before such supervised training can be performed. During supervised training, the VR system has a priori knowledge of the correct pronunciation categories. If the combined match score for the input speech is better than the combined match score of the previously stored SD template for that utterance class, then an alternate SD template is formed with the input speech. In an alternative embodiment, the VR system allows a user to force an existing SD template to be replaced during supervised training.

The SD acoustic model may be designed with multiple (two or more) templates for a single utterance class. In an alternative embodiment, two templates are stored for each pronunciation category in the SD model. Thus, the comparison made at step 312 necessarily compares the match score obtained with the new template with the match scores obtained for two templates in the SD acoustic model for the same utterance class. If the new template has a better match score than any of the older templates in the SD acoustic model, then the SD acoustic model template with the worst match score may be replaced with the new template at step 314. If the match rating of the new template is not good for both old templates, skip step314. In addition, at step 312, the match score obtained with the new template is compared to a threshold value for the match score. Thus, before the original contents of the SD acoustic model are covered with a new template, a comparison of the new template to the threshold value is made until the new template has a better match rating than the threshold value stored in the SD acoustic model. Various obvious variations are contemplated, such as storing the SD acoustic model templates in sorted order according to combined match scores and comparing the new match scores to the lowest match scores, and are considered to be within the scope of the embodiments disclosed herein. Various obvious variations to the number of templates stored in the acoustic model for each utterance class are also contemplated. For example, the SD acoustic model may contain more than two templates for each utterance class, or may contain different numbers of templates for different utterance classes. FIG. 5 is a flow chart illustrating an exemplary method for performing VR testing using a combination of SI and SD acoustic models. Steps 302, 304, 306 and 308 are the same as discussed in fig. 3. At step 510, the exemplary method differs from the method shown in FIG. 3. At step 510, the VR engine 220 generates an SD match rating based on a comparison of the input feature vector to templates in the SD acoustic model. In an exemplary embodiment, the SD match score is generated only for the utterance class associated with the best n SIHMM match score and the best m SIDTW match score. In an exemplary embodiment, n-m-3. This may result in the generation of SD match scores for three to six pronunciation categories, depending on the degree of overlap between the two sets of pronunciation categories. As discussed above, the SD acoustic model may contain multiple templates for a single utterance class. At step 512, the VR engine 220 generates a mixed combined match rating for use in VR testing. In an exemplary embodiment, these mixed combined match ratings are based on both the individual SI and individual SD match ratings. At step 514, the vocabulary or pronunciation with the best combined match score may be selected and compared to a test threshold. An utterance is only considered recognized if its combined match score exceeds a test threshold. In an exemplary embodiment, the weights [ W ] used to generate the combined evaluation for training₁...W₆](as shown in FIG. 4) and weights [ W ] used to generate the combined evaluation for testing₁...W₆](as shown in fig. 6) are the same, but the trained threshold is different from the tested threshold.

FIG. 6 shows the generation of a blended combined match score at step 512. The operational radar of the exemplary embodiment shown is identical to the combiner shown in FIG. 4, except that a weighting factor W is used₄Application to DTW_TTo replace SIDTW_TAnd applying a weighting factor W₅Application to DTW_NTTo replace SIDTW_NT. Selecting DTW from the SIDTW and SDDTW best scores associated with the target utterance class_T(dynamic time warping matching evaluation for target utterance class). Similarly, DTW is selected from the best scores for SIDTW and SDDTW associated with non-target utterance classes_NT(applicable to the evaluation of dynamic time warping matching for the remaining non-target utterance classes).

SI/SD mixed evaluation S suitable for special pronunciation types_{COMB_H}Is a weighted sum according to the illustrated eqn.2, where: SIHMM_T，SIHMM_NT，SIHMM_GAnd SIDTW_GSIHMM with EQN.1_T，SIHMM_NT，SIHMM_GAnd SIDTW_GAre the same. In particular, in eqn.2:

SIHMM_Tis the SIHMM matching evaluation of the target utterance class;

SIHMM_NTis the next best match for the template in the SIHMM acoustic model that is associated with the non-target utterance class (an utterance class that is not the target utterance class);

SIHMM_Gis suitable for SIHMM matching evaluation of the garbage pronunciation category;

DTW_Tis the best DTW matching evaluation suitable for SI and SD template corresponding to the target pronunciation category;

DTW_NTis the best DTW match rating for SI and SD templates corresponding to non-target utterance classes; and the number of the first and second groups,

SIDTW_Gthe method is suitable for SIDTW matching evaluation of 'garbage' pronunciation types.

Then, SI/SD mixed evaluation S_{COMB_H}Is a combination of the respective SI and the respective SD match evaluations. The final combined match evaluation does not depend entirely on SI or SD acoustic models. Evaluating SIDTW if match_TThan any SDDTW_TIf the evaluation of (1) is better, the better SIDTW is obtained_TIn the evaluation, SI/SD mixed evaluation was calculated. Similarly, SDDTW is evaluated if there is a match_TCompare any SIDTW_TIf the evaluation of (2) is better, then from better SDDTW_TIn the evaluation, SI/SD mixed evaluation was calculated. Thus, if the templates in the SD acoustic model produce a poor match score, the VR system can still recognize the input speech from the SI portion of the SI/SD hybrid score. Such poor SD match assessment can have a variety of reasons, including differences between the acoustic environments during training and testing. Or perhaps a poor quality input used for training.

In alternative embodiments, the weighting for the SI rating is lighter than the SD rating, or may even be omitted altogether. For example, selecting DTW from the best SDDTW rating associated with the target utterance class_TAnd ignores the SIDTW evaluation of the target utterance class. Likewise, DTW may be selected from the best SIDTW or SDDTW scores associated with non-target utterance classes_NTInstead, two sets of evaluations were used.

Although the exemplary embodiments are discussed using only SDDTW acoustic models that are appropriate for speaker specific patterns, the mixing methods discussed herein may be equally applied to VR systems that use SDHMM acoustic models or even SDDTW and SDHMM acoustic models. For example, by modifying the method shown in FIG. 6, the weighting factor W₁Can be applied to the optimal SIHMM_TAnd SDHMM_TA selected matching rating of the ratings. Weighting factor W₂Can be applied to the optimal SIHMM_NTAnd SDHMM_NTA selected matching rating of the ratings.

Thus, disclosed herein are VR methods and apparatus that use a combination of SI and SD acoustic models to improve VR performance during unsupervised training and testing. Those skilled in the art will understand that: information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above discussion may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof. Also, although the embodiments described above primarily consider Dynamic Time Warping (DTW) or Hidden Markov Model (HMM) acoustic models, the techniques discussed are equally applicable to other types of acoustic models, such as neural network acoustic models.

Those of ordinary skill would further appreciate that the various illustrative logical units, modules, circuits, and algorithm steps discussed in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To succinctly illustrate this interchangeability of hardware and software, various illustrative components, logic units, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The various illustrative logical units, modules, and circuits discussed above in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a digital signal processor (DPS), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate and transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.

The steps of a method or algorithm discussed in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EPPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of speech recognition, the method comprising the steps of:

pattern matching the first input speech segment with at least a first template to produce at least one input pattern matching score and to determine a recognized utterance class;

comparing the at least one input pattern matching score to a corresponding score associated with at least a second template from the speaker-specific acoustic model associated with the identified utterance class; and

and determining whether to update the at least second template according to the comparison result.

2. The method of claim 1, wherein the step for pattern matching further comprises:

performing hidden Markov model pattern matching on the first input speech segment with at least one hidden Markov model template to generate at least one hidden Markov model matching evaluation;

performing dynamic time warping pattern matching on the first input speech segment and at least one dynamic time warping template to generate at least one dynamic time warping matching evaluation; and

at least one weighted sum of the at least one hidden Markov model match rating and the at least one dynamic time warping match rating to generate the at least one input pattern match rating.

3. The method of claim 1, further comprising:

generating at least one speaker independent match rating by pattern matching a second input speech segment with the at least first template;

generating at least one speaker-specific matching score by pattern matching the second input speech segment with the at least second template; and is

Combining the at least one speaker independent matching score with the speaker specific matching score to generate at least one combined matching score.

4. The method of claim 3, further comprising: identifying a pronunciation category associated with a best combination match score of the at least one combination match score.

5. A method of unsupervised speech recognition training and testing, the method comprising the steps of:

pattern matching, in a speech recognition engine (220), input speech from a speaker with content in a speaker-independent acoustic model (230, 232) to produce a speaker-independent pattern matching score;

comparing, with the speech recognition engine (220), the speaker-independent pattern matching assessment to an assessment associated with a template of a speaker-specific acoustic model (234) that is appropriate for the speaker; and

generating a new template for the speaker-specific acoustic model (234) based on the speaker-independent pattern matching score if the speaker-independent pattern matching score is higher than the score associated with the template of the speaker-specific acoustic model (234).

6. The method of claim 5, wherein the speaker independent acoustic models (230, 232) comprise at least one hidden Markov model acoustic model.

7. The method of claim 5, wherein the speaker independent acoustic models (230, 232) include at least one dynamic time warping acoustic model.

8. The method of claim 5, wherein the speaker independent acoustic models (230, 232) include at least one hidden Markov model acoustic model and at least one dynamic time warping acoustic model.

9. The method of claim 5, wherein said speaker independent acoustic model (230, 232) comprises at least one garbage template, and wherein said comparing step comprises comparing said input speech to said at least one garbage template.

10. The method of claim 5, wherein said speaker-specific acoustic models (234) comprise at least one dynamic time warping acoustic model.

11. The method of claim 5, further comprising:

constructing the speech recognition engine (220) to compare a second input speech segment to content in the speaker-independent acoustic model and the speaker-specific acoustic model to generate at least one speaker-specific and speaker-independent combined match score; and

pronunciation categories are identified that have speaker-specific and speaker-independent best combination match scores, where a pronunciation category is a particular vocabulary or speech segment.

12. The method of claim 11, wherein the speaker independent acoustic models comprise at least one hidden markov model acoustic model.

13. The method of claim 11, wherein said speaker independent acoustic models (230, 232) include at least one dynamic time warping acoustic model.

14. The method of claim 11, wherein the speaker independent acoustic models (230, 232) include at least one hidden markov model acoustic model and at least one dynamic time warping acoustic model.

15. The method as recited in claim 11, wherein said speaker-specific acoustic model (234) comprises at least one dynamic time warping acoustic model.

16. A method of speech recognition, the method comprising the steps of:

performing pattern matching on the input speech segment and at least one speaker independent acoustic template to generate at least one speaker independent matching evaluation;

performing pattern matching on the input speech segment and a speaker-specific acoustic template to generate at least one speaker-specific matching evaluation;

combining the at least one speaker independent match rating with the at least one speaker specific match rating to generate at least one combined match rating, wherein each combined match rating corresponds to a pronunciation category and is dependent on a speaker independent pattern match rating of the pronunciation category and a speaker specific pattern match rating of the pronunciation category, wherein the pronunciation category is a specific vocabulary or a speech fragment.

17. The method of claim 16, wherein said step for pattern matching and said step for combining are performed by a speech recognition engine (220).