WO2002029785A1 - Procede, appareil et systeme permettant la verification du locuteur s'inspirant d'un modele de melanges de gaussiennes (gmm) - Google Patents
Procede, appareil et systeme permettant la verification du locuteur s'inspirant d'un modele de melanges de gaussiennes (gmm) Download PDFInfo
- Publication number
- WO2002029785A1 WO2002029785A1 PCT/CN2000/000303 CN0000303W WO0229785A1 WO 2002029785 A1 WO2002029785 A1 WO 2002029785A1 CN 0000303 W CN0000303 W CN 0000303W WO 0229785 A1 WO0229785 A1 WO 0229785A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speaker
- model
- test
- feature vectors
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
Definitions
- the present invention relates to the field of speaker recognition. More specifically, the present invention relates to a method, apparatus, and system for speaker verification based upon orthogonal Gaussian mixture model (GMM).
- GMM orthogonal Gaussian mixture model
- the speech signal can convey various types of information at different levels. In particular, not only that the speech signal conveys a message as a sequence of words, it also conveys speaker specific information, for example, information about the identity of the speaker who produces the speech signal.
- the field of speech recognition is concerned with extracting the underlying message conveyed in the speech signal and the field of speaker recognition deals with extracting and verifying the identity of the speaker who generates the speech signal.
- Speaker recognition can be divided into two areas: speaker identification and speaker verification. In speaker identification, the task is to determine the identity of a speaker based upon a speech sample provided by that speaker. In speaker verification, the task is to verify whether a speaker is whom he or she claims to be, based upon a speech sample provided by that speaker.
- speaker verification involves a two-way classification or a binary test to determine whether the speaker's claim is correct or not.
- either speaker identification or speaker verification can be constrained to specific phrases or text which is referred to as text-dependent, or unconstrained to any specific text which is referred to as text- independent.
- Operating a speaker recognition system typically involves two stages. In the first stage, a user enrolls in the system by providing the system with one or more samples of his speech. These training samples are used by the system to build a model for that user. In the second stage, the user provides a test sample to be used by the system to test the similarity between the test sample and the model(s) of the user(s) in order to perform its corresponding function (e.g., speaker identification or speaker verification).
- the Gaussian mixture speaker model has been successfully and widely used for text-independent speaker verification.
- This modeling technique basically uses a Gaussian mixture density (a weighted sum of several multivariate Gaussian functions) to represent or model the distribution of training feature vectors.
- each Gaussian function may have a full covariance matrix.
- the diagonal covariance matrix has been mostly used in practice because of its computational advantages.
- the elements of feature vectors extracted from a speech signal are correlated.
- a linear combination of diagonal covariance Gaussian functions is capable of modeling the correlation.
- a large number of mixtures needs to be used in order to provide a good approximation for the distribution of the feature vectors extracted from a person's speech.
- Figure 1 is a block diagram of one embodiment of a speaker recognition system according to the teachings of the present invention
- Figure 2 is a flow diagram of one embodiment of a method according to the teachings of the present invention.
- Figure 3 shows a flow diagram of one embodiment of a method according to the teachings of the present invention.
- a test signal representing a test speech is converted or transformed into a set of test feature vectors that represent the identity of a test speaker who claims a particular identity.
- the test feature vectors are then transformed using corresponding linear transform matrices associated with a speaker independent Gaussian mixture model (SIGMM) that was previously trained.
- SIGMM speaker independent Gaussian mixture model
- the system determines whether to accept or reject the claimed identity of the test speaker based upon the transformed test feature vectors representing the identity of the test speaker and the models including the speaker dependent model representing the claimed identity and the anti-models corresponding to the claimed identity (cohort models or background model).
- the speaker dependent model in one embodiment, is represented by a speaker dependent GMM (SDGMM) which was constructed using linear transform matrices associated with corresponding mixtures in the speaker independent model.
- SDGMM speaker dependent GMM
- the training feature vectors used to construct the SDGMM are first transformed by the corresponding linear transform matrices associated with the speaker independent model and the parameters of each mixture of the respective SDGMM are trained based upon the transformed training feature vectors.
- the speaker independent GMM is constructed using speech samples provided by a large set of speakers in a training corpus. After the speaker independent GMM is trained, a linear transform matrix is computed for each mixture of the speaker independent model and the resultant linear transform matrices are utilized for the training of the speaker dependent models.
- the speaker independent model is trained using the expectation-maximization (EM) method.
- the speaker dependent models are then trained using the maximum a posteriori (MAP) adaptation method.
- the linear transform matrices computed for the speaker independent model are shared by the speaker dependent model mixtures that are adapted from the same mixtures of the speaker independent model.
- the teachings of the present invention are applicable to any scheme, method and system for speaker recognition that employs GMMs as the probabilistic model of the underlying sounds of a speaker's voice.
- the present invention is not limited to speaker recognition systems and can be applied to other types of probabilistic and data modeling in speech recognition and in other fields or disciplines including, but not limited to, image processing, signal processing, geometric modeling, computer-aided-design (CAD), computer-aided- manufacturing (CAM), etc.
- the present invention provides a method and a system that combines orthogonal GMM with maximum a posteriori (MAP) adaptation.
- MAP maximum a posteriori
- the correlation of feature vectors can be modeled much better than diagonal GMM-based speaker verification systems.
- GMM-based speaker verification systems the distribution of feature vectors extracted from a speaker's speech is modeled by a Gaussian mixture density which is a weighted sum of several multivariate Gaussian functions.
- a Gaussian function has the form as shown below:
- ⁇ is the mean vector
- ⁇ is the covariance matrix.
- the diagonal Gaussian function in Y space is equivalent to the Gaussian function with full covariance matrix in X space. Accordingly, a diagonal GMM in Y space would provide better approximation to the distribution of feature vectors compared to a diagonal GMM in X space.
- the GMM with orthogonal transform is referred to as orthogonal GMM herein.
- the present invention provides a method to more accurately model the correlation of feature vectors than that provided by diagonal GMM-based speaker verification systems. First, a speaker independent model is trained using EM algorithm. The eigenvectors of the covariance matrix for each mixture are calculated.
- the linear transform matrix for each mixture in the speaker independent model is composed of these eigenvectors.
- a speaker dependent model for each new speaker enrolled in the system is trained using the MAP method.
- the linear transform matrices computed for the speaker independent model are shared by the mixtures of the speaker dependent model that are adapted from the same mixture of the speaker independent model.
- the covariance matrices in transformed spaces are more diagonal. Accordingly, the diagonal Gaussian functions in transformed spaces will provide a better approximation to the distribution of feature vectors.
- the covariance is usually much less adapted than mean. Some systems therefore use only mean adaptation. Sharing of transformation matrix will make MAP adaptation more effective because the transformation matrix is computed from the corresponding covariance matrix.
- the feature vectors are first transformed to the spaces spanned by the eigenvectors of the covariance matrix of each corresponding speaker- independent mixture.
- the probability or similarity measure between the input speech and the speaker dependent model is then computed using the transformed vectors for each speaker in the set.
- FIG. 1 illustrates a block diagram of one embodiment of a speaker verification system 100 according to the teachings of the present invention.
- the system 100 includes an analog to digital converter (A/D) 110, a feature extractor or spectral analysis unit 120, a similarity measurement unit 130, a speaker dependent model or reference database 140, and a decision making unit 150.
- An input signal 101 representing a sample speech of a speaker whose claimed identity is to be verified by the system (also referred to as the test speaker herein) is first digitized using the A/D 110.
- the digital signal is sliced up into frames at a suitable rate (e.g., 10, 15, or 20 ms).
- the digital signal is then converted or transformed into a set of feature vectors containing acoustic parameters that convey the identity characteristics of the test speaker.
- the feature vectors are then inputted to the similarity measurement unit 130 which computes a similarity measure between the identity of the test speaker as represented by the feature vectors and the claimed identity that is represented by a previously constructed model stored in the speaker dependent model database 140.
- the decision-making unit 150 then compares the similarity measure computed by the similarity measurement unit 130 to a predetermined value or threshold and decides whether to accept or reject the claimed identity of the test speaker.
- the test feature vectors are transformed using the corresponding linear transform matrices computed from a previously trained speaker independent GMM.
- the similarity measurement unit 130 computes the similarity measure based upon the transformed test feature vectors and a set of speaker dependent models stored in the database 140.
- the parameters of the speaker dependent models are trained using transformed training feature vectors that are obtained by transforming the training feature vectors extracted from training speech samples using the corresponding linear transform matrices associated with the previously trained speaker independent GMM.
- Figure 2 shows a flow diagram of a method 200 for performing speaker verification according to the teachings of the present invention.
- the method starts at block 201 and proceeds to block 210.
- a test signal representing a test speech is converted into a set of test feature vectors.
- the test speech is provided by a test speaker who claims a particular identity.
- the test feature vectors are transformed to the spaces spanned by the eigenvectors of the covariance matrix of each corresponding speaker-independent mixture by using the corresponding linear transform matrices that were previously computed for the speaker independent GMM.
- the speaker dependent model is represented by a speaker dependent GMM that was constructed using the corresponding linear transform matrices associated with the speaker independent model.
- a likelihood ratio test is used to determine whether to accept or reject the claimed speaker.
- the likelihood ratio test is well known in the art. Assuming that an utterance or sample speech X given by a speaker who claims to be a particular speaker Y that has a corresponding model M , then the likelihood ratio is:
- the likelihood ratio is then compared to a threshold value ⁇ .
- the claimed speaker is accepted if the likelihood ratio exceeds the threshold value and is rejected if it is less than the threshold value.
- the decision threshold value can be set to adjust the tradeoff between false rejection error and false acceptance errors.
- a cohort normalization method is used to compute the likelihood ratio.
- the likelihood ratio instead of computing the likelihood ratio using the average score of the entire set of speaker dependent models, only the average score of the subset having the highest scores (excluding the score of the claimed identity model) is used. For example, assuming that there are 100 speaker dependent models that represent a set of 100 speakers that have been enrolled in the system (one model for each speaker in the set), during verification phase, the probability of the input speech (as represented by the transformed feature vectors described above) being generated from each model (including the model that represents the claimed identity) is computed. Then the average score for a predetermined number of the top scores (e.g., the top 10 scores) excluding the score of the claimed identity is computed. This average score is then compared with the score of the claimed identity to generate the likelihood ratio.
- the probability of the input speech as represented by the transformed feature vectors described above
- the average score for a predetermined number of the top scores e.g., the top 10 scores
- This average score is then compared with the score of the claimed identity to generate the likelihood ratio.
- FIG. 3 illustrates a flow diagram of one embodiment of a method 300 according to the teachings of the present invention.
- the method 300 starts at block 301 and proceeds to block 305 to perform speaker independent model training.
- a speaker independent GMM having M mixtures is trained using the expectation- maximization (EM) technique.
- EM expectation- maximization
- the linear transform matrix for each mixture of the speaker independent GMM is computed. In one embodiment, this is done by calculating the eigenvectors of the covariance matrix for each mixture.
- the linear transform matrix for each mixture is composed of the corresponding eigenvectors.
- the method 300 proceeds to block 313 to perform speaker dependent model training (enrolling new speakers).
- the feature vectors of the training speech provided by a speaker being enrolled are transformed to the spaces spanned by the corresponding linear transform matrices that were computed previously based on the mixtures of the speaker independent model.
- the linear transform matrices computed from the speaker independent model are shared by the mixtures that are adapted from the same mixture of the speaker independent model. By this shared transformation, the covariance matrices in the transformed spaces are model diagonal.
- the parameters of each mixture of the speaker dependent GMM are trained in the transformed spaces using MAP algorithm. The process of speaker dependent model training is performed for each speaker enrolled in the system. The method 300 proceeds to block 331 to perform speaker verification task.
- the feature vectors extracted from a test speech of a speaker who claims a particular identity are transformed to the corresponding spaces by the corresponding transform matrices, h other words, these feature vectors are transformed to the spaces spanned by the eigenvectors of the covariance matrix of each corresponding speaker independent mixture.
- the probabilities of the feature vectors with respect to the speaker dependent models are calculated in the corresponding spaces to obtain verification results (i.e., whether to accept or reject the claimed identity).
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Game Theory and Decision Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Complex Calculations (AREA)
Abstract
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| AU2000276401A AU2000276401A1 (en) | 2000-09-30 | 2000-09-30 | Method, apparatus, and system for speaker verification based on orthogonal gaussian mixture model (gmm) |
| PCT/CN2000/000303 WO2002029785A1 (fr) | 2000-09-30 | 2000-09-30 | Procede, appareil et systeme permettant la verification du locuteur s'inspirant d'un modele de melanges de gaussiennes (gmm) |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2000/000303 WO2002029785A1 (fr) | 2000-09-30 | 2000-09-30 | Procede, appareil et systeme permettant la verification du locuteur s'inspirant d'un modele de melanges de gaussiennes (gmm) |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2002029785A1 true WO2002029785A1 (fr) | 2002-04-11 |
Family
ID=4574716
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2000/000303 Ceased WO2002029785A1 (fr) | 2000-09-30 | 2000-09-30 | Procede, appareil et systeme permettant la verification du locuteur s'inspirant d'un modele de melanges de gaussiennes (gmm) |
Country Status (2)
| Country | Link |
|---|---|
| AU (1) | AU2000276401A1 (fr) |
| WO (1) | WO2002029785A1 (fr) |
Cited By (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2005055200A1 (fr) * | 2003-12-05 | 2005-06-16 | Queensland University Of Technology | Systeme et procede d'adaptation de modele destines a la reconnaissance du locuteur |
| GB2465782A (en) * | 2008-11-28 | 2010-06-02 | Univ Nottingham Trent | Biometric identity verification utilising a trained statistical classifier, e.g. a neural network |
| US20110010171A1 (en) * | 2009-07-07 | 2011-01-13 | General Motors Corporation | Singular Value Decomposition for Improved Voice Recognition in Presence of Multi-Talker Background Noise |
| CN102237089A (zh) * | 2011-08-15 | 2011-11-09 | 哈尔滨工业大学 | 一种减少文本无关说话人识别系统误识率的方法 |
| US8433567B2 (en) | 2010-04-08 | 2013-04-30 | International Business Machines Corporation | Compensation of intra-speaker variability in speaker diarization |
| WO2017045429A1 (fr) * | 2015-09-18 | 2017-03-23 | 广州酷狗计算机科技有限公司 | Procédé et système de détection de données audio, et support d'informations |
| US10257191B2 (en) | 2008-11-28 | 2019-04-09 | Nottingham Trent University | Biometric identity verification |
| CN111027453A (zh) * | 2019-12-06 | 2020-04-17 | 西北工业大学 | 基于高斯混合模型的非合作水中目标自动识别方法 |
| US11611581B2 (en) | 2020-08-26 | 2023-03-21 | ID R&D, Inc. | Methods and devices for detecting a spoofing attack |
| CN119441902A (zh) * | 2024-10-18 | 2025-02-14 | 华中科技大学 | 一种无监督降维的电压互感器二次回路异常在线监测方法 |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5555320A (en) * | 1992-11-27 | 1996-09-10 | Kabushiki Kaisha Toshiba | Pattern recognition system with improved recognition rate using nonlinear transformation |
| WO1999023643A1 (fr) * | 1997-11-03 | 1999-05-14 | T-Netix, Inc. | Systeme d'adaptation de modele et procede de verification de locuteur |
-
2000
- 2000-09-30 AU AU2000276401A patent/AU2000276401A1/en not_active Abandoned
- 2000-09-30 WO PCT/CN2000/000303 patent/WO2002029785A1/fr not_active Ceased
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5555320A (en) * | 1992-11-27 | 1996-09-10 | Kabushiki Kaisha Toshiba | Pattern recognition system with improved recognition rate using nonlinear transformation |
| WO1999023643A1 (fr) * | 1997-11-03 | 1999-05-14 | T-Netix, Inc. | Systeme d'adaptation de modele et procede de verification de locuteur |
Cited By (14)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2005055200A1 (fr) * | 2003-12-05 | 2005-06-16 | Queensland University Of Technology | Systeme et procede d'adaptation de modele destines a la reconnaissance du locuteur |
| US10257191B2 (en) | 2008-11-28 | 2019-04-09 | Nottingham Trent University | Biometric identity verification |
| US9311546B2 (en) | 2008-11-28 | 2016-04-12 | Nottingham Trent University | Biometric identity verification for access control using a trained statistical classifier |
| GB2465782B (en) * | 2008-11-28 | 2016-04-13 | Univ Nottingham Trent | Biometric identity verification |
| GB2465782A (en) * | 2008-11-28 | 2010-06-02 | Univ Nottingham Trent | Biometric identity verification utilising a trained statistical classifier, e.g. a neural network |
| US20110010171A1 (en) * | 2009-07-07 | 2011-01-13 | General Motors Corporation | Singular Value Decomposition for Improved Voice Recognition in Presence of Multi-Talker Background Noise |
| US9177557B2 (en) * | 2009-07-07 | 2015-11-03 | General Motors Llc. | Singular value decomposition for improved voice recognition in presence of multi-talker background noise |
| US8433567B2 (en) | 2010-04-08 | 2013-04-30 | International Business Machines Corporation | Compensation of intra-speaker variability in speaker diarization |
| CN102237089A (zh) * | 2011-08-15 | 2011-11-09 | 哈尔滨工业大学 | 一种减少文本无关说话人识别系统误识率的方法 |
| WO2017045429A1 (fr) * | 2015-09-18 | 2017-03-23 | 广州酷狗计算机科技有限公司 | Procédé et système de détection de données audio, et support d'informations |
| CN111027453A (zh) * | 2019-12-06 | 2020-04-17 | 西北工业大学 | 基于高斯混合模型的非合作水中目标自动识别方法 |
| US11611581B2 (en) | 2020-08-26 | 2023-03-21 | ID R&D, Inc. | Methods and devices for detecting a spoofing attack |
| CN119441902A (zh) * | 2024-10-18 | 2025-02-14 | 华中科技大学 | 一种无监督降维的电压互感器二次回路异常在线监测方法 |
| CN119441902B (zh) * | 2024-10-18 | 2025-10-17 | 华中科技大学 | 一种无监督降维的电压互感器二次回路异常在线监测方法 |
Also Published As
| Publication number | Publication date |
|---|---|
| AU2000276401A1 (en) | 2002-04-15 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| EP0870300B1 (fr) | Systeme de verification de locuteur | |
| US6539352B1 (en) | Subword-based speaker verification with multiple-classifier score fusion weight and threshold adaptation | |
| EP0744734B1 (fr) | Méthode et appareil de vérification du locuteur utilisant une discrimination basée sur la décomposition des mixtures | |
| US9646614B2 (en) | Fast, language-independent method for user authentication by voice | |
| US6519561B1 (en) | Model adaptation of neural tree networks and other fused models for speaker verification | |
| US6697778B1 (en) | Speaker verification and speaker identification based on a priori knowledge | |
| US8099288B2 (en) | Text-dependent speaker verification | |
| US6401063B1 (en) | Method and apparatus for use in speaker verification | |
| CN101465123B (zh) | 说话人认证的验证方法和装置以及说话人认证系统 | |
| US6233555B1 (en) | Method and apparatus for speaker identification using mixture discriminant analysis to develop speaker models | |
| JPH09127972A (ja) | 連結数字の認識のための発声識別立証 | |
| Angkititrakul et al. | Discriminative in-set/out-of-set speaker recognition | |
| Woodward et al. | Confidence Measures in Encoder-Decoder Models for Speech Recognition. | |
| CN100363938C (zh) | 基于得分差加权融合的多模态身份识别方法 | |
| Ozaydin | Design of a text independent speaker recognition system | |
| WO2002029785A1 (fr) | Procede, appareil et systeme permettant la verification du locuteur s'inspirant d'un modele de melanges de gaussiennes (gmm) | |
| Ilyas et al. | Speaker verification using vector quantization and hidden Markov model | |
| Kadhim et al. | Enhancement and modification of automatic speaker verification by utilizing hidden Markov model | |
| Olsson | Text dependent speaker verification with a hybrid HMM/ANN system | |
| Dustor | Voice verification based on nonlinear Ho-Kashyap classifier | |
| Li et al. | Evaluation of the i-vector system for text-dependent speaker verification | |
| Zhou et al. | Novel discriminative vector quantization approach for speaker identification | |
| Suh et al. | Filling acoustic holes through leveraged uncorellated GMMs for in-set/out-of-set speaker recognition. | |
| Chao | Verbal Information Verification for High-performance Speaker Authentication | |
| Ahn et al. | On effective speaker verification based on subword model. |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW |
|
| AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
| REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
| 122 | Ep: pct application non-entry in european phase | ||
| NENP | Non-entry into the national phase |
Ref country code: JP |