TW201638931A - Speech recognition device and rescoring device - Google Patents
Speech recognition device and rescoring device Download PDFInfo
- Publication number
- TW201638931A TW201638931A TW104129304A TW104129304A TW201638931A TW 201638931 A TW201638931 A TW 201638931A TW 104129304 A TW104129304 A TW 104129304A TW 104129304 A TW104129304 A TW 104129304A TW 201638931 A TW201638931 A TW 201638931A
- Authority
- TW
- Taiwan
- Prior art keywords
- language model
- learning
- recognition
- word
- result
- Prior art date
Links
- 238000013528 artificial neural network Methods 0.000 claims abstract description 12
- 230000000306 recurrent effect Effects 0.000 claims abstract description 9
- 238000000034 method Methods 0.000 description 13
- 238000010586 diagram Methods 0.000 description 12
- 230000006870 function Effects 0.000 description 12
- 238000012545 processing Methods 0.000 description 12
- 230000000694 effects Effects 0.000 description 10
- 230000010354 integration Effects 0.000 description 7
- 239000000463 material Substances 0.000 description 7
- 239000013598 vector Substances 0.000 description 7
- 238000011156 evaluation Methods 0.000 description 4
- 241000282326 Felis catus Species 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 238000003780 insertion Methods 0.000 description 3
- 230000037431 insertion Effects 0.000 description 3
- 239000000203 mixture Substances 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- PCTMTFRHKVHKIS-BMFZQQSSSA-N (1s,3r,4e,6e,8e,10e,12e,14e,16e,18s,19r,20r,21s,25r,27r,30r,31r,33s,35r,37s,38r)-3-[(2r,3s,4s,5s,6r)-4-amino-3,5-dihydroxy-6-methyloxan-2-yl]oxy-19,25,27,30,31,33,35,37-octahydroxy-18,20,21-trimethyl-23-oxo-22,39-dioxabicyclo[33.3.1]nonatriaconta-4,6,8,10 Chemical compound C1C=C2C[C@@H](OS(O)(=O)=O)CC[C@]2(C)[C@@H]2[C@@H]1[C@@H]1CC[C@H]([C@H](C)CCCC(C)C)[C@@]1(C)CC2.O[C@H]1[C@@H](N)[C@H](O)[C@@H](C)O[C@H]1O[C@H]1/C=C/C=C/C=C/C=C/C=C/C=C/C=C/[C@H](C)[C@@H](O)[C@@H](C)[C@H](C)OC(=O)C[C@H](O)C[C@H](O)CC[C@@H](O)[C@H](O)C[C@H](O)C[C@](O)(C[C@H](O)[C@H]2C(O)=O)O[C@H]2C1 PCTMTFRHKVHKIS-BMFZQQSSSA-N 0.000 description 1
- 238000012935 Averaging Methods 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- ZZORFUFYDOWNEF-UHFFFAOYSA-N sulfadimethoxine Chemical compound COC1=NC(OC)=CC(NS(=O)(=O)C=2C=CC(N)=CC=2)=N1 ZZORFUFYDOWNEF-UHFFFAOYSA-N 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/19—Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
Abstract
Description
本發明係關於語音識別裝置與調整裝置,特別有關於使用根據recurrent neural network(循環神經網路)的語言模型。 The present invention relates to speech recognition devices and adjustment devices, and more particularly to the use of a language model based on a recurrent neural network.
語音識別中,藉由使用Recurrent Neural Network(RNN,循環神經網路)於語言模型(Language model,LM)(RNN-LM),性能大幅提升係廣為人知。例如記載於T.Mikolov,M.Karafiat,L.Burget,J.Cernocky及S.Khudanpur,“Recurrent neural network based language model(根據循環神經網路的語言模型)”,INTERSPEECH期刊2010第1045-1048頁。 In speech recognition, the performance improvement is well known by using the Recurrent Neural Network (RNN) in the Language Model (LM) (RNN-LM). For example, it is described in T. Mikolov, M. Karafiat, L. Burget, J. Cernocky and S. Khudanpur, "Recurrent neural network based language model", INTERSPEECH Journal 2010 1045-1048 .
又,以語音識別調整的識別結果候補,由n-gram(N-連)模型作為基礎的辨識語言模型做調整的手法是眾所周知的。例如,記載於B.Roark,M.Saraclar,M.Collins及M.Johnson,“Discriminative language modeling with conditional random fields and the perceptron algorithm(利用條件式隨機領域及感知器運算法的鑑別式語言模型製作)”ACL期刊2004第47-54頁以及日本專利公開公報2014-089247。 Further, it is known that the recognition result candidate adjusted by the voice recognition is adjusted by the n-gram (N-linked) model as the basic recognition language model. For example, it is described in B. Roark, M. Saraclar, M. Collins, and M. Johnson, "Discriminative language modeling with conditional random fields and the perceptron algorithm (produced using a discriminative language model of conditional random fields and perceptron algorithms) ACL Journal 2004, pp. 47-54 and Japanese Patent Laid-Open Publication No. 2014-089247.
使用通常n-gram的語言模型,不能考慮長篇上下 文。相對於此,使用RNN於LM,原則上可以保持無限長度的上下文。第1圖中顯示此方式。輸入x是N單詞構成的辭典中1-of-N(N之中的1)表現。輸出y是分別對應N單詞的事後機率。隱藏層中有低次元的向量s。投影行列U將輸入層連結至隱藏層。投影行列V將隱藏層連結至輸出層。複製就在之前時刻的隱藏層至輸入層,藉此保持上下文。藉由使用利用此構造的LM,考慮比可以考慮利用n-gram的LM之上下文更長的上下文,可以產生更佳的識別候補(recognition hypotheses)。又,因為往隱藏層的映射在低次元的向量上執行,考慮單詞間的類似性。例如,單詞「狗」和「貓」根據上下文可以替換,在那情況下那些向量s間的consine(餘弦)類似度變高。 Using a language model that is usually n-gram, you can't consider long articles Text. In contrast, using RNN to LM, in principle, can maintain an infinite length of context. This mode is shown in Figure 1. The input x is a 1-of-N (1 of N) representation in a dictionary composed of N words. The output y is an afterthought probability corresponding to the N word. There is a low-dimensional vector s in the hidden layer. The projection row U links the input layer to the hidden layer. The projection row V connects the hidden layer to the output layer. Copying is in the hidden layer of the previous moment to the input layer, thereby maintaining the context. By using the LM using this configuration, it is possible to generate a better recognition hypotheses by considering a context longer than the context in which the LM of the n-gram can be considered. Also, since the mapping to the hidden layer is performed on a low-dimensional vector, the similarity between words is considered. For example, the words "dog" and "cat" can be replaced according to the context, in which case the consine (cosine) similarity between those vectors s becomes higher.
RNN-LM,相較於使用習知對照表的n-gram的手法,因為需要長的處理時間,主要用於調整(rescoring)。在第2圖中顯示使用於調整(rescoring)時的構成。識別手段4,接受語音1作為輸入,使用音響模型2與識別用的語言模型3,調整(rescoring)複數的候補列,提供調整(rescoring)的結果作為輸出識別結果5。相對於此,調整手段6接受識別結果5作為輸入,使用調整用的語言模型7,送回依概似的下降順序重排候補的識別結果8。調整用的語言模型7係RNN-LM。使用可以考慮長篇上下文的語言模型7,可以期待修正完畢識別結果8的識別性能變得比識別結果5好。 RNN-LM, compared to the n-gram method using the conventional comparison table, is mainly used for rescoring because of the long processing time. The structure used for rescoring is shown in Fig. 2. The recognition means 4 receives the speech 1 as an input, uses the acoustic model 2 and the language model 3 for recognition, reconciles the plural candidate columns, and provides a result of the rescoring as the output recognition result 5. On the other hand, the adjustment means 6 receives the recognition result 5 as an input, and uses the language model 7 for adjustment to return the recognition result 8 of the candidate of the descending order of the approximate descending order. The language model for adjustment 7 is RNN-LM. Using the language model 7 in which the long context can be considered, it can be expected that the recognition performance of the corrected recognition result 8 becomes better than the recognition result 5.
又,識別手段4能夠識別的單詞,因為都有在識別結果5出現的可能性,調整手段6應識別的單詞的用語,最好覆蓋識別手段4的用語。但是,藉由模型化未知語(unk)為組 別,調整手段6的用語數相較於識別手段4可以減少。 Further, since the word recognizable by the recognition means 4 has a possibility that the recognition result 5 appears, the term of the word to be recognized by the adjustment means 6 preferably covers the term of the recognition means 4. But by modeling the unknown (unk) as a group In addition, the number of terms of the adjustment means 6 can be reduced as compared with the identification means 4.
RNN-LM中,根據目前為止的單詞列w1、w2、...wt計算其次的單詞wt+1的事後機率。應識別的用語中包含單詞| V |個,給予各單詞分別不同的單詞號碼。單詞號碼以n表示(但是1≦n≦| V |)。又,根據各單詞以某種基準分類的結果給予單詞號碼也可以。語音中,第t號出現的單詞的單詞號碼以ct提供時,如式(1)提供交叉熵(CE)基準中的學習的評估函數。 In the RNN-LM, the probability of the next word w t+1 is calculated based on the word lines w 1 , w 2 , ... w t so far. The word to be recognized contains the word | V |, and each word is given a different word number. The word number is denoted by n (but 1≦n≦| V |). Further, the word number may be given based on the result of classifying each word by a certain standard. In speech, when the word number of the word appearing at t is provided as c t , the evaluation function of the learning in the cross entropy (CE) reference is provided as in equation (1).
C是語音中出現的單詞列(正確答案列)轉換為單詞號碼列,Ct是其中第t號的單詞的單詞號碼。即,C是排c1、c2、c3、...的順序列。δ是克羅內克(Kronecker)的delta。y通常使用式(2)表示的Softmax函數。 C is a word column (correct answer column) appearing in speech converted into a word number column, and C t is a word number of the word of the tth. That is, C is a sequential column of rows c 1 , c 2 , c 3 , . δ is the delta of Kronecker. y usually uses the Softmax function represented by equation (2).
但是,a是活化,例如a=V‧st。學習法則是如式(3)以a微分FCE得到。 However, a is activation, such as a = V‧s t . The learning rule is obtained as a differential F CE as in equation (3).
學習之際,計算輸入目前的單詞xt(ct)=1之際得到之其次的單詞的事後機率yt(n),在。因為正確答案以δ(n,ct)提供,正確答案δ(n,ct)與目前時間點推測的機率yt(n)間的差為誤差εt(n),藉由逆傳播,更新NU(neural network(神經網路)) 的參數。 At the time of learning, the probability of the next word y t (n) obtained by inputting the current word x t (c t )=1 is calculated. Since the correct answer is provided by δ(n,c t ), the difference between the correct answer δ(n,c t ) and the probability of the current time point y t (n) is the error ε t (n), by inverse propagation, Update the parameters of the NU (neural network).
應學習的NN的參數,包含第1圖的投影行列U及/或V的至少1個要素。又,應學習的NN的參數,隨著投影行列U及V產生的投影,包含表示相加偏移之向量的各成分也可以。逆傳播,例如為了求出將誤差εt(n)最小化的參數而實行。又,逆傳播的具體方法及計算式,可以使用眾所周知的方法及計算式。 The parameters of the NN to be learned include at least one element of the projection row U and/or V of Fig. 1. Further, the parameters of the NN to be learned may include components representing the vector of the added offset as the projections of the projection rows U and V are generated. The inverse propagation is performed, for example, in order to find a parameter that minimizes the error ε t (n). Further, as the specific method and calculation formula of the reverse propagation, well-known methods and calculation formulas can be used.
習知的調整手段6的具體例,有使用辨識的語言模型之範例。這是根據學習資料,使用正確答案或N-best識別結果進行學習。所謂N-best識別結果,係指例如從全部的候補中,依分數大的順序排列上位N個候補之識別結果。 A specific example of the conventional adjustment means 6 is an example of a language model using recognition. This is based on learning materials, using the correct answer or N-best recognition results to learn. The N-best recognition result is a result of arranging the recognition results of the upper N candidates in order of the scores, for example, from all the candidates.
分數,例如表示為音響模型分數及語言模型分數的函數,例如這些的加權和。辨識語言模型,以正確答案列或N-best識別結果之中識別錯誤最少的候補為正確答案,以N-best識別結果之中識別錯誤最多的候補為不正確答案,根據分別包含在內的n-gram,以(平均化)感知器運算法(perceptron algorithm)學習。此方法的範例,記載於上述的Roark2004及日本專利公開公報2014-089247。 The score, for example, is expressed as a function of the acoustic model score and the language model score, such as the weighted sum of these. The language model is identified, and the candidate with the least recognition error among the correct answer column or the N-best recognition result is the correct answer, and the candidate with the most recognition error among the N-best recognition results is an incorrect answer, according to the respectively included n -gram, learning with (average) perceptron algorithm. An example of such a method is described in the above-mentioned Roark 2004 and Japanese Patent Laid-Open Publication No. 2014-089247.
這樣的習知方法的缺點,第1,在於不考慮超過n-gram的上下文之點。即,bi-gram(二元模型)的模型的話,不考慮超過bi-gram的上下文長度,又,tri-gram(三元模型)的模型的話,不考慮超過tri-gram的上下文長度。 The disadvantage of such a conventional method, first, is that it does not consider the point of exceeding the context of the n-gram. That is, the model of the bi-gram (binary model) does not consider the context length beyond the bi-gram, and the model of the tri-gram (ternary model) does not consider the context length exceeding the tri-gram.
第2,也有N-best識別結果未顯示的n-gram不能計分的問題。因此,學習資料與評估資料的識別區接近的情況下有效,但這些分離的情況下(例如,學習資料是新聞記事朗讀文字,評估資料是自由的e-mail(電子郵件)的內容製作的情況等),有可能不發揮效果。 Second, there is also a problem that the n-gram not shown by the N-best recognition result cannot be scored. Therefore, it is effective in the case where the learning materials are close to the identification areas of the evaluation materials, but in the case of these separations (for example, the learning materials are the news notes, the evaluation materials are the contents of the free e-mail (email) content production). Etc.), it may not work.
第3,與RNN-LM組合使用的情況下,具有必須進行第2次調整的問題點。即,根據調整手段6產生的調整(使用辨識語言模型)之外,其前或後,使用RNN-LM的調整是必需的。 Thirdly, in the case of being used in combination with the RNN-LM, there is a problem that the second adjustment must be performed. That is, in addition to the adjustment (using the recognition language model) generated by the adjustment means 6, before or after, adjustment using the RNN-LM is necessary.
此發明係用以解決上述問題點而形成,目的在於建構語音識別裝置與調整裝置,由於RNN-LM內導入辨識的效果,減少識別錯誤,能夠考慮比辨識的語言模型更長的上下文,即使對於未知的上下文也是某種程度強健。 The invention is formed to solve the above problems, and aims to construct a speech recognition device and an adjustment device. Due to the effect of introducing recognition in the RNN-LM, the recognition error is reduced, and a context longer than the recognized language model can be considered, even for The unknown context is also somewhat robust.
為了解決上述的問題,根據本發明的語音識別裝置,係記憶辨識學習的語言模型的語音識別裝置,辨識學習的語言模型係根據正確答案列與候補列間的單詞單位中的對準來學習,辨識學習的語言模型係根據recurrent neural network(循環神經網路)而構成。對準例如係使用動態規劃法,藉由實現文字列的最大一致可以求出。 In order to solve the above problems, the speech recognition apparatus according to the present invention is a speech recognition apparatus that memorizes the learned language model, and the language model of the recognition learning is learned based on the alignment in the word unit between the correct answer column and the candidate column. The language model for learning recognition is based on a recurrent neural network. The alignment can be determined, for example, by using a dynamic programming method to achieve maximum uniformity of character strings.
又,根據此發明的調整裝置,係使用辨識學習的語言模型,調整語音辨識的候補列之調整裝置,辨識學習的語言模型,根據正確答案列與候補列間的單詞單位中的對準來學習,辨識學習的語言模型係根據recurrent neural network(循環 神經網路)而構成。 Further, according to the adjustment apparatus of the present invention, the language model of the recognition learning is used, the adjustment means for the candidate column of the speech recognition is adjusted, the language model of the learning is recognized, and the learning is performed based on the alignment in the word unit between the correct answer column and the candidate column. , the learning language model is based on the recurrent neural network Neural network).
調整裝置中,取得原語言模型的參數與辨識學習的語言模型的參數之間的加權平均。 In the adjustment device, a weighted average between the parameters of the original language model and the parameters of the language model for learning is obtained.
候補列的各單詞可以分別附加可靠度。學習辨識學習的語言模型之際,辨識學習的語言模型也可以使具有更高可靠度的單詞更成為重點來學習。 Each word in the candidate column can be individually attached with reliability. While learning to identify the language model of learning, recognizing the language model of learning can also make words with higher reliability more important to learn.
根據原語言模型,取得包含候補列的第1結果,根據辨識學習的語言模型,取得包含候補列的第2結果,統合第1結果及第2結果也可以。 According to the original language model, the first result including the candidate column is obtained, and the second result including the candidate column is obtained based on the language model of the recognition learning, and the first result and the second result may be integrated.
根據此發明,提供語音識別裝置與調整裝置,減少識別錯誤,能夠考慮比辨識語言模型更長的上下文,即使對於未知的上下文也是某種程度強健。 According to the present invention, a speech recognition apparatus and an adjustment apparatus are provided to reduce recognition errors, and it is possible to consider a context longer than the recognition language model, even if it is somewhat robust to an unknown context.
1‧‧‧語音 1‧‧‧ voice
2‧‧‧音響模型 2‧‧‧Acoustic model
3‧‧‧語言模型 3‧‧‧ language model
4‧‧‧識別手段 4‧‧‧Recognition means
5‧‧‧N-best識別結果 5‧‧‧N-best recognition results
6‧‧‧調整手段 6‧‧‧Adjustment means
7‧‧‧語言模型 7‧‧‧ language model
8‧‧‧重排的N-best識別結果 8‧‧‧ Rearranged N-best recognition results
10‧‧‧語音識別裝置 10‧‧‧Voice recognition device
20‧‧‧運算手段 20‧‧‧ arithmetic means
21‧‧‧識別手段 21‧‧‧Recognition means
22‧‧‧對準手段 22‧‧‧Alignment means
23‧‧‧辨識學習手段 23‧‧‧ Identification learning methods
24‧‧‧調整手段 24‧‧‧Adjustment means
25‧‧‧加權手段 25‧‧‧ Weighted means
26‧‧‧結果統合手段 26‧‧‧ Results integration
30‧‧‧記憶手段 30‧‧‧ memory means
31‧‧‧音響模型 31‧‧‧Acoustic model
32‧‧‧第1語言模型 32‧‧‧1st language model
33‧‧‧N-best識別結果 33‧‧‧N-best recognition results
34‧‧‧正確答案標記 34‧‧‧correct answer mark
35‧‧‧第2語言模型 35‧‧‧2nd language model
36‧‧‧原語言模型 36‧‧‧ original language model
37‧‧‧單詞可靠度 37‧‧‧Word reliability
38‧‧‧非正確答案候補 38‧‧‧Incorrect answer alternate
40‧‧‧語音輸入手段 40‧‧‧Voice input means
50‧‧‧結果輸出手段 50‧‧‧ Results output means
60‧‧‧語音 60‧‧‧ voice
70‧‧‧重排的N-best識別結果 70‧‧‧ Rearranged N-best recognition results
121‧‧‧識別手段 121‧‧‧Recognition means
123‧‧‧辨識學習手段 123‧‧‧ID learning means
224‧‧‧第1調整手段 224‧‧‧1st adjustment means
225‧‧‧第2調整手段 225‧‧‧2nd adjustment means
270‧‧‧重排的N-best識別結果 270‧‧‧ Rearranged N-best recognition results
271‧‧‧重排的N-best識別結果 271‧‧‧ Rearranged N-best recognition results
322‧‧‧對準手段 322‧‧‧Alignment means
323‧‧‧模型學習手段 323‧‧‧Model learning methods
325‧‧‧加權手段 325‧‧ ‧ weighting means
335‧‧‧第2語言模型 335‧‧‧2nd language model
421‧‧‧識別手段 421‧‧‧Recognition means
423‧‧‧辨識學習手段 423‧‧‧ Identification learning methods
432‧‧‧更新語言模型 432‧‧‧Update language model
N-best‧‧‧識別結果 N-best‧‧‧ recognition results
[第1圖]係說明根據Recurrent neural network(循環神經網路)的語言模型圖;[第2圖]係習知的語音識別裝置的功能方塊圖;[第3圖]係說明正確答案列與候補列間的對準圖;[第4圖]係根據第一實施例的語音識別裝置的硬體構成例;[第5圖]係第4圖的語音識別裝置實行用以學習的處理流程圖;[第6圖]係第4圖的語音識別裝置實行用以應用的處理流 程圖;[第7圖]係第4圖的語音識別裝置的功能方塊圖;[第8圖]係根據第二實施例的語音識別裝置的功能方塊圖;[第9圖]係根據第三實施例的語音識別裝置的功能方塊圖;[第10圖]係根據第四實施例的語音識別裝置的功能方塊圖;[第11圖]係根據第五實施例的語音識別裝置的功能方塊圖;以及[第12圖]係根據第六實施例的語音識別裝置的功能方塊圖。 [Fig. 1] is a diagram showing a language model according to a Recurrent neural network; [Fig. 2] is a functional block diagram of a conventional speech recognition apparatus; [Fig. 3] shows a correct answer column and An alignment diagram between candidate columns; [Fig. 4] is a hardware configuration example of the voice recognition device according to the first embodiment; [Fig. 5] is a flowchart of processing for learning by the voice recognition device of Fig. 4 [Fig. 6] The speech recognition apparatus of Fig. 4 implements a processing flow for application [Fig. 7] is a functional block diagram of the speech recognition apparatus of Fig. 4; [Fig. 8] is a functional block diagram of the speech recognition apparatus according to the second embodiment; [Fig. 9] is based on the third Functional block diagram of the speech recognition apparatus of the embodiment; [Fig. 10] is a functional block diagram of the speech recognition apparatus according to the fourth embodiment; [11] is a functional block diagram of the speech recognition apparatus according to the fifth embodiment And [12th] is a functional block diagram of the speech recognition apparatus according to the sixth embodiment.
以下,根據附加圖面說明此發明的實施例。 Hereinafter, embodiments of the invention will be described based on additional drawings.
第一實施例,係使用根據辨識基準的RNN-LM。本發明目的在於藉由辨識學習RNN-LM提高識別性能。由於語言模型的重要目之一係轉換想識別的語音為正確的文字資料,期望建構可以補正習知的語音識別結果之語言模型。 In the first embodiment, the RNN-LM according to the identification reference is used. The object of the present invention is to improve recognition performance by learning to learn RNN-LM. Since one of the important purposes of the language model is to convert the speech to be recognized into the correct text, it is desirable to construct a language model that can correct the conventional speech recognition result.
於是,除了上述正確答案標記ct,還考慮使用語音識別產生的候補(hypotheses)ht,辨識建構RNN-LM。在此之際的目的函數如以下式(4),考慮使用單詞等級中的概似比。此外,也可以使用充分用於辨識學習且互相資料量最大化、最小音素錯誤之評估函數。 Thus, in addition to the above-mentioned correct answer mark c t , it is also considered to use the hypotheses h t generated by the speech recognition to identify the constructed RNN-LM. The objective function at this time is considered to use the approximate ratio in the word level as in the following formula (4). In addition, an evaluation function that is sufficient for identifying learning and maximizing the amount of mutual data and the smallest phoneme error can also be used.
H係排h1、h2、h3...的順序列,β是折扣係數(discount factor)。同樣以a微分時,得到以下式(5)的學習法則。 H is a sequence of rows h 1 , h 2 , h 3 ..., and β is a discount factor. Similarly, when a is differentiated, the learning rule of the following formula (5) is obtained.
此順序,使用第3圖具體說明。現在,正確答案列為A、B、C、D,考慮插入(I)、脫落(@)、置換(S)錯誤發生的情況。正確答案列C與語音識別結果H開始對準(align),得到如第3(a)圖的對應關係。 This sequence is specified using Figure 3. Now, the correct answer is listed as A, B, C, and D, considering the insertion (I), shedding (@), and replacement (S) errors. The correct answer column C starts to align with the speech recognition result H, and the correspondence relationship as shown in Fig. 3(a) is obtained.
通常的RNN-LM的學習中,A、B、C、D的權重分別為1,計算誤差ε,根據式(3)更新RNN-LM的參數。相對於此,第一實施例中,如第3(b)圖所示,比起正確答案的識別結果,為了放置並學習對非正確答案的識別結果的較高權重,打折正確答案時(此範例中係單詞A與D)的權重。在此範例中,對正確答案的權重,只從對非正確答案的權重1減少折扣係數β,藉此可以授予B、C的學習上較高的權重。這是式(5)的意圖。 In the normal RNN-LM learning, the weights of A, B, C, and D are 1, respectively, and the error ε is calculated, and the parameters of the RNN-LM are updated according to the equation (3). On the other hand, in the first embodiment, as shown in the third figure (b), in order to place and learn a higher weight of the recognition result of the incorrect answer than the recognition result of the correct answer, when the correct answer is discounted (this In the example, the weights of the words A and D). In this example, the weight of the correct answer is reduced by only the discount coefficient β from the weight 1 of the incorrect answer, whereby the learning weights of B and C can be awarded. This is the intent of equation (5).
在此之際,對於插入錯誤,必須特別處理。例如,對於第3(a)圖的正確答案列,得到錯誤插入單詞I的ABCID之候補列。在此情況下,對應I的正確答案單詞不存在。在此情況下,例如無視I,當作候補列是「ABCD」處理也可以,如第3(b)圖藉由考慮重複之前時刻的單詞C來處理也可以。 At this time, special errors must be dealt with for insertion errors. For example, for the correct answer column of the third (a) figure, the candidate column of the ABCID in which the word I is erroneously inserted is obtained. In this case, the correct answer word corresponding to I does not exist. In this case, for example, I may be ignored as I, and the candidate column may be "ABCD" processing. For example, the third (b) diagram may be processed by considering the word C at the previous time.
候補數量2以上(例如N-best識別結果)。分別同 樣處理各候補也可以。例如,2-best的情況下,對於第1名的候補進行如第3圖的對準處理,更新RNN-LM的參數,同樣地對於第2名的候補也進行如第3圖的對準處理,更新RNN-LM的參數。 The number of candidates is 2 or more (for example, the N-best recognition result). Same separately It is also possible to sample each candidate. For example, in the case of 2-best, the alignment processing of FIG. 3 is performed on the candidate of the first name, the parameters of the RNN-LM are updated, and the alignment processing of FIG. 3 is also performed for the candidates of the second name. , update the parameters of RNN-LM.
第4圖中,顯示本發明第一實施例的語音識別裝置10的硬體構成範例。語音識別裝置10,例如可以利用眾所周知的電腦構成。語音識別裝置10,包括運算手段20、記憶手段30、語音輸入手段40及結果輸出手段50。運算手段20包含處理器,記憶手段30包含半導體記憶體及HDD(硬碟驅動器)等的記憶媒體。記憶手段30中記憶未圖示的程式,藉由運算手段20執行此程式,實現本說明書記載的語音識別裝置10的各功能。此程式記錄在非瞬時性的(non-transitory)資訊記憶媒體內也可以。 In Fig. 4, an example of the hardware configuration of the voice recognition device 10 of the first embodiment of the present invention is shown. The voice recognition device 10 can be constituted by, for example, a well-known computer. The speech recognition apparatus 10 includes an arithmetic means 20, a memory means 30, a voice input means 40, and a result output means 50. The computing means 20 includes a processor, and the memory means 30 includes a semiconductor memory and a memory medium such as an HDD (hard disk drive). The program (not shown) is stored in the memory means 30, and the program is executed by the arithmetic means 20 to realize the functions of the voice recognition device 10 described in the present specification. This program can also be recorded in a non-transitory information memory medium.
語音輸入手段40例如是麥克風,接受包含單詞列的語音60的輸入。或者,語音輸入手段40也可以是電子資料輸入裝置,接受語音60的輸入作為電子資料也可以。結果輸出手段50例如是液晶顯示器、印表機、網路介面等,輸出重排的N-best識別結果70。 The voice input means 40 is, for example, a microphone, and accepts an input of a voice 60 including a word column. Alternatively, the voice input means 40 may be an electronic data input device, and the input of the voice 60 may be accepted as electronic data. The result output means 50 is, for example, a liquid crystal display, a printer, a network interface or the like, and outputs a rearranged N-best recognition result 70.
第5及6圖中,顯示表示語音識別裝置10實行的處理之流程圖。 In the fifth and sixth figures, a flowchart showing the processing performed by the voice recognition device 10 is shown.
第5圖係學習的流程圖。語音識別裝置10根據第5圖的流程動作時,語音識別裝置10可以是語音識別學習裝置。首先,語音識別裝置10,接受訓練用的語音60的輸入(步驟S1)。其次,語音識別裝置10,對語音60進行語音識別處 理,取得N-best識別結果(步驟S2)。其次,語音識別裝置10,將N-best識別結果內包含的各候補列對準正確答案列(步驟S3)。其次,語音識別裝置10,根據對準結果,辨識學習語言模型(步驟S4)。其次,語音識別裝置10,輸出辨識學習的語言模型(步驟S5)。又,通常使用多數的正確答案列進行學習,但如果有至少1正確答案列與至少1候補列的話,就能夠實施本發明。 Figure 5 is a flow chart of learning. When the voice recognition device 10 operates in accordance with the flow of FIG. 5, the voice recognition device 10 may be a voice recognition learning device. First, the voice recognition device 10 receives an input of the voice 60 for training (step S1). Next, the voice recognition device 10 performs voice recognition on the voice 60. The N-best recognition result is obtained (step S2). Next, the speech recognition apparatus 10 aligns each candidate column included in the N-best recognition result with the correct answer column (step S3). Next, the voice recognition device 10 recognizes the learning language model based on the result of the alignment (step S4). Next, the voice recognition device 10 outputs a language model for learning recognition (step S5). Further, it is common to use a plurality of correct answer columns for learning, but the present invention can be implemented if there are at least one correct answer column and at least one candidate column.
第6圖係應用(applicatio)的流程圖。語音識別裝置10根據第6圖的流程圖動作時,語音識別裝置10可以稱作調整裝置。首先,語音識別裝置10,接受應識別的語音60的輸入(步驟S6)。其次,語音識別裝置10對語音60進行語音識別處理,取得N-best識別結果(步驟S7)。其次,語音識別裝置10,根據辨識學習的語言模型,調整N-best識別結果內包含的各候補列(步驟S8)。其次,語音識別裝置10根據調整的結果輸出重排的N-best識別結果70(步驟S9)。又,通常係輸出複數的候補列,但至少輸出1候補列的話,就能適用於本發明。 Figure 6 is a flow chart of an application. When the voice recognition device 10 operates in accordance with the flowchart of FIG. 6, the voice recognition device 10 may be referred to as an adjustment device. First, the voice recognition device 10 accepts the input of the voice 60 to be recognized (step S6). Next, the voice recognition device 10 performs voice recognition processing on the voice 60 to obtain an N-best recognition result (step S7). Next, the speech recognition apparatus 10 adjusts each candidate column included in the N-best recognition result based on the language model of the recognition learning (step S8). Next, the voice recognition device 10 outputs the rearranged N-best recognition result 70 based on the result of the adjustment (step S9). Further, in general, a plurality of candidate columns are output, but at least one candidate column is output, and the present invention can be applied to the present invention.
第7圖中,顯示語音識別裝置10的功能方塊圖。語音識別裝置10的運算手段20,作用為識別手段21、對準手段22、辨識學習手段23及調整手段24。又,語音識別裝置10的記憶手段30中,可以記憶音響模型31、第1語言模型32、N-best識別結果33、正確答案標記34及第2語言模型35。第1語言模型32例如是語音識別用構成的語言模型,第2語言模型35例如是調整用構成的語言模型。 In Fig. 7, a functional block diagram of the voice recognition device 10 is shown. The arithmetic means 20 of the voice recognition device 10 functions as the recognition means 21, the alignment means 22, the recognition learning means 23, and the adjustment means 24. Further, in the memory means 30 of the voice recognition device 10, the acoustic model 31, the first language model 32, the N-best recognition result 33, the correct answer mark 34, and the second language model 35 can be stored. The first language model 32 is, for example, a language model for speech recognition, and the second language model 35 is, for example, a language model for adjustment.
識別手段21、音響模型31及第1語言模型32,也可以是習知的構成。即,第2圖的識別手段4,使用音響模型2及識別用的語言模型3也可以。 The identification means 21, the acoustic model 31, and the first language model 32 may be of a conventional configuration. In other words, the recognition means 4 of Fig. 2 may use the acoustic model 2 and the language model 3 for recognition.
第7圖的構成中,對第2圖的習知構成,追加正確答案標記34、對準手段22、辨識學習手段23及第2語言模型35。 In the configuration of Fig. 7, the correct answer mark 34, the alignment means 22, the recognition learning means 23, and the second language model 35 are added to the conventional configuration of Fig. 2.
對準手段22,使N-best識別結果33與正確答案標記34之間對準。所謂「對準」(to align),係指對應例如正確答案列內包含的各單詞與候補列內包含的各單詞。例如第3(a)圖的範例中,對於正確答案列的單詞A、B、D,分別加以對應候補列的單詞A、S、D。又,關於未加以對應的單詞,考慮發生插入或脫落。例如第3(a)圖的範例中單詞C脫落,插入單詞I。對準例如以使用動態規劃法,可以求出最大一致。 The alignment means 22 aligns the N-best recognition result 33 with the correct answer mark 34. The term "to align" means, for example, each word included in the correct answer column and each word included in the candidate column. For example, in the example of the third (a) diagram, the words A, B, and D corresponding to the candidate column are respectively assigned to the words A, B, and D in the correct answer column. Also, regarding the words that are not corresponding, it is considered that insertion or dropping occurs. For example, in the example of Fig. 3(a), the word C falls off and the word I is inserted. Alignment, for example, using dynamic programming, can find the maximum agreement.
辨識學習手段23,根據對準處理的結果,進行辨識學習產生或更新第2語言模型35。第2語言模型35,根據RNN而構成。第2語言模型35的辨識學習,例如以使用上述的式(5)的逆傳播進行,藉此更新RNN的參數。這以與習知的學習中的逆傳播同樣的方法可以進行。於是,根據正確答案列與候補列之間的對準學習第2語言模型35。 The recognition learning means 23 performs the recognition learning to generate or update the second language model 35 based on the result of the alignment processing. The second language model 35 is constructed based on RNN. The identification learning of the second language model 35 is performed, for example, by inverse propagation using the above equation (5), thereby updating the parameters of the RNN. This can be done in the same way as the reverse propagation in the conventional learning. Then, the second language model 35 is learned based on the alignment between the correct answer column and the candidate column.
調整手段24,根據第2語言模型35,調整N-best識別結果33,得到重排的N-best識別結果70。所謂「調整」,係指例如對於評了一次分數的候補列重新評分。最初的調整,在第一實施例中係根據識別手段21產生的評分。 The adjustment means 24 adjusts the N-best recognition result 33 based on the second language model 35, and obtains the rearranged N-best recognition result 70. The term "adjustment" refers to, for example, re-rating a candidate column for a score. The initial adjustment is based on the score generated by the recognition means 21 in the first embodiment.
例如,各候補的分數由音響模型分數與語言模型 分數表示時,調整手段24置換N-best識別結果33內包含的各候補的語言模型分數,成為使用NN推斷的語言模型分數。或者,取得與原語言模型分數的加權平均。於是,藉由使用辨識學習的第2語言模型35,可以進行考慮識別手段21中的語音識別的錯誤傾向的調整。 For example, the scores of each candidate are composed of acoustic model scores and language models. When the score is expressed, the adjustment means 24 replaces the language model score of each candidate included in the N-best recognition result 33, and becomes a language model score estimated using NN. Alternatively, obtain a weighted average of the scores of the original language model. Then, by using the second language model 35 for recognition learning, adjustment of the error tendency in consideration of the speech recognition in the recognition means 21 can be performed.
例如,車用導航系統中,進行根據語音識別技術產生的簡訊服務作成時,藉由事先學習好對於特定用戶的錯誤傾向,可以得到更正確的文字資料。或者,如同既定的指令,詞彙或句法受限的語音的話,以適合的指令先作成第2語言模型35的話,也具有第1語言模型32可以使用通用之物的優點。 For example, in a car navigation system, when a newsletter service generated based on a voice recognition technology is created, a more accurate text data can be obtained by learning the error tendency for a specific user in advance. Alternatively, if the predetermined language, vocabulary or syntactically restricted speech is used to create the second language model 35 with a suitable command, the first language model 32 may have the advantage of using a common object.
如上述,因為RNN-LM得到辨識學習的效果,比習知構成可以更有效訂正N-best識別結果。 As described above, since the RNN-LM obtains the effect of the recognition learning, the N-best recognition result can be corrected more effectively than the conventional composition.
相較於習知的構成中組合辨識語言模組的情況,因為本申請發明中對於學習資料中未出現的上下文的類推變成可能,例如考慮對區域的不同變得更強健。例如,單詞「狗」和「貓」根據上下文可以替換,但映射如此的單詞至低次元的向量s的情況下,那些向量間的consine(餘弦)類似度變高。因此,學習資料中「狗」出現時的學習效果變得相似於「貓」出現時的學習效果,可以得到從包含可交換的單詞之類的接近上下文類推的效果。以習知的辨識語言模型不能得到如此的效果。又,向量s的具體次元,一般能夠適當設計得比|V|小。 Compared to the case of combining the recognition language modules in the conventional configuration, it is possible to analogize the context that does not appear in the learning material in the present invention, for example, considering that the difference in the area becomes more robust. For example, the words "dog" and "cat" can be replaced according to the context, but in the case of mapping such words to the low-order vector s, the consine (cosine) similarity between those vectors becomes high. Therefore, the learning effect when the "dog" appears in the learning material becomes similar to the learning effect when the "cat" appears, and an effect similar to the proximity of the context including the exchangeable word can be obtained. This effect cannot be obtained with the conventional recognition language model. Moreover, the specific dimension of the vector s can generally be appropriately designed to be smaller than |V|.
又,相較於並用RNN-LM與辨識語言模型的習知構成,也具有調整1次完成的優點。當然,此後段,更並用其他的辨識語言模型,可以更提高性能。例如,調整手段24的 後段設置追加的調整手段,此追加的調整手段,根據別的辨識語言模型,進行重排的N-best識別結果70的調整也可以。 Moreover, compared with the conventional configuration in which the RNN-LM and the recognition language model are used in combination, there is an advantage that the adjustment is completed once. Of course, in the latter part, the use of other recognition language models can improve performance. For example, the adjustment means 24 In the subsequent stage, an additional adjustment means is provided, and the additional adjustment means may perform adjustment of the rearranged N-best recognition result 70 based on another identification language model.
又,本說明書的各實施例,雖然分別使用單一的裝置進行學習及應用,但分別使用不同的裝置(不同的電腦等)進行學習及應用也可以。例如,學習用的裝置不具有調整手段24也可以,應用用的裝置不包括對準手段22或辨識學習手段23也可以。又,應用用的裝置例如是習知的語音辨識裝置(如第2圖所示的構成之裝置)也可以(但是,調整係使用第2語言模型35)。 Further, in each of the embodiments of the present specification, learning and application are performed using a single device, but learning and application may be performed using different devices (different computers, etc.). For example, the device for learning may not have the adjustment means 24, and the device for application may not include the alignment means 22 or the recognition learning means 23. Further, the device for application may be, for example, a conventional speech recognition device (such as the device shown in Fig. 2) (however, the adjustment uses the second language model 35).
第一實施例維持使用辨識學習的第2語言模型35。第二實施例中,使用原語言模型36與第2語言模型35之間加權平均的參數。根據如此的構成,可以減少過度學習的影響。 The first embodiment maintains the second language model 35 using the recognition learning. In the second embodiment, the weighted average parameter between the original language model 36 and the second language model 35 is used. According to such a configuration, the influence of over-learning can be reduced.
原語言模型36,係根據辨識學習手段23產生的NN參數更新前之第2語言模型35,也就是指與初期狀態的第2語言模型35相同。換言之,對原語言模型36,藉由實行識別學習,產生第2語言模型35。 The original language model 36 is the second language model 35 before the NN parameter update generated by the recognition learning means 23, that is, the same as the second language model 35 in the initial state. In other words, the second language model 35 is generated for the original language model 36 by performing recognition learning.
第二實施例的構成顯示於第8圖。追加加權手段25。語音識別裝置10的運算手段20作用為加權手段25。加權手段25,加權平均原語言模型36的參數與第2語言模型35的參數。例如第1圖的構成中,成為式(6)所示。 The configuration of the second embodiment is shown in Fig. 8. The weighting means 25 is added. The arithmetic means 20 of the speech recognition apparatus 10 functions as a weighting means 25. The weighting means 25, the parameters of the weighted average original language model 36 and the parameters of the second language model 35. For example, in the configuration of Fig. 1, the equation (6) is shown.
{U,V}←τ{U CE ,V CE }+(1-τ){U LR ,V LR } (6) { U , V }← τ { U CE , V CE }+(1- τ ){ U LR , V LR } (6)
UCE、VCE係使用交叉熵(cross entropy)學習的模型之參 數,ULR、VLR係辨識學習的模型之參數。τ係平滑係數。又,通常各語言模型包含複數的參數,但包含至少1個參數的語言模型的話,就能夠加權平均。 U CE and V CE are parameters of a model learned by cross entropy, and U LR and V LR are parameters of the model to be learned. τ system smoothing coefficient. Further, in general, each language model includes a plurality of parameters, but if a language model including at least one parameter is used, weighted averaging can be performed.
如上述,藉由取得原語言模型36與辨識學習的第2語言模型35的平均,降低辨識學習中容易引起的過度學習的影響,產生更穩定的辨識學習效果。 As described above, by obtaining the average of the original language model 36 and the second language model 35 of the recognition learning, the influence of over-learning which is easily caused in the recognition learning is reduced, and a more stable recognition learning effect is produced.
第三實施例中,使用根據利用單詞可靠度的辨識基準之RNN-LM。 In the third embodiment, an RNN-LM based on an identification criterion using word reliability is used.
第三實施例的構成顯示於第9圖。此例中,取代第一及二實施例的識別手段21,具有識別手段121,取代第一及二實施例的辨識學習手段23,具有辨識學習手段123。語音識別裝置10的運算手段20,作用為識別手段121及辨識學習手段123。 The configuration of the third embodiment is shown in Fig. 9. In this example, instead of the identification means 21 of the first and second embodiments, the identification means 121 is provided, and the identification learning means 123 is provided instead of the identification learning means 23 of the first and second embodiments. The arithmetic means 20 of the speech recognition apparatus 10 functions as the identification means 121 and the identification learning means 123.
識別手段121,輸出N-best識別結果33的同時,N-best識別結果33內包含的每一單詞要求可靠度,輸出做為單詞可靠度37。單詞可靠度37,例如記憶在語音識別裝置10的記憶手段30內。辨識學習手段123,除了整列處理的結果,根據單詞可靠度37,進行辨識學習產生或更新第2語言模型35。 The recognition means 121 outputs the N-best recognition result 33, and each word included in the N-best recognition result 33 requires reliability, and the output is the word reliability 37. The word reliability 37 is, for example, stored in the memory means 30 of the voice recognition device 10. The recognition learning means 123 generates or updates the second language model 35 based on the word reliability 37 based on the result of the entire column processing.
求得單詞可靠度的方法,多數係眾所周知的。例如,某一時刻的特定候補的概似,可以使用此時刻在全候補的概似和中所占的比例,作為其特定候補的單詞可靠度。例如,時刻t中假設各單詞候補為wt i(1≦i≦I)時,使用各單詞候補的
概似p(wt i),可以表示為:
因為考慮單詞可靠度高的錯誤比單詞可靠度低的錯誤嚴重,根據單詞可靠度高可以改變折扣比。例如,以下式(7)計算。 Because the error of considering high word reliability is more serious than the error of low word reliability, the discount ratio can be changed according to the high reliability of the word. For example, it is calculated by the following formula (7).
非正確答案的單詞具有最高可靠度(例如ν t=1)時,具有最大的權重(例如1)被學習。另一方面,非正確答案的單詞具有最低可靠度(例如ν t=0)時,因為根據其單詞產生的學習沒什麼效果,與正確答案相同具有打折的權重(例如1-β)被學習。 When the word of the incorrect answer has the highest reliability (for example, ν t =1), the greatest weight (for example, 1) is learned. On the other hand, when the word of the incorrect answer has the lowest reliability (for example, ν t =0), since the learning according to the word has no effect, the weight (e.g., 1-β) having the same discount as the correct answer is learned.
於是,第三實施例中,候補列的各單詞分別具有可靠度,第2語言模型35比具有更高可靠度的單詞更成為重點被學習。 Therefore, in the third embodiment, each word of the candidate column has reliability, and the second language model 35 is more focused on learning than words having higher reliability.
根據如此的構成,即使相同的單詞錯誤也具有不同的權重學習,可以特別以更大的權重學習重要的單詞。如上述,藉由使用單詞可靠度的學習,可以根據識別錯誤的重大性進行學習。 According to such a configuration, even if the same word error has different weight learning, an important word can be learned with a greater weight. As described above, by using the learning of the word reliability, it is possible to learn according to the significance of the recognition error.
又,第9圖中與第二實施例相同,設置與加權手段25及原語言模型36,但也可以與第一實施例相同,不設置這些。 Further, in the ninth embodiment, the weighting means 25 and the original language model 36 are provided in the same manner as the second embodiment, but the same as in the first embodiment, these are not provided.
第一及二實施例中,學習的結果以語言模型等級統合。相對於此,第四實施例中,以識別結果等級統合學習的結果。 In the first and second embodiments, the results of the learning are integrated at the language model level. On the other hand, in the fourth embodiment, the result of the learning is integrated by the recognition result level.
第10圖中顯示根據第四實施例的構成。取代第一及二實施例的調整手段24,設置第1調整手段224及第2調整手段225。語音識別裝置10的運算手段20作用為第1調整手段224及第2調整手段225也可以。 The configuration according to the fourth embodiment is shown in Fig. 10. The first adjustment means 224 and the second adjustment means 225 are provided instead of the adjustment means 24 of the first and second embodiments. The arithmetic means 20 of the voice recognition device 10 may function as the first adjustment means 224 and the second adjustment means 225.
第1調整手段224,根據原先的語言模型36,調整N-best識別結果33得到重排的N-best識別結果270(第1結果)。第2調整手段225,根據辨識學習的第2語言模型35,調整N-best識別結果33得到重排的N-best識別結果271(第2結果)。重排的N-best識別結果270及271,記憶在語音識別裝置10的記憶手段30內也可以。 The first adjustment means 224 adjusts the N-best recognition result 33 based on the original language model 36 to obtain the rearranged N-best recognition result 270 (first result). The second adjustment means 225 adjusts the N-best recognition result 33 based on the second language model 35 of the recognition learning to obtain the rearranged N-best recognition result 271 (second result). The rearranged N-best recognition results 270 and 271 may be stored in the memory means 30 of the speech recognition apparatus 10.
又,第四實施例中,設置結果統合手段26。語音識別裝置10的運算手段20作用為結果統合手段26也可以。結果統合手段26,統合重排的N-best識別結果270及271,得到最終重排的N-best識別結果70。 Further, in the fourth embodiment, the result integration means 26 is set. The arithmetic means 20 of the speech recognition apparatus 10 may function as the result integration means 26. The resulting integration means 26 integrates the rearranged N-best identification results 270 and 271 to obtain a final rearranged N-best recognition result 70.
統合,例如根據分數比較各候補,藉由選擇高分的候補進行也可以。 For example, it is also possible to compare the candidates according to the scores and select the candidates for the high scores.
或者,統合根據少數服從多數進行也可以。少數服從多數的具體應用方法能夠任意設計,例如採用使用3個以上的系統之少數服從多數也可以,各系統分別輸出不同的候補的情況下,以分數比較也可以。 Or, integration can be done according to a minority majority. A small number of specific application methods that can be applied to a large number can be arbitrarily designed. For example, a minority of three or more systems can be used for a large number of subordinates. If each system outputs a different candidate, the score can be compared.
又,統合之際,先適當打折任何語言模型的分數也可以。例如,對於知道不太可靠的語言模型,施加比1小的權重(例如0.8),再根據分數比較統合各候補也可以。 Also, at the time of integration, it is also possible to appropriately discount the scores of any language model. For example, for a language model that is less reliable, a weight smaller than 1 (for example, 0.8) is applied, and each candidate can be integrated based on the score comparison.
當然,如此的統合處理,也可以如同第三實施例一樣應用於使用單詞可靠度的構成。 Of course, such integration processing can also be applied to the composition using word reliability as in the third embodiment.
如上述,使用複數的語言模型獨立進行調整,比使用單一的(或平均化的)語言模型的情況,更能夠強健進行調整。 As described above, it is more robust to make adjustments independently using a complex language model than in the case of using a single (or averaged) language model.
第五實施例,語言模型的辨識學習中,係只使用非正確答案假設的構成。 In the fifth embodiment, in the recognition learning of the language model, only the composition of the incorrect answer hypothesis is used.
第一~四實施例中,使用正確答案的候補及非正確答案的候補雙方進行學習。不過,為了得到更簡便辨識學習的效果,考慮使用只根據非正確答案假設學習的語言模型。 In the first to fourth embodiments, both the candidate of the correct answer and the candidate of the incorrect answer are used for learning. However, in order to make it easier to identify learning, consider using a language model that only learns based on incorrect answer assumptions.
第11圖中顯示根據第五實施例的構成。取代第二實施例的對準手段22,設置對準手段322。對準手段322,從N-best識別結果33抽出非正確答案候補38並對準。 The configuration according to the fifth embodiment is shown in Fig. 11. Instead of the alignment means 22 of the second embodiment, an alignment means 322 is provided. The alignment means 322 extracts the incorrect answer candidate 38 from the N-best recognition result 33 and aligns it.
取代第二實施例的辨識學習手段23,設置模型學習手段323。模型學習手段323,根據對準整理的結果,利用非正確答案候補38進行學習,產生或更新第2語言模型335。此學習處理本體,不必根據辨識手法實行。例如,模型學習手段323,依照式(3),藉由更新NN的參數進行學習。 The model learning means 323 is provided instead of the identification learning means 23 of the second embodiment. The model learning means 323 learns by using the incorrect answer candidate 38 based on the result of the alignment, and generates or updates the second language model 335. This learning process body does not have to be implemented according to identification methods. For example, the model learning means 323 learns by updating the parameters of the NN according to the equation (3).
又,取代第二實施例的加權手段25,設置加權手段325。加權手段325,為了對非正確答案候補施罰,加權平 均原語言模型36的參數與第2語言模型335的參數。例如,加權手段325,加權使第2語言模型335的參數成為負數(即式(6)的τ變得比1大)。 Further, instead of the weighting means 25 of the second embodiment, the weighting means 325 is provided. Weighting means 325, in order to punish the incorrect answer candidate, the weighting is flat The parameters of the original language model 36 and the parameters of the second language model 335. For example, the weighting means 325 weights the parameter of the second language model 335 to a negative number (that is, τ of the equation (6) becomes larger than 1).
在此,語言模型的學習本體即使是不辨識的,也藉由組合原語言模型與根據非正確候補學習的語言模型,語音識別裝置10可以全體進行辨識學習。 Here, even if the learning ontology of the language model is not recognized, the speech recognition apparatus 10 can perform the recognition learning by combining the original language model and the language model based on the non-correct candidate learning.
語音識別裝置10的運算手段20,作用為對準手段322、模型學習手段323及加權手段325也可以。又,非正確答案候補38及第2語言模型335,記憶在語音識別裝置10的記憶手段30內也可以。 The calculation means 20 of the speech recognition apparatus 10 may function as the alignment means 322, the model learning means 323, and the weighting means 325. Further, the incorrect answer candidate 38 and the second language model 335 may be stored in the memory means 30 of the voice recognition device 10.
如上述,除了原語言模型36,藉由使用只根據非正確答案學習的第2語言模型335,語言模型學習的方法維持不變,可以得到簡便辨識學習的效果。 As described above, in addition to the original language model 36, by using the second language model 335 that is only learned based on the incorrect answer, the method of learning the language model remains unchanged, and the effect of simple recognition learning can be obtained.
第一實施例中,語音識別用的第1語言模型32不成為辨識學習的對象。相對於此,第六實施例中,利用RNN-LM,學習語音識別用的語言模型。 In the first embodiment, the first language model 32 for speech recognition does not become an object of recognition learning. On the other hand, in the sixth embodiment, the language model for speech recognition is learned using RNN-LM.
第12圖中顯示根據第六實施例的構成。第六實施例中,取代第一實施例的辨識學習手段23,設置辨識學習手段423。辨識學習手段423,根據對準處理的結果,進行辨識學習更新語言模型432。又,取代第一實施例的識別手段21,設置識別手段421。識別手段421,根據辨識學習的語言模型432,進行語音識別輸出N-best識別結果33。 The configuration according to the sixth embodiment is shown in Fig. 12. In the sixth embodiment, the identification learning means 423 is provided instead of the identification learning means 23 of the first embodiment. The recognition learning means 423 performs the recognition learning update language model 432 based on the result of the alignment processing. Further, in place of the identification means 21 of the first embodiment, the identification means 421 is provided. The recognition means 421 performs the speech recognition output N-best recognition result 33 based on the language model 432 of the recognition learning.
根據如此的構成,與第一實施例相同,也可以得 到辨識學習產生的效果。 According to such a configuration, as in the first embodiment, it is also possible to obtain To identify the effects of learning.
S1‧‧‧輸入語音 S1‧‧‧ input voice
S2‧‧‧根據語音識別,取得N-best識別結果 S2‧‧‧According to speech recognition, obtain N-best recognition result
S3‧‧‧對準N-best識別結果 S3‧‧ Aligned with N-best recognition results
S4‧‧‧實行辨識學習 S4‧‧‧ Implementation of identification learning
S5‧‧‧輸出辨識學習的語言模型 S5‧‧‧ Language model for output recognition learning
Claims (8)
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/US2015/026217 WO2016167779A1 (en) | 2015-04-16 | 2015-04-16 | Speech recognition device and rescoring device |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| TW201638931A true TW201638931A (en) | 2016-11-01 |
Family
ID=57125816
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| TW104129304A TW201638931A (en) | 2015-04-16 | 2015-09-04 | Speech recognition device and rescoring device |
Country Status (3)
| Country | Link |
|---|---|
| JP (1) | JP6461308B2 (en) |
| TW (1) | TW201638931A (en) |
| WO (1) | WO2016167779A1 (en) |
Families Citing this family (13)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11081105B2 (en) * | 2016-09-16 | 2021-08-03 | Nippon Telegraph And Telephone Corporation | Model learning device, method and recording medium for learning neural network model |
| WO2018062265A1 (en) * | 2016-09-30 | 2018-04-05 | 日本電信電話株式会社 | Acoustic model learning device, method therefor, and program |
| US10170110B2 (en) * | 2016-11-17 | 2019-01-01 | Robert Bosch Gmbh | System and method for ranking of hybrid speech recognition results with neural networks |
| JP6744633B2 (en) * | 2017-06-26 | 2020-08-19 | 株式会社Rutilea | Article determination device, system, learning method and program |
| CN108288468B (en) * | 2017-06-29 | 2019-07-19 | 腾讯科技(深圳)有限公司 | Audio recognition method and device |
| CN108335694B (en) * | 2018-02-01 | 2021-10-15 | 北京百度网讯科技有限公司 | Far-field environmental noise processing method, device, device and storage medium |
| CA3099933A1 (en) * | 2018-05-18 | 2019-11-21 | Greeneden U.S. Holdings Ii, Llc | System and method for a multiclass approach for confidence modeling in automatic speech recognition systems |
| JP6965846B2 (en) * | 2018-08-17 | 2021-11-10 | 日本電信電話株式会社 | Language model score calculation device, learning device, language model score calculation method, learning method and program |
| US11011156B2 (en) * | 2019-04-11 | 2021-05-18 | International Business Machines Corporation | Training data modification for training model |
| KR102577589B1 (en) | 2019-10-22 | 2023-09-12 | 삼성전자주식회사 | Voice recognizing method and voice recognizing appratus |
| EP4078572B1 (en) * | 2020-01-28 | 2024-04-10 | Google LLC | Proper noun recognition in end-to-end speech recognition |
| CN112163636B (en) * | 2020-10-15 | 2023-09-26 | 电子科技大学 | Unknown pattern recognition method of electromagnetic signal radiation source based on twin neural network |
| US11574639B2 (en) * | 2020-12-18 | 2023-02-07 | Microsoft Technology Licensing, Llc | Hypothesis stitcher for speech recognition of long-form audio |
Family Cites Families (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6490555B1 (en) * | 1997-03-14 | 2002-12-03 | Scansoft, Inc. | Discriminatively trained mixture models in continuous speech recognition |
| WO2004049305A2 (en) * | 2002-11-21 | 2004-06-10 | Scansoft, Inc. | Discriminative training of hidden markov models for continuous speech recognition |
| EP1450350A1 (en) * | 2003-02-20 | 2004-08-25 | Sony International (Europe) GmbH | Method for Recognizing Speech with attributes |
| JP2008026721A (en) * | 2006-07-24 | 2008-02-07 | Nec Corp | Speech recognizer, speech recognition method, and program for speech recognition |
| US20080243503A1 (en) * | 2007-03-30 | 2008-10-02 | Microsoft Corporation | Minimum divergence based discriminative training for pattern recognition |
| JP2013125144A (en) * | 2011-12-14 | 2013-06-24 | Nippon Hoso Kyokai <Nhk> | Speech recognition device and program thereof |
| US8775177B1 (en) * | 2012-03-08 | 2014-07-08 | Google Inc. | Speech recognition process |
| US9286897B2 (en) * | 2013-09-27 | 2016-03-15 | Amazon Technologies, Inc. | Speech recognizer with multi-directional decoding |
-
2015
- 2015-04-16 WO PCT/US2015/026217 patent/WO2016167779A1/en not_active Ceased
- 2015-04-16 JP JP2017507782A patent/JP6461308B2/en not_active Expired - Fee Related
- 2015-09-04 TW TW104129304A patent/TW201638931A/en unknown
Also Published As
| Publication number | Publication date |
|---|---|
| JP6461308B2 (en) | 2019-01-30 |
| JP2017527846A (en) | 2017-09-21 |
| WO2016167779A1 (en) | 2016-10-20 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| TW201638931A (en) | Speech recognition device and rescoring device | |
| JP6444530B2 (en) | Spoken language understanding system | |
| CN110110062B (en) | Machine intelligence question answering method, device and electronic device | |
| JP6222821B2 (en) | Error correction model learning device and program | |
| US10580432B2 (en) | Speech recognition using connectionist temporal classification | |
| JP7070894B2 (en) | Time series information learning system, method and neural network model | |
| JP6815899B2 (en) | Output statement generator, output statement generator and output statement generator | |
| KR101851787B1 (en) | Domain matching device and method for multi-domain natural language processing | |
| CN111145718A (en) | Chinese mandarin character-voice conversion method based on self-attention mechanism | |
| CN114020887B (en) | Method, apparatus, device and medium for determining response statement | |
| CN112863494A (en) | Voice emotion recognition method and system based on semi-supervised adversity variation self-coding | |
| JP6127778B2 (en) | Model learning method, model learning program, and model learning apparatus | |
| CN110851601A (en) | Cross-domain emotion classification system and method based on layered attention mechanism | |
| CN114360552A (en) | Network model training method and device for speaker recognition and storage medium | |
| CN114662503A (en) | An aspect-level sentiment analysis method based on LSTM and grammatical distance | |
| CN108549703A (en) | A kind of training method of the Mongol language model based on Recognition with Recurrent Neural Network | |
| CN110232118A (en) | A kind of novel answer preference pattern based on GRU attention mechanism | |
| KR101851793B1 (en) | Domain matching device and method for multi-domain natural language processing | |
| Iori et al. | The direction of technical change in AI and the trajectory effects of government funding | |
| US12373685B2 (en) | Dynamically enhancing supervised learning | |
| KR101834564B1 (en) | Domain matching device and method for multi-domain natural language processing | |
| CN119888331A (en) | Collaborative countermeasure training method, collaborative countermeasure training system, terminal equipment, medium and integrated model | |
| CN112560440A (en) | Deep learning-based syntax dependence method for aspect-level emotion analysis | |
| CN114242045A (en) | Deep learning method for natural language dialogue system intention | |
| CN108563639B (en) | Mongolian language model based on recurrent neural network |