TWI731921B

TWI731921B - Speech recognition method and device

Info

Publication number: TWI731921B
Application number: TW106102245A
Authority: TW
Inventors: 李曉輝; 李宏言
Original assignee: 香港商阿里巴巴集團服務有限公司
Priority date: 2017-01-20
Filing date: 2017-01-20
Publication date: 2021-07-01
Also published as: TW201828279A

Abstract

本申請公開了一種語音辨識方法，包括：利用預設的語音知識源，生成包含用戶端預設資訊的、用於對語音信號進行解碼的搜索空間；提取待識別語音信號的特徵向量序列；計算特徵向量對應於搜索空間基本單元的機率；以所述機率為輸入、在所述搜索空間中執行解碼操作，得到與所述特徵向量序列對應的詞序列。本申請同時提供一種語音辨識裝置，以及另一種語音辨識方法及裝置。採用本申請提供的方法，由於在生成用於解碼的搜索空間時包含了用戶端預設資訊，因此在對用戶端採集的語音信號進行識別時能夠相對準確地識別出與客戶端相關的資訊，從而可以提高語音辨識的準確率，提升用戶的使用體驗。 This application discloses a voice recognition method, which includes: using a preset voice knowledge source to generate a search space containing user-side preset information for decoding a voice signal; extracting a feature vector sequence of the voice signal to be recognized; and calculating; The feature vector corresponds to the probability of the basic unit of the search space; using the probability as an input, a decoding operation is performed in the search space to obtain a word sequence corresponding to the feature vector sequence. This application also provides a voice recognition device, and another voice recognition method and device. With the method provided in this application, since the user-side preset information is included when generating the search space for decoding, the information related to the client-side can be relatively accurately identified when recognizing the voice signal collected by the user-side. Thereby, the accuracy of speech recognition can be improved, and the user experience can be improved.

Description

Speech recognition method and device

本申請關於語音辨識技術，具體關於一種語音辨識方法及裝置。本申請同時關於另一種語音辨識方法及裝置。 This application relates to voice recognition technology, and specifically relates to a voice recognition method and device. This application also relates to another voice recognition method and device.

語音是語言的聲學表現，是人類交流資訊最自然、最有效、最方便的手段，也是人類思維的一種依託。自動語音辨識(Automatic Speech Recognition-ASR)通常是指讓電腦等設備透過對語音的識別和理解，把人的口語轉化為相應的輸出文本或者命令的過程。其核心框架是：在利用統計模型建模的基礎上，根據從待識別語音信號中提取的特徵序列O，利用下述貝葉斯決策準則來求解與待識別語音信號對應的最佳詞序列W*：W*=argmaxP(O|W)P(W) Voice is the acoustic expression of language, the most natural, effective, and convenient way for humans to communicate information, and it is also a support for human thinking. Automatic speech recognition (Automatic Speech Recognition-ASR) usually refers to the process of allowing computers and other devices to recognize and understand speech to convert human spoken language into corresponding output text or commands. The core framework is: on the basis of using statistical model modeling, according to the feature sequence O extracted from the voice signal to be recognized, the following Bayesian decision criterion is used to solve the optimal word sequence W corresponding to the voice signal to be recognized *: W*=argmaxP(O|W)P(W)

在具體實施中，上述求解最佳詞序列的過程稱為解碼過程(實現解碼功能的模組通常稱為解碼器)，即：在由發音詞典、語言模型等多種知識源組成的搜索空間中搜索出上式所示的最佳詞序列。 In specific implementation, the above process of finding the optimal word sequence is called the decoding process (the module that realizes the decoding function is usually called a decoder), that is: searching in a search space composed of multiple knowledge sources such as pronunciation dictionaries and language models Find the best word sequence shown in the above formula.

隨著技術的發展，硬體的計算能力和儲存容量有了很大的進步，語音辨識系統已經逐步在業界得以應用，在用戶端設備上也出現了各種用語音作為人機交互媒介的應用，例如智慧手機上的撥打電話應用，使用者只需發出語音指示(如：“給張三打電話”)，即可自動實現電話撥打功能。 With the development of technology, the computing power and storage capacity of hardware have made great progress. Voice recognition systems have gradually been applied in the industry. Various applications that use voice as a human-computer interaction medium have also appeared on user-end devices. For example, for a call application on a smart phone, the user only needs to give a voice instruction (such as: "Call Zhang San") to automatically realize the phone call function.

目前的語音辨識應用通常採用兩種模式，一種是基於用戶端和伺服器的模式，即：用戶端採集語音，經由網路上傳至伺服器，伺服器透過解碼將語音辨識為文本，然後回傳到用戶端。之所以採用這樣的模式，是因為用戶端的計算能力相對較弱，其記憶體空間也比較有限，而伺服器在這兩方面都具有明顯的優勢；但是採用這種模式，如果沒有網路接入環境，用戶端則無法完成語音辨識功能。針對上述問題出現了僅依賴於用戶端的第二種語音辨識應用模式，在該模式下，透過縮減規模，將原本存放在伺服器上的模型和搜索空間放在用戶端設備本地，由用戶端自行完成採集語音以及解碼的操作。 Current voice recognition applications usually use two modes, one is based on the client and the server, that is: the client collects voice, uploads it to the server via the network, and the server recognizes the voice as text through decoding, and then sends it back To the user side. The reason for adopting this mode is that the computing power of the client is relatively weak, and its memory space is relatively limited, and the server has obvious advantages in both aspects; but in this mode, if there is no network access Environment, the user terminal cannot complete the voice recognition function. In response to the above problems, a second voice recognition application mode that only relies on the client has emerged. In this mode, the model and search space originally stored on the server are placed locally on the client device by reducing the scale. Complete the operation of collecting voice and decoding.

在實際應用中，無論是第一種模式還是第二種模式，在採用上述通用框架進行語音辨識時，通常無法有效識別語音信號中與用戶端設備本地資訊相關的內容，例如：通訊錄中的聯絡人名稱，從而導致識別準確率低，給用戶的使用帶來不便，影響用戶的使用體驗。 In practical applications, whether it is the first mode or the second mode, when the above-mentioned general framework is used for speech recognition, it is usually unable to effectively recognize the content related to the local information of the user equipment in the speech signal, such as: The name of the contact person, which leads to low recognition accuracy, brings inconvenience to the user's use, and affects the user's experience.

本申請實施例提供一種語音辨識方法和裝置，以解決現有的語音辨識技術對用戶端本地相關資訊的識別準確率低的問題。本申請實施例還提供另一種語音辨識方法和裝置。 The embodiments of the present application provide a voice recognition method and device to solve the problem of low recognition accuracy of the existing voice recognition technology on the local related information of the user terminal. The embodiment of the present application also provides another voice recognition method and device.

本申請提供一種語音辨識方法，包括：利用預設的語音知識源，生成包含用戶端預設資訊的、用於對語音信號進行解碼的搜索空間；提取待識別語音信號的特徵向量序列；計算特徵向量對應於搜索空間基本單元的機率；以所述機率為輸入、在所述搜索空間中執行解碼操作，得到與所述特徵向量序列對應的詞序列。 This application provides a voice recognition method, including: using a preset voice knowledge source to generate a search space containing user-side preset information for decoding a voice signal; extracting a feature vector sequence of the voice signal to be recognized; and calculating features The vector corresponds to the probability of the basic unit of the search space; using the probability as an input, a decoding operation is performed in the search space to obtain a word sequence corresponding to the feature vector sequence.

可選的，所述搜索空間包括：加權有限狀態轉換器。 Optionally, the search space includes: a weighted finite state converter.

可選的，所述搜索空間基本單元包括：上下文相關的三音素；所述預設的知識源包括：發音詞典、語言模型、以及三音素狀態繫結欄位表。 Optionally, the basic unit of the search space includes: context-related triphones; the preset knowledge source includes: a pronunciation dictionary, a language model, and a triphone state binding field table.

可選的，所述利用預設的語音知識源生成包含用戶端預設資訊的、用於對語音信號進行解碼的搜索空間，包括：透過替換標籤的方式，向預先生成的至少基於語言模型的加權有限狀態轉換器中添加與預設主題類別對應的用戶端預設資訊，並得到基於三音素狀態繫結欄位表、發音詞典以及語言模型的單一加權有限狀態轉換器；其中，所述語言模型是採用如下方式預先訓練得到的：將用於訓練語言模型的文本中的預設命名實體替換為與預設主題類別對應的標籤，並利用所述文本訓練語言模型。 Optionally, the generation of the search space for decoding the speech signal containing the user-side preset information by using the preset speech knowledge source includes: replacing tags to a pre-generated at least language model-based search space The weighted finite state converter adds user-side preset information corresponding to the preset theme category, and obtains a single weighted finite state converter based on the triphone state binding field table, pronunciation dictionary, and language model; wherein, the language The model is obtained by pre-training in the following manner: replacing the preset named entities in the text used for training the language model with tags corresponding to the preset subject category, and using the text to train the language model.

可選的，所述透過替換標籤的方式，向預先生成的至少基於語言模型的加權有限狀態轉換器中添加與預設主題類別對應的用戶端預設資訊，並得到基於三音素狀態繫結欄位表、發音詞典以及語言模型的單一加權有限狀態轉換器，包括：透過替換標籤的方式，向預先生成的基於語言模型的加權有限狀態轉換器中添加與預設主題類別對應的用戶端預設資訊；將添加了用戶端預設資訊的所述加權有限狀態轉換器、與預先生成的基於三音素狀態繫結欄位表和發音詞典的加權有限狀態轉換器進行合併，得到所述單一加權有限狀態轉換器。 Optionally, the method of replacing tags is to add user-side preset information corresponding to the preset theme category to the pre-generated weighted finite state converter based at least on the language model, and obtain a binding column based on the triphone state A single weighted finite state converter for bit tables, pronunciation dictionaries, and language models, including: adding a user-side preset corresponding to the preset theme category to the pre-generated weighted finite state converter based on the language model by replacing tags Information; combining the weighted finite state converter with user-side preset information and the pre-generated weighted finite state converter based on the triphone state binding field table and pronunciation dictionary to obtain the single weighted finite State converter.

可選的，所述用於訓練語言模型的文本是指，針對所述預設主題類別的文本。 Optionally, the text used for training the language model refers to text for the preset topic category.

可選的，所述預設主題類別的數目至少為2個；所述語言模型的數目、以及所述至少基於語言模型的加權有限狀態器的數目分別與預設主題類別的數目一致；所述透過替換標籤的方式，向預先生成的至少基於語言模型的加權有限狀態轉換器中添加與預設主題類別對應的用戶端預設資訊，包括：確定待識別語音信號所屬的預設主題類別；選擇預先生成的、與所述預設主題類別相對應的所述至少基於語言模型的加權有限狀態轉換器；透過用與所述預設主題類別對應的用戶端預設資訊替換相應標籤的方式，向所選的加權有限狀態轉換器中添加用戶端預設資訊。 Optionally, the number of the preset topic categories is at least two; the number of the language models and the number of the weighted finite state machines based on at least the language models are respectively consistent with the number of the preset topic categories; By replacing tags, adding user-side preset information corresponding to the preset theme category to the pre-generated weighted finite state converter based at least on the language model, including: determining the preset theme category to which the voice signal to be recognized belongs; select The weighted finite state converter based at least on the language model that is generated in advance and corresponds to the preset topic category; by replacing the corresponding label with the user-side preset information corresponding to the preset topic category, Add client default information to the selected weighted finite state converter.

可選的，所述確定待識別語音信號所屬的預設主題類別，採用如下方式實現：根據採集所述語音信號的用戶端類型、或應用程式確定所述所屬的預設主題類別。 Optionally, the determination of the preset theme category to which the voice signal to be recognized belongs is implemented in the following manner: the preset theme category to which the voice signal is collected is determined according to the type of the user terminal or the application program that collects the voice signal.

可選的，所述預設主題類別包括：撥打電話或發送簡訊，播放樂曲，或者，設置指令；相應的用戶端預設資訊包括：通訊錄中的聯絡人名稱，曲庫中的樂曲名稱，或者，指令集中的指令。 Optionally, the preset theme category includes: making a call or sending a text message, playing music, or setting instructions; the corresponding client preset information includes: the name of the contact in the address book, the name of the music in the music library, Or, the instructions in the instruction set.

可選的，所述合併操作包括：採用基於預測的方法進行合併。 Optionally, the merging operation includes: merging using a prediction-based method.

可選的，預先訓練所述語言模型所採用的詞表與所述發音詞典包含的詞一致。 Optionally, the vocabulary used for pre-training the language model is consistent with the words contained in the pronunciation dictionary.

可選的，所述計算特徵向量對應於搜索空間基本單元的機率，包括：採用預先訓練的DNN模型計算特徵向量對應於各三音素狀態的機率；根據特徵向量對應於所述各三音素狀態的機率，採用預先訓練的HMM模型計算特徵向量對應於各三音素的機率。 Optionally, the calculating the probability that the feature vector corresponds to the basic unit of the search space includes: using a pre-trained DNN model to calculate the probability that the feature vector corresponds to each triphone state; according to the feature vector corresponding to each triphone state Probability, using the pre-trained HMM model to calculate the probability that the feature vector corresponds to each triphone.

可選的，透過如下方式提升所述採用預先訓練的DNN模型計算特徵向量對應於各三音素狀態的機率的步驟的執行速度：利用硬體平臺提供的資料並行處理能力。 Optionally, the execution speed of the step of using the pre-trained DNN model to calculate the probability of the feature vector corresponding to each triphone state is improved by the following method: using the data parallel processing capability provided by the hardware platform.

可選的，所述提取待識別語音信號的特徵向量序列，包括：按照預先設定的幀長度對待識別語音信號進行分幀處理，得到多個音訊幀；提取各音訊幀的特徵向量，得到所述特徵向量序列。 Optionally, the extracting the feature vector sequence of the voice signal to be recognized includes: performing framing processing on the voice signal to be recognized according to a preset frame length to obtain multiple audio frames; extracting the feature vector of each audio frame to obtain the Feature vector sequence.

可選的，所述提取各音訊幀的特徵向量包括：提取MFCC特徵、PLP特徵、或者LPC特徵。 Optionally, the extracting the feature vector of each audio frame includes: extracting the MFCC feature, the PLP feature, or the LPC feature.

可選的，在所述得到與所述特徵向量序列對應的詞序列後，執行下述操作：透過與所述用戶端預設資訊進行文字匹配驗證所述詞序列的準確性，並根據驗證結果生成相應的語音辨識結果。 Optionally, after the word sequence corresponding to the feature vector sequence is obtained, the following operation is performed: verify the accuracy of the word sequence through text matching with preset information on the client side, and according to the verification result Generate the corresponding speech recognition results.

可選的，所述透過與所述用戶端預設資訊進行文字匹配驗證所述詞序列的準確性，並根據驗證結果得到相應的語音辨識結果，包括：從所述詞序列中選擇對應於所述用戶端預設資訊的待驗證詞；在所述用戶端預設資訊中查找所述待驗證詞；若找到，則判定透過所述準確性驗證，並將所述詞序列作為語音辨識結果；否則透過基於拼音的模糊匹配方式修正所述詞序列，並將修正後的詞序列作為語音辨識結果。 Optionally, the verification of the accuracy of the word sequence through text matching with preset information on the client side, and obtaining the corresponding voice recognition result according to the verification result, includes: selecting from the word sequence corresponding to all The word to be verified in the preset information on the client side; search for the word to be verified in the preset information on the client side; if it is found, it is determined to pass the accuracy verification, and the word sequence is used as the voice recognition result; Otherwise, the word sequence is corrected through the fuzzy matching method based on pinyin, and the corrected word sequence is used as the voice recognition result.

可選的，所述透過基於拼音的模糊匹配方式修正所述詞序列，包括：將所述待驗證詞轉換為待驗證拼音序列；將所述用戶端預設資訊中的各個詞分別轉換為比對拼音序列；依次計算所述待驗證拼音序列與各比對拼音序列之間的相似度，並從所述用戶端預設資訊中選擇按照所述相似度從高到低排序靠前的詞；用所選詞替換所述詞序列中的待驗證詞，得到所述修正後的詞序列。 Optionally, the modification of the word sequence through a fuzzy matching method based on pinyin includes: converting the word to be verified into a pinyin sequence to be verified; and converting each word in the preset information on the user side into a comparison. Pair the pinyin sequence; sequentially calculate the similarity between the pinyin sequence to be verified and each compared pinyin sequence, and select the top words sorted from high to low from the user-side preset information; Replace the word to be verified in the word sequence with the selected word to obtain the revised word sequence.

可選的，所述相似度包括：基於編輯距離計算的相似度。 Optionally, the similarity includes: a similarity calculated based on the edit distance.

可選的，所述方法在用戶端設備上實施；所述用戶端設備包括：智慧移動終端、智慧音箱、或者機器人。 Optionally, the method is implemented on a user-end device; the user-end device includes: a smart mobile terminal, a smart speaker, or a robot.

相應的，本申請還提供一種語音辨識裝置，包括：搜索空間生成單元，用於利用預設的語音知識源，生成包含用戶端預設資訊的、用於對語音信號進行解碼的搜索空間；特徵向量提取單元，用於提取待識別語音信號的特徵向量序列；機率計算單元，用於計算特徵向量對應於搜索空間基本單元的機率；解碼搜索單元，用於以所述機率為輸入、在所述搜索空間中執行解碼操作，得到與所述特徵向量序列對應的詞序列。 Correspondingly, the present application also provides a voice recognition device, including: a search space generating unit, configured to use a preset voice knowledge source to generate a search space containing user-side preset information for decoding voice signals; features; The vector extraction unit is used to extract the feature vector sequence of the speech signal to be recognized; the probability calculation unit is used to calculate the probability that the feature vector corresponds to the basic unit of the search space; the decoding search unit is used to use the probability as the input and the A decoding operation is performed in the search space to obtain a word sequence corresponding to the feature vector sequence.

可選的，所述搜索空間生成單元具體用於，透過替換標籤的方式，向預先生成的至少基於語言模型的加權有限狀態轉換器中添加與預設主題類別對應的用戶端預設資訊，並得到基於三音素狀態繫結欄位表、發音詞典、以及語言模型的單一加權有限狀態轉換器；所述語言模型是由語言模型訓練單元預先生成的，所述語言模型訓練單元用於，將用於訓練語言模型的文本中的預設命名實體替換為與預設主題類別對應的標籤，並利用所述文本訓練語言模型。 Optionally, the search space generating unit is specifically configured to add user-side preset information corresponding to the preset topic category to the weighted finite state converter generated in advance at least based on the language model by replacing tags, and Obtain a single weighted finite state converter based on the triphone state binding field table, pronunciation dictionary, and language model; the language model is pre-generated by the language model training unit, and the language model training unit is used to The preset named entities in the text for training the language model are replaced with tags corresponding to the preset topic categories, and the text is used to train the language model.

可選的，所述搜索空間生成單元包括：第一用戶端資訊添加子單元，用於透過替換標籤的方式，向預先生成的基於語言模型的加權有限狀態轉換器中添加與預設主題類別對應的用戶端預設資訊；加權有限狀態轉換器合併子單元，用於將添加了所述用戶端預設資訊的所述加權有限狀態轉換器、與預先生成的基於三音素狀態繫結欄位表和發音詞典的加權有限狀態轉換器進行合併，得到所述單一加權有限狀態轉換器。 Optionally, the search space generating unit includes: a first user-side information adding subunit, which is used to add a language model-based weighted finite state converter corresponding to a preset topic category to a pre-generated language model-based weighted finite state converter by replacing tags The user-side preset information; the weighted finite state converter merging subunit is used to combine the weighted finite state converter with the user-side preset information and the pre-generated triphone-based state binding field table Combined with the weighted finite state converter of the pronunciation dictionary, the single weighted finite state converter is obtained.

可選的，所述解碼空間生成單元包括：第二用戶端資訊添加子單元，用於透過替換標籤的方式，向預先生成的至少基於語言模型的加權有限狀態轉換器中添加與預設主題類別對應的用戶端預設資訊；統一加權有限狀態轉換器獲取子單元，用於在所述第二用戶端資訊添加子單元完成添加操作之後，得到基於三音素狀態繫結欄位表、發音詞典、以及語言模型的單一加權有限狀態轉換器；其中，所述第二用戶端資訊添加子單元包括：主題確定子單元，用於確定待識別語音信號所屬的預設主題類別；加權有限狀態轉換器選擇子單元，用於選擇預先生成的、與所述預設主題類別相對應的所述至少基於語言模型的加權有限狀態轉換器；標籤替換子單元，用於透過用與所述預設主題類別對應的用戶端預設資訊替換相應標籤的方式，向所選的加權有限狀態轉換器中添加用戶端預設資訊。 Optionally, the decoding space generating unit includes: a second user-side information adding subunit, which is used to add and preset topic categories to the pre-generated weighted finite state converter at least based on the language model by replacing tags Corresponding user-side preset information; the unified weighted finite state converter acquisition subunit is used to obtain the binding field table, pronunciation dictionary, pronunciation dictionary based on triphone state after the second user-side information adding subunit completes the adding operation And a single weighted finite state converter of the language model; wherein the second user-side information adding subunit includes: a topic determination subunit for determining the preset topic category to which the speech signal to be recognized belongs; weighted finite state converter selection Sub-unit for selecting the weighted finite state converter based on the language model that is generated in advance and corresponding to the preset topic category; the label replacement sub-unit is used to correspond to the preset topic category by using The method of replacing the corresponding label with the default information on the client side to add the default information on the client side to the selected weighted finite state converter.

可選的，所述主題確定子單元具體用於，根據採集所述語音信號的用戶端類型、或應用程式確定所述所屬的預設主題類別。 Optionally, the theme determining subunit is specifically configured to determine the preset theme category to which the voice signal belongs according to the type of the user terminal or the application program that collects the voice signal.

可選的，所述加權有限狀態轉換器合併子單元具體用於，採用基於預測的方法執行合併操作，並得到所述單一加權有限狀態轉換器。 Optionally, the weighted finite state converter merging subunit is specifically configured to perform a merging operation using a prediction-based method, and obtain the single weighted finite state converter.

可選的，所述機率計算單元包括：三音素狀態機率計算子單元，用於採用預先訓練的DNN模型計算特徵向量對應於各三音素狀態的機率；三音素機率計算子單元，用於根據特徵向量對應於所述各三音素狀態的機率，採用預先訓練的HMM模型計算特徵向量對應於各三音素的機率。 Optionally, the probability calculation unit includes: a triphone state probability calculation subunit for calculating the probability that a feature vector corresponds to each triphone state by using a pre-trained DNN model; a triphone probability calculation subunit for calculating the probability of each triphone state according to the feature The vector corresponds to the probability of each triphone state, and a pre-trained HMM model is used to calculate the probability that the feature vector corresponds to each triphone.

可選的，所述特徵向量提取單元包括：分幀子單元，用於按照預先設定的幀長度對待識別語音信號進行分幀處理，得到多個音訊幀；特徵提取子單元，用於提取各音訊幀的特徵向量，得到所述特徵向量序列。 Optionally, the feature vector extraction unit includes: a framing subunit for performing framing processing on the speech signal to be recognized according to a preset frame length to obtain multiple audio frames; a feature extraction subunit for extracting each audio signal The feature vector of the frame to obtain the feature vector sequence.

可選的，所述裝置包括：準確性驗證單元，用於在所述解碼搜索單元得到與特徵向量序列對應的詞序列後，透過與所述用戶端預設資訊進行文字匹配驗證所述詞序列的準確性，並根據驗證結果生成相應的語音辨識結果。 Optionally, the device includes: an accuracy verification unit configured to verify the word sequence through text matching with the user-side preset information after the decoding search unit obtains the word sequence corresponding to the feature vector sequence According to the accuracy of the verification result, the corresponding voice recognition result is generated.

可選的，所述準確性驗證單元包括：待驗證詞選擇子單元，用於從所述詞序列中選擇對應於所述用戶端預設資訊的待驗證詞；查找子單元，用於在所述用戶端預設資訊中查找所述待驗證詞；識別結果確認子單元，用於當所述查找子單元找到所述待驗證詞之後，判定透過所述準確性驗證，並將所述詞序列作為語音辨識結果；識別結果修正子單元，用於當所述查找子單元未找到所述待驗證詞之後，透過基於拼音的模糊匹配方式修正所述詞序列，並將修正後的詞序列作為語音辨識結果。 Optionally, the accuracy verification unit includes: a word-to-be-verified selection sub-unit for selecting a word to be verified from the word sequence corresponding to the user-side preset information; a search sub-unit for selecting a word to be verified in the word sequence; Search for the word to be verified in the preset information on the client side; a recognition result confirmation sub-unit for determining that the accuracy verification is passed after the search sub-unit finds the word to be verified, and to compare the word sequence As the result of speech recognition; the recognition result correction subunit is used to correct the word sequence through the fuzzy matching method based on pinyin after the search subunit does not find the word to be verified, and use the corrected word sequence as the voice Identification result.

可選的，所述識別結果修正子單元，包括：待驗證拼音序列轉換子單元，用於將所述待驗證詞轉換為待驗證拼音序列；比對拼音序列轉換子單元，用於將所述用戶端預設資訊中的各個詞分別轉換為比對拼音序列；相似度計算選擇子單元，用於依次計算所述待驗證拼音序列與各比對拼音序列之間的相似度，並從所述用戶端預設資訊中選擇按照所述相似度從高到低排序靠前的詞；待驗證詞替換子單元，用於用所選詞替換所述詞序列中的待驗證詞，得到所述修正後的詞序列。 Optionally, the recognition result correction subunit includes: a pinyin sequence conversion subunit to be verified, used to convert the word to be verified into a pinyin sequence to be verified; a comparison pinyin sequence conversion subunit, used to convert the Each word in the preset information on the client side is converted into a compared pinyin sequence; the similarity calculation and selection subunit is used to sequentially calculate the similarity between the pinyin sequence to be verified and each compared pinyin sequence, and from the The user-side preset information selects the top words sorted from high to low according to the similarity; the word to be verified replacement subunit is used to replace the word to be verified in the word sequence with the selected word to obtain the correction The sequence of words after.

此外，本申請還提供另一種語音辨識方法，包括：透過解碼獲取與待識別語音信號對應的詞序列；透過與用戶端預設資訊進行文字匹配驗證所述詞序列的準確性，並根據驗證結果生成相應的語音辨識結果。 In addition, this application also provides another voice recognition method, including: obtaining a word sequence corresponding to the voice signal to be recognized through decoding; verifying the accuracy of the word sequence by text matching with preset information on the client side, and according to the verification result Generate the corresponding speech recognition results.

可選的，所述透過與用戶端預設資訊進行文字匹配驗證所述詞序列的準確性，並根據驗證結果生成相應的語音辨識結果，包括：從所述詞序列中選擇對應於所述用戶端預設資訊的待驗證詞；在所述用戶端預設資訊中查找所述待驗證詞；若找到，則判定透過所述準確性驗證，並將所述詞序列作為語音辨識結果；否則透過基於拼音的模糊匹配方式修正所述詞序列，並將修正後的詞序列作為語音辨識結果。 Optionally, the verification of the accuracy of the word sequence through text matching with preset information on the client side, and generating a corresponding voice recognition result according to the verification result includes: selecting from the word sequence corresponding to the user The word to be verified in the preset information of the client; search for the word to be verified in the preset information of the client; if found, it is determined to pass the accuracy verification, and the word sequence is used as the voice recognition result; otherwise, it passes The word sequence is corrected based on the fuzzy matching method of pinyin, and the corrected word sequence is used as the voice recognition result.

相應的，本申請還提供另一種語音辨識裝置，包括：詞序列獲取單元，用於透過解碼獲取與待識別語音信號對應的詞序列；詞序列驗證單元，用於透過與用戶端預設資訊進行文字匹配驗證所述詞序列的準確性，並根據驗證結果生成相應的語音辨識結果。 Correspondingly, the present application also provides another voice recognition device, including: a word sequence acquisition unit for acquiring a word sequence corresponding to the voice signal to be recognized through decoding; a word sequence verification unit for performing processing with user-end preset information Text matching verifies the accuracy of the word sequence, and generates a corresponding voice recognition result according to the verification result.

可選的，所述詞序列驗證單元包括：待驗證詞選擇子單元，用於從所述詞序列中選擇對應於所述用戶端預設資訊的待驗證詞；查找子單元，用於在所述用戶端預設資訊中查找所述待驗證詞；識別結果確認子單元，用於當所述查找子單元找到所述待驗證詞之後，判定透過所述準確性驗證，並將所述詞序列作為語音辨識結果；識別結果修正子單元，用於當所述查找子單元未找到所述待驗證詞之後，透過基於拼音的模糊匹配方式修正所述詞序列，並將修正後的詞序列作為語音辨識結果。 Optionally, the word sequence verification unit includes: a word-to-be-verified selection sub-unit for selecting a word to be verified from the word sequence corresponding to the user-side preset information; a search sub-unit for selecting the word to be verified in the word sequence; Search for the word to be verified in the preset information on the client side; a recognition result confirmation sub-unit for determining that the accuracy verification is passed after the search sub-unit finds the word to be verified, and to compare the word sequence As the result of speech recognition; the recognition result correction subunit is used to correct the word sequence through the fuzzy matching method based on pinyin after the search subunit does not find the word to be verified, and use the corrected word sequence as the voice Identification result.

與現有技術相比，本申請具有以下優點：本申請提供的語音辨識方法，在利用預設的語音知識源生成包含用戶端預設資訊的、用於對語音信號進行解碼的搜索空間的基礎上，計算從待識別語音信號中提取的特徵向量對應於搜索空間基本單元的機率，並且根據所述機率在搜索空間中執行解碼操作，從而得到與所述待識別語音信號對應的詞序列。本申請提供的上述方法，由於在生成用於解碼的搜索空間時包含了用戶端預設資訊，因此在對用戶端採集的語音信號進行識別時能夠相對準確地識別出與客戶端相關的資訊，從而可以提高語音辨識的準確率，提升用戶的使用體驗。 Compared with the prior art, this application has the following advantages: the speech recognition method provided by this application uses a preset speech knowledge source to generate a search space containing user-side preset information for decoding speech signals. Calculate the probability that the feature vector extracted from the voice signal to be recognized corresponds to the basic unit of the search space, and perform a decoding operation in the search space according to the probability, so as to obtain the word sequence corresponding to the voice signal to be recognized. The above-mentioned method provided by the present application includes the user-side preset information when generating the search space for decoding, and therefore can relatively accurately identify the information related to the client-side when recognizing the voice signal collected by the user-side. Thereby, the accuracy of speech recognition can be improved, and the user experience can be improved.

901‧‧‧搜索空間生成單元 901‧‧‧Search Space Generating Unit

902‧‧‧特徵向量提取單元 902‧‧‧Feature Vector Extraction Unit

903‧‧‧機率計算單元 903‧‧‧Probability calculation unit

904‧‧‧解碼搜索單元 904‧‧‧Decoding Search Unit

1101‧‧‧詞序列獲取單元 1101‧‧‧Word Sequence Acquisition Unit

1102‧‧‧詞序列驗證單元 1102‧‧‧Word Sequence Verification Unit

圖1是本申請的一種語音辨識方法的實施例的流程圖；圖2是本申請實施例提供的生成包含用戶端預設資訊的、用於對語音信號進行解碼的搜索空間的處理流程圖；圖3是本申請實施例提供的執行替換操作前的G結構WFST的示意圖；圖4是本申請實施例提供的執行替換操作後的G結構WFST的示意圖；圖5是本申請實施例提供的提取待識別語音信號的特徵向量序列的處理流程圖；圖6是本申請實施例提供的計算特徵向量對應於各三音素的機率的處理流程圖；圖7是本申請實施例提供的透過文字匹配驗證詞序列的準確性、並根據驗證結果生成相應語音辨識結果的處理流程圖；圖8為本申請實施例提供的語音辨識的整體框架圖；圖9是本申請的一種語音辨識裝置的實施例的示意圖；圖10是本申請的另一種語音辨識方法的實施例的流程圖；圖11是本申請的另一種語音辨識裝置的實施例的示意圖。 FIG. 1 is a flowchart of an embodiment of a voice recognition method of the present application; FIG. 2 is a process flowchart of generating a search space containing user-side preset information for decoding a voice signal according to an embodiment of the present application; Fig. 3 is a schematic diagram of the G structure WFST before the replacement operation provided by an embodiment of the application; Fig. 4 is a schematic diagram of the G structure WFST after the replacement operation provided by an embodiment of the application; Fig. 5 is an extraction provided by the embodiment of the application The processing flowchart of the feature vector sequence of the voice signal to be recognized; FIG. 6 is a processing flowchart for calculating the probability of a feature vector corresponding to each triphone provided by an embodiment of the present application; FIG. 7 is a verification through text matching provided by an embodiment of the present application The accuracy of the word sequence and the processing flow chart of generating the corresponding voice recognition result according to the verification result; FIG. 8 is the overall framework diagram of the voice recognition provided by the embodiment of the application; FIG. 9 is the embodiment of the voice recognition device of the present application Schematic diagram; Figure 10 is a flowchart of another embodiment of the voice recognition method of the present application; Figure 11 is a schematic diagram of another embodiment of the voice recognition device of the present application.

在下面的描述中闡述了很多具體細節以便於充分理解本申請。但是，本申請能夠以很多不同於在此描述的其它方式來實施，本領域技術人員可以在不違背本申請內涵的情況下做類似推廣，因此，本申請不受下面公開的具體實施的限制。 In the following description, many specific details are set forth in order to fully understand this application. However, this application can be implemented in many other ways different from those described herein, and those skilled in the art can make similar promotion without violating the connotation of this application. Therefore, this application is not limited by the specific implementation disclosed below.

在本申請中，分別提供了一種語音辨識方法及相應裝置，以及另一種語音辨識方法及相應裝置，在下面的實施例中逐一進行詳細說明。為了便於理解，在對實施例進行描述之前，先對本申請的技術方案、相關的技術術語、以及實施例的撰寫方式作簡要說明。 In this application, a voice recognition method and corresponding device are respectively provided, and another voice recognition method and corresponding device are respectively provided, which are described in detail in the following embodiments. In order to facilitate understanding, before describing the embodiments, a brief description of the technical solutions of the present application, related technical terms, and the writing method of the embodiments are given.

本申請提供的語音辨識方法，通常應用於以語音作為人機交互媒介的應用中，此類應用將採集的語音信號識別為文本，再根據文本執行相應的操作，所述語音信號中通常涉及用戶端本地的預設資訊(例如，通訊錄中的聯絡人名稱)。現有的語音辨識技術，對於上述待識別語音信號採用通用的搜索空間進行解碼識別，而通用的搜索空間並沒有考慮此類應用在不同用戶端之間的差異性，因此通常無法有效識別語音信號中與用戶端本地資訊相關的內容，導致識別準確率低。針對這一問題，本申請的技術方案透過在構建用於對語音信號進行解碼的搜索空間的過程中融入用戶端預設資訊，相當於針對用戶端的具體語音辨識需求進行了定制，從而能夠有效識別與客戶端相關的本地資訊，達到提高語音辨識準確率的目的。 The voice recognition method provided in this application is usually applied to applications that use voice as a human-computer interaction medium. Such applications recognize collected voice signals as text, and then perform corresponding operations based on the text. The voice signals usually involve users. The local default information (for example, the name of the contact in the address book). The existing speech recognition technology uses a general search space for decoding and recognition of the above-mentioned speech signal to be recognized, and the general search space does not consider the differences between such applications in different user terminals, so it is usually unable to effectively recognize the speech signal. The content related to the local information on the client side leads to low recognition accuracy. In response to this problem, the technical solution of the present application incorporates user-side preset information in the process of constructing the search space for decoding the voice signal, which is equivalent to customizing the specific voice recognition requirements of the user-side, so that it can effectively identify Local information related to the client can achieve the purpose of improving the accuracy of voice recognition.

在語音辨識系統中，根據待識別的語音信號得到與其最匹配的詞序列的過程叫做解碼。而本申請所述的用於對語音信號進行解碼的搜索空間是指，由語音辨識系統涉及的語音知識源(例如：聲學模型、發音詞典以及語言模型等)所覆蓋的、所有可能的語音辨識結果所組成的空間。相應的，解碼的過程就是針對待識別語音信號在搜索空間中進行搜索和匹配、得到最佳匹配的詞序列的過程。 In the speech recognition system, the process of obtaining the most matching word sequence based on the speech signal to be recognized is called decoding. The search space used to decode speech signals mentioned in this application refers to all possible speech recognitions covered by the speech knowledge sources (for example: acoustic models, pronunciation dictionaries, language models, etc.) involved in the speech recognition system The result is the space. Correspondingly, the decoding process is a process of searching and matching the speech signal to be recognized in the search space to obtain the best matching word sequence.

所述搜索空間的形式可以是多樣化的，可以採用各種知識源處於相對獨立的不同層面的搜索空間，解碼的過程就是逐層計算搜索的過程；也可以採用基於加權有限狀態轉換器(Weighted Finite State Transducer-簡稱WFST)的搜索空間，將各種知識源有機融入到統一的WFST網路(也稱WFST搜索空間)中。考慮到後者便於引入不同的知識源、並且可以提高搜索效率，是本申請技術方案進行語音辨識的較佳方式，因此在本申請提供的實施例中重點描述基於WFST網路的實施方式。 The form of the search space can be diversified, and search spaces in which various knowledge sources are at relatively independent levels can be used. The decoding process is a process of calculating and searching layer by layer; it can also be based on a weighted finite state converter (Weighted Finite State Translator). The search space of State Transducer (WFST for short) integrates various knowledge sources into a unified WFST network (also known as WFST search space). Considering that the latter facilitates the introduction of different knowledge sources and can improve search efficiency, it is a better way for the technical solution of this application to perform voice recognition. Therefore, the embodiments provided in this application focus on the implementation based on the WFST network.

所述WFST搜索空間，其核心是利用加權有限狀態轉換器來類比語言的文法結構以及相關的聲學特性。具體操作方法是：將處於不同層次的知識源分別用WFST的形式表示，然後運用WFST的特性及合併演算法，將上述處於不同層次的WFST整合成一個單一的WFST網路，構成用於進行語音辨識的搜索空間。 The core of the WFST search space is to use a weighted finite state converter to analogize the grammatical structure of a language and related acoustic characteristics. The specific operation method is: express the knowledge sources at different levels in the form of WFST, and then use the characteristics of WFST and the merger algorithm to integrate the above-mentioned WFST at different levels into a single WFST network, which is used for voice Identification search space.

WFST網路的基本單元(即驅動WFST進行狀態轉換的基本輸入單元)可以根據具體的需求進行選擇。考慮音素上下文對音素發音的影響，為了獲得更高的識別準確率，在本申請提供的實施例中採用上下文相關的三音素 (Context Dependent triphone，簡稱三音素或者三音子)作為WFST網路的基本單元，相應的構建WFST搜索空間的知識源包括：三音素狀態繫結欄位表、發音詞典、以及語言模型。 The basic unit of the WFST network (that is, the basic input unit that drives the WFST to perform state transitions) can be selected according to specific needs. Considering the influence of phoneme context on phoneme pronunciation, in order to obtain higher recognition accuracy, in the embodiment provided in this application, context dependent triphone (Context Dependent triphone, triphone or triphone for short) is used as the WFST network. The basic unit, the corresponding knowledge sources for constructing the WFST search space include: triphone state binding field table, pronunciation dictionary, and language model.

所述三音素狀態繫結欄位表通常包含各三音素彼此之間基於發音特點的綁定關係，通常在以三音素為建模單位訓練聲學模型時，由於三音素可能的組合方式數目眾多，為了減少對訓練資料的要求，通常可以基於發音特點、採用決策樹聚類方法在最大似然準則下對不同的三音素進行聚類，並使用捆綁技術把具有相同發音特點的三音素綁定到一起以便進行參數共用，從而得到所述三音素狀態繫結欄位表；所述發音詞典通常包含音素與詞之間的對應關係，是橋接在聲學層(實體層)和語義層之間的橋樑，讓聲學層的內容和語義層的內容耦合關聯在一起；所述語言模型則提供了語言結構的相關知識，用於計算詞序列在自然語言中出現的機率，在具體實施中通常採用n元(n-gram)文法語言模型，具體可以透過統計單詞之間相互跟隨出現的可能性來建模。 The triphone state binding field table usually contains the binding relationship between the triphones based on the pronunciation characteristics. Usually, when the acoustic model is trained with the triphone as the modeling unit, due to the large number of possible combinations of the triphones, In order to reduce the requirements for training data, usually based on pronunciation characteristics, decision tree clustering method can be used to cluster different triphones under the maximum likelihood criterion, and use bundling technology to bind triphones with the same pronunciation characteristics to Together for parameter sharing, the triphone state binding field table is obtained; the pronunciation dictionary usually contains the correspondence between phonemes and words, which is a bridge between the acoustic layer (physical layer) and the semantic layer , The content of the acoustic layer and the content of the semantic layer are coupled and associated together; the language model provides relevant knowledge of the language structure and is used to calculate the probability of the word sequence appearing in natural language. In specific implementations, n-grams are usually used. The (n-gram) grammar language model can be specifically modeled by counting the likelihood of words appearing after each other.

採用基於上述知識源構建的WFST網路進行語音辨識時，為了驅動WFST進行所需的搜索，可以先提取待識別語音信號的特徵向量序列，然後利用預先訓練好的模型計算從特徵向量對應於各三音素的機率，並根據所述各三音素的機率，在WFST搜索空間中執行解碼操作，得到與待識別語音信號對應的詞序列。 When using the WFST network constructed based on the above knowledge sources for speech recognition, in order to drive the WFST to perform the required search, you can first extract the feature vector sequence of the voice signal to be recognized, and then use the pre-trained model to calculate the feature vector corresponding to each The probability of triphones, and according to the probabilities of each triphone, a decoding operation is performed in the WFST search space to obtain a word sequence corresponding to the speech signal to be recognized.

需要說明的是，在本申請提供的實施例中採用上下文相關的三音素作為WFST網路的基本單元，在其他實施方式中，也可以採用其他語音單位作為WFST網路的基本單元，例如：單音素、或者三音素狀態等。採用不同的基本單元，在構建搜索空間時、以及根據特徵向量計算機率時會有一定的差別，例如以三音素狀態作為基本單元，那麼在構建WFST網路時可以融合基於HMM的聲學模型，在進行語音辨識時，則可以計算特徵向量對應於各三音素狀態的機率。上述這些都是具體實施方式的變更，只要在構建搜索空間的過程中包含了用戶端預設資訊，就同樣可以實現本申請的技術方案，就都不偏離本申請的技術核心，也都在本申請的保護範圍之內。 It should be noted that in the embodiments provided in this application, context-sensitive triphones are used as the basic unit of the WFST network. In other implementations, other phonetic units can also be used as the basic unit of the WFST network, such as single Phoneme, or triphone state, etc. Using different basic units, there will be certain differences in the construction of the search space and the calculation rate according to the feature vector. For example, if the triphone state is used as the basic unit, then the HMM-based acoustic model can be integrated when constructing the WFST network. When performing speech recognition, the probability that the feature vector corresponds to each triphone state can be calculated. The above are all changes to the specific implementation. As long as the user-side preset information is included in the process of constructing the search space, the technical solution of this application can also be realized without departing from the technical core of this application. Within the scope of protection applied for.

下面，對本申請的實施例進行詳細說明。請參考圖1，其為本申請的一種語音辨識方法的實施例的流程圖。所述方法包括步驟101至步驟104，在具體實施時，為了提高執行效率，通常可以在步驟101之前完成相關的準備工作(此階段也可以稱作準備階段)，生成基於類的語言模型、預設結構的WFST以及用於進行語音辨識的聲學模型等，從而為步驟101的執行做好準備。下面先對準備階段作詳細說明。 Hereinafter, the embodiments of the present application will be described in detail. Please refer to FIG. 1, which is a flowchart of an embodiment of a voice recognition method of this application. The method includes steps 101 to 104. In the specific implementation, in order to improve the execution efficiency, the relevant preparation work can usually be completed before step 101 (this stage can also be called the preparation stage) to generate a class-based language model, pre-processing Set the structured WFST and the acoustic model for speech recognition, etc., so as to prepare for the execution of step 101. The preparation phase will be explained in detail below.

在準備階段可以採用如下方式訓練語言模型：將用於訓練語言模型的文本中的預設命名實體替換為與預設主題類別對應的標籤，並利用所述文本訓練語言模型。所述命名實體通常是指文本中具有特定類別的實體，例如：人名、歌曲名、機構名、地名等。 In the preparation phase, the language model can be trained in the following manner: replacing the preset named entities in the text used for training the language model with tags corresponding to the preset topic category, and using the text to train the language model. The named entity usually refers to an entity with a specific category in the text, for example: names of people, names of songs, names of organizations, names of places, and so on.

下面以撥打電話應用為例進行說明：預設主題類別為撥打電話，對應的標籤為“$CONTACT”，預設命名實體為人名，那麼在預先訓練語言模型時，可以將訓練文本中的人名替換為對應的標籤，比如將“我要打電話給小明”中的“小明”替換為“$CONTACT”，然後得到的訓練文本為“我要打電話給$CONTACT”。採用進行上述實體替換之後的文本訓練語言模型，得到基於類的語言模型。在訓練得到上述語言模型的基礎上，還可以預先生成基於語言模型的WFST，以下簡稱為G結構的WFST。 Let’s take a call application as an example to illustrate: the default subject category is call, the corresponding label is "$CONTACT", and the default named entity is a person’s name. Then when the language model is pre-trained, the person’s name in the training text can be replaced For the corresponding label, for example, replace "Xiaoming" in "I want to call Xiaoming" with "$CONTACT", and then get the training text "I want to call $CONTACT". The text training language model after the above entity replacement is used to obtain a class-based language model. On the basis of the above-mentioned language model obtained through training, a language model-based WFST can also be generated in advance, hereinafter referred to as a G-structure WFST.

較佳地，為了縮減語言模型以及對應的G結構的WFST的規模，在預先訓練語言模型時，可以選用針對所述預設主題類別的文本(也可以稱為基於類的訓練文本)進行訓練，例如，預設主題類別為撥打電話，那麼針對所述預設主題類別的文本可以包括：我要打電話給小明，給小明打個電話等等。 Preferably, in order to reduce the scale of the language model and the corresponding G-structure WFST, when the language model is pre-trained, the text for the preset topic category (also called class-based training text) can be selected for training. For example, if the preset theme category is making a call, the text for the preset theme category may include: I want to call Xiaoming, make a call to Xiaoming, and so on.

考慮到用戶端設備以及以語音作為人機交互媒介的應用程式的多樣性，可以預設兩個或者兩個以上的主題類別，並針對每種主題類別分別預先訓練基於類的語言模型、並構建基於所述語言模型的G結構WFST。 Taking into account the diversity of user-end devices and applications that use voice as the human-computer interaction medium, two or more topic categories can be preset, and a class-based language model can be pre-trained and constructed for each topic category. G structure WFST based on the language model.

在準備階段還可以預先構建基於發音詞典的WFST，簡稱為L結構的WFST，以及基於三音素狀態繫結欄位表的WFST，簡稱為C結構的WFST，並採用預設的方式對上述各WFST進行有針對性的、選擇性地合併操作，例如：可以將C結構與L結構的WFST合併為CL結構的WFST，也可以將L結構與G結構的WFST合併為LG結構的WFST，還可以將C結構、L結構以及G結構的WFST合併為CLG結構的WFST。本實施例在準備階段生成了CL結構的WFST以及G結構的WFST(關於合併操作的說明可以參見步驟101中的相關文字)。 In the preparation stage, WFST based on pronunciation dictionary, referred to as L-structure WFST, and WFST based on triphone state binding field table, referred to as C-structure WFST can also be constructed in advance, and the above-mentioned WFST can be adjusted in a preset manner. Perform targeted and selective merging operations. For example, you can merge the WFST of the C structure and the L structure into the WFST of the CL structure, or you can merge the WFST of the L structure and the G structure into the WFST of the LG structure, and you can also combine The WFST of the C structure, the L structure and the G structure are merged into the WFST of the CLG structure. In this embodiment, the WFST of the CL structure and the WFST of the G structure are generated in the preparation stage (for the description of the merging operation, please refer to the relevant text in step 101).

此外，在準備階段還可以預先訓練用於進行語音辨識的聲學模型。在本實施例中，每個三音素用一個HMM(Hidden Markov Model-隱瑪律可夫模型)表徵，HMM的隱含狀態為三音素中的一個狀態(每個三音素通常包含三個狀態)，採用GMM(Gaussian mixture model-高斯混合模型)模型確定HMM中每個隱含狀態輸出各特徵向量的發射機率，以從大量語音資料中提取的特徵向量作為訓練樣本，採用Baum-Welch演算法學習GMM模型和HMM模型的參數，可以得到對應於每個狀態的GMM模型以及對應於每個三音素的HMM模型。在後續步驟103中則可以使用預先訓練好的GMM和HMM模型計算特徵向量對應於各三音素的機率。 In addition, an acoustic model for speech recognition can be pre-trained in the preparation phase. In this embodiment, each triphone is characterized by an HMM (Hidden Markov Model-Hidden Markov Model), and the hidden state of the HMM is one of the triphones (each triphone usually contains three states) , Using GMM (Gaussian mixture model-Gaussian mixture model) model to determine the transmitter rate of each hidden state output feature vector in HMM, using feature vectors extracted from a large amount of speech data as training samples, using Baum-Welch algorithm to learn With the parameters of the GMM model and the HMM model, the GMM model corresponding to each state and the HMM model corresponding to each triphone can be obtained. In the subsequent step 103, the pre-trained GMM and HMM models can be used to calculate the probability that the feature vector corresponds to each triphone.

為了提升語音辨識的準確率，本實施例在進行語音辨識時用DNN(Deep Neural Networks-深度神經網路)模型替代了GMM模型，相應的，在準備階段可以預先訓練用於根據輸入的特徵向量輸出對應於各三音素狀態機率的DNN模型。在具體實施時，可以在訓練GMM和HMM模型的基礎上，透過對訓練樣本進行強制對齊的方式、為訓練樣本添加對應於各三音素狀態的標籤，並用打好標籤的訓練樣本訓練得到所述DNN模型。 In order to improve the accuracy of speech recognition, this embodiment uses DNN (Deep Neural Networks-Deep Neural Networks) model instead of GMM model when performing speech recognition. Correspondingly, in the preparation phase, it can be pre-trained for the feature vector according to the input. The DNN model corresponding to the state probability of each triphone is output. In specific implementation, based on the training of GMM and HMM models, the training samples can be forced to align, and labels corresponding to the states of each triphone can be added to the training samples, and the labeled training samples can be used to train to obtain the DNN model.

需要說明的是，在具體實施時，由於準備階段的運算量比較大，對記憶體以及計算速度的要求相對較高，因此準備階段的操作通常是在伺服器端完成的。為了在沒有網路接入環境的情況下依然能夠完成語音辨識功能，本申請提供的方法通常在用戶端設備上實施，因此準備階段生成的各WFST以及用於進行聲學機率計算的各模型，可以預先安裝到用戶端設備中，例如：與應用程式一起打包並安裝到用戶端。 It should be noted that during the specific implementation, due to the relatively large amount of calculation in the preparation phase, the requirements for memory and calculation speed are relatively high, so the operations in the preparation phase are usually completed on the server side. In order to still be able to complete the voice recognition function in the absence of a network access environment, the method provided in this application is usually implemented on the client device. Therefore, each WFST generated in the preparation phase and each model used for acoustic probability calculation can be Pre-installed on the client device, for example: packaged with the application and installed on the client.

至此，對本實施例涉及的準備階段進行了詳細說明，下面對本實施例的具體步驟101至104做詳細說明。 So far, the preparation phase involved in this embodiment has been described in detail, and specific steps 101 to 104 of this embodiment will be described in detail below.

步驟101、利用預設的語音知識源，生成包含用戶端預設資訊的、用於對語音信號進行解碼的搜索空間。 Step 101: Use a preset voice knowledge source to generate a search space that contains preset information on the client side and is used to decode a voice signal.

本步驟構建WFST搜索空間，為後續的語音辨識做好準備。在具體實施時，本步驟通常在用語音作為人機交互媒介的用戶端應用程式的啟動階段(也稱為初始化階段)執行，透過替換標籤的方式，向預先生成的至少基於語言模型的加權有限狀態轉換器中添加與預設主題類別對應的用戶端預設資訊，並得到基於三音素狀態繫結欄位表、發音詞典、以及語言模型的單一加權有限狀態轉換器。 In this step, the WFST search space is constructed to prepare for the subsequent speech recognition. In specific implementation, this step is usually performed during the startup phase (also known as the initialization phase) of the client application that uses voice as the human-computer interaction medium. By replacing tags, the weight is limited to the pre-generated at least based on the language model. The state converter adds user-side preset information corresponding to the preset theme category, and obtains a single weighted finite state converter based on the triphone state binding field table, pronunciation dictionary, and language model.

本步驟的處理過程可以包括以下步驟101-1至101-4，下面結合圖2做進一步說明。 The processing process of this step may include the following steps 101-1 to 101-4, which will be further described with reference to FIG. 2 below.

步驟101-1、確定待識別語音信號所屬的預設主題類別。 Step 101-1: Determine the preset subject category to which the voice signal to be recognized belongs.

在具體實施時，可以根據採集所述語音信號的用戶端類型、或應用程式確定所述所屬的預設主題類別。所述預設主題類別包括：撥打電話、發送簡訊、播放樂曲、設置指令、或者其他應用場景相關的主題類別。其中，與撥打電話或發送簡訊對應的用戶端預設資訊包括：通訊錄中的連絡人名稱；與播放樂曲對應的用戶端預設資訊包括：曲庫中的樂曲名稱；與設置指令對應的用戶端預設資訊包括：指令集中的指令；對於其他應用場景相關的主題類別，也同樣可以與應用場景所涉及的用戶端預設資訊相對應，此處不再一一贅述。 In a specific implementation, the preset theme category to which the voice signal belongs can be determined according to the type of the client terminal or the application program that collects the voice signal. The preset theme categories include: making calls, sending short messages, playing music, setting instructions, or other theme categories related to application scenarios. Among them, the client default information corresponding to making a call or sending a text message includes: the contact name in the address book; the client default information corresponding to the music being played includes: the name of the music in the music library; the user corresponding to the setting command The terminal default information includes: commands in the command set; for other application scenarios related topic categories, it can also correspond to the client default information involved in the application scenarios, which will not be repeated here.

例如：對於智慧手機，可以根據用戶端類型確定待識別語音信號所屬的預設主題類別為：撥打電話或發送簡訊，相應的用戶端預設資訊為通訊錄中的連絡人名稱；對於智慧音箱，可以確定主題類別為：播放樂曲，相應的用戶端預設資訊為曲庫中的樂曲名稱；對於機器人，可以確定主題類別為：設置指令，相應的用戶端預設資訊為指令集中的指令。 For example: for a smart phone, the default theme category to which the voice signal to be recognized belongs can be determined according to the type of the client: making a call or sending a text message, and the corresponding client default information is the contact name in the address book; for a smart speaker, The theme category can be determined as: playing music, and the corresponding user-side preset information is the name of the song in the music library; for robots, the theme category can be determined as: setting commands, and the corresponding client-side default information is the commands in the command set.

考慮到用戶端設備可以同時具有多個用語音作為人機交互媒介的應用，不同的應用涉及不同的用戶端預設資訊，例如：智慧手機也可以安裝基於語音交互的音樂播放機，在這種情況下可以根據當前啟動的應用程式確定待識別語音信號所屬的預設主題類別。 Considering that the user-end device can have multiple applications that use voice as the human-computer interaction medium at the same time, different applications involve different user-end preset information. For example, a smart phone can also be equipped with a music player based on voice interaction. In this case, the preset theme category to which the voice signal to be recognized belongs can be determined according to the currently activated application.

步驟101-2、選擇預先生成的、與所述預設主題類別相對應的G結構WFST。 Step 101-2: Select a pre-generated G structure WFST corresponding to the preset theme category.

對於存在多個預設主題類別的情況，在準備階段通常會生成多個G結構WFST，每個G結構WFST分別與不同的預設主題類別相對應。本步驟從預先生成的多個G結構WFST中選擇與步驟101-1所確定的預設主題類別相對應的G結構WFST。 In the case of multiple preset subject categories, multiple G-structure WFSTs are usually generated in the preparation stage, and each G-structure WFST corresponds to a different preset subject category. In this step, a G structure WFST corresponding to the preset theme category determined in step 101-1 is selected from a plurality of G structure WFSTs generated in advance.

步驟101-3、透過用與所述預設主題類別對應的用戶端預設資訊替換相應標籤的方式，向所選的G結構WFST中添加用戶端預設資訊。 Step 101-3: Add the client default information to the selected G structure WFST by replacing the corresponding label with the client default information corresponding to the preset theme category.

在準備階段針對每種預設主題類別訓練基於類的語言模型時，將訓練文本中的預設命名實體替換為了與相應主題類別對應的標籤，例如主題類別為撥打電話或發送簡訊，將訓練文本中的人名替換為“$CONTACT”標籤，主題類別為播放樂曲，將訓練文本中的樂曲名稱替換為“$SONG”標籤，因此，生成的G結構的WFST中通常包含與預設主題類別對應的標籤資訊。本步驟用與步驟101-1所確定的預設主題類別對應的用戶端預設資訊，替換步驟101-2所選G結構WFST中的相應標籤，從而實現向所選G結構WFST中添加用戶端預設資訊的目的。 When training a class-based language model for each preset topic category in the preparation phase, replace the preset named entity in the training text with a label corresponding to the corresponding topic category. For example, if the topic category is making a call or sending a text message, the training text Replace the person’s name in the “$CONTACT” tag with the “$CONTACT” tag, and the theme category is the playing music. Replace the music name in the training text with the “$SONG” tag. Therefore, the generated G-structure WFST usually contains the corresponding preset theme category. Label information. This step replaces the corresponding tags in the G structure WFST selected in step 101-2 with the user terminal preset information corresponding to the preset theme category determined in step 101-1, so as to add the client terminal to the selected G structure WFST The purpose of the default information.

例如，主題類別為撥打電話或者發送簡訊，則可以將G結構WFST中的“$CONTACT”標籤替換為用戶端本地通訊錄中的連絡人名稱，如“張三”、“李四”等；主題類別為播放樂曲，則可以將G結構WFST中的“$SONG”標籤替換為用戶端本地曲庫中的歌曲名稱，如 “義勇軍進行曲”等。具體的替換，可以透過將與所述標籤對應的狀態轉移鏈路替換為若干組並行的狀態轉移鏈路的方式實現。請參見圖3和圖4給出的用用戶端通訊錄中的連絡人進行替換的例子，其中圖3為替換前的G結構WFST的示意圖，圖4為用通訊錄中的“張三”和“李四”進行替換後得到的G結構WFST的示意圖。 For example, if the subject category is making a call or sending a text message, you can replace the "$CONTACT" tag in the G structure WFST with the name of the contact person in the local address book of the client, such as "Zhang San", "Li Si", etc.; If the category is to play music, you can replace the "$SONG" tag in the G structure WFST with the name of the song in the local music library of the client, such as "March of Volunteers". The specific replacement can be achieved by replacing the state transition link corresponding to the label with a number of parallel state transition links. Please refer to Figure 3 and Figure 4 for examples of replacing contacts in the client address book, where Figure 3 is a schematic diagram of the G structure WFST before replacement, and Figure 4 is using "Zhang San" and "Zhang San" in the address book. The schematic diagram of the G structure WFST obtained after the replacement of "Li Si".

步驟101-4、將添加了用戶端預設資訊的G結構的WFST、與預先生成的CL結構的WFST進行合併，得到單一的WFST網路。 Step 101-4: Combine the WFST of the G structure with the user-side preset information and the WFST of the pre-generated CL structure to obtain a single WFST network.

在本實施例中，語音辨識所用到的知識源涉及從語言層(語言模型)到實體層(三音素狀態繫結欄位表)的內容，本步驟的任務是將不同層次的WFST合併(也稱為組合、結合)到一起，得到單一的WFST網路。 In this embodiment, the knowledge source used for speech recognition involves content from the language layer (language model) to the entity layer (the triphone state binding field table). The task of this step is to merge the WFST of different levels (also It is called combination and combination) to get a single WFST network.

對於兩個WFST，進行合併的基本條件是：其中一個WFST的輸出符號是另外一個WFST輸入符號集合的子集。在滿足上述要求的前提下，如果將兩個WFST，例如分別為A和B，整合成一個新的WFST：C，那麼C的每個狀態都由A的狀態和B的狀態組成，C的每個成功路經，都由A的成功路經P_a和B的成功路徑P_b組成，輸入為i[P]=i[P_a]，輸出為o[P]=o[P_b]，其加權值為由P_a和P_b的加權值進行相應運算後得到，最後得到的C包含A和B共有的有限狀態轉換器特性以及搜索空間。在具體實施時，可以採用OpenFst庫提供的合併演算法完成兩個WFST的合併操作。 For two WFSTs, the basic condition for merging is that the output symbol of one WFST is a subset of the input symbol set of the other WFST. Under the premise of meeting the above requirements, if two WFSTs, such as A and B respectively, are integrated into a new WFST: C, then each state of C is composed of the state of A and the state of B, and each state of C Each success path consists of A’s success path P _a and B’s success path P _b . The input is i[P]=i[P _a ], and the output is o[P]=o[P _b ]. The weighted value is obtained by _{performing corresponding calculations on the weighted values of P a} and P _b , and the finally obtained C contains the characteristics of the finite state converter shared by A and B and the search space. In specific implementation, the merging algorithm provided by the OpenFst library can be used to complete the merging operation of the two WFSTs.

具體到本實施例，可以這樣理解，L結構的WFST可以看作是單音素與詞之間的對應關係，C結構的WFST則在三音素與單音素之間建立對應關係，其輸出和L結構WFST的輸入相互對應，可以進行合併，在本實施例的準備階段已經透過合併得到了CL結構的WFST，本步驟將所述CL結構的WFST與步驟101-3中添加了用戶端預設資訊的G結構WFST進行合併，得到了一個輸入為三音素機率，輸出為詞序列的WFST網路，從而將處於不同層次的分別對應不同知識源的WFST，整合為一個單一的WFST網路，構成了用於進行語音辨識的搜索空間。 Specific to this embodiment, it can be understood that the WFST of the L structure can be regarded as the correspondence between monophones and words, and the WFST of the C structure establishes the correspondence between triphones and monophones, and its output is the same as that of the L structure. The input of WFST corresponds to each other and can be merged. In the preparation stage of this embodiment, the WFST of the CL structure has been obtained through merging. In this step, the WFST of the CL structure is added to the default information of the client in step 101-3. G-structure WFST is merged to obtain a WFST network whose input is triphone probability and output is word sequence, so that WFST at different levels and corresponding to different knowledge sources are integrated into a single WFST network, which constitutes a single WFST network. The search space for speech recognition.

較佳地，為了加快CL結構WFST和G結構WFST的合併速度，減少初始化的耗時，本實施例在執行所述合併操作時沒有採用常規的WFST合併方法，而是採用了基於預測的合併方法(lookahead合併方法)。所述lookahead合併方法，即在兩個WFST的合併過程中，透過對未來路徑的預測，判斷當前執行的合併操作是否會導致無法到達的最終狀態(non-coaccessible state)，如果是，則阻塞當前操作、不再執行後續的合併操作。透過預測可以提前終止沒有必要的合併操作，不僅可以節省合併時間，而且可以縮減最終生成的WFST的規模，減少對存儲空間的佔用。具體實施時，可以採用OpenFst庫提供的具備lookahead功能的篩檢程式(filter)，實現上述預測篩選功能。 Preferably, in order to speed up the merging speed of the CL structure WFST and the G structure WFST and reduce the time-consuming initialization, this embodiment does not use the conventional WFST merging method when performing the merging operation, but adopts a prediction-based merging method. (lookahead merge method). The lookahead merging method, that is, in the merging process of the two WFSTs, through the prediction of the future path, it is judged whether the merge operation currently performed will lead to an unreachable final state (non-coaccessible state), and if it is, the current state is blocked. Operation, do not perform subsequent merge operations. Through forecasting, unnecessary merging operations can be terminated early, which not only saves merging time, but also reduces the scale of the final WFST generated and reduces the storage space occupation. In specific implementation, a filter with lookahead function provided by the OpenFst library can be used to realize the above predictive screening function.

較佳地，為了加快CL結構WFST和G結構WFST的合併速度，在本實施例中預先訓練所述語言模型所採用的詞表與所述發音詞典包含的詞是一致的。一般而言，詞表中的詞的數目通常大於發音詞典中的詞的數目，而詞表中的詞的數目和G結構的WFST的大小有直接關係，如果G結構的WFST比較大，和CL結構的WFST進行合併就比較耗時，所以本實施例在準備階段訓練語言模型時，縮減了詞表的規模，讓詞表中的詞與發音詞典中的詞保持一致，從而達到了縮減CL結構WFST和G結構WFST的合併時間的效果。 Preferably, in order to speed up the merging speed of the CL structure WFST and the G structure WFST, the vocabulary used in the pre-training of the language model in this embodiment is consistent with the words contained in the pronunciation dictionary. Generally speaking, the number of words in the vocabulary is usually greater than the number of words in the pronunciation dictionary, and the number of words in the vocabulary is directly related to the size of the WFST of the G structure. If the WFST of the G structure is relatively large, and CL The WFST of the structure is more time-consuming to merge, so when training the language model in the preparation phase in this embodiment, the scale of the vocabulary is reduced, so that the words in the vocabulary are consistent with the words in the pronunciation dictionary, thereby achieving a reduced CL structure The effect of the merging time of WFST and G structure WFST.

至此，透過步驟101-1至101-4，完成了本技術方案的初始化過程，生成了包含用戶端預設資訊的WFST搜索空間。 So far, through steps 101-1 to 101-4, the initialization process of the technical solution is completed, and the WFST search space containing the preset information on the user side is generated.

需要說明的是，本實施例在準備階段預先完成CL結構的WFST的合併、並生成G結構的WFST，在本步驟101中則向G結構WFST中添加用戶端預設資訊，並將CL結構和G結構合併得到單一的WFST。在其他實施方式中，也可以採用其他合併策略，例如，在準備階段預先完成LG結構的WFST的合併，在本步驟101中向該WFST中添加用戶端預設資訊，然後再與準備階段生成的C結構WFST進行合併；或者，在準備階段直接完成CLG結構的WFST的合併，並在本步驟101中向該WFST中添加用戶端預設資訊也是可以的。考慮到準備階段生成的WFST要佔據用戶端的儲存空間，在有多個基於類的語言模型(相應有多個G結構的WFST)的應用場景中，如果在準備階段將每個G結構WFST與其他WFST進行合併，將佔據較大儲存空間，因此本實施例採取的合併方式是較佳實施方式，可以減少在準備階段生成的WFST對用戶端儲存空間的佔用。 It should be noted that in this embodiment, in the preparation stage, the WFST of the CL structure is merged in advance, and the WFST of the G structure is generated. In this step 101, the user-side preset information is added to the G structure WFST, and the CL structure and The G structures are combined to obtain a single WFST. In other implementations, other merging strategies can also be adopted. For example, the WFST of LG structure is merged in advance in the preparation phase, and the user-side preset information is added to the WFST in this step 101, and then combined with the WFST generated in the preparation phase. The C structure WFST is merged; or, the WFST of the CLG structure is merged directly in the preparation stage, and the user-side preset information is added to the WFST in this step 101. Considering that the WFST generated in the preparation phase takes up storage space on the user side, in an application scenario where there are multiple class-based language models (corresponding to a WFST with multiple G structures), if the WFST of each G structure is combined with other The WFST merging will occupy a larger storage space. Therefore, the merging method adopted in this embodiment is a preferred implementation method, which can reduce the user-side storage space occupied by the WFST generated in the preparation phase.

步驟102、提取待識別語音信號的特徵向量序列。 Step 102: Extract the feature vector sequence of the voice signal to be recognized.

待識別語音信號通常是時域信號，本步驟透過分幀和提取特徵向量兩個處理過程，獲取能夠表徵所述語音信號的特徵向量序列，下面結合附圖5做進一步說明。 The voice signal to be recognized is usually a time-domain signal. In this step, a feature vector sequence that can characterize the voice signal is obtained through two processing processes of framing and feature vector extraction. This will be further described with reference to FIG. 5 below.

步驟102-1、按照預先設定的幀長度對待識別語音信號進行分幀處理，得到多個音訊幀。 Step 102-1: Perform framing processing on the voice signal to be recognized according to the preset frame length to obtain multiple audio frames.

在具體實施時，可以根據需求預先設定幀長度，例如可以設置為10ms、或者15ms，然後根據所述幀長度對待識別語音信號進行逐幀切分，從而將語音信號切分為多個音訊幀。根據所採用的切分策略的不同，相鄰音訊幀可以不存在交疊、也可以是有交疊的。 In specific implementation, the frame length can be preset according to requirements, for example, it can be set to 10ms or 15ms, and then the speech signal to be recognized is segmented frame by frame according to the frame length, thereby segmenting the speech signal into multiple audio frames. According to different splitting strategies adopted, adjacent audio frames may not overlap or overlap.

步驟102-2、提取各音訊幀的特徵向量，得到所述特徵向量序列。 Step 102-2: Extract the feature vector of each audio frame to obtain the feature vector sequence.

將待識別語音信號切分為多個音訊幀後，可以逐幀提取能夠表徵所述語音信號的特徵向量。由於語音信號在時域上的描述能力相對較弱，通常可以針對每個音訊幀進行傅立葉變換，然後提取頻域特徵作為音訊幀的特徵向量，例如，可以提取MFCC(Mel Frequency Cepstrum Coefficient-梅爾頻率倒譜系數)特徵、PLP(Perceptual Linear Predictive-感知線性預測)特徵、或者LPC (Linear Predictive Coding-線性預測編碼)特徵等。 After the voice signal to be recognized is divided into multiple audio frames, the feature vector that can characterize the voice signal can be extracted frame by frame. Due to the relatively weak description ability of the speech signal in the time domain, the Fourier transform can usually be performed for each audio frame, and then the frequency domain features can be extracted as the feature vector of the audio frame. For example, the MFCC (Mel Frequency Cepstrum Coefficient-Meer) can be extracted. Frequency cepstral coefficient) feature, PLP (Perceptual Linear Predictive-Perceptual Linear Predictive) feature, or LPC (Linear Predictive Coding-Linear Predictive Coding) feature, etc.

下面以提取某一音訊幀的MFCC特徵為例，對特徵向量的提取過程作進一步描述。首先將音訊幀的時域信號透過FFT(Fast Fourier Transformation-快速傅氏變換)得到對應的頻譜資訊，將所述頻譜資訊透過Mel濾波器組得到Mel頻譜，在Mel頻譜上進行倒譜分析，其核心一般是採用DCT(Discrete Cosine Transform-離散餘弦變換)進行逆變換，然後取預設的N個係數(例如N=12或者38)，則得到了所述音訊幀的特徵向量：MFCC特徵。對每個音訊幀都採用上述方式進行處理，可以得到表徵所述語音信號的一系列特徵向量，即本申請所述的特徵向量序列。 The following takes the extraction of the MFCC feature of a certain audio frame as an example to further describe the feature vector extraction process. First, the time-domain signal of the audio frame is obtained through FFT (Fast Fourier Transformation) to obtain the corresponding spectrum information, the spectrum information is passed through the Mel filter bank to obtain the Mel spectrum, and the cepstrum analysis is performed on the Mel spectrum. The core generally uses DCT (Discrete Cosine Transform) for inverse transformation, and then takes preset N coefficients (for example, N=12 or 38) to obtain the feature vector of the audio frame: MFCC feature. Each audio frame is processed in the above manner to obtain a series of feature vectors that characterize the speech signal, that is, the feature vector sequence described in this application.

步驟103、計算特徵向量對應於搜索空間基本單元的機率。 Step 103: Calculate the probability that the feature vector corresponds to the basic unit of the search space.

在本實施例中，WFST搜索空間基本單元是三音素，因此本步驟計算特徵向量對應於各三音素的機率。為了提高語音辨識的準確率，本實施例採用HMM模型和具備強大特徵提取能力的DNN模型計算所述機率，在其他實施方式中，也可以採用其他方式，例如：採用傳統的GMM和HMM模型計算所述機率也同樣可以實現本申請的技術方案，也在本申請的保護範圍之內。 In this embodiment, the basic unit of the WFST search space is triphones, so this step calculates the probability that the feature vector corresponds to each triphone. In order to improve the accuracy of speech recognition, this embodiment uses an HMM model and a DNN model with powerful feature extraction capabilities to calculate the probability. In other implementations, other methods can also be used, such as: traditional GMM and HMM model calculations The probability can also realize the technical solution of this application, and it is also within the protection scope of this application.

在具體實施時，可以在計算特徵向量對應於各三音素狀態的基礎上，進一步計算特徵向量對應於各三音素的機率，下面結合附圖6，對本步驟的處理過程作進一步描述。 In specific implementation, on the basis of calculating the feature vector corresponding to the state of each triphone, the probability of the feature vector corresponding to each triphone can be further calculated. The processing process of this step will be further described with reference to Fig. 6 below.

步驟103-1、採用預先訓練的DNN模型計算特徵向量對應於各三音素狀態的機率。 Step 103-1: Use the pre-trained DNN model to calculate the probability that the feature vector corresponds to each triphone state.

在本實施例的準備階段已經預先訓練好了DNN模型，本步驟以步驟102提取的特徵向量作為所述DNN模型的輸入，則可以得到特徵向量對應於各三音素狀態的機率。例如：三音素的數量為1000，每個三音素包含3個狀態，那麼總共有3000個三音素狀態，本步驟DNN模型輸出：特徵向量對應於3000個三音素狀態中每一狀態的機率。 The DNN model has been pre-trained in the preparation stage of this embodiment. In this step, the feature vector extracted in step 102 is used as the input of the DNN model, and the probability that the feature vector corresponds to each triphone state can be obtained. For example: the number of triphones is 1000, and each triphone contains 3 states, so there are a total of 3000 triphone states. The output of the DNN model in this step: The feature vector corresponds to the probability of each state in the 3000 triphone states.

較佳地，由於採用DNN模型涉及的計算量通常很大，本實施例透過利用硬體平臺提供的資料並行處理能力提升採用DNN模型進行計算的速度。例如，目前嵌入式設備和移動設備用的比較多的是ARM架構平臺，在現行的大多數ARM平臺上，都有SIMD(single instruction multiple data-單指令多資料流程)的NEON指令集，該指令集可以在一個指令中處理多個資料，具備一定的資料並行處理能力，在本實施例中透過向量化程式設計可以形成單指令流多資料流程的程式設計泛型，從而可以充分利用硬體平臺提供的資料並行處理能力，實現加速DNN計算的目的。 Preferably, since the amount of calculation involved in the use of the DNN model is usually very large, this embodiment uses the data parallel processing capability provided by the hardware platform to increase the speed of the calculation using the DNN model. For example, currently embedded devices and mobile devices use ARM architecture platforms. On most current ARM platforms, there are SIMD (single instruction multiple data) NEON instruction sets. The set can process multiple data in one instruction, and has certain data parallel processing capabilities. In this embodiment, vectorized programming can form a programming generic of single instruction stream and multiple data flow, so that the hardware platform can be fully utilized The provided data parallel processing capability realizes the purpose of accelerating DNN calculation.

在用戶端設備上實施本申請技術方案時，為了能夠與用戶端的硬體能力相匹配，通常會縮減DNN模型的規模，這樣往往會導致DNN模型的精確度下降，對不同語音內容的識別能力也會隨著下降。本實施例由於利用硬體加速機制，則可以不縮減或者儘量少縮減DNN模型的規模，從而可以最大限度地保留DNN模型的精確性，提高識別準確率。 When the technical solution of this application is implemented on the client device, in order to be able to match the hardware capabilities of the client, the scale of the DNN model is usually reduced. This often leads to a decrease in the accuracy of the DNN model and the ability to recognize different voice content. Will follow the decline. In this embodiment, due to the hardware acceleration mechanism, the scale of the DNN model may not be reduced or minimized, so that the accuracy of the DNN model can be retained to the greatest extent and the recognition accuracy rate can be improved.

步驟103-2、根據特徵向量對應於所述各三音素狀態的機率，採用預先訓練的HMM模型計算特徵向量對應於各三音素的機率。 Step 103-2: According to the probability that the feature vector corresponds to each triphone state, the pre-trained HMM model is used to calculate the probability that the feature vector corresponds to each triphone.

在準備階段已經訓練好了針對每個三音素的HMM模型，本步驟根據連續輸入的若干個特徵向量對應於各三音素狀態的機率，利用HMM模型計算對應於各三音素的轉移機率，從而得到特徵向量對應於各三音素的機率。 In the preparation stage, the HMM model for each triphone has been trained. In this step, according to the probability of the consecutive input of several feature vectors corresponding to the state of each triphone, the HMM model is used to calculate the transition probability corresponding to each triphone, so as to obtain The feature vector corresponds to the probability of each triphone.

該計算過程實際上就是根據連續的特徵向量在各HMM上的傳播過程計算相應轉移機率的過程，下面以計算針對某一三音素(包括3個狀態)的機率為例對計算過程作進一步說明，其中，pe(i,j)表示第i幀特徵向量在第j個狀態上的發射機率，pt(h,k)表示從h狀態轉移到k狀態的轉移機率：1)第一幀的特徵向量對應於相應HMM的狀態1，具有發射機率pe(1,1)；2)對於第二幀的特徵向量，如果從所述HMM的狀態1轉移到狀態1，對應的機率為pe(1,1)*pt(1,1)*pe(2,1)，如果從狀態1轉移到狀態2，對應的機率為pe(1,1)*pt(1,2)*pe(2,2)，根據上述機率判斷轉移到狀態1還是狀態2； 3)對於第三幀的特徵向量以及後續幀的特徵向量也採用上述類似的計算方式，直到從狀態3轉移出去，至此在所述HMM上的傳播結束，從而得到了連續多幀的特徵向量針對該HMM的機率，即：對應於該HMM表徵的三音素的機率。 The calculation process is actually the process of calculating the corresponding transfer probability according to the propagation process of the continuous feature vector on each HMM. The calculation process is further explained by taking the calculation of the probability for a certain triphone (including 3 states) as an example. Among them, pe(i,j) represents the transmitter rate of the feature vector of the i-th frame in the j-th state, and pt(h,k) represents the transition probability from the h state to the k state: 1) The feature vector of the first frame Corresponding to the state 1 of the corresponding HMM, with the transmitter rate pe(1,1); 2) For the feature vector of the second frame, if the state 1 of the HMM transitions to the state 1, the corresponding probability is pe(1,1) )*pt(1,1)*pe(2,1), if you transition from state 1 to state 2, the corresponding probability is pe(1,1)*pt(1,2)*pe(2,2), Determine whether to transition to state 1 or state 2 based on the above probability; 3) The above-mentioned similar calculation method is also used for the feature vector of the third frame and the feature vector of the subsequent frame, until the transition from state 3, and the propagation on the HMM so far At the end, the probability of the feature vector of the consecutive multiple frames for the HMM is obtained, that is, the probability corresponding to the triphone represented by the HMM.

針對連續輸入的特徵向量，採用上述方式計算在各HMM上傳播的轉移機率，從而得到對應於各三音素的機率。 For the continuously input feature vector, the transfer probability of propagation on each HMM is calculated using the above method, so as to obtain the probability corresponding to each triphone.

步驟104、以所述機率為輸入、在所述搜索空間中執行解碼操作，得到與所述特徵向量序列對應的詞序列。 Step 104: Using the probability as an input, perform a decoding operation in the search space to obtain a word sequence corresponding to the feature vector sequence.

根據步驟103輸出的特徵向量對應於各三音素的機率，在WFST網路中執行解碼操作，得到與所述特徵向量序列對應的詞序列。該過程通常為執行圖搜索、找到得分最高的路徑的搜索過程。常用的搜索方法是維特比演算法，它的好處在於採用動態規劃的方法節省了計算量，也可以做到時間同步解碼。 According to the probability that the feature vector output in step 103 corresponds to each triphone, a decoding operation is performed in the WFST network to obtain a word sequence corresponding to the feature vector sequence. This process is usually a search process of performing a graph search and finding the path with the highest score. The commonly used search method is the Viterbi algorithm. Its advantage is that the dynamic programming method saves the amount of calculation and can also achieve time synchronization decoding.

考慮到在實際解碼過程中，由於搜索空間巨大，維特比演算法的計算量仍然很大，為了減少計算量、提高計算速度，在解碼過程中並不是把所有路徑的可能的後續路徑都展開，而是只展開那些在最佳路徑附近的路徑，即：可以在採用維特比演算法進行搜索的過程中，採用適當的剪枝策略以提高搜索效率，例如：可以採用維特比柱搜索演算法或者是採用長條圖剪枝策略等。 Considering that in the actual decoding process, due to the huge search space, the calculation amount of the Viterbi algorithm is still very large. In order to reduce the calculation amount and increase the calculation speed, not all possible subsequent paths of the path are expanded during the decoding process. Instead, it only expands those paths near the best path, that is, in the process of searching using the Viterbi algorithm, appropriate pruning strategies can be used to improve the search efficiency, for example: the Viterbi column search algorithm or It adopts bar graph pruning strategy and so on.

至此，透過解碼得到了與特徵向量序列對應的詞序列，即，獲取了待識別語音信號對應的識別結果。由於在步驟101構建用於進行語音辨識的搜索空間時，添加了用戶端預設資訊，因此上述語音辨識過程通常可以比較準確地識別出與用戶端本地資訊相關的語音內容。 So far, the word sequence corresponding to the feature vector sequence is obtained through decoding, that is, the recognition result corresponding to the speech signal to be recognized is obtained. Since the user-side preset information is added when constructing the search space for voice recognition in step 101, the above-mentioned voice recognition process can generally identify the voice content related to the local information of the user-side relatively accurately.

考慮到用戶端本地資訊有可能被使用者修改或者刪除，為了進一步保證透過上述解碼過程獲得的詞序列的準確性，本實施例還提供一種較佳實施方式：透過與所述用戶端預設資訊進行文字匹配驗證所述詞序列的準確性，並根據驗證結果生成相應的語音辨識結果。 Considering that the local information on the client side may be modified or deleted by the user, in order to further ensure the accuracy of the word sequence obtained through the above decoding process, this embodiment also provides a preferred implementation method: Perform text matching to verify the accuracy of the word sequence, and generate a corresponding voice recognition result according to the verification result.

在具體實施時，上述較佳實施方式可以包括以下所列的步驟104-1至步驟104-4，下面結合附圖7做進一步說明。 In specific implementation, the above-mentioned preferred embodiment may include steps 104-1 to 104-4 listed below, which will be further described with reference to FIG. 7 below.

步驟104-1、從所述詞序列中選擇對應於所述用戶端預設資訊的待驗證詞。 Step 104-1: Select a word to be verified from the word sequence corresponding to the preset information on the client side.

例如：針對打電話應用，所述用戶端預設資訊為通訊錄中的連絡人名稱，語音辨識的結果為詞序列“給小明打電話”，那麼透過與範本匹配的方式或者語法解析過程，可以確定所述詞序列中的“小明”是與用戶端預設資訊對應的待驗證詞。 For example: for a phone call application, the default information on the client side is the contact name in the address book, and the result of voice recognition is the word sequence "Call Xiaoming", then through matching with the template or the grammatical analysis process, you can It is determined that "Xiaoming" in the word sequence is the word to be verified corresponding to the preset information on the client side.

步驟104-2、在所述用戶端預設資訊中查找所述待驗證詞，若找到，判定透過準確性驗證，執行步驟104-3，否則執行步驟104-4。 Step 104-2: Search for the word to be verified in the preset information on the client side. If it is found, it is determined that the accuracy verification is passed, and step 104-3 is executed; otherwise, step 104-4 is executed.

本步驟透過執行文字層面的精準匹配，判斷所述待驗證詞是否屬於相對應的用戶端預設資訊，從而驗證所述詞序列的準確性。 In this step, by performing precise matching at the text level, it is determined whether the word to be verified belongs to the corresponding user-side preset information, so as to verify the accuracy of the word sequence.

仍沿用步驟104-1中的例子，本步驟在用戶端通訊錄中查找是否存在“小明”這個聯絡人，即：通訊錄中與連絡人名稱相關的資訊中是否包含“小明”這一字串，若包含，則判定透過準確性驗證，繼續執行步驟104-3，否則，轉到步驟104-4執行。 Still using the example in step 104-1, this step looks for the contact "Xiaoming" in the client address book, that is, whether the information related to the contact name in the address book contains the string "Xiaoming" If it is included, it is determined that the accuracy verification is passed, and the step 104-3 is continued; otherwise, the step 104-4 is performed.

步驟104-3、將所述詞序列作為語音辨識結果。 Step 104-3: Use the word sequence as a voice recognition result.

執行到本步驟，說明透過解碼得到的詞序列中所包含的待驗證詞，與用戶端預設資訊是相吻合的，可以將所述詞序列作為語音辨識結果輸出，從而觸發使用該語音辨識結果的應用程式執行相應的操作。 The execution of this step shows that the word to be verified contained in the word sequence obtained by decoding is consistent with the preset information on the user side. The word sequence can be output as the voice recognition result, thereby triggering the use of the voice recognition result To perform the corresponding actions.

步驟104-4、透過基於拼音的模糊匹配方式修正所述詞序列，並將修正後的詞序列作為語音辨識結果。 Step 104-4: Correct the word sequence through a fuzzy matching method based on pinyin, and use the corrected word sequence as a voice recognition result.

執行到本步驟，通常說明所述透過解碼得到的詞序列中所包含的待驗證詞，與用戶端預設資訊是不相吻合的，如果將該詞序列作為語音辨識結果輸出，那麼相關應用程式通常無法執行正確的操作，因此在這種情況下，可以透過拼音層面的模糊匹配對所述詞序列進行必要的修正。 When this step is performed, it usually means that the word to be verified contained in the word sequence obtained through decoding does not match the preset information on the client side. If the word sequence is output as the voice recognition result, then the relevant application Normally, the correct operation cannot be performed. Therefore, in this case, the necessary corrections can be made to the word sequence through fuzzy matching at the pinyin level.

在具體實施時，可以透過如下方式實現上述修正功能：透過查找發音詞典，將所述待驗證詞轉換為待驗證拼音序列，將所述用戶端預設資訊中的各個詞也分別轉換為比對拼音序列，然後依次計算所述待驗證拼音序列與各比對拼音序列之間的相似度，並從所述用戶端預設資訊中選擇按照相似度從高到低排序靠前的詞，最後用所選詞替換所述詞序列中的待驗證詞，得到所述修正後的詞序列。 In the specific implementation, the above-mentioned correction function can be realized by the following ways: by looking up the pronunciation dictionary, the word to be verified is converted into a pinyin sequence to be verified, and each word in the preset information of the user terminal is also converted into a comparison. Pinyin sequence, and then sequentially calculate the similarity between the pinyin sequence to be verified and each compared pinyin sequence, and select the top words sorted according to the similarity from high to low from the user-side preset information, and finally use The selected word replaces the word to be verified in the word sequence to obtain the revised word sequence.

在具體實施時可以採用不同的方式計算兩個拼音序列之間的相似度，本實施例採用基於編輯距離計算所述相似度的方式，例如：用兩個拼音序列之間的編輯距離與1相加求和的倒數作為所述相似度。所述編輯距離(Edit Distance)，是指兩個字串之間，由一個轉成另一個所需的最少編輯操作次數，所述編輯操作包括將一個字元替換成另一個字元，插入一個字元，刪除一個字元，一般來說，編輯距離越小，兩個串的相似度越大。 In specific implementations, different methods can be used to calculate the similarity between two pinyin sequences. This embodiment adopts the method of calculating the similarity based on the edit distance, for example, the edit distance between the two pinyin sequences is used to calculate the similarity between the two pinyin sequences. The reciprocal of the sum is added as the similarity. The Edit Distance refers to the minimum number of editing operations required to convert two strings from one to the other. The editing operations include replacing one character with another, and inserting one Character, delete a character. Generally speaking, the smaller the editing distance, the greater the similarity between the two strings.

仍沿用上述步驟104-1中的例子，詞序列為“給小明打電話”，待驗證詞為“小明”，如果在用戶端通訊錄的連絡人中沒有找到“小明”，則可以透過查找發音詞典，將小明轉換為待驗證拼音序列“xiaoming”，並將通訊錄中的各個聯絡人名稱也都轉換為相應的拼音序列，即：比對拼音序列，然後依次計算“xiaoming”與各比對拼音序列之間的編輯距離，然後選擇編輯距離最小(相似度最高)的比對拼音序列所對應的連絡人名稱(例如：“xiaomin”對應的“小敏”)，替換所述詞序列中的待驗證詞，從而完成了對所述詞序列的修正，並可以將修正後的詞序列作為最終的語音辨識結果。 The example in step 104-1 above is still used. The word sequence is "Call Xiaoming" and the word to be verified is "Xiaoming". If "Xiaoming" is not found in the contacts in the client address book, you can search for pronunciation Dictionary, convert Xiaoming into the pinyin sequence "xiaoming" to be verified, and convert each contact name in the address book into the corresponding pinyin sequence, that is: compare the pinyin sequence, and then calculate the "xiaoming" and each comparison in turn The edit distance between the pinyin sequences, and then select the contact name corresponding to the compared pinyin sequence with the smallest edit distance (highest similarity) (for example: "xiaomin" corresponding to "xiaomin"), and replace the word sequence in The word to be verified has completed the correction of the word sequence, and the corrected word sequence can be used as the final voice recognition result.

在具體實施時，也可以先計算出待驗證拼音序列與各比對拼音序列之間的相似度並按照相似度從高到低排序，選擇排序靠前的若干個(例如三個)比對拚音序列對應的詞，然後將這些詞透過螢幕輸出等方式提示給用戶端使用者，由用戶從中選擇正確的詞，並根據使用者選擇的詞替換所述詞序列中的待驗證詞。 In specific implementation, it is also possible to first calculate the similarity between the pinyin sequence to be verified and each compared pinyin sequence and sort it from high to low, and select several (for example, three) compared pinyin in the top order The words corresponding to the sequence are then prompted to the user-end user through screen output and other methods, and the user selects the correct word from them, and replaces the word to be verified in the word sequence according to the word selected by the user.

至此，透過上述步驟101-步驟104對本申請提供的語音辨識方法的具體實施方式進行了詳細說明。為了便於理解，請參考圖8，其為本實施例提供的語音辨識過程的整體框架圖。其中虛線框對應本實施例描述的準備階段，實線框對應具體的語音辨識處理過程。 So far, the specific implementation of the voice recognition method provided by the present application has been described in detail through the above steps 101 to 104. For ease of understanding, please refer to FIG. 8, which is an overall framework diagram of the speech recognition process provided in this embodiment. The dashed frame corresponds to the preparation stage described in this embodiment, and the solid frame corresponds to a specific speech recognition process.

需要說明的是，本實施例描述的步驟101可以在以語音作為交互媒介的用戶端應用程式每次啟動時都執行一次，即每次啟動都重新生成包含用戶端預設資訊的、用於對語音信號進行解碼的搜索空間，也可以僅在所述用戶端應用程式首次啟動時生成所述搜索空間並存儲、後續採用定期更新的方式，這樣可以減少每次應用程式啟動時生成搜索空間的時間開銷(可以直接使用之前已生成的搜索空間)，提高語音辨識的執行效率，改善用戶的使用體驗。 It should be noted that step 101 described in this embodiment can be executed once every time a client application that uses voice as the interactive medium is started, that is, every time it is started, it will regenerate the default information of the client and be used for matching. The search space for decoding the speech signal can also be generated and stored only when the client application is first launched, and then periodically updated, which can reduce the time to generate the search space every time the application is launched Overhead (you can directly use the previously generated search space), improve the execution efficiency of speech recognition, and improve the user experience.

此外，本申請提供的方法通常在用戶端設備上實施，所述用戶端設備包括：智慧移動終端、智慧音箱、機器人、或者其他可以運行所述方法的設備，本實施例即描述了在用戶端實施本申請所提供方法的具體實施方式。但是在其他實施方式中，本申請提供的方法也可以在基於用戶端和伺服器模式的應用場景下實施，在這種情況下，在準備階段生成的各個WFST以及用於聲學機率計算的模型無需預先安裝到用戶端設備中，每次用戶端應用啟動時，可以將相應的用戶端預設資訊上傳給伺服器，並將後續採集到待識別語音信號也上傳給伺服器，由伺服器一側實施本申請提供的方法，並將解碼得到的詞序列回傳給用戶端，同樣可以實現本申請的技術方案，並取得相應的有益效果。 In addition, the method provided in this application is usually implemented on user-end equipment, and the user-end equipment includes: smart mobile terminals, smart speakers, robots, or other devices that can run the method. Implementation of the specific implementation of the method provided in this application. However, in other implementation manners, the method provided in this application can also be implemented in application scenarios based on client and server modes. In this case, each WFST generated in the preparation phase and the model used for acoustic probability calculation do not need Pre-installed on the client device, each time the client application is started, the corresponding client preset information can be uploaded to the server, and the subsequent collected voice signals to be recognized are also uploaded to the server, from the server side Implementing the method provided in this application and returning the decoded word sequence to the user terminal can also implement the technical solution of this application and obtain corresponding beneficial effects.

綜上所述，本申請提供的語音辨識方法，由於在生成用於對語音信號進行解碼的搜索空間時包含了用戶端預設資訊，因此在對用戶端採集的語音信號進行識別時能夠相對準確地識別出與用戶端本地相關的資訊，從而可以提高語音辨識的準確率，提升用戶的使用體驗。 In summary, the voice recognition method provided by the present application includes user-side preset information when generating the search space for decoding the voice signal, so it can be relatively accurate when recognizing the voice signal collected by the user-side Recognize the local information related to the user terminal, which can improve the accuracy of voice recognition and improve the user experience.

特別是在用戶端設備上採用本申請提供的方法進行語音辨識，由於添加了用戶端本地資訊，因此可以在一定程度上彌補由於機率計算模型以及搜索空間規模縮小導致的識別準確率下降的問題，從而既能夠滿足在沒有網路接入環境下進行語音辨識的需求，同時也能達到一定的識別準確率。進一步地，在解碼得到詞序列後，透過採用本實施例給出的基於文字層面以及拼音層面的匹配驗證方案，可以進一步提升語音辨識的準確率。透過實際的測試結果表明：常規的語音辨識方法的字元錯誤率(CER)在20%左右，而使用本申請提供的方法，字元錯誤率在3%以下，以上資料充分說明了本方法的有益效果是顯著的。 In particular, the method provided in this application is used for voice recognition on the user-side equipment. Since the local information of the user-side is added, it can compensate to a certain extent for the problem of the decrease in the recognition accuracy caused by the probability calculation model and the shrinking of the search space. In this way, it can not only meet the needs of voice recognition in an environment without network access, but also achieve a certain recognition accuracy rate. Furthermore, after the word sequence is obtained by decoding, the accuracy of speech recognition can be further improved by adopting the matching verification scheme based on the text level and the pinyin level provided in this embodiment. The actual test results show that: the character error rate (CER) of the conventional speech recognition method is about 20%, while using the method provided in this application, the character error rate is below 3%. The above data fully explains the method The beneficial effect is significant.

在上述的實施例中，提供了一種語音辨識方法，與之相對應的，本申請還提供一種語音辨識裝置。請參看圖9，其為本申請的一種語音辨識裝置的實施例的示意圖。由於裝置實施例基本相似於方法實施例，所以描述得比較簡單，相關之處參見方法實施例的部分說明即可。下述描述的裝置實施例僅僅是示意性的。 In the above-mentioned embodiment, a voice recognition method is provided. Correspondingly, this application also provides a voice recognition device. Please refer to FIG. 9, which is a schematic diagram of an embodiment of a voice recognition device of this application. Since the device embodiment is basically similar to the method embodiment, the description is relatively simple. For related parts, please refer to the part of the description of the method embodiment. The device embodiments described below are merely illustrative.

本實施例的一種語音辨識裝置，包括：搜索空間生成單元901，用於利用預設的語音知識源，生成包含用戶端預設資訊的、用於對語音信號進行解碼的搜索空間；特徵向量提取單元902，用於提取待識別語音信號的特徵向量序列；機率計算單元903，用於計算特徵向量對應於搜索空間基本單元的機率；解碼搜索單元904，用於以所述機率為輸入、在所述搜索空間中執行解碼操作，得到與所述特徵向量序列對應的詞序列。 A voice recognition device of this embodiment includes: a search space generating unit 901, configured to use a preset voice knowledge source to generate a search space containing user-side preset information for decoding a voice signal; feature vector extraction The unit 902 is used to extract the feature vector sequence of the speech signal to be recognized; the probability calculation unit 903 is used to calculate the probability that the feature vector corresponds to the basic unit of the search space; the decoding and search unit 904 is used to input the probability of the Perform a decoding operation in the search space to obtain a word sequence corresponding to the feature vector sequence.

可選的，所述解碼空間生成單元包括：第二用戶端資訊添加子單元，用於透過替換標籤的方式，向預先生成的至少基於語言模型的加權有限狀態轉換器中添加與預設主題類別對應的用戶端預設資訊；統一加權有限狀態轉換器獲取子單元，用於在所述第二用戶端資訊添加子單元完成添加操作之後，得到基於三音素狀態繫結欄位表、發音詞典、以及語言模型的單一加權有限狀態轉換器；其中，所述第二用戶端資訊添加子單元包括：主題確定子單元，用於確定待識別語音信號所屬的預設主題類別；加權有限狀態轉換器選擇子單元，用於選擇預先生成的、與所述預設主題類別相對應的所述至少基於語言模型的加權有限狀態轉換器；標籤替換子單元，用於透過用與所述預設主題類別對應的用戶端預設資訊替換相應標籤的方式，向所選的加權有限狀態轉換器中添加用戶端預設資訊。 Optionally, the decoding space generating unit includes: a second user-side information adding subunit, which is used to add and preset topic categories to the pre-generated weighted finite state converter based at least on the language model by replacing tags Corresponding user-side preset information; the unified weighted finite state converter acquisition subunit is used to obtain the binding field table, pronunciation dictionary, pronunciation dictionary based on triphone state after the second user-side information adding subunit completes the adding operation And a single weighted finite state converter of the language model; wherein the second user-side information adding subunit includes: a topic determination subunit for determining the preset topic category to which the speech signal to be recognized belongs; weighted finite state converter selection Sub-unit for selecting the weighted finite state converter based on the language model that is generated in advance and corresponding to the preset topic category; the label replacement sub-unit is used to correspond to the preset topic category by using The method of replacing the corresponding label with the default information on the client side to add the default information on the client side to the selected weighted finite state converter.

可選的，所述加權有限狀態轉換器合併子單元具體用於，採用基於預測的方法執行合併操作，並得到所述單一加權有限狀態轉換器。 Optionally, the weighted finite state converter merging subunit is specifically used to perform a merging operation using a prediction-based method, and obtain the single weighted finite state converter.

可選的，所述準確性驗證單元包括：待驗證詞選擇子單元，用於從所述詞序列中選擇對應於所述用戶端預設資訊的待驗證詞；查找子單元，用於在所述用戶端預設資訊中查找所述待驗證詞；識別結果確認子單元，用於當所述查找子單元找到所述待驗證詞之後，判定透過所述準確性驗證，並將所述詞序列作為語音辨識結果；識別結果修正子單元，用於當所述查找子單元未找到所述待驗證詞之後，透過基於拼音的模糊匹配方式修正所述詞序列，並將修正後的詞序列作為語音辨識結果。 Optionally, the accuracy verification unit includes: a word-to-be-verified selection sub-unit for selecting a word to be verified from the word sequence corresponding to the preset information on the user side; and a search sub-unit for selecting the word to be verified in the word sequence; Search for the word to be verified in the preset information on the client side; a recognition result confirmation sub-unit for determining that the accuracy verification is passed after the search sub-unit finds the word to be verified, and to compare the word sequence As the result of speech recognition; the recognition result correction subunit is used to correct the word sequence through the fuzzy matching method based on pinyin after the search subunit does not find the word to be verified, and use the corrected word sequence as the voice Identification result.

此外，本申請還提供另一種語音辨識方法，請參考圖10，其為本申請提供的另一種語音辨識方法的實施例的流程圖，本實施例與之前提供的方法實施例內容相同的部分不再贅述，下面重點描述不同之處。本申請提供的另一種語音辨識方法包括： In addition, this application also provides another voice recognition method. Please refer to FIG. 10, which is a flowchart of an embodiment of another voice recognition method provided by this application. To repeat, the following focuses on the differences. Another voice recognition method provided by this application includes:

步驟1001、透過解碼獲取與待識別語音信號對應的詞序列。 Step 1001: Obtain a word sequence corresponding to the voice signal to be recognized through decoding.

對於語音辨識來說，解碼的過程就是在用於語音辨識的搜索空間中進行搜索的過程，以獲取與待識別語音信號對應的最佳詞序列。所述搜索空間可以是基於各種知識源的WFST網路，也可以是其他形式的搜索空間；所述搜索空間可以包含用戶端預設資訊，也可以不包含用戶端預設資訊，本實施例並不對此作具體的限定。 For speech recognition, the decoding process is a process of searching in the search space used for speech recognition to obtain the best word sequence corresponding to the speech signal to be recognized. The search space may be a WFST network based on various knowledge sources, or it may be a search space of other forms; the search space may include or may not include preset information on the client side. This embodiment does not There is no specific restriction on this.

步驟1002、透過與用戶端預設資訊進行文字匹配驗證所述詞序列的準確性，並根據驗證結果生成相應的語音辨識結果。 Step 1002: Verify the accuracy of the word sequence through text matching with preset information on the client side, and generate a corresponding voice recognition result according to the verification result.

本步驟包括以下操作：從所述詞序列中選擇對應於所述用戶端預設資訊的待驗證詞；在所述用戶端預設資訊中查找所述待驗證詞；若找到，則判定透過所述準確性驗證，並將所述詞序列作為語音辨識結果；否則透過基於拼音的模糊匹配方式修正所述詞序列，並將修正後的詞序列作為語音辨識結果。 This step includes the following operations: selecting the word to be verified from the word sequence corresponding to the user-side preset information; searching for the word to be verified in the user-side preset information; The accuracy of the statement is verified, and the word sequence is used as the voice recognition result; otherwise, the word sequence is corrected through the fuzzy matching method based on pinyin, and the corrected word sequence is used as the voice recognition result.

所述透過基於拼音的模糊匹配方式修正所述詞序列，包括：將所述待驗證詞轉換為待驗證拼音序列；將所述用戶端預設資訊中的各個詞分別轉換為比對拼音序列；依次計算所述待驗證拼音序列與各比對拼音序列之間的相似度，並從所述用戶端預設資訊中選擇按照所述相似度從高到低排序靠前的詞；用所選詞替換所述詞序列中的待驗證詞，得到所述修正後的詞序列。 The modification of the word sequence through the fuzzy matching method based on pinyin includes: converting the word to be verified into a pinyin sequence to be verified; and converting each word in the user-side preset information into a matched pinyin sequence; Calculate the similarity between the pinyin sequence to be verified and each compared pinyin sequence in turn, and select the top words sorted from high to low according to the similarity from the preset information on the user side; use the selected word Replace the word to be verified in the word sequence to obtain the revised word sequence.

其中，所述轉換拼音序列可以透過查找發音詞典實現，所述相似度可以根據兩個拼音序列之間的編輯距離計算。 Wherein, the converted pinyin sequence can be realized by searching a pronunciation dictionary, and the similarity can be calculated according to the edit distance between the two pinyin sequences.

本申請提供的方法，通常應用於用語音作為交互媒介的應用程式中，此類應用程式採集的待識別語音中可能會涉及用戶端資訊，而本申請提供的方法，透過將解碼得到的詞序列與用戶端預設資訊進行文字匹配，可以驗證所述詞序列的準確性，從而為對詞序列進行必要修正提供了依據。進一步地，透過採用基於拼音層面的模糊匹配，可以對所述詞序列進行修正，從而提升語音辨識的準確率。 The method provided in this application is usually used in applications that use voice as an interactive medium. The voice to be recognized collected by such applications may involve user-side information. The method provided in this application uses the word sequence obtained by decoding The text matching with the preset information on the client side can verify the accuracy of the word sequence, thereby providing a basis for making necessary corrections to the word sequence. Further, by using fuzzy matching based on the pinyin level, the word sequence can be corrected, thereby improving the accuracy of speech recognition.

在上述的實施例中，提供了另一種語音辨識方法，與之相對應的，本申請還提供另一種語音辨識裝置。請參看圖11，其為本申請的另一種語音辨識裝置的實施例示意圖。由於裝置實施例基本相似於方法實施例，所以描述得比較簡單，相關之處參見方法實施例的部分說明即可。下述描述的裝置實施例僅僅是示意性的。 In the above-mentioned embodiment, another voice recognition method is provided. Correspondingly, this application also provides another voice recognition device. Please refer to FIG. 11, which is a schematic diagram of an embodiment of another voice recognition device of this application. Since the device embodiment is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment. The device embodiments described below are merely illustrative.

本實施例的一種語音辨識裝置，包括：詞序列獲取單元1101，用於透過解碼獲取與待識別語音信號對應的詞序列；詞序列驗證單元1102，用於透過與用戶端預設資訊進行文字匹配驗證所述詞序列的準確性，並根據驗證結果生成相應的語音辨識結果。 A speech recognition device of this embodiment includes: a word sequence obtaining unit 1101, which is used to obtain a word sequence corresponding to a voice signal to be recognized through decoding; and a word sequence verification unit 1102, which is used to perform text matching with user-end preset information The accuracy of the word sequence is verified, and a corresponding voice recognition result is generated according to the verification result.

本申請雖然以較佳實施例公開如上，但其並不是用來限定本申請，任何本領域技術人員在不脫離本申請的精神和範圍內，都可以做出可能的變動和修改，因此本申請的保護範圍應當以本申請權利要求所界定的範圍為準。 Although this application is disclosed as above in preferred embodiments, it is not intended to limit this application. Any person skilled in the art can make possible changes and modifications without departing from the spirit and scope of this application. Therefore, this application The scope of protection shall be subject to the scope defined by the claims of this application.

在一個典型的配置中，計算設備包括一個或多個處理器(CPU)、輸入/輸出介面、網路介面和記憶體。 In a typical configuration, the computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

記憶體可能包括電腦可讀介質中的非永久性記憶體，隨機存取記憶體(RAM)和/或非揮發性記憶體等形式，如唯讀記憶體(ROM)或快閃記憶體(flash RAM)。記憶體是電腦可讀媒體的示例。 Memory may include non-permanent memory in computer-readable media, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

1、電腦可讀媒體包括永久性和非永久性、可移動和非可移動媒體可以由任何方法或技術來實現資訊儲存。資訊可以是電腦可讀指令、資料結構、程式的模組或其他資料。電腦的儲存媒體的例子包括，但不限於相變記憶體(PRAM)、靜態隨機存取記憶體(SRAM)、動態隨機存取記憶體(DRAM)、其他類型的隨機存取記憶體(RAM)、唯讀記憶體(ROM)、電可擦除可程式設計唯讀記憶體(EEPROM)、快閃記憶體或其他記憶體技術、唯讀光碟唯讀記憶體(CD-ROM)、數位多功能光碟(DVD)或其他光學儲存、磁盒式磁帶，磁帶磁磁片儲存或其他磁性存放裝置或任何其他非傳輸媒體，可用於儲存可以被計算設備訪問的資訊。按照本文中的界定，電腦可讀介質不包括非暫存電腦可讀媒體(transitory media)，如調製的資料信號和載波。 1. Computer-readable media includes permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology. Information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), and other types of random access memory (RAM) , Read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital multi-function Optical discs (DVD) or other optical storage, magnetic cassettes, magnetic tape storage or other magnetic storage devices, or any other non-transmission media, can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include non-transitory computer-readable media (transitory media), such as modulated data signals and carrier waves.

2、本領域技術人員應明白，本申請的實施例可提供為方法、系統或電腦程式產品。因此，本申請可採用完全硬體實施例、完全軟體實施例或結合軟體和硬體方面的實施例的形式。而且，本申請可採用在一個或多個其中包含有電腦可用程式碼的電腦可用存儲介質(包括但不限於磁碟記憶體、CD-ROM、光學記憶體等)上實施的電腦程式產品的形式。 2. Those skilled in the art should understand that the embodiments of this application can be provided as methods, systems or computer program products. Therefore, this application may adopt the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware. Moreover, this application can be in the form of computer program products implemented on one or more computer-usable storage media (including but not limited to disk memory, CD-ROM, optical memory, etc.) containing computer-usable program codes. .

Claims

A method for speech recognition, which is characterized in that it comprises: using a preset speech knowledge source to generate a search space containing user-side preset information for decoding a speech signal; extracting a feature vector sequence of the speech signal to be recognized; and calculating; The feature vector corresponds to the probability of the basic unit of the search space; using the probability as the input, the decoding operation is performed in the search space to obtain the word sequence corresponding to the feature vector sequence, where the search space includes: a weighted finite state converter, Among them, the basic unit of the search space includes: context-related triphones; the preset knowledge sources include: pronunciation dictionary, language model, and triphone state binding field table, wherein, the preset phonetic knowledge source is used to generate The search space that contains the preset information on the client side and is used to decode the speech signal includes: by replacing tags, adding the pre-generated weighted finite state converter at least based on the language model corresponding to the preset topic category The user terminal presets information, and obtains a single-weighted finite state converter based on the triphone state binding field table, pronunciation dictionary, and language model; the language model is pre-trained in the following way: it will be used to train the language The preset named entities in the text of the model are replaced with tags corresponding to the preset topic categories, and the text is used to train the language model.

The speech recognition method according to item 1 of the scope of the patent application, wherein the method of replacing tags is used to generate at least language-based The weighted finite state converter of the model adds the user-side preset information corresponding to the preset theme category, and obtains a single weighted finite state converter based on the triphone state binding field table, pronunciation dictionary, and language model, including: The way of replacing tags is to add the user-side preset information corresponding to the preset theme category to the pre-generated weighted finite state converter based on the language model; the weighted finite state converter with the client-side preset information added, and The pre-generated weighted finite state converter based on the triphone state binding field table and the pronunciation dictionary are combined to obtain the single weighted finite state converter.

The speech recognition method according to item 1 of the scope of patent application, wherein the text used for training the language model refers to the text of the preset subject category.

The speech recognition method according to item 1 of the scope of patent application, wherein the number of the preset topic categories is at least two; the number of the language models and the number of the weighted finite state machines based on at least the language model are respectively compared with the preset It is assumed that the number of subject categories is the same; the method of replacing tags, adding user-side preset information corresponding to the preset subject category to the weighted finite state converter based at least on the language model, includes: determining the voice signal to be recognized The default theme category to which it belongs; select the weighted finite state converter based at least on the language model that is generated in advance and corresponds to the default theme category; replace it with the user-side default information corresponding to the default theme category Add the default information of the client to the selected weighted finite state converter by means of the corresponding label.

According to the voice recognition method described in item 4 of the scope of patent application, the determination of the preset subject category to which the voice signal to be recognized belongs is implemented in the following manner: the voice signal is determined according to the type of the user terminal or the application that collects the voice signal. The default theme category.

According to the voice recognition method described in item 5 of the scope of patent application, the preset subject category includes: making a call or sending a text message, playing music, or setting instructions; the corresponding user-side preset information includes: The name of the contact, the name of the song in the music library, or the command in the command set.

According to the speech recognition method described in item 2 of the scope of patent application, the merging operation includes: merging using a prediction-based method.

The speech recognition method according to item 1 of the scope of patent application, wherein the vocabulary used for pre-training the language model is consistent with the words contained in the pronunciation dictionary.

According to the speech recognition method described in item 1 of the scope of patent application, the calculation of the probability that the feature vector corresponds to the basic unit of the search space includes: using a pre-trained DNN model to calculate the probability that the feature vector corresponds to each triphone state; The feature vector corresponds to the probability of each triphone state, and the pre-trained HMM model is used to calculate the feature vector corresponding to each triphone. rate.

According to the method for speech recognition according to item 9 of the scope of patent application, the execution speed of the step of using the pre-trained DNN model to calculate the probability of the feature vector corresponding to each triphone state is improved by the following method: provided by a hardware platform Parallel processing capacity of data.

The speech recognition method according to any one of items 1 to 10 of the scope of patent application, wherein the extracting the feature vector sequence of the speech signal to be recognized includes: performing framing processing on the speech signal to be recognized according to a preset frame length, Obtain multiple audio frames; extract the feature vector of each audio frame to obtain the feature vector sequence.

The speech recognition method according to item 11 of the scope of patent application, wherein the extracting the feature vector of each audio frame includes: extracting the MFCC feature, the PLP feature, or the LPC feature.

The speech recognition method according to any one of items 1 to 10 of the scope of patent application, wherein, after the word sequence corresponding to the feature vector sequence is obtained, the following operation is performed: The matching verifies the accuracy of the word sequence, and generates the corresponding voice recognition result according to the verification result.

The speech recognition method according to item 13 of the scope of patent application, wherein the verification of the accuracy of the word sequence is performed by text matching with the preset information of the client, and the corresponding voice recognition result is obtained according to the verification result, including: Select the word sequence to be verified corresponding to the default information of the client Word; find the word to be verified in the default information of the client; if it is found, it is determined to pass the accuracy verification, and the word sequence is used as the voice recognition result; otherwise, the word sequence is corrected by the fuzzy matching method based on pinyin, And use the corrected word sequence as the result of speech recognition.

The speech recognition method according to item 14 of the scope of patent application, wherein the modification of the word sequence through a pinyin-based fuzzy matching method includes: converting the word to be verified into a pinyin sequence to be verified; and the user-side preset information Each word in is converted into a compared pinyin sequence; the similarity between the pinyin sequence to be verified and each compared pinyin sequence is calculated in turn, and the user-side preset information is selected according to the similarity from high to low The first word; replace the word to be verified in the word sequence with the selected word to get the revised word sequence.

The speech recognition method according to item 15 of the scope of patent application, wherein the similarity includes a similarity calculated based on an edit distance.

The voice recognition method according to any one of items 1 to 10 in the scope of patent application, wherein the method is implemented on a user-end device; the user-end device includes: a smart mobile terminal, a smart speaker, or a robot.

A speech recognition device is characterized by comprising: a search space generating unit, which is used to generate a search containing preset information on the user side and used to decode a speech signal by using a preset speech knowledge source. Search space; feature vector extraction unit, used to extract the feature vector sequence of the speech signal to be recognized; probability calculation unit, used to calculate the probability that the feature vector corresponds to the basic unit of the search space; decoding search unit, used to use the probability as the input, Perform a decoding operation in the search space to obtain a word sequence corresponding to the feature vector sequence, wherein the search space generation unit is specifically configured to transform to a pre-generated weighted finite state based at least on a language model by replacing tags Add the user-side preset information corresponding to the preset theme category in the device, and obtain a single weighted finite state converter based on the triphone state binding field table, pronunciation dictionary, and language model; the language model is trained by the language model The language model training unit is pre-generated by the unit, and the language model training unit is used to replace the preset named entity in the text used for training the language model with a label corresponding to the preset topic category, and use the text to train the language model.

The speech recognition device according to item 18 of the scope of patent application, wherein the search space generating unit includes: a first user-side information adding subunit, which is used to replace tags with a pre-generated language model based on limited weight The state converter adds the user-side preset information corresponding to the preset theme category; the weighted finite state converter merges the sub-unit for combining the weighted finite state converter with the client-side preset information and the pre-generated Weighted finite state transition based on triphone state binding field table and pronunciation dictionary Combine the converters to obtain the single weighted finite state converter.

The speech recognition device according to item 18 of the scope of patent application, wherein the decoding space generating unit includes: a second user-side information adding subunit for adding tags to pre-generated weights based on language models at least The finite state converter adds the user-side preset information corresponding to the preset theme category; the unified weighted finite state converter obtains the subunit, which is used to obtain the triphone-based information after the second user-side information adding subunit completes the addition operation State binding field table, pronunciation dictionary, and single weighted finite state converter of language model; wherein, the second user terminal information adding subunit includes: a topic determination subunit for determining the preset to which the speech signal to be recognized belongs Subject category; weighted finite state converter selection subunit for selecting the weighted finite state converter based at least on the language model that is generated in advance and corresponding to the preset subject category; label replacement subunit for The method of replacing the corresponding label with the client default information corresponding to the default theme category adds the client default information to the selected weighted finite state converter.

The voice recognition device according to item 20 of the scope of patent application, wherein the theme determining subunit is specifically configured to determine the preset theme category to which the voice signal belongs according to the type of the user terminal or the application program that collects the voice signal.

The speech recognition device according to item 19 of the scope of patent application, wherein the weighted finite state converter merging subunit is specifically used for: The combination operation is performed based on the prediction method, and the single weighted finite state converter is obtained.

The speech recognition device according to item 18 of the scope of patent application, wherein the probability calculation unit includes: a triphone state probability calculation subunit for calculating the probability that the feature vector corresponds to each triphone state by using a pre-trained DNN model; The triphone probability calculation subunit is used to calculate the probability that the feature vector corresponds to each triphone according to the probability that the feature vector corresponds to the state of each triphone using a pre-trained HMM model.

The speech recognition device according to any one of items 18 to 23 in the scope of patent application, wherein the feature vector extraction unit includes: a framing subunit for performing framing processing on the speech signal to be recognized according to a preset frame length , To obtain multiple audio frames; feature extraction subunit, used to extract the feature vector of each audio frame, to obtain the feature vector sequence.

The speech recognition device according to any one of items 18 to 23 in the scope of patent application, which includes: an accuracy verification unit for after the decoding and searching unit obtains the word sequence corresponding to the feature vector sequence, through the The user-side preset information performs text matching to verify the accuracy of the word sequence, and generates a corresponding voice recognition result according to the verification result.

The speech recognition device according to item 25 of the scope of patent application, wherein the accuracy verification unit includes: a word selection subunit to be verified for selecting from the word sequence corresponding to The word to be verified in the user-side preset information; the search subunit is used to search for the word to be verified in the user-side preset information; the recognition result confirmation sub-unit is used when the search sub-unit finds the word to be verified , It is determined that the accuracy verification is adopted, and the word sequence is used as the voice recognition result; the recognition result correction subunit is used to correct the word sequence through the fuzzy matching method based on pinyin after the search subunit fails to find the word to be verified , And use the corrected word sequence as the voice recognition result.

The speech recognition device according to item 26 of the scope of patent application, wherein the recognition result correction subunit includes: a pinyin sequence conversion subunit to be verified, for converting the word to be verified into a pinyin sequence to be verified; The sequence conversion subunit is used to convert each word in the user-side preset information into a compared pinyin sequence; the similarity calculation selection subunit is used to sequentially calculate the difference between the pinyin sequence to be verified and each compared pinyin sequence The similarity of, and select the top words sorted according to the similarity from high to low from the preset information of the client; the word to be verified replacement subunit is used to replace the word to be verified in the word sequence with the selected word , Get the revised word sequence.

A voice recognition method, which is characterized in that it comprises: obtaining a word sequence corresponding to a voice signal to be recognized through decoding; verifying the accuracy of the word sequence by text matching with user-side preset information, and generating a corresponding voice according to the verification result Recognition result, Wherein, the verification of the accuracy of the word sequence by text matching with the preset information of the client, and generating the corresponding voice recognition result according to the verification result, includes: selecting the pending information corresponding to the preset information of the client from the word sequence. Verification word; search for the word to be verified in the default information of the client; if found, it will be determined through the accuracy verification, and the word sequence will be used as the voice recognition result; otherwise, the word sequence will be corrected by the fuzzy matching method based on pinyin , And use the corrected word sequence as the voice recognition result.

The speech recognition method according to item 28 of the scope of patent application, wherein the modification of the word sequence through a pinyin-based fuzzy matching method includes: converting the word to be verified into a pinyin sequence to be verified; and the user-side preset information Each word in is converted into a compared pinyin sequence; the similarity between the pinyin sequence to be verified and each compared pinyin sequence is calculated in turn, and the user-side preset information is selected according to the similarity from high to low The first word; replace the word to be verified in the word sequence with the selected word to get the revised word sequence.

A voice recognition device, which is characterized by comprising: a word sequence acquisition unit for acquiring a word sequence corresponding to a voice signal to be recognized through decoding; a word sequence verification unit for performing a sentence with preset information on the client side The word matching verifies the accuracy of the word sequence, and generates the corresponding voice recognition result according to the verification result. The word sequence verification unit includes: a word selection subunit to be verified, which is used to select the user terminal corresponding to the word sequence from the word sequence. The word to be verified in the preset information; the search sub-unit is used to search for the word to be verified in the user-side preset information; the recognition result confirmation sub-unit is used to determine the pass after the search sub-unit finds the word to be verified The accuracy verification takes the word sequence as the voice recognition result; the recognition result correction sub-unit is used to correct the word sequence through the fuzzy matching method based on pinyin after the search sub-unit fails to find the word to be verified. The corrected word sequence is used as the result of speech recognition.

The speech recognition device according to item 30 of the scope of patent application, wherein the recognition result correction subunit includes: a pinyin sequence conversion subunit to be verified, for converting the word to be verified into a pinyin sequence to be verified; The sequence conversion subunit is used to convert each word in the user-side preset information into a compared pinyin sequence; the similarity calculation selection subunit is used to sequentially calculate the difference between the pinyin sequence to be verified and each compared pinyin sequence The similarity of, and select the top words sorted according to the similarity from high to low from the preset information of the client; the word to be verified replacement subunit is used to replace the word to be verified in the word sequence with the selected word , Get the revised word sequence.