WO2021051544A1

WO2021051544A1 - Voice recognition method and device

Info

Publication number: WO2021051544A1
Application number: PCT/CN2019/117326
Authority: WO
Inventors: 刘博卿; 王健宗; 贾雪丽
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-09-16
Filing date: 2019-11-12
Publication date: 2021-03-25
Anticipated expiration: 2022-03-16
Also published as: CN110706690B; CN110706690A

Abstract

Provided are a voice recognition method and device, relating to the technical field of artificial intelligence. The method comprises: obtaining a target voice to be recognized; extracting waveform characteristics and pitch characteristics corresponding to each frame of target speech; inputting, in order, the waveform characteristics and pitch characteristics corresponding to each frame of target speech into a trained speech recognition model, the speech recognition model comprising an encoding sub-model, a first decoding sub-model, and a second decoding sub-model; the encoding sub-model comprises a convolutional neural network and bidirectional long- and short-term memory networks; the first decoding sub-model comprises a speech–text matching unit, the second decoding sub-model comprises a text context matching unit, and the voice–text matching unit comprises a CTC loss function and attention model; according to the output of the voice recognition model, generating text corresponding to the target voice to be recognized. Thus by combining voice–text matching and text context matching, the accuracy of voice recognition is improved.

Description

Speech recognition method and device

本申请要求于2019年9月16日提交中国专利局、申请号为201910869774.0、申请名称为“语音识别方法及其装置”的中国专利申请的优先权，其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office, the application number is 201910869774.0, and the application name is "Voice Recognition Method and Device" on September 16, 2019, the entire content of which is incorporated into this application by reference.

【Technical Field】

本申请涉及人工智能技术领域，尤其涉及一种语音识别方法及其装置。This application relates to the field of artificial intelligence technology, and in particular to a voice recognition method and device.

【Background technique】

语音识别，也被称为自动语音识别(Automatic Speech Recognition,ASR)，其目标是让机器通过识别和理解，把语音信号变成文字，是现代人工智能发展的重要分支。Speech recognition, also known as Automatic Speech Recognition (ASR), whose goal is to allow machines to recognize and understand speech signals into text, is an important branch of the development of modern artificial intelligence.

传统的语音识别技术是基于隐马尔科夫模型(Hidden Markov Model；以下简称：HMM)、高斯混合模型(Gaussian Mixture Model；以下简称：GMM)以及深度神经网络-隐马尔科夫模型(Deep Neural Network-Hidden Markov Model；以下简称：DNN-HMM)进行声学模型的建立。然而，上述基于上述声学模型的语音识别准确度还有待于进一步提高。The traditional speech recognition technology is based on Hidden Markov Model (Hidden Markov Model; HMM), Gaussian Mixture Model (GMM), and Deep Neural Network-Hidden Markov Model (Deep Neural Network). -Hidden Markov Model; hereinafter referred to as: DNN-HMM) to establish an acoustic model. However, the accuracy of the above-mentioned speech recognition based on the above-mentioned acoustic model needs to be further improved.

【申请内容】【Content of Application】

本申请实施例提供了一种语音识别方法及其装置，可以提高语音识别的准确度。The embodiments of the present application provide a voice recognition method and device, which can improve the accuracy of voice recognition.

第一方面，本申请实施例提供了一种语音识别方法，包括以下步骤：获取待识别的目标语音；提取每一帧所述目标语音对应的波形特征和音调特征；将每一帧所述目标语音对应的所述波形特征和所述音调特征顺序输入训练完的语音识别模型中；其中，所述语音识别模型包括编码子模型，第一解码子模型和第二解码子模型，所述编码子模型包括卷积神经网络和双向长短期记忆网络，所述第一解码子模型包括语音-文字匹配单元，所述第二解码子模型包括文字上下文匹配单元，所述语音-文字匹配单元包括CTC损失函数和注意力模型；根据所述语音识别模型的输出，生成所述待识别的目标语音对应的文字。In the first aspect, an embodiment of the present application provides a speech recognition method, which includes the following steps: acquiring the target speech to be recognized; extracting the waveform characteristics and pitch characteristics corresponding to the target speech in each frame; The waveform features and the tone features corresponding to the speech are sequentially input into the trained speech recognition model; wherein, the speech recognition model includes an encoding sub-model, a first decoding sub-model and a second decoding sub-model, the encoding sub-model The model includes a convolutional neural network and a bidirectional long and short-term memory network, the first decoding sub-model includes a speech-text matching unit, the second decoding sub-model includes a text context matching unit, and the speech-text matching unit includes a CTC loss Function and attention model; according to the output of the voice recognition model, the text corresponding to the target voice to be recognized is generated.

第二方面，本申请实施例还提供了一种语音识别装置，包括：第一获取模块，用于获取待识别的目标语音；第一提取模块，用于提取每一帧所述目标语音对应的波形特征和音调特征；第一输入模块，用于将每一帧所述目标语音对应的所述波形特征和所述音调特征顺序输入训练完的语音识别模型中；其中，所述语音识别模型包括编码子模型，第一解码子模型和第二解码子模型，所述编码子模型包括卷积神经网络和双向长短期记忆网络，所述第一解码子模型包括语音-文字匹配单元，所述第二解码子模型包括文字上下文匹配单元，所述语音-文字匹配单元包括CTC损失函数和注意力模型；生成模块，用于根据所述语音识别模型的输出，生成所述待识别的目标语音对应的文字。In a second aspect, an embodiment of the present application also provides a voice recognition device, including: a first acquisition module, configured to acquire a target voice to be recognized; a first extraction module, configured to extract the target voice corresponding to each frame Waveform feature and pitch feature; a first input module for sequentially inputting the waveform feature and the pitch feature corresponding to each frame of the target speech into the trained speech recognition model; wherein, the speech recognition model includes An encoding sub-model, a first decoding sub-model and a second decoding sub-model, the encoding sub-model includes a convolutional neural network and a two-way long short-term memory network, the first decoding sub-model includes a voice-text matching unit, the first The second decoding sub-model includes a text context matching unit, and the voice-text matching unit includes a CTC loss function and an attention model; a generating module is used to generate the corresponding target voice to be recognized according to the output of the voice recognition model Text.

第三方面，本申请实施例还提供了一种计算机非易失性可读存储介质，所述计算机非易失性可读存储介质包括存储的程序，其中，在程序运行时控制所述计算机非易失性可读存储介质所在设备执行上述语音识别方法。In the third aspect, the embodiments of the present application also provide a computer non-volatile readable storage medium. The computer non-volatile readable storage medium includes a stored program, wherein the non-volatile computer is controlled when the program is running. The device where the volatile readable storage medium is located executes the above voice recognition method.

第四方面，本申请实施例还提供了一种计算机设备，包括存储器和处理器，所述存储器用于存储包括程序指令的信息，所述处理器用于控制程序指令的执行，所述程序指令被处理器加载并执行时实现上述语音识别方法。In a fourth aspect, an embodiment of the present application also provides a computer device, including a memory and a processor, the memory is used to store information including program instructions, the processor is used to control the execution of the program instructions, and the program instructions are The above voice recognition method is realized when the processor is loaded and executed.

以上技术方案中，在语音-文字匹配过程中，结合了CTC损失函数和注意力模型，使得语音特征和文字能够对齐，在进行语音识别的过程中，结合了语音-文字匹配和文字上下文匹配，提高了语音识别的准确度。In the above technical solution, in the process of voice-text matching, the CTC loss function and attention model are combined, so that the voice features and text can be aligned. In the process of speech recognition, voice-text matching and text context matching are combined. Improve the accuracy of speech recognition.

【Explanation of the drawings】

为了更清楚地说明本申请实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly describe the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative labor.

图1为本申请实施例所提供的一种语音识别方法的流程示意图；FIG. 1 is a schematic flowchart of a voice recognition method provided by an embodiment of this application;

图2为本申请实施例所提供的语音识别方法的一个示例的流程图；FIG. 2 is a flowchart of an example of a voice recognition method provided by an embodiment of the application;

图3为本申请实施例所提供的一种语音识别装置的结构示意图；FIG. 3 is a schematic structural diagram of a speech recognition device provided by an embodiment of this application;

图4为本申请计算机设备一个实施例的结构示意图。Fig. 4 is a schematic structural diagram of an embodiment of a computer device of this application.

【detailed description】

为了更好的理解本申请的技术方案，下面结合附图对本申请实施例进行详细描述。In order to better understand the technical solutions of the present application, the embodiments of the present application will be described in detail below with reference to the accompanying drawings.

应当明确，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其它实施例，都属于本申请保护的范围。It should be clear that the described embodiments are only a part of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative work shall fall within the protection scope of this application.

在本申请实施例中使用的术语是仅仅出于描述特定实施例的目的，而非旨在限制本申请。在本申请实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式，除非上下文清楚地表示其他含义。The terms used in the embodiments of the present application are only for the purpose of describing specific embodiments, and are not intended to limit the present application. The singular forms of "a", "said" and "the" used in the embodiments of the present application and the appended claims are also intended to include plural forms, unless the context clearly indicates other meanings.

图1为本申请实施例所提供的一种语音识别方法的流程示意图，如图1所示，上述方法可以包括：FIG. 1 is a schematic flowchart of a voice recognition method provided by an embodiment of the application. As shown in FIG. 1, the above method may include:

步骤S101，获取待识别的目标语音。Step S101: Obtain a target voice to be recognized.

具体来说，本实施例中的目标语音通常是通过交互应用获取到用户输入的音频数据。Specifically, the target voice in this embodiment is usually audio data input by the user obtained through an interactive application.

步骤S102，提取每一帧目标语音对应的波形特征和音调特征。Step S102: Extract the waveform feature and pitch feature corresponding to each frame of target speech.

具体地，波形特征和音调特征是语音识别中语音特征的一种。其中，在提取波形特征和音调特征之前，一般需要对目标语音进行预处理。具体地，首先对输入的目标语音进行预加重处理，通过使用一个高通滤波器提升语音信号中的高频部分，使得频谱更平滑，然后将经过预加重处理的目标语音进行分帧加窗，从而将非平稳的语音信号转变为短时平稳的信号，接着通过端点检测，区分语音与噪声，并提取出有效的语音部分。Specifically, the waveform feature and the pitch feature are one of the voice features in speech recognition. Among them, before extracting waveform features and pitch features, it is generally necessary to preprocess the target speech. Specifically, the input target voice is pre-emphasized first, and the high-frequency part of the voice signal is improved by using a high-pass filter to make the frequency spectrum smoother, and then the pre-emphasized target voice is framed and windowed, thereby The non-stationary speech signal is transformed into a short-term stable signal, and then the end point is detected to distinguish the speech and noise, and extract the effective speech part.

其中，波形特征通过以下步骤进行提取：将经过预处理的目标语音进行快速傅里叶变化，从而将时域的语音信号转换为频域的能量谱进行分析，然后将能量谱通过一组梅尔尺度的三角滤波器组，突出语音的共振峰特征，之后计算每个滤波器组输出的对数能量，得到目标语音对应的波形特征。Among them, the waveform feature is extracted through the following steps: the pre-processed target speech is subjected to fast Fourier change, thereby converting the speech signal in the time domain into an energy spectrum in the frequency domain for analysis, and then passing the energy spectrum through a set of mel The triangular filter bank of the scale highlights the formant characteristics of the speech, and then the logarithmic energy output by each filter bank is calculated to obtain the waveform characteristics corresponding to the target speech.

音调特征通过以下步骤进行提取：将经过预处理的目标语音中，频率在50-450Hz之间的语音进行提取，以得到目标语音对应的音调特征。需要说明的是，人类的音调在50-450Hz之间，与环境背景音的音调有较大差异，通过对50-450Hz之间的语音进行提取，能够滤除较大部分的语音静音片段，提高后续语音识别效率。The tonal feature is extracted by the following steps: the preprocessed target speech, the voice with a frequency between 50-450 Hz is extracted to obtain the tonal feature corresponding to the target speech. It should be noted that the human tone is between 50-450Hz, which is quite different from the tone of the environmental background sound. By extracting the voice between 50-450Hz, it is possible to filter out a large part of the silent voice segment and improve Subsequent speech recognition efficiency.

步骤S103，将每一帧目标语音对应的波形特征和音调特征顺序输入训练完的语音识别模型中。In step S103, the waveform characteristics and pitch characteristics corresponding to each frame of target speech are sequentially input into the trained speech recognition model.

其中，语音识别模型包括编码子模型，第一解码子模型和第二解码子模型，编码子模型包括卷积神经网络和双向长短期记忆网络，第一解码子模型包括语音-文字匹配单元，第二解码子模型包括文字上下文匹配单元，语音-文字匹配单元包括CTC损失函数和注意力模型。Among them, the speech recognition model includes an encoding sub-model, a first decoding sub-model and a second decoding sub-model. The encoding sub-model includes a convolutional neural network and a two-way long and short-term memory network. The first decoding sub-model includes a speech-text matching unit. The second decoding sub-model includes a text context matching unit, and the speech-text matching unit includes a CTC loss function and an attention model.

需要特别说明的是，本申请实施例所提供的语音识别模型包括编码子模型，第一解码子模型和第二解码子模型，编码子模型用于将波形特征和音调特征进行编码处理，生成抽象特征向量，以便于与文字库中的文字进行匹配。It should be particularly noted that the speech recognition model provided by the embodiments of this application includes an encoding sub-model, a first decoding sub-model and a second decoding sub-model. The encoding sub-model is used to encode waveform features and tonal features to generate abstractions. Feature vector to facilitate matching with the text in the text library.

第一解码子模型用于根据特征向量，确定目标语音对应的文字。可以理解，语音具有连续性的特点，因此需要不停地将目标语音帧输入语音识别模型，语音识别模型不断地对每一帧目标语音进行语音识别。即编码子模型不断地对每一帧目标语音对应的波形特征和音调特征进行特征编码，第一解码子模型不断地对特征编码进行识别，生成每一帧目标语音对应的文字。The first decoding sub-model is used to determine the text corresponding to the target speech according to the feature vector. It can be understood that speech has the characteristic of continuity, so it is necessary to continuously input target speech frames into the speech recognition model, and the speech recognition model continuously performs speech recognition on each frame of target speech. That is, the coding sub-model continuously performs feature coding on the waveform characteristics and tonal characteristics corresponding to each frame of target speech, and the first decoding sub-model continuously recognizes the feature codes to generate text corresponding to each frame of target speech.

具体地，每一帧目标语音对应的波形特征和音调特征是一个80维的特征向量。编码子模型包括卷积神经网络和双向长短期记忆网络，卷积神经网络包括4层卷积层和2层最大池化层，用于对波形特征和音调特征进行卷积，以生成目标语音的卷积特征，将每一帧目标语音对应的卷积特征顺序输入双向长短期记忆网络，以获取目标语音的时序特征。第一解码子模型包括CTC损失函数和注意力模型，CTC损失函数使用了帧级别的字母序列和一个空白标签，能够对目标语音的时序特征进行分隔，以便于与文字进行匹配。注意力模型能够将对应于同一个文字、同一个词组的时序特征进行分组，以便于对时序特征进行对齐。其中，CTC损失函数通过确定目标语音中空白标签的位置来对目标语音的时序特征进行分隔，无需考虑前后相关语音，而注意力模型基于通过确定目标语音的关键信息，确定目标语音中的有效语音部分，依赖语音之间的关联关系。Specifically, the waveform feature and pitch feature corresponding to each frame of target speech is an 80-dimensional feature vector. The coding sub-model includes a convolutional neural network and a bidirectional long-short-term memory network. The convolutional neural network includes 4 layers of convolutional layers and 2 layers of maximum pooling layers, which are used to convolve waveform features and pitch features to generate the target speech. Convolution feature, the convolution feature corresponding to each frame of target speech is sequentially input into the bidirectional long and short-term memory network to obtain the time sequence feature of the target speech. The first decoding sub-model includes a CTC loss function and an attention model. The CTC loss function uses a frame-level letter sequence and a blank label, which can separate the timing features of the target speech to facilitate matching with the text. The attention model can group the temporal features corresponding to the same text and the same phrase in order to align the temporal features. Among them, the CTC loss function separates the timing characteristics of the target voice by determining the position of the blank label in the target voice, without considering the related voice before and after, and the attention model is based on determining the key information of the target voice to determine the effective voice in the target voice Partly, it depends on the relationship between voices.

需要说明的是，本申请实施例中将时序特征分别输入CTC损失函数和注意力模型，再将二者的输出进行结合，以实现对目标语音的时序特征的分隔和对齐。It should be noted that in the embodiment of the present application, the timing features are respectively input to the CTC loss function and the attention model, and the outputs of the two are combined to achieve separation and alignment of the timing features of the target speech.

可以理解，在对目标语音的时序特征进行分隔和对齐之后，就将目标语音的时序特征进行了拆分，用于与多个文字或者词组进行匹配，得到第一解码子模型对应的文字输出。It can be understood that after separating and aligning the timing features of the target speech, the timing characteristics of the target speech are split and used to match multiple words or phrases to obtain the text output corresponding to the first decoding sub-model.

此外，本申请实施例还包括第二解码子模型，用于根据已经生成的文字，对下一帧目标语音对应的文字进行预测。需要说明的是，由于待识别的目标语音是人类的语音，识别出的文字之间具有前后逻辑，而本申请中对目标语音的识别是顺序实现的，因此可以借助文字之间上下文的联系，通过已经识别出的文字，对尚未识别出的文字进行预测。In addition, the embodiment of the present application also includes a second decoding sub-model, which is used to predict the text corresponding to the next frame of target speech based on the generated text. It should be noted that, since the target voice to be recognized is human voice, the recognized text has front-to-back logic, and the recognition of the target voice in this application is implemented sequentially, so the contextual connection between texts can be used. Predict the characters that have not been recognized based on the characters that have been recognized.

具体地，第二解码子模型包括文字上下文匹配单元，文字上下文匹配单元包括循环神经网络语言模型，循环神经网络语言模型通过大量的文本内容进行训练，训练后的循环神经网络语言模型能够根据输入的文本内容的上文，对文本内容的下文进行预测。Specifically, the second decoding sub-model includes a text context matching unit, and the text context matching unit includes a cyclic neural network language model. The cyclic neural network language model is trained through a large amount of text content, and the trained cyclic neural network language model can be based on the input Above the text content, predict the following text content.

步骤S104，根据语音识别模型的输出，生成待识别的目标语音对应的文字。Step S104: Generate text corresponding to the target voice to be recognized according to the output of the voice recognition model.

综上所述，本申请实施例所提供的语音识别方法，获取待识别的目标语音，提取每一帧目标语音对应的波形特征和音调特征。将每一帧目标语音对应的波形特征和音调特征顺序输入训练完的语音识别模型中。其中，语音识别模型包括编码子模型，第一解码子模型和第二解码子模型，编码子模型包括卷积神经网络和双向长短期记忆网络，第一解码子模型包括语音-文字匹配单元，第二解码子模型包括文字上下文匹配单元，语音-文字匹配单元包括CTC损失函数和注意力模型。根据语音识别模型的输出，生成待识别的目标语音对应的文字。由此，结合了语音-文字匹配和文字上下文匹配，提高了语音识别的准确度。In summary, the voice recognition method provided by the embodiments of the present application obtains the target voice to be recognized, and extracts the waveform characteristics and pitch characteristics corresponding to each frame of the target voice. The waveform characteristics and pitch characteristics corresponding to each frame of target speech are sequentially input into the trained speech recognition model. Among them, the speech recognition model includes an encoding sub-model, a first decoding sub-model and a second decoding sub-model. The encoding sub-model includes a convolutional neural network and a two-way long and short-term memory network. The first decoding sub-model includes a speech-text matching unit. The second decoding sub-model includes a text context matching unit, and the speech-text matching unit includes a CTC loss function and an attention model. According to the output of the voice recognition model, the text corresponding to the target voice to be recognized is generated. Thus, the combination of voice-text matching and text context matching improves the accuracy of speech recognition.

应当理解，语音识别模型需要经过预先训练才能用于语音识别，本申请实施例所提供的语音识别模型通过以下步骤进行训练：It should be understood that the speech recognition model needs to be pre-trained before it can be used for speech recognition. The speech recognition model provided in the embodiment of the present application is trained through the following steps:

步骤S11，获取参考语音和参考语音对应的参考文本。Step S11: Obtain the reference voice and the reference text corresponding to the reference voice.

其中，参考语音和参考语音对应的参考文本是用于训练语音识别模型的标准信息，参考语音对应的参考文本是参考语音的正确识别结果。Among them, the reference voice and the reference text corresponding to the reference voice are standard information used to train the voice recognition model, and the reference text corresponding to the reference voice is the correct recognition result of the reference voice.

步骤S12，提取每一帧参考语音对应的波形特征和音调特征。Step S12: Extract the waveform feature and pitch feature corresponding to each frame of reference speech.

在训练过程中，首先将参考语音进行分帧处理，再提取每一帧参考语音对应的波形特征和音调特征。In the training process, the reference speech is first processed into frames, and then the waveform characteristics and pitch characteristics corresponding to each frame of the reference speech are extracted.

步骤S13，将一帧参考语音对应的波形特征和音调特征输入卷积神经网络，以获取参考语音的卷积特征。Step S13: Input the waveform feature and pitch feature corresponding to a frame of reference speech into the convolutional neural network to obtain the convolutional feature of the reference speech.

其中，卷积神经网络包括4层卷积层和2层最大池化层。Among them, the convolutional neural network includes 4 convolutional layers and 2 maximum pooling layers.

步骤S14，将每一帧参考语音对应的卷积特征顺序输入双向长短期记忆网络，以获取参考语音的时序特征。In step S14, the convolutional features corresponding to each frame of the reference voice are sequentially input into the bidirectional long and short-term memory network to obtain the time sequence features of the reference voice.

步骤S15，将参考语音的时序特征分别输入CTC损失函数和注意力模型，以顺序输出每一帧参考语音对应的文字。In step S15, the time sequence characteristics of the reference voice are input into the CTC loss function and the attention model respectively, and the text corresponding to each frame of the reference voice is sequentially output.

步骤S16，将已经输出的参考语音对应的文字输入训练后的文字上下文匹配单元。Step S16, input the text corresponding to the output reference voice into the text context matching unit after training.

步骤S17，根据语音-文字匹配单元的输出，对文字上下文匹配单元，以及参考文本，对卷积神经网络、双向长短期记忆网络、语音-文字匹配单元、文字上下文匹配单元的参数进行训练。Step S17, according to the output of the voice-text matching unit, the parameters of the text context matching unit and the reference text are trained on the parameters of the convolutional neural network, the bidirectional long-short-term memory network, the voice-text matching unit, and the text context matching unit.

基于前述说明，可以知道，参考语音对应的波形特征和音调特征可以用于区分不同的参考语音，经过卷积神经网络和双向长短期记忆网络，能够将特征进行抽象，突出不同参考语音对应的特征之间的差异，利用不同参考语音对应的特征之间的差异。通过CTC损失函数和注意力模型，来确定参考语音对应的文字。Based on the foregoing description, it can be known that the waveform characteristics and pitch characteristics corresponding to the reference voice can be used to distinguish different reference voices. Through the convolutional neural network and the two-way long and short-term memory network, the features can be abstracted and the features corresponding to different reference voices can be highlighted. Use the difference between the features corresponding to different reference voices. Through the CTC loss function and the attention model, the text corresponding to the reference voice is determined.

将参考语音对应的文字与参考文本进行比较，调整文字上下文匹配单元，以及参考文本，对卷积神经网络、双向长短期记忆网络、语音-文字匹配单元、文字上下文匹配单元的参数，从而完成对语音识别模型的训练。Compare the text corresponding to the reference voice with the reference text, adjust the text context matching unit, and the reference text, and compare the parameters of the convolutional neural network, two-way long and short-term memory network, voice-text matching unit, and text context matching unit to complete the alignment. Training of speech recognition model.

此外，文字上下文匹配单元包括循环神经网络语言模型，文字上下文匹配单元通过以下步骤进行训练：In addition, the text context matching unit includes a recurrent neural network language model, and the text context matching unit is trained through the following steps:

步骤S21，获取参考文本。Step S21: Obtain the reference text.

步骤S22，将参考文本顺序输入循环神经网络语言模型。In step S22, the reference text is sequentially input into the cyclic neural network language model.

步骤S23，根据参考文本的上下文，对循环神经网络语言模型的参数进行训练。Step S23, training the parameters of the recurrent neural network language model according to the context of the reference text.

需要说明的是，循环神经网络语言模型能够根据参考文本的上文，预测参考文本的下文，根据预测结果与真实结果，对循环神经网络语言模型的参数进行训练。It should be noted that the cyclic neural network language model can predict the following text of the reference text based on the upper part of the reference text, and train the parameters of the cyclic neural network language model according to the predicted result and the real result.

由于在训练语音识别模型时需要使用训练完的文字上下文匹配单元，因此需要对循环神经网络语言模型进行单独训练，所使用的参考文本可以是用于训练语音识别模型的参考语音对应的参考文本，也可以是其他参考文本，本申请实施例对此不做限定。Since the trained text context matching unit needs to be used when training the speech recognition model, the recurrent neural network language model needs to be trained separately. The reference text used can be the reference text corresponding to the reference speech used to train the speech recognition model. It may also be other reference texts, which are not limited in the embodiments of the present application.

可以理解，本申请实施例所提供的语音识别模型包括第一解码子模型和第二解码子模型，用于对双向长短期记忆网络的输出进行解码，以确定目标语音对应的文字。It can be understood that the speech recognition model provided by the embodiment of the present application includes a first decoding sub-model and a second decoding sub-model, which are used to decode the output of the two-way long and short-term memory network to determine the text corresponding to the target speech.

需要说明的是，本申请实施例所提供的第一解码子模型能够根据目标语音的时序特征，确定目标语音对应的文字，第二解码子模型能够根据已经识别出的文字，对目标语音的下文进行预测，将二者的输出进行结合，能够提升语音识别的准确度。但是第一解码子模型和第二解码子模型对于最终的识别结果的影响力不同，具体可以通过权重来实现。换句话说，语音-文字匹配单元的参数包括语音-文字匹配单元对应的第一权重，文字上下文匹配单元的参数包括文字上下文匹配单元对应的第二权重。It should be noted that the first decoding sub-model provided by the embodiments of the present application can determine the text corresponding to the target speech according to the time sequence characteristics of the target speech, and the second decoding sub-model can determine the following text of the target speech according to the recognized text. Predicting and combining the output of the two can improve the accuracy of speech recognition. However, the influence of the first decoding sub-model and the second decoding sub-model on the final recognition result is different, which can be specifically implemented by weighting. In other words, the parameters of the speech-text matching unit include the first weight corresponding to the speech-text matching unit, and the parameters of the text context matching unit include the second weight corresponding to the text context matching unit.

第一权重越大，则第一解码子模型的识别结果对语音识别模型的识别结果的影响力越大，第二权重越大，则第二解码子模型的识别结果对语音识别模型的识别结果的影响力越大。The greater the first weight, the greater the influence of the recognition result of the first decoding sub-model on the recognition result of the speech recognition model, the greater the second weight, the greater the influence of the recognition result of the second decoding sub-model on the recognition result of the speech recognition model The greater the influence.

进一步地，语音-文字匹配单元包括CTC损失函数和注意力模型，CTC损失函数和注意力模型分别从不同侧重点对目标语音的时序特征进行识别，将二者的输出进行结合，能够提升时序特征识别的准确度。但是CTC损失函数和注意力模型对于时序特征的识别结果的影响力不同，具体也可以通过权重来实现。具体来说，语音-文字匹配单元的参数还包括CTC损失函数对于的第三权重，注意力模型对应的第四权重。Furthermore, the speech-text matching unit includes a CTC loss function and an attention model. The CTC loss function and the attention model recognize the timing characteristics of the target speech from different focuses. Combining the output of the two can improve the timing characteristics. Accuracy of recognition. However, the CTC loss function and the attention model have different influences on the recognition results of time series features, and they can also be implemented by weights. Specifically, the parameters of the speech-text matching unit also include the third weight of the CTC loss function and the fourth weight corresponding to the attention model.

第三权重越大，则CTC损失函数的识别结果对时序特征的识别结果的影响力越大，第四权重越大，则注意力模型的识别结果对时序特征的识别结果的影响力越大。The greater the third weight, the greater the influence of the recognition result of the CTC loss function on the recognition result of the time series feature, and the greater the fourth weight, the greater the influence of the recognition result of the attention model on the recognition result of the time series feature.

为了更加清楚地说明本申请实施例所提供的语音识别方法，下面进行举例说明，图2为本申请实施例所提供的语音识别方法的一个示例的流程图。如图2所示，将待识别的目标语音进行分帧，得到N帧目标语音。将N帧目标语音分别输入卷积神经网络中，得到N帧目标语音对应的卷积特征，将N帧目标语音对应的卷积特征输入双向长短期记忆网络，得到N帧目标语音对应的时序特征。In order to more clearly explain the voice recognition method provided by the embodiment of the present application, an example is given below. FIG. 2 is a flowchart of an example of the voice recognition method provided by the embodiment of the present application. As shown in Figure 2, the target voice to be recognized is divided into frames to obtain N frames of target voice. Input N frames of target speech into the convolutional neural network respectively to obtain the convolutional features corresponding to N frames of target speech, and input the convolutional features corresponding to N frames of target speech into the bidirectional long and short-term memory network to obtain the timing characteristics corresponding to N frames of target speech .

将N帧目标语音对应的时序特征分别输入CTC损失函数和注意力模型中，以及使用循环神经网络语音模型对文字进行预测。The timing features corresponding to the N frames of target speech are input into the CTC loss function and the attention model respectively, and the cyclic neural network speech model is used to predict the text.

举例来说，对于第二个文字，CTC损失函数根据时序特征确定可能性最大的文字A1，注意力模型根据时序特征确定可能性最大的文字是A2，循环神经网络语音模型根据第一个文字，确定可能性最大的文字是A3。结合CTC损失函数对应的第三权重、注意力模型对应的第四权重、语音-文字匹配单元对应的第一权重、文字上下文匹配单元对应的第二权重，确定第二个文字。For example, for the second text, the CTC loss function determines the most likely text A1 based on the time series features, the attention model determines the most likely text A2 based on the time series features, and the cyclic neural network speech model based on the first text, The most likely text is A3. Combining the third weight corresponding to the CTC loss function, the fourth weight corresponding to the attention model, the first weight corresponding to the speech-text matching unit, and the second weight corresponding to the text context matching unit, determine the second text.

为了实现上述实施例，本申请实施例还提出了一种语音识别装置，图3为本申请实施例所提供的一种语音识别装置的结构示意图，如图3所示，该装置包括：第一获取模块210，第一提取模块220，第一输入模块230，生成模块240。In order to implement the above-mentioned embodiment, the embodiment of the present application also proposes a voice recognition device. FIG. 3 is a schematic structural diagram of a voice recognition device provided by an embodiment of this application. As shown in FIG. 3, the device includes: The acquisition module 210, the first extraction module 220, the first input module 230, and the generation module 240.

第一获取模块210，用于获取待识别的目标语音。The first acquiring module 210 is used to acquire the target voice to be recognized.

第一提取模块220，用于提取每一帧目标语音对应的波形特征和音调特征。The first extraction module 220 is used for extracting waveform characteristics and pitch characteristics corresponding to each frame of target speech.

第一输入模块230，用于将每一帧目标语音对应的波形特征和音调特征顺序输入训练完的语音识别模型中。The first input module 230 is configured to sequentially input the waveform characteristics and pitch characteristics corresponding to each frame of target speech into the trained speech recognition model.

生成模块240，用于根据语音识别模型的输出，生成待识别的目标语音对应的文字。The generating module 240 is configured to generate text corresponding to the target voice to be recognized according to the output of the voice recognition model.

进一步地，为了对语音识别模型进行训练，该装置还包括：第二获取模块310，用于获取参考语音和参考语音对应的参考文本。第二提取模块320，用于提取每一帧参考语音对应的波形特征和音调特征。第二输入模块330，用于将一帧参考语音对应的波形特征和音调特征输入卷积神经网络，以获取参考语音的卷积特征。其中，卷积神经网络包括4层卷积层和2层最大池化层。第三输入模块340，用于将每一帧参考语音对应的卷积特征顺序输入双向长短期记忆网络，以获取参考语音的时序特征。第四输入模块350，用于将参考语音的时序特征分别输入CTC损失函数和注意力模型，以顺序输出每一帧参考语音对应的文字。第五输入模块360，用于将已经输出的参考语音对应的文字输入训练完的文字上下文匹配单元。第一训练模块370，用于根据语音-文字匹配单元的输出，对文字上下文匹配单元，以及参考文本，对卷积神经网络、双向长短期记忆网络、语音-文字匹配单元、文字上下文匹配单元的参数进行训练。Further, in order to train the voice recognition model, the device further includes: a second obtaining module 310, configured to obtain the reference voice and the reference text corresponding to the reference voice. The second extraction module 320 is used to extract the waveform characteristics and pitch characteristics corresponding to each frame of reference speech. The second input module 330 is configured to input the waveform characteristics and pitch characteristics corresponding to a frame of reference speech into the convolutional neural network to obtain the convolutional characteristics of the reference speech. Among them, the convolutional neural network includes 4 convolutional layers and 2 maximum pooling layers. The third input module 340 is configured to sequentially input the convolutional features corresponding to each frame of the reference voice into the bidirectional long and short-term memory network to obtain the time sequence features of the reference voice. The fourth input module 350 is configured to input the timing characteristics of the reference voice into the CTC loss function and the attention model respectively, and sequentially output the text corresponding to each frame of the reference voice. The fifth input module 360 is used for inputting the text corresponding to the reference voice that has been output into the trained text context matching unit. The first training module 370 is used for matching the text context unit and the reference text according to the output of the speech-text matching unit, and the convolutional neural network, the bidirectional long-short-term memory network, the speech-text matching unit, and the text context matching unit. Parameters for training.

进一步地，为了对文字上下文匹配单元进行训练，文字上下文匹配单元包括循环神经网络语言模型，所述装置还包括：第三获取模块410，用于获取参考文本。第六输入模块420，用于将参考文本顺序输入循环神经网络语言模型。第二训练模块430，用于根据参考文本的上下文，对循环神经网络语言模型的参数进行训练。Further, in order to train the text context matching unit, the text context matching unit includes a cyclic neural network language model, and the device further includes: a third obtaining module 410 for obtaining reference text. The sixth input module 420 is used for sequentially inputting the reference text into the cyclic neural network language model. The second training module 430 is used to train the parameters of the recurrent neural network language model according to the context of the reference text.

需要说明的是，前述对语音识别方法实施例的解释说明也适用于该实施例的语音识别装置，此处不再赘述。It should be noted that the foregoing explanation of the embodiment of the voice recognition method is also applicable to the voice recognition device of this embodiment, and will not be repeated here.

综上所述，本申请实施例所提供的语音识别装置，获取待识别的目标语音，提取每一帧目标语音对应的波形特征和音调特征。将每一帧目标语音对应的波形特征和音调特征顺序输入训练完的语音识别模型中。其中，语音识别模型包括编码子模型，第一解码子模型和第二解码子模型，编码子模型包括卷积神经网络和双向长短期记忆网络，第一解码子模型包括语音-文字匹配单元，第二解码子模型包括文字上下文匹配单元，语音-文字匹配单元包括CTC损失函数和注意力模型。根据语音识别模型的输出，生成待识别的目标语音对应的文字。由此，结合了语音-文字匹配和文字上下文匹配，提高了语音识别的准确度。In summary, the voice recognition device provided by the embodiment of the present application obtains the target voice to be recognized, and extracts the waveform characteristics and pitch characteristics corresponding to each frame of the target voice. The waveform characteristics and pitch characteristics corresponding to each frame of target speech are sequentially input into the trained speech recognition model. Among them, the speech recognition model includes an encoding sub-model, a first decoding sub-model and a second decoding sub-model. The encoding sub-model includes a convolutional neural network and a two-way long and short-term memory network. The first decoding sub-model includes a speech-text matching unit. The second decoding sub-model includes a text context matching unit, and the speech-text matching unit includes a CTC loss function and an attention model. According to the output of the voice recognition model, the text corresponding to the target voice to be recognized is generated. Thus, the combination of voice-text matching and text context matching improves the accuracy of speech recognition.

图4为本申请计算机设备一个实施例的结构示意图，上述计算机设备可以包括存储器、处理器及存储在上述存储器上并可在上述处理器上运行的计算机程序，上述处理器执行上述计算机程序时，可以实现本申请实施例提供的语音识别方法。Fig. 4 is a schematic structural diagram of an embodiment of a computer device of this application. The computer device may include a memory, a processor, and a computer program stored in the memory and capable of running on the processor. When the processor executes the computer program, The voice recognition method provided in the embodiments of the present application can be implemented.

其中，上述计算机设备可以为服务器，例如：云服务器，或者上述计算机设备也可以为电子设备，例如：智能手机、智能手表、个人计算机(Personal Computer；以下简称：PC)、笔记本电脑或平板电脑等智能设备，本实施例对上述计算机设备的具体形态不作限定。Among them, the above-mentioned computer equipment may be a server, such as a cloud server, or the above-mentioned computer equipment may also be an electronic device, such as a smart phone, a smart watch, a personal computer (Personal Computer; hereinafter referred to as: PC), a notebook computer or a tablet computer, etc. Smart device, this embodiment does not limit the specific form of the above-mentioned computer device.

图4示出了适于用来实现本申请实施方式的示例性计算机设备52的框图。图4显示的计算机设备52仅仅是一个示例，不应对本申请实施例的功能和使用范围带来任何限制。Fig. 4 shows a block diagram of an exemplary computer device 52 suitable for implementing the embodiments of the present application. The computer device 52 shown in FIG. 4 is only an example, and should not bring any limitation to the functions and scope of use of the embodiments of the present application.

如图4所示，计算机设备52以通用计算设备的形式表现。计算机设备52的组件可以包括但不限于：一个或者多个处理器或者处理单元56，系统存储器78，连接不同系统组件(包括系统存储器78和处理单元56)的总线58。As shown in FIG. 4, the computer device 52 is in the form of a general-purpose computing device. The components of the computer device 52 may include, but are not limited to: one or more processors or processing units 56, a system memory 78, and a bus 58 connecting different system components (including the system memory 78 and the processing unit 56).

总线58表示几类总线结构中的一种或多种，包括存储器总线或者存储器控制器，外围总线，图形加速端口，处理器或者使用多种总线结构中的任意总线结构的局域总线。举例来说，这些体系结构包括但不限于工业标准体系结构(Industry Standard Architecture；以下简称：ISA)总线，微通道体系结构(Micro Channel Architecture；以下简称：MAC)总线，增强型ISA总线、视频电子标准协会(Video Electronics Standards Association；以下简称：VESA)局域总线以及外围组件互连(Peripheral Component Interconnection；以下简称：PCI)总线。The bus 58 represents one or more of several types of bus structures, including a memory bus or a memory controller, a peripheral bus, a graphics acceleration port, a processor, or a local bus using any bus structure among multiple bus structures. For example, these architectures include but are not limited to industry standard architecture (Industry Standard Architecture; hereinafter referred to as ISA) bus, Micro Channel Architecture (hereinafter referred to as MAC) bus, enhanced ISA bus, video electronics Standards Association (Video Electronics Standards Association; hereinafter referred to as VESA) local bus and Peripheral Component Interconnection (hereinafter referred to as PCI) bus.

计算机设备52典型地包括多种计算机系统可读介质。这些介质可以是任何能够被计算机设备52访问的可用介质，包括易失性和非易失性介质，可移动的和不可移动的介质。The computer device 52 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by the computer device 52, including volatile and non-volatile media, removable and non-removable media.

系统存储器78可以包括易失性存储器形式的计算机系统可读介质，例如随机存取存储器(Random Access Memory；以下简称：RAM)70和/或高速缓存存储器72。计算机设备52可以进一步包括其它可移动/不可移动的、易失性/非易失性计算机系统存储介质。仅作为举例，存储系统74可以用于读写不可移动的、非易失性磁介质(图4未显示，通常称为“硬盘驱动器”)。尽管图4中未示出，可以提供用于对可移动非易失性磁盘(例如“软盘”)读写的磁盘驱动器，以及对可移动非易失性光盘(例如：光盘只读存储器(Compact Disc Read Only Memory；以下简称：CD-ROM)、数字多功能只读光盘(Digital Video Disc Read Only Memory；以下简称：DVD-ROM)或者其它光介质)读写的光盘驱动器。在这些情况下，每个驱动器可以通过一个或者多个数据介质接口与总线58相连。存储器78可以包括至少一个程序产品，该程序产品具有一组(例如至少一个)程序模块，这些程序模块被配置以执行本申请各实施例的功能。The system memory 78 may include a computer system readable medium in the form of a volatile memory, such as a random access memory (Random Access Memory; hereinafter referred to as RAM) 70 and/or a cache memory 72. The computer device 52 may further include other removable/non-removable, volatile/non-volatile computer system storage media. For example only, the storage system 74 may be used to read and write non-removable, non-volatile magnetic media (not shown in FIG. 4, usually referred to as a "hard drive"). Although not shown in FIG. 4, a disk drive for reading and writing to a removable non-volatile disk (such as a "floppy disk") and a removable non-volatile optical disk (such as a compact disk read-only memory) can be provided. Disc Read Only Memory; hereinafter referred to as CD-ROM), Digital Video Disc Read Only Memory (hereinafter referred to as DVD-ROM) or other optical media) read and write optical disc drives. In these cases, each drive may be connected to the bus 58 through one or more data media interfaces. The memory 78 may include at least one program product, and the program product has a set of (for example, at least one) program modules, and these program modules are configured to perform the functions of the embodiments of the present application.

具有一组(至少一个)程序模块82的程序/实用工具80，可以存储在例如存储器78中，这样的程序模块82包括——但不限于——操作系统、一个或者多个应用程序、其它程序模块以及程序数据，这些示例中的每一个或某种组合中可能包括网络环境的实现。程序模块82通常执行本申请所描述的实施例中的功能和/或方法。A program/utility tool 80 having a set of (at least one) program module 82 may be stored in, for example, the memory 78. Such program module 82 includes, but is not limited to, an operating system, one or more application programs, and other programs Modules and program data, each of these examples or some combination may include the realization of a network environment. The program module 82 usually executes the functions and/or methods in the embodiments described in this application.

计算机设备52也可以与一个或多个外部设备54(例如键盘、指向设备、显示器64等)通信，还可与一个或者多个使得用户能与该计算机设备52交互的设备通信，和/或与使得该计算机设备52能与一个或多个其它计算设备进行通信的任何设备(例如网卡，调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口62进行。并且，计算机设备52还可以通过网络适配器60与一个或者多个网络(例如局域网(Local Area Network；以下简称：LAN)，广域网(Wide Area Network；以下简称：WAN)和/或公共网络，例如因特网)通信。如图4所示，网络适配器60通过总线58与计算机设备52的其它模块通信。应当明白，尽管图4中未示出，可以结合计算机设备52使用其它硬件和/或软件模块，包括但不限于：微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。The computer device 52 may also communicate with one or more external devices 54 (such as a keyboard, pointing device, display 64, etc.), and may also communicate with one or more devices that enable a user to interact with the computer device 52, and/or communicate with Any device (such as a network card, modem, etc.) that enables the computer device 52 to communicate with one or more other computing devices. This communication can be performed through an input/output (I/O) interface 62. In addition, the computer device 52 can also communicate with one or more networks (such as a local area network (Local Area Network; hereinafter referred to as: LAN), a wide area network (hereinafter referred to as WAN)) and/or a public network, such as the Internet, through the network adapter 60. ) Communication. As shown in FIG. 4, the network adapter 60 communicates with other modules of the computer device 52 through the bus 58. It should be understood that although not shown in FIG. 4, other hardware and/or software modules can be used in conjunction with the computer device 52, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tapes Drives and data backup storage systems, etc.

处理单元56通过运行存储在系统存储器78中的程序，从而执行各种功能应用以及数据处理，例如实现本申请实施例提供的语音识别方法。The processing unit 56 executes various functional applications and data processing by running programs stored in the system memory 78, such as implementing the voice recognition method provided in the embodiments of the present application.

本申请实施例还提供一种计算机非易失性可读存储介质，其上存储有计算机程序，上述计算机程序被处理器执行时可以实现本申请实施例提供的语音识别方法。The embodiment of the present application also provides a computer non-volatile readable storage medium on which a computer program is stored, and the above-mentioned computer program can implement the voice recognition method provided by the embodiment of the present application when the computer program is executed by a processor.

上述计算机非易失性可读存储介质可以采用一个或多个计算机可读的介质的任意组合。计算机非易失性可读介质可以是计算机可读信号介质或者计算机可读存储介质。计算机非易失性可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。计算机非易失性可读存储介质的更具体的例子(非穷举的列表)包括：具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机存取存储器(RAM)、只读存储器(Read Only Memory；以下简称：ROM)、可擦式可编程只读存储器(Erasable Programmable Read Only Memory；以下简称：EPROM)或闪存、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本文件中，计算机非易失性可读存储介质可以是任何包含或存储程序的有形介质，该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。The above-mentioned non-volatile computer-readable storage medium may adopt any combination of one or more computer-readable media. The computer non-volatile readable medium may be a computer readable signal medium or a computer readable storage medium. The computer non-volatile readable storage medium may be, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above, for example. More specific examples (non-exhaustive list) of computer non-volatile readable storage media include: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (Read Only Memory; hereinafter referred to as ROM), Erasable Programmable Read Only Memory (Erasable Programmable Read Only Memory; hereinafter referred to as EPROM) or flash memory, optical fiber, portable compact disk read-only memory (CD-ROM), optical storage Components, magnetic storage devices, or any suitable combination of the above. In this document, a computer non-volatile readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or in combination with an instruction execution system, apparatus, or device.

计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号，其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式，包括——但不限于——电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机非易失性可读存储介质以外的任何计算机可读介质，该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。The computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and computer-readable program code is carried therein. This propagated data signal can take many forms, including, but not limited to, electromagnetic signals, optical signals, or any suitable combination of the foregoing. The computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium. The computer-readable medium can be sent, propagated, or transmitted for use by or with the instruction execution system, apparatus, or device. Combined program.

计算机可读介质上包含的程序代码可以用任何适当的介质传输，包括——但不限于——无线、电线、光缆、RF等等，或者上述的任意合适的组合。The program code contained on the computer-readable medium can be transmitted by any suitable medium, including, but not limited to, wireless, wire, optical cable, RF, etc., or any suitable combination of the above.

可以以一种或多种程序设计语言或其组合来编写用于执行本申请操作的计算机程序代码，程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++，还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中，远程计算机可以通过任意种类的网络——包括局域网(Local Area Network；以下简称：LAN)或广域网(Wide Area Network；以下简称：WAN)连接到用户计算机，或者，可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。The computer program code used to perform the operations of this application can be written in one or more programming languages or a combination thereof. The programming languages include object-oriented programming languages—such as Java, Smalltalk, C++, and also conventional procedural programming languages. Programming language-such as "C" language or similar programming language. The program code can be executed entirely on the user's computer, partly on the user's computer, executed as an independent software package, partly on the user's computer and partly executed on a remote computer, or entirely executed on the remote computer or server. In the case of a remote computer, the remote computer can be connected to the user's computer through any kind of network, including local area network (Local Area Network; hereinafter referred to as: LAN) or Wide Area Network (hereinafter referred to as: WAN), or Connect to an external computer (for example, use an Internet service provider to connect via the Internet).

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本申请的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且，描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外，在不相互矛盾的情况下，本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。In the description of this specification, descriptions with reference to the terms "one embodiment", "some embodiments", "examples", "specific examples", or "some examples" etc. mean specific features described in conjunction with the embodiment or example , The structure, materials, or characteristics are included in at least one embodiment or example of the present application. In this specification, the schematic representations of the above terms do not necessarily refer to the same embodiment or example. Moreover, the described specific features, structures, materials or characteristics may be combined in any one or more embodiments or examples in a suitable manner. In addition, those skilled in the art can combine and combine the different embodiments or examples and the features of the different embodiments or examples described in this specification without contradicting each other.

此外，术语“第一”、“第二”仅用于描述目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本申请的描述中，“多个”的含义是至少两个，例如两个，三个等，除非另有明确具体的限定。In addition, the terms "first" and "second" are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Therefore, the features defined with "first" and "second" may explicitly or implicitly include at least one of the features. In the description of the present application, "a plurality of" means at least two, such as two, three, etc., unless specifically defined otherwise.

流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为，表示包括一个或更多个用于实现定制逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分，并且本申请的优选实施方式的范围包括另外的实现，其中可以不按所示出或讨论的顺序，包括根据所涉及的功能按基本同时的方式或按相反的顺序，来执行功能，这应被本申请的实施例所属技术领域的技术人员所理解。Any process or method description described in the flowchart or described in other ways herein can be understood as a module, segment or part of code that includes one or more executable instructions for implementing custom logic functions or steps of the process , And the scope of the preferred embodiments of the present application includes additional implementations, which may not be in the order shown or discussed, including performing functions in a substantially simultaneous manner or in the reverse order according to the functions involved. This should It is understood by those skilled in the art to which the embodiments of the present application belong.

取决于语境，如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”或“响应于检测”。类似地，取决于语境，短语“如果确定”或“如果检测(陈述的条件或事件)”可以被解释成为“当确定时”或“响应于确定”或“当检测(陈述的条件或事件)时”或“响应于检测(陈述的条件或事件)”。Depending on the context, the word "if" as used herein can be interpreted as "when" or "when" or "in response to determination" or "in response to detection". Similarly, depending on the context, the phrase "if determined" or "if detected (statement or event)" can be interpreted as "when determined" or "in response to determination" or "when detected (statement or event) )" or "in response to detection (statement or event)".

需要说明的是，本申请实施例中所涉及的终端可以包括但不限于个人计算机(Personal Computer；以下简称：PC)、个人数字助理(Personal Digital Assistant；以下简称：PDA)、无线手持设备、平板电脑(Tablet Computer)、手机、MP3播放器、MP4播放器等。It should be noted that the terminals involved in the embodiments of the present application may include, but are not limited to, personal computers (Personal Computer; hereinafter referred to as PC), Personal Digital Assistants (Personal Digital Assistant; hereinafter referred to as PDA), wireless handheld devices, and tablets Computer (Tablet Computer), mobile phone, MP3 player, MP4 player, etc.

在本申请所提供的几个实施例中，应该理解到，所揭露的系统、装置和方法，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如，多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the device embodiments described above are merely illustrative. For example, the division of units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.

另外，在本申请各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用硬件加软件功能单元的形式实现。In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional units.

上述以软件功能单元的形式实现的集成的单元，可以存储在一个计算机非易失性可读存储介质中。上述软件功能单元存储在一个存储介质中，包括若干指令用以使得一台计算机装置(可以是个人计算机，服务器，或者网络装置等)或处理器(Processor)执行本申请各个实施例方法的部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(Read-Only Memory；以下简称：ROM)、随机存取存储器(Random Access Memory；以下简称：RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The above-mentioned integrated unit implemented in the form of a software functional unit may be stored in a computer non-volatile readable storage medium. The above-mentioned software functional unit is stored in a storage medium and includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (Processor) execute part of the steps of the methods in the various embodiments of the present application . The aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory; hereinafter referred to as ROM), random access memory (Random Access Memory; hereinafter referred to as RAM), magnetic disks or optical disks, etc. A medium that can store program codes.

以上仅为本申请的较佳实施例而已，并不用以限制本申请，凡在本申请的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本申请保护的范围之内。The above are only preferred embodiments of this application, and are not intended to limit this application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in the protection of this application. Within range.

Claims

A speech recognition method, characterized in that the method includes:

Obtain the target voice to be recognized;

Extracting waveform characteristics and pitch characteristics corresponding to the target speech in each frame;

The waveform feature and the pitch feature corresponding to each frame of the target speech are sequentially input into the trained speech recognition model; wherein, the speech recognition model includes an encoding sub-model, a first decoding sub-model and a second decoding A sub-model, the encoding sub-model includes a convolutional neural network and a bidirectional long short-term memory network, the first decoding sub-model includes a speech-text matching unit, the second decoding sub-model includes a text context matching unit, the speech -The text matching unit includes CTC loss function and attention model;

According to the output of the voice recognition model, the text corresponding to the target voice to be recognized is generated.

The method of claim 1, wherein the speech recognition model is trained through the following steps:

Acquiring a reference voice and a reference text corresponding to the reference voice;

Extracting the waveform feature and the pitch feature corresponding to the reference voice in each frame;

The waveform feature and the pitch feature corresponding to one frame of the reference voice are input to the convolutional neural network to obtain the convolutional feature of the reference voice; wherein, the convolutional neural network includes 4 layers of convolution Layer and 2 maximum pooling layer;

Sequentially input the convolution features corresponding to the reference speech in each frame to the two-way long and short-term memory network, so as to obtain the time sequence characteristics of the reference speech;

Inputting the timing characteristics of the reference voice into the CTC loss function and the attention model respectively, and sequentially outputting the text corresponding to each frame of the reference voice;

Input the text corresponding to the reference voice that has been output into the trained text context matching unit;

According to the output of the speech-text matching unit, for the text context matching unit and the reference text, the convolutional neural network, the two-way long-short-term memory network, the speech-text matching unit, and the reference text The parameters of the text context matching unit are trained.

3. The method of claim 2, wherein the text context matching unit comprises a cyclic neural network language model, and the text context matching unit is trained through the following steps:

Get reference text;

Sequentially inputting the reference text into the recurrent neural network language model;

Training the parameters of the recurrent neural network language model according to the context of the reference text.

The method according to claim 3, wherein the parameters of the speech-text matching unit include a first weight corresponding to the speech-text matching unit, and the parameters of the text context matching unit include all The second weight corresponding to the text context matching unit.

The method of claim 4, wherein the parameters of the speech-text matching unit further include a third weight corresponding to the CTC loss function and a fourth weight corresponding to the attention model.

A speech recognition device, characterized in that the device includes:

The first acquiring module is used to acquire the target voice to be recognized;

The first extraction module is used to extract the waveform characteristics and pitch characteristics corresponding to the target speech in each frame;

The first input module is used to input the waveform feature and the pitch feature corresponding to each frame of the target speech into the trained speech recognition model sequentially; wherein, the speech recognition model includes an encoding sub-model, and the first A decoding sub-model and a second decoding sub-model, the coding sub-model includes a convolutional neural network and a bidirectional long-short-term memory network, the first decoding sub-model includes a speech-text matching unit, and the second decoding sub-model includes a text A context matching unit, where the speech-text matching unit includes a CTC loss function and an attention model;

The generating module is used to generate the text corresponding to the target voice to be recognized according to the output of the voice recognition model.

The device according to claim 6, wherein the device further comprises:

The second acquiring module is configured to acquire a reference voice and a reference text corresponding to the reference voice;

The second extraction module is configured to extract the waveform feature and the pitch feature corresponding to the reference voice in each frame;

The second input module is configured to input the waveform feature and the pitch feature corresponding to one frame of the reference voice into the convolutional neural network to obtain the convolution feature of the reference voice; wherein, the convolution The neural network includes 4 convolutional layers and 2 maximum pooling layers;

A third input module, configured to sequentially input the convolutional features corresponding to the reference voice in each frame into the bidirectional long-term and short-term memory network to obtain the timing features of the reference voice;

A fourth input module, configured to input the timing characteristics of the reference voice into the CTC loss function and the attention model, respectively, to sequentially output the text corresponding to each frame of the reference voice;

A fifth input module, configured to input the text corresponding to the reference voice that has been output into the trained text context matching unit;

The first training module is configured to perform, according to the output of the speech-text matching unit, to the text context matching unit, and the reference text, to the convolutional neural network, the bidirectional long short-term memory network, and the The parameters of the speech-text matching unit and the text context matching unit are trained.

8. The device according to claim 7, wherein the text context matching unit comprises a cyclic neural network language model, and the device further comprises:

The third obtaining module is used to obtain the reference text;

A sixth input module, configured to sequentially input the reference text into the cyclic neural network language model;

The second training module is used to train the parameters of the recurrent neural network language model according to the context of the reference text.

The device of claim 8, wherein the parameter of the speech-text matching unit includes a first weight corresponding to the speech-text matching unit, and the parameter of the text context matching unit includes all The second weight corresponding to the text context matching unit.

9. The device of claim 9, wherein the parameters of the speech-text matching unit further comprise a third weight corresponding to the CTC loss function, and a fourth weight corresponding to the attention model.

A computer non-volatile readable storage medium, wherein the computer non-volatile readable storage medium includes a stored program, wherein the computer non-volatile readable storage medium is controlled when the program is running Perform the following steps on the device:

Obtain the target voice to be recognized;

11. The computer non-volatile readable storage medium according to claim 11, wherein when the program is running, controlling the device where the computer non-volatile readable storage medium is located further executes the following steps:

The computer non-volatile readable storage medium according to claim 12, wherein, when the program is running, controlling the device where the computer non-volatile readable storage medium is located further executes the following steps:

Get reference text;

The computer non-volatile readable storage medium according to claim 13, wherein the parameter of the speech-text matching unit includes a first weight corresponding to the speech-text matching unit, and the text context The parameter of the matching unit includes a second weight corresponding to the text context matching unit.

The computer non-volatile readable storage medium according to claim 14, wherein the parameters of the speech-text matching unit include a first weight corresponding to the speech-text matching unit, and the text context The parameter of the matching unit includes a second weight corresponding to the text context matching unit.

A computer device includes a memory and a processor, the memory is used to store information including program instructions, the processor is used to control the execution of the program instructions, and the feature is that the program instructions are implemented when the program instructions are loaded and executed by the processor The following steps:

Obtain the target voice to be recognized;

The computer device according to claim 16, wherein the following steps are further implemented when the program instructions are loaded and executed by the processor:

The computer device according to claim 17, wherein the program instructions further implement the following steps when being loaded and executed by the processor:

Get reference text;

The computer device according to claim 18, wherein the parameter of the speech-text matching unit comprises a first weight corresponding to the speech-text matching unit, and the parameter of the text context matching unit comprises The second weight corresponding to the text context matching unit.

19. The computer device of claim 19, wherein the parameters of the speech-text matching unit further comprise a third weight corresponding to the CTC loss function, and a fourth weight corresponding to the attention model.