WO2021051572A1

WO2021051572A1 - Voice recognition method and apparatus, and computer device

Info

Publication number: WO2021051572A1
Application number: PCT/CN2019/117761
Authority: WO
Inventors: 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-09-16
Filing date: 2019-11-13
Publication date: 2021-03-25
Anticipated expiration: 2022-03-16
Also published as: CN110556126A; CN110556126B

Abstract

A voice recognition method and apparatus, a computer device and a non-volatile computer-readable storage medium. Said method comprises: acquiring a voice segment, and framing the voice segment to obtain each frame of voice data (S500); sequentially windowing, according to a preset stable windowing algorithm, each frame of voice data of the voice segment, so as to obtain a windowed voice frame of the voice segment (S502); extracting a Mel frequency cepstral coefficient vector (MFCC) of the windowed voice frame of the voice segment (S504); calculating the distance between the MFCC and a voiceprint discrimination vector (S506); and when the distance is smaller than a preset threshold, determining that the recognition result of the voice segment is passed (S508). The voice recognition method can calculate a characteristic vector in a voice segment more accurately, thereby improving the accuracy of voice recognition.

Description

Speech recognition method, device and computer equipment

本申请要求于2019年09月16日提交中国专利局、申请号为201910871726.5、发明名称为“语音识别方法、装置以及计算机设备”的中国专利申请的优先权，其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office, the application number is 201910871726.5, and the invention title is "speech recognition method, device and computer equipment" on September 16, 2019, the entire content of which is incorporated into this application by reference in.

Technical field

本申请涉及语音识别技术领域，尤其涉及一种语音识别方法、装置、计算机设备及非易失性计算机可读存储介质。This application relates to the field of speech recognition technology, and in particular to a speech recognition method, device, computer equipment, and non-volatile computer-readable storage medium.

Background technique

语音识别属于生物特征识别技术的一种，是一项根据语音波形中反映语音中生理或行为的特征的语音参数，自动识别出语音对应的用户身份的技术。现有技术中，语音识别一般都是利用语音信号中的声纹特征进行识别，其中，在声纹特征提取阶段，现有的加窗处理过程，比如使用汉宁窗、汉明窗、三角窗、高斯窗等对语音数据进行加窗。发明人意识到，现有的加窗处理方式几乎都会对原始语音信号进行了修改，从而造成了部分声纹特征信息的丢失，降低了语音识别的准确率。Voice recognition is a type of biometric recognition technology, which is a technology that automatically recognizes the user's identity corresponding to the voice based on the voice parameters that reflect the physiological or behavioral characteristics of the voice in the voice waveform. In the prior art, speech recognition generally uses the voiceprint features in the voice signal for recognition. Among them, in the voiceprint feature extraction stage, the existing windowing process, such as the use of Hanning window, Hamming window, and triangular window , Gaussian window, etc. to add windows to voice data. The inventor realizes that the existing windowing processing methods almost always modify the original speech signal, which causes the loss of part of the voiceprint feature information and reduces the accuracy of speech recognition.

发明内容Summary of the invention

有鉴于此，本申请提出一种语音识别方法、装置、计算机设备及非易失性计算机可读存储介质，能够获取语音片段之后进行分帧得到每一帧语音数据，然后根据预设的平稳加窗算法对所每一帧语音数据进行加窗以得到加窗语音帧；接着，再提取所述语音片段的加窗语音帧的梅尔频率倒谱特征向量MFCC，并计算所述MFCC与声纹鉴别向量的距离；当所述距离小于预设阈值时，判断所述语音片段的识别结果为通过。通过以上方式，在对语音信号进行少量修改的情况下能够更加精确地计算出语音片段中的特征向量，从而提升语音识别的精度。In view of this, this application proposes a speech recognition method, device, computer equipment, and non-volatile computer-readable storage medium, which can obtain speech fragments and then divide into frames to obtain each frame of speech data, and then smoothly add according to a preset The windowing algorithm windows each frame of speech data to obtain a windowed speech frame; then, extract the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment, and calculate the MFCC and voiceprint The distance of the discrimination vector; when the distance is less than a preset threshold, it is determined that the recognition result of the speech segment is passed. Through the above method, the feature vector in the speech segment can be calculated more accurately with a small amount of modification to the speech signal, thereby improving the accuracy of speech recognition.

首先，为实现上述目的，本申请提供一种语音识别方法，所述方法包括：First of all, in order to achieve the above objective, the present application provides a voice recognition method, which includes:

获取语音片段，对所述语音片段进行分帧，得到每一帧语音数据；根据预设的平稳加窗算法依次对所述语音片段的每一帧语音数据进行加窗，得到所述语音片段的加窗语音帧；提取所述语音片段的加窗语音帧的梅尔频率倒谱特征向量MFCC；计算所述MFCC与声纹鉴别向量的距离，其中，所述声纹鉴别向量是预先将所述用户的采样语音信息输入到声纹特征训练模型进行训练得到；当所述距离小于预设阈值时，判断所述语音片段的识别结果为通过。Acquire a voice segment, divide the voice segment into frames to obtain each frame of voice data; according to a preset smooth windowing algorithm, sequentially window each frame of the voice segment of the voice segment to obtain the voice segment Windowed speech frame; extracting the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment; calculating the distance between the MFCC and the voiceprint discrimination vector, wherein the voiceprint discrimination vector is the The sampled voice information of the user is input into the voiceprint feature training model for training; when the distance is less than a preset threshold, it is determined that the recognition result of the voice segment is passed.

此外，为实现上述目的，本申请还提供一种语音识别装置，所述装置包括：In addition, in order to achieve the above objective, the present application also provides a voice recognition device, which includes:

分帧模块，用于获取语音片段，对所述语音片段进行分帧，得到每一帧语音数据；加窗模块，用于根据预设的平稳加窗算法依次对所述语音片段的每一帧语音数据进行加窗，得到所述语音片段的加窗语音帧；提取模块，用于提取所述语音片段的加窗语音帧的梅尔频率倒谱特征向量MFCC；计算模块，用于计算所述MFCC与声纹鉴别向量的距离，其中，所述声纹鉴别向量是预先将所述用户的采样语音信息输入到声纹特征训练模型进行训练得到；识别模块，用于当所述距离小于预设阈值时，判断所述语音片段的识别结果为通过。The framing module is used to obtain speech fragments, and the speech fragments are divided into frames to obtain each frame of speech data; the windowing module is used to sequentially calculate each frame of the speech fragments according to a preset smooth windowing algorithm The speech data is windowed to obtain the windowed speech frame of the speech segment; the extraction module is used to extract the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment; the calculation module is used to calculate the The distance between the MFCC and the voiceprint identification vector, where the voiceprint identification vector is obtained by pre-inputting sampled voice information of the user into the voiceprint feature training model for training; the identification module is used for when the distance is less than a preset When the threshold is set, it is determined that the recognition result of the speech segment is passed.

进一步地，本申请还提出一种计算机设备，所述计算机设备包括存储器、处理器，所述存储器上存储有可在所述处理器上运行的计算机可读指令，所述计算机可读指令被所述处理器执行时实现步骤：Further, this application also proposes a computer device, the computer device includes a memory and a processor, the memory stores computer-readable instructions that can run on the processor, and the computer-readable instructions are The implementation steps when the processor is executed:

进一步地，为实现上述目的，本申请还提供一种非易失性计算机可读存储介质，所述非易失性计算机可读存储介质存储有计算机可读指令，所述计算机可读指令可被至少一个处理器执行，以使所述至少一个处理器执行步骤：Further, in order to achieve the above object, the present application also provides a non-volatile computer-readable storage medium, the non-volatile computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions can be At least one processor executes, so that the at least one processor executes the steps:

获取语音片段，对所述语音片段进行分帧，得到每一帧语音数据；根据预设的平稳加窗算法依次对所述语音片段的每一帧语音数据进行加窗，得到所述语音片段的加窗语音帧；提取所述语音片段的加窗语音帧的梅尔频率倒谱特征向量MFCC；计算所述MFCC与声纹鉴别向量的距离，其中，所述声纹鉴别向量是预先将所述用户的采样语音信息输入到声纹特征训练模型进行训练得到；当所述距离小于预设阈值时，判断所述语音片段的识别结果为通过。Acquire a voice segment, divide the voice segment into frames to obtain each frame of voice data; according to a preset smooth windowing algorithm, sequentially window each frame of the voice segment of the voice segment to obtain the voice segment Windowed speech frame; extracting the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech fragment; calculating the distance between the MFCC and the voiceprint discrimination vector, wherein the voiceprint discrimination vector is the The sampled voice information of the user is input into the voiceprint feature training model for training; when the distance is less than a preset threshold, it is determined that the recognition result of the voice segment is passed.

本申请所提出的语音识别方法、装置、计算机设备及非易失性计算机可读存储介质，能够获取语音片段之后进行分帧得到每一帧语音数据，然后根据预设的平稳加窗算法对所每一帧语音数据进行加窗以得到加窗语音帧；接着，再提取所述语音片段的加窗语音帧的梅尔频率倒谱特征向量MFCC，并计算所述MFCC与声纹鉴别向量的距离；当所述距离小于预设阈值时，判断所述语音片段的识别结果为通过。通过以上方式，在对语音信号进行少量修改的情况下能够更加精确地计算出语音片段中的特征向量，从而提升语音识别的精度。The speech recognition method, device, computer equipment, and non-volatile computer-readable storage medium proposed in this application can obtain speech fragments and then divide into frames to obtain each frame of speech data, and then compare all the speech data according to a preset smooth windowing algorithm. Each frame of speech data is windowed to obtain a windowed speech frame; then, the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment is extracted, and the distance between the MFCC and the voiceprint discrimination vector is calculated ; When the distance is less than a preset threshold, it is determined that the recognition result of the speech segment is passed. Through the above method, the feature vector in the speech segment can be calculated more accurately with a small amount of modification to the speech signal, thereby improving the accuracy of speech recognition.

Description of the drawings

图1是本申请计算机设备一可选的硬件架构的示意图；Fig. 1 is a schematic diagram of an optional hardware architecture of the computer equipment of the present application;

图2是本申请语音识别装置一实施例的程序模块示意图；2 is a schematic diagram of program modules of an embodiment of the speech recognition device of the present application;

图3是本申请语音识别方法一实施例的流程示意图。Fig. 3 is a schematic flowchart of an embodiment of a speech recognition method according to the present application.

detailed description

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本申请进行进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本申请，并不用于限定本申请。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。In order to make the purpose, technical solutions, and advantages of this application clearer, the following further describes this application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not used to limit the present application. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

需要说明的是，在本申请中涉及“第一”、“第二”等的描述仅用于描述目的，而不能理解为指示或暗示其相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。另外，各个实施例之间的技术方案可以相互结合，但是必须是以本领域普通技术人员能够实现为基础，当技术方案的结合出现相互矛盾或无法实现时应当认为这种技术方案的结合不存在，也不在本申请要求的保护范围之内。It should be noted that the descriptions related to "first", "second", etc. in this application are only for descriptive purposes, and cannot be understood as indicating or implying their relative importance or implicitly indicating the number of indicated technical features . Therefore, the features defined with "first" and "second" may explicitly or implicitly include at least one of the features. In addition, the technical solutions between the various embodiments can be combined with each other, but it must be based on what can be achieved by a person of ordinary skill in the art. When the combination of technical solutions is contradictory or cannot be achieved, it should be considered that such a combination of technical solutions does not exist. , Is not within the scope of protection required by this application.

参阅图1所示，是本申请计算机设备1一可选的硬件架构的示意图。Refer to FIG. 1, which is a schematic diagram of an optional hardware architecture of the computer device 1 of the present application.

本实施例中，所述计算机设备1可包括，但不仅限于，可通过系统总线相互通信连接存储器11、处理器12、网络接口13。In this embodiment, the computer device 1 may include, but is not limited to, a memory 11, a processor 12, and a network interface 13 that can communicate with each other through a system bus.

所述计算机设备1通过网络接口13连接网络(图1未标出)，通过网络连接到其他终端设备如移动终端(Mobile Terminal)、移动电话(Mobile Telephone)、用户设备(User Equipment，UE)、手机(handset)及便携设备(portable equipment)，PC端等。所述网络可以是企业内部网(Intranet)、互联网(Internet)、全球移动通讯系统(Global System of Mobile communication，GSM)、宽带码分多址(Wideband Code Division Multiple Access，WCDMA)、4G网络、5G网络、蓝牙(Bluetooth)、Wi-Fi、通话网络等无线或有线网络。The computer device 1 is connected to the network through the network interface 13 (not shown in FIG. 1), and connected to other terminal devices such as mobile terminals (Mobile Terminal), mobile phones (Mobile Telephone), user equipment (User Equipment, UE), and Mobile phones (handset) and portable equipment (portable equipment), PC terminal, etc. The network may be Intranet, Internet, Global System of Mobile Communication (GSM), Wideband Code Division Multiple Access (WCDMA), 4G network, 5G Network, Bluetooth (Bluetooth), Wi-Fi, call network and other wireless or wired networks.

需要指出的是，图1仅示出了具有组件11-13的计算机设备1，但是应理解的是，并不要求实施所有示出的组件，可以替代的实施更多或者更少的组件。It should be pointed out that FIG. 1 only shows the computer device 1 with components 11-13, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.

其中，所述存储器11至少包括一种类型的非易失性计算机可读存储介质，所述非易失性计算机可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如，SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中，所述存储器11可以是所述计算机设备1的内部存储单元，例如该计算机设备1的硬盘或内存。在另一些实施例中，所述存储器11也可以是所述计算机设备1的外部存储设备，例如该计算机设备1配备的插接式硬盘，智能存储卡(Smart Media Card,SMC)，安全数字(Secure Digital,SD)卡，闪存卡(Flash Card)等。当然，所述存储器11还可以既包括所述计算机设备1的内部存储单元也包括其外部存储设备。本实施例中，所述存储器11通常用于存储安装于所述计算机设备1的操作系统和各类应用软件，例如语音识别装置200的程序代码等。此外，所述存储器11还可以用于暂时地存储已经输出或者将要输出的各类数据。Wherein, the memory 11 includes at least one type of non-volatile computer-readable storage medium, and the non-volatile computer-readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX). Memory, etc.), random access memory (RAM), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory , Disks, CDs, etc. In some embodiments, the memory 11 may be an internal storage unit of the computer device 1, for example, a hard disk or a memory of the computer device 1. In other embodiments, the memory 11 may also be an external storage device of the computer device 1, such as a plug-in hard disk, a smart media card (SMC), and a secure digital ( Secure Digital, SD card, Flash Card, etc. Of course, the memory 11 may also include both the internal storage unit of the computer device 1 and its external storage device. In this embodiment, the memory 11 is generally used to store an operating system and various application software installed in the computer device 1, such as the program code of the voice recognition apparatus 200. In addition, the memory 11 can also be used to temporarily store various types of data that have been output or will be output.

所述存储器11存储有计算机可读指令，所述计算机可读指令可被至少一个处理器执行，以使所述至少一个处理器执行步骤：The memory 11 stores computer readable instructions, and the computer readable instructions can be executed by at least one processor, so that the at least one processor executes the steps:

所述处理器12在一些实施例中可以是中央处理器(Central Processing Unit，CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器12通常用于控制所述计算机设备1的总体操作，例如执行数据交互或者通信相关的控制和处理等。本实施例中，所述处理器12用于运行所述存储器11中存储的程序代码或者处理数据，例如运行所述的语音识别装置200等。In some embodiments, the processor 12 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips. The processor 12 is generally used to control the overall operation of the computer device 1, such as performing data interaction or communication-related control and processing. In this embodiment, the processor 12 is configured to run the program code or process data stored in the memory 11, for example, to run the voice recognition device 200.

所述网络接口13可包括无线网络接口或有线网络接口，该网络接口13通常用于在所述计算机设备1与其他终端设备如移动终端、移动电话、用户设备、手机及便携设备，PC端等之间建立通信连接。The network interface 13 may include a wireless network interface or a wired network interface. The network interface 13 is usually used to connect the computer device 1 with other terminal devices such as mobile terminals, mobile phones, user equipment, mobile phones and portable devices, PC terminals, etc. Establish a communication connection between.

本实施例中，所述计算机设备1内安装并运行有语音识别装置200时，当所述语音识别装置200运行时，能够获取语音片段之后进行分帧得到每一帧语音数据，然后根据预设的平稳加窗算法对所每一帧语音数据进行加窗以得到加窗语音帧；接着，再提取所述语音片段的加窗语音帧的梅尔频率倒谱特征向量MFCC，并计算所述MFCC与声纹鉴别向量的距离；当所述距离小于预设阈值时，判断所述语音信息的识别结果为通过。通过以上方式，能够更加精确地计算出语音片段中的特征向量，从而提升语音识别的精度。In this embodiment, when a voice recognition device 200 is installed and running in the computer device 1, when the voice recognition device 200 is running, it can obtain voice fragments and then divide into frames to obtain each frame of voice data, and then according to the preset The stationary windowing algorithm of ”performs windowing on each frame of speech data to obtain a windowed speech frame; then, extracts the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment, and calculates the MFCC The distance to the voiceprint discrimination vector; when the distance is less than a preset threshold, it is determined that the recognition result of the voice information is passed. Through the above method, the feature vector in the speech segment can be calculated more accurately, thereby improving the accuracy of speech recognition.

至此，己经详细介绍了本申请各个实施例的应用环境和相关设备的硬件结构和功能。下面，将基于上述应用环境和相关设备，提出本申请的各个实施例。So far, the application environment of each embodiment of the present application and the hardware structure and functions of related devices have been introduced in detail. Hereinafter, various embodiments of the present application will be proposed based on the above-mentioned application environment and related equipment.

首先，本申请提出一种语音识别装置200。First, this application proposes a voice recognition device 200.

参阅图2所示，是本申请语音识别装置200一实施例的程序模块图。Refer to FIG. 2, which is a program module diagram of an embodiment of the speech recognition device 200 of the present application.

本实施例中，所述语音识别装置200包括一系列的存储于存储器11上的计算机可读指令，当该计算机可读指令被处理器12执行时，可以实现本申请各实施例的语音识别功能。在一些实施例中，基于该计算机可读指令各部分所实现的特定的操作，语音识别装置200可以被划分为一个或多个模块。例如，在图2中，所述语音识别装置200可以被分割成分帧模块201、加窗模块202、提取模块203、计算模块204和识别模块205。其中：In this embodiment, the speech recognition device 200 includes a series of computer-readable instructions stored on the memory 11, and when the computer-readable instructions are executed by the processor 12, the speech recognition functions of the various embodiments of the present application can be implemented. . In some embodiments, the speech recognition apparatus 200 may be divided into one or more modules based on specific operations implemented by various parts of the computer-readable instructions. For example, in FIG. 2, the speech recognition device 200 may be divided into a frame module 201, a windowing module 202, an extraction module 203, a calculation module 204 and a recognition module 205. among them:

所述分帧模块201，用于获取语音片段，对所述语音片段进行分帧，得到每一帧语音数据。The framing module 201 is used to obtain a voice segment, and framing the voice segment to obtain each frame of voice data.

在本实施例中，所述计算机设备1与用户终端，比如手机，移动终端，PC端等设备连接，然后通过用户终端获取用户的语音信息。当然，在其他实施例中，所述计算机设备1也可以直接提供拾音器单元采集用户的语音数据，所述语音数据包括至少一个语音片段，因此，所述分帧模块201可以获取语音片段。所述分帧模块201获取到语音片段之后，则进一步对所述语音片段进行分帧，得到每一帧的语音数据。当然，由于人体的生理特性，语音片段中的高频部分往往被压抑，因此，在其他实施例中，所述分帧模块201还会对所述语音片段进行预加重处理，从而补偿语音片段中的高频成分。In this embodiment, the computer device 1 is connected to a user terminal, such as a mobile phone, a mobile terminal, a PC terminal, etc., and then the user's voice information is obtained through the user terminal. Of course, in other embodiments, the computer device 1 may also directly provide a pickup unit to collect the user's voice data. The voice data includes at least one voice segment. Therefore, the framing module 201 can obtain the voice segment. After the framing module 201 obtains the speech segment, it further divides the speech segment to obtain the speech data of each frame. Of course, due to the physiological characteristics of the human body, the high frequency part of the speech segment is often suppressed. Therefore, in other embodiments, the framing module 201 also performs pre-emphasis processing on the speech segment to compensate for the The high frequency components.

所述加窗模块202，用于根据预设的平稳加窗算法依次对所述语音片段的每一帧语音数据进行加窗，得到所述语音片段的加窗语音帧。The windowing module 202 is configured to sequentially window each frame of speech data of the speech segment according to a preset smooth windowing algorithm to obtain a windowed speech frame of the speech segment.

具体地，所述分帧模块201将语音片段分帧之后，所述加窗模块202进一步对所述语音片段的每一帧语音数据进行加窗。在本实施例中，所述加窗模块202根据预设的平稳加窗算法依次对所述语音片段的每一帧语音数据进行加窗，然后得到所述语音片段的加窗语音帧。其中，所述平稳加窗算法为：Specifically, after the framing module 201 divides the speech segment into frames, the windowing module 202 further performs windowing on each frame of speech data of the speech segment. In this embodiment, the windowing module 202 sequentially windows each frame of speech data of the speech segment according to a preset smooth windowing algorithm, and then obtains the windowed speech frame of the speech segment. Wherein, the stable windowing algorithm is:

其中，T1为加窗语音帧的时长，w(t)表示在语音帧的时长范围内的t时刻的需对t时刻语音信号进行加窗的加权值，K和K′是常数变量，K<K′且K+K′＝1，K是根据环境噪声进行设置的。Among them, T1 is the time length of the windowed speech frame, w(t) represents the weighted value of the speech signal at time t within the time length range of the speech frame, and K and K′ are constant variables, K< K'and K+K'=1, K is set according to the environmental noise.

在本实施例中，所述计算机设备1对每一帧语音数据进行加窗时，首先获取语音数据中的环境噪声的频率分布信息，然后自动调整变量K，再根据变量K对所述分帧进行分段加窗，包括：对于语音帧的帧首和帧尾采用类似余弦波形的加窗，减少低频部分的环境噪声干扰；对于语音帧的中间部分采用类似矩形的加窗，从而避免突发变异产生的高频噪声。其中，对于自动调整变量K的过程，所述计算机设备1可以预先随机从所述语音片段中的语音分帧中选择两个语音分帧，然后经傅里叶变换转换到频域，检测其中的环境噪声的频率分布，然后将所述KT1设置在高于所述环境噪声的最大频率的位置。In this embodiment, when the computer device 1 performs windowing on each frame of speech data, it first obtains the frequency distribution information of the environmental noise in the speech data, and then automatically adjusts the variable K, and then divides the frame according to the variable K. Segmented windowing includes: adopting cosine waveform-like windowing for the beginning and end of the speech frame to reduce environmental noise interference in the low-frequency part; adopting rectangular-like windowing for the middle part of the speech frame to avoid sudden bursts High frequency noise generated by mutation. Wherein, for the process of automatically adjusting the variable K, the computer device 1 may randomly select two voice sub-frames from the voice sub-frames in the voice fragment in advance, and then convert them to the frequency domain by Fourier transform, and detect the The frequency distribution of environmental noise, and then set the KT1 at a position higher than the maximum frequency of the environmental noise.

所述提取模块203，用于提取所述语音片段的加窗语音帧的梅尔频率倒谱特征向量MFCC。The extraction module 203 is configured to extract the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment.

具体地，所述加窗模块202对所述语音片段的所有语音分帧进行加窗之后，所述提取模块203进一步对所述语音片段的加窗语音帧进行处理，提取梅尔频率倒谱特征向量MFCC。在本实施例中，所述提取模块203首先对加窗语音帧进行离散傅里叶变换，从时域转换到频域；接着再根据公式：Specifically, after the windowing module 202 performs windowing on all the voice sub-frames of the voice segment, the extraction module 203 further processes the windowed voice frames of the voice segment to extract the Mel frequency cepstrum feature Vector MFCC. In this embodiment, the extraction module 203 first performs discrete Fourier transform on the windowed speech frame, from the time domain to the frequency domain; then according to the formula:

将加窗语音帧的线性频谱域映射到梅尔频谱域；最后再输入到一组梅尔三角滤波器组，计算每个频段的滤波器输出的信号对数能量，得到一个对数能量序列；再将所述对数能量序列做离散余弦变换，从而提取出所述加窗语音帧的梅尔频率倒谱特征向量MFCC。Map the linear spectrum domain of the windowed speech frame to the Mel spectrum domain; finally input to a set of Mel triangle filter bank, calculate the logarithmic energy of the signal output by the filter of each frequency band, and obtain a logarithmic energy sequence; Then, the logarithmic energy sequence is subjected to discrete cosine transform, thereby extracting the Mel frequency cepstrum feature vector MFCC of the windowed speech frame.

所述计算模块204，用于计算所述MFCC与声纹鉴别向量的距离，其中，所述声纹鉴别向量是预先将所述用户的采样语音信息输入到声纹特征训练模型进行训练得到。所述识别模块205，用于当所述距离小于预设阈值时，判断所述语音片段的识别结果为通过。The calculation module 204 is configured to calculate the distance between the MFCC and a voiceprint identification vector, where the voiceprint identification vector is obtained by pre-inputting sampled voice information of the user into a voiceprint feature training model for training. The recognition module 205 is configured to determine that the recognition result of the speech segment is passed when the distance is less than a preset threshold.

具体地，所述计算机设备1预先将对所述用户进行语音信息采样，然后将采用语音信息输入到声纹特征训练模型进行训练，从而获得所述用户对应的声纹鉴别向量。因此，在所述提取模块203提取到所述语音片段的MFCC之后，所述计算模块204进一步计算所述MFCC与所述声纹鉴别向量的距离。所述距离为余弦距离，所述距离对应的计算公式为：Specifically, the computer device 1 will sample the user's voice information in advance, and then input the adopted voice information into a voiceprint feature training model for training, so as to obtain a voiceprint identification vector corresponding to the user. Therefore, after the extraction module 203 extracts the MFCC of the speech segment, the calculation module 204 further calculates the distance between the MFCC and the voiceprint discrimination vector. The distance is the cosine distance, and the calculation formula corresponding to the distance is:

其中，x代表标准声纹鉴别向量，y代表当前声纹鉴别向量。在本实施例中，所述计算模块204通过余弦距离公式计算出所述语音片段的MFCC与预设的声纹鉴别向量之间的距离，然后所述识别模块205将所述距离与预先设定的阈值进行比较；当所述距离小于所述阈值时，则判断所述语音片段的识别结果为通过。Among them, x represents the standard voiceprint discrimination vector, and y represents the current voiceprint discrimination vector. In this embodiment, the calculation module 204 uses the cosine distance formula to calculate the distance between the MFCC of the speech segment and the preset voiceprint discrimination vector, and then the recognition module 205 compares the distance with the preset When the distance is less than the threshold, it is determined that the recognition result of the speech segment is passed.

具体地，所述计算机设备1预先通过将GMM训练出不同用户的声纹鉴别向量与将所述MFCC分别进行距离计算，从而选择出小于预设阈值且最小的距离所对应的第一声纹鉴别向量，将所述第一声纹鉴别向量对应的第一用户作为所述语音片段对应的目标用户。Specifically, the computer device 1 preliminarily trains the voiceprint identification vectors of different users through GMM and calculates the distances of the MFCC respectively, thereby selecting the first voiceprint identification corresponding to the smallest distance which is less than the preset threshold. Vector, the first user corresponding to the first voiceprint discrimination vector is used as the target user corresponding to the voice segment.

当然，在其他实施例中，所述计算机设备1还会预先训练一个准确度较高的GMM(Gaussian Mixture Model，高斯混合模型)，其中，所述GMM作为通用背景模型(UBM，Universal Background Model)，可以用于提取语音中的声纹鉴别向量，其中，所述GMM可以经过一系列的样本数据训练，从而能够提升声纹鉴别向量的训练准确度。其中，所述GMM的训练过程如下：Of course, in other embodiments, the computer device 1 will also pre-train a GMM (Gaussian Mixture Model) with higher accuracy, where the GMM serves as a universal background model (UBM, Universal Background Model) , Can be used to extract the voiceprint identification vector in speech, wherein the GMM can be trained through a series of sample data, so as to improve the training accuracy of the voiceprint identification vector. The training process of the GMM is as follows:

B1、获取预设数量(例如，10万个)的语音数据样本，每个语音数据样本可以采集自不同的人在不同环境中的语音(即对应一个声纹鉴别向量)，这样的语音数据样本用来训练能够表征一般语音特性的通用背景模型。B1. Obtain a preset number (for example, 100,000) of voice data samples. Each voice data sample can be collected from the voices of different people in different environments (that is, corresponding to a voiceprint identification vector), such voice data samples It is used to train a general background model that can characterize general speech characteristics.

B2、分别对各个语音数据样本进行处理以提取出各个语音数据样本对应的预设类型声纹特征，并基于各个语音数据样本对应的预设类型声纹特征构建各个语音数据样本对应的声纹特征向量；B2. Process each voice data sample separately to extract the preset type voiceprint feature corresponding to each voice data sample, and construct the voiceprint feature corresponding to each voice data sample based on the preset type voiceprint feature corresponding to each voice data sample vector;

B3、将构建出的所有预设类型声纹特征向量分为第一百分比的训练集和第二百分比的验证集，所述第一百分比和第二百分比之后小于或等于100％；B3. Divide the constructed voiceprint feature vectors of all preset types into a training set of a first percentage and a validation set of a second percentage, and the first percentage and the second percentage will be less than or Equal to 100%;

B4、利用训练集中的声纹特征向量对所述第一模型进行训练，并在训练完成之后利用验证集对训练的所述第一模型的准确率进行验证；B4. Use the voiceprint feature vectors in the training set to train the first model, and use the verification set to verify the accuracy of the trained first model after the training is completed;

B5、若准确率大于预设准确率(例如，98.5％)，则模型训练结束，否则，增加语音数据样本的数量，并基于增加后的语音数据样本重新执行上述步骤B2、B3、B4、B5。B5. If the accuracy rate is greater than the preset accuracy rate (for example, 98.5%), the model training ends, otherwise, increase the number of voice data samples, and re-execute the above steps B2, B3, B4, B5 based on the increased voice data samples .

因此，所述计算机设备1先根据训练好的GMM对采集的用户的语音信息进行训练，得到对应的声纹鉴别向量，然后所述计算模块204利用所述声纹鉴别向量计算与所述语音片段对应的MFCC的距离，从而提升精确度。Therefore, the computer device 1 first trains the collected user's voice information according to the trained GMM to obtain the corresponding voiceprint discrimination vector, and then the calculation module 204 uses the voiceprint discrimination vector to calculate the difference between the voice segment and the voice segment. The distance of the corresponding MFCC to improve accuracy.

从上文可知，所述计算机设备1能够获取语音片段之后进行分帧得到每一帧语音数据，然后根据预设的平稳加窗算法对所每一帧语音数据进行加窗以得到加窗语音帧；接着，再提取所述语音片段的加窗语音帧的梅尔频率倒谱特征向量MFCC，并计算所述MFCC与声纹鉴别向量的距离；当所述距离小于预设阈值时，判断所述语音片段的识别结果为通过。通过以上方式，能够在对语音信号进行少量修改的情况下更加精确地计算出语音片段中的特征向量，从而提升语音识别的精度。It can be seen from the above that the computer device 1 can obtain the voice segment and then divide the frame to obtain each frame of voice data, and then perform windowing on each frame of voice data according to a preset smooth windowing algorithm to obtain a windowed voice frame ; Next, extract the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment, and calculate the distance between the MFCC and the voiceprint discrimination vector; when the distance is less than a preset threshold, determine the The recognition result of the speech fragment is passed. Through the above method, the feature vector in the speech segment can be calculated more accurately with a small amount of modification to the speech signal, thereby improving the accuracy of speech recognition.

此外，本申请还提出一种语音识别方法，所述方法应用于计算机设备。In addition, this application also proposes a voice recognition method, which is applied to computer equipment.

参阅图3所示，是本申请语音识别方法一实施例的流程示意图。在本实施例中，根据不同的需求，图3所示的流程图中的步骤的执行顺序可以改变，某些步骤可以省略。Refer to FIG. 3, which is a schematic flowchart of an embodiment of a speech recognition method according to the present application. In this embodiment, according to different requirements, the execution order of the steps in the flowchart shown in FIG. 3 can be changed, and some steps can be omitted.

步骤S500，获取语音片段，对所述语音片段进行分帧，得到每一帧语音数据。Step S500: Acquire a voice segment, divide the voice segment into frames, and obtain each frame of voice data.

在本实施例中，所述计算机设备与用户终端，比如手机，移动终端，PC端等设备连接，然后通过用户终端获取用户的语音信息。当然，在其他实施例中，所述计算机设备也可以直接提供拾音器单元采集用户的语音数据，所述语音数据包括至少一个语音片段，因此，所述计算机设备可以获取语音片段。所述计算机设备获取到语音片段之后，则进一步对所述语音片段进行分帧，得到每一帧的语音数据。当然，由于人体的生理特性，语音片段中的高频部分往往被压抑，因此，在其他实施例中，所述计算机设备还会对所述语音片段进行预加重处理，从而补偿语音片段中的高频成分。In this embodiment, the computer device is connected to a user terminal, such as a mobile phone, a mobile terminal, a PC terminal, and other devices, and then the user's voice information is obtained through the user terminal. Of course, in other embodiments, the computer device can also directly provide a pickup unit to collect the user's voice data, and the voice data includes at least one voice segment. Therefore, the computer device can obtain the voice segment. After acquiring the voice segment, the computer device further divides the voice segment into frames to obtain the voice data of each frame. Of course, due to the physiological characteristics of the human body, the high frequency part of the speech segment is often suppressed. Therefore, in other embodiments, the computer device also performs pre-emphasis processing on the speech segment to compensate for the high frequency in the speech segment. Frequency components.

步骤S502，根据预设的平稳加窗算法依次对所述语音片段的每一帧语音数据进行加窗，得到所述语音片段的加窗语音帧。In step S502, each frame of voice data of the voice segment is sequentially windowed according to a preset smooth windowing algorithm to obtain a windowed voice frame of the voice segment.

具体地，所述计算机设备将语音片段分帧之后，进一步对所述语音片段的每一帧语音数据进行加窗。在本实施例中，所述计算机设备根据预设的平稳加窗算法依次对所述语音片段的每一帧语音数据进行加窗，然后得到所述语音片段的加窗语音帧。其中，所述平稳加窗算法为：Specifically, after the computer device divides the speech segment into frames, it further performs windowing on each frame of speech data of the speech segment. In this embodiment, the computer device sequentially windows each frame of speech data of the speech segment according to a preset smooth windowing algorithm, and then obtains a windowed speech frame of the speech segment. Wherein, the stable windowing algorithm is:

在本实施例中，所述计算机设备对每一帧语音数据进行加窗时，首先获取语音数据中的环境噪声的频率分布信息，然后自动调整变量K，再根据变量K对所述分帧进行分段加窗，包括：对于语音帧的帧首和帧尾采用类似余弦波形的加窗，减少低频部分的环境噪声干扰；对于语音帧的中间部分采用类似矩形的加窗，从而避免突发变异产生的高频噪声。其中，对于自动调整变量K的过程，所述计算机设备可以预先随机从所述语音片段中的语音分帧中选择两个语音分帧，然后经傅里叶变换转换到频域，检测其中的环境噪声的频率分布，然后将所述KT1设置在高于所述环境噪声的最大频率的位置。In this embodiment, when the computer device performs windowing on each frame of speech data, it first obtains the frequency distribution information of the environmental noise in the speech data, and then automatically adjusts the variable K, and then divides the frame according to the variable K. Segmented windowing includes: adopting cosine waveform-like windowing for the start and end of the speech frame to reduce environmental noise interference in the low frequency part; adopting rectangular-like windowing for the middle part of the speech frame to avoid sudden mutation High-frequency noise generated. Wherein, for the process of automatically adjusting the variable K, the computer device may randomly select two voice sub-frames from the voice sub-frames in the voice segment in advance, and then convert them to the frequency domain by Fourier transform, and detect the environment therein. The frequency distribution of the noise, and then set the KT1 at a position higher than the maximum frequency of the environmental noise.

步骤S504，提取所述语音片段的加窗语音帧的梅尔频率倒谱特征向量MFCC。Step S504: Extract the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment.

具体地，所述计算机设备对所述语音片段的所有语音分帧进行加窗之后，还进一步对所述语音片段的加窗语音帧进行处理，提取梅尔频率倒谱特征向量MFCC。在本实施例中，所述计算机设备首先对加窗语音帧进行离散傅里叶变换，从时域转换到频域；接着再根据公式：Specifically, after the computer device performs windowing on all voice sub-frames of the voice segment, it further processes the windowed voice frames of the voice segment to extract the Mel frequency cepstrum feature vector MFCC. In this embodiment, the computer device first performs discrete Fourier transform on the windowed speech frame, from the time domain to the frequency domain; then according to the formula:

步骤S506，计算所述MFCC与声纹鉴别向量的距离，其中，所述声纹鉴别向量是预先将所述用户的采样语音信息输入到声纹特征训练模型进行训练得到。Step S506: Calculate the distance between the MFCC and the voiceprint identification vector, where the voiceprint identification vector is obtained by pre-inputting sampled voice information of the user into a voiceprint feature training model for training.

步骤S508，当所述距离小于预设阈值时，判断所述语音片段的识别结果为通过。Step S508: When the distance is less than a preset threshold, it is determined that the recognition result of the speech segment is passed.

具体地，所述计算机设备预先将对所述用户进行语音信息采样，然后将采用语音信息输入到声纹特征训练模型进行训练，从而获得所述用户对应的声纹鉴别向量。因此，在所述计算机设备提取到所述语音片段的MFCC之后，还会进一步计算所述MFCC与所述声纹鉴别向量的距离。所述距离为余弦距离，所述距离对应的计算公式为：Specifically, the computer device will sample the user's voice information in advance, and then input the adopted voice information into the voiceprint feature training model for training, so as to obtain the voiceprint identification vector corresponding to the user. Therefore, after the computer device extracts the MFCC of the speech segment, it will further calculate the distance between the MFCC and the voiceprint discrimination vector. The distance is the cosine distance, and the calculation formula corresponding to the distance is:

其中，x代表标准声纹鉴别向量，y代表当前声纹鉴别向量。在本实施例中，所述计算机设备通过余弦距离公式计算出所述语音片段的MFCC与预设的声纹鉴别向量之间的距离，然后所述计算机设备将所述距离与预先设定的阈值进行比较；当所述距离小于所述阈值时，则判断所述语音片段的识别结果为通过。Among them, x represents the standard voiceprint discrimination vector, and y represents the current voiceprint discrimination vector. In this embodiment, the computer device uses the cosine distance formula to calculate the distance between the MFCC of the speech segment and the preset voiceprint discrimination vector, and then the computer device compares the distance with a preset threshold. Compare; when the distance is less than the threshold, it is determined that the recognition result of the speech segment is passed.

具体地，所述计算机设备预先通过将GMM训练出不同用户的声纹鉴别向量与将所述MFCC分别进行距离计算，从而选择出小于预设阈值且最小的距离所对应的第一声纹鉴别向量，将所述第一声纹鉴别向量对应的第一用户作为所述语音片段对应的目标用户。Specifically, the computer device preliminarily trains the voiceprint identification vectors of different users through the GMM and calculates the distances of the MFCC respectively, thereby selecting the first voiceprint identification vector corresponding to the smallest distance that is less than a preset threshold. , Taking the first user corresponding to the first voiceprint discrimination vector as the target user corresponding to the voice segment.

当然，在其他实施例中，所述计算机设备还会预先训练一个准确度较高的 GMM(Gaussian Mixture Model，高斯混合模型)，其中，所述GMM作为通用背景模型(UBM，Universal Background Model)，可以用于提取语音中的声纹鉴别向量，其中，所述GMM可以经过一系列的样本数据训练，从而能够提升声纹鉴别向量的训练准确度。其中，所述GMM的训练过程如下：Of course, in other embodiments, the computer device will also pre-train a GMM (Gaussian Mixture Model) with higher accuracy, where the GMM is used as a universal background model (UBM, Universal Background Model), It can be used to extract the voiceprint discrimination vector in speech, where the GMM can be trained through a series of sample data, so as to improve the training accuracy of the voiceprint discrimination vector. The training process of the GMM is as follows:

因此，所述计算机设备先根据训练好的GMM对采集的用户的语音信息进行训练，得到对应的声纹鉴别向量，然后所述计算模块204利用所述声纹鉴别向量计算与所述语音片段对应的MFCC的距离，从而提升精确度。Therefore, the computer device first trains the collected user's voice information according to the trained GMM to obtain the corresponding voiceprint identification vector, and then the calculation module 204 uses the voiceprint identification vector to calculate the corresponding voice segment MFCC distance, thereby improving accuracy.

本实施例所提出的语音识别方法能够获取语音片段之后进行分帧得到每一帧语音数据，然后根据预设的平稳加窗算法对所每一帧语音数据进行加窗以得到加窗语音帧；接着，再提取所述语音片段的加窗语音帧的梅尔频率倒谱特征向量MFCC，并计算所述MFCC与声纹鉴别向量的距离；当所述距离小于预设阈值时，判断所述语音片段的识别结果为通过。通过以上方式，能够在对语音信号进行少量修改的情况下更加精确地计算出语音片段中的特征向量，从而提升语音识别的精度。The speech recognition method proposed in this embodiment can obtain each frame of speech data by framing after acquiring a speech fragment, and then windowing each frame of speech data according to a preset smooth windowing algorithm to obtain a windowed speech frame; Then, extract the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment, and calculate the distance between the MFCC and the voiceprint discrimination vector; when the distance is less than a preset threshold, determine the speech The recognition result of the fragment is passed. Through the above method, the feature vector in the speech segment can be calculated more accurately with a small amount of modification to the speech signal, thereby improving the accuracy of speech recognition.

上述本申请实施例序号仅仅为了描述，不代表实施例的优劣。The serial numbers of the foregoing embodiments of the present application are only for description, and do not represent the advantages and disadvantages of the embodiments.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件，但很多情况下前者是更佳的实施方式。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中，包括若干指令用以使得一台终端设备(可以是手机，计算机，服务器，空调器，或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to enable a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the method described in each embodiment of the present application.

以上仅为本申请的优选实施例，并非因此限制本申请的专利范围，凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换，或直接或间接运用在其他相关的技术领域，均同理包括在本申请的专利保护范围内。The above are only the preferred embodiments of the application, and do not limit the scope of the patent for this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of the application, or directly or indirectly applied to other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims

A speech recognition method, the method includes the steps:

Acquiring a voice segment, framing the voice segment to obtain each frame of voice data;

Sequentially windowing each frame of speech data of the speech segment according to a preset smooth windowing algorithm to obtain a windowed speech frame of the speech segment;

Extracting the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment;

Calculating the distance between the MFCC and a voiceprint identification vector, where the voiceprint identification vector is obtained by pre-inputting sampled voice information of the user into a voiceprint feature training model for training;

When the distance is less than the preset threshold, it is determined that the recognition result of the speech segment is passed.

The speech recognition method according to claim 1, wherein the stable windowing algorithm is:

Among them, T1 is the time length of the windowed speech frame, w(t) represents the weighted value of the speech signal at time t within the time length range of the speech frame, and K and K′ are constant variables, K< K'and K+K'=1, K is set according to the environmental noise.

The speech recognition method according to claim 2, the method further comprising:

When each frame of speech data is windowed, the frequency distribution information of the environmental noise in the speech data is obtained, and then the K is adjusted according to the highest frequency distribution of the noise.

The speech recognition method according to claim 2, wherein the smooth windowing algorithm comprises:

For the frame start and end of the speech frame frequency distribution, use cosine waveform windowing to reduce the environmental noise interference in the low frequency part;

For the middle part of the frequency distribution of the speech frame, a rectangular window is used to avoid high frequency noise caused by sudden mutation.

The speech recognition method according to claim 1, wherein the voiceprint feature training model is a Gaussian mixture model GMM, and the method further comprises:

The voiceprint identification vectors of different users are trained by GMM and the distance calculation is performed on the MFCC respectively;

Selecting the first voiceprint discrimination vector corresponding to the smallest distance smaller than the preset threshold;

The first user corresponding to the first voiceprint discrimination vector is used as the target user corresponding to the voice segment.

The speech recognition method according to claim 1, wherein the distance is a cosine distance, and the calculation formula corresponding to the distance is:

Among them, x represents the standard voiceprint discrimination vector, and y represents the current voiceprint discrimination vector.

5. The speech recognition method according to claim 1, before said framing said speech segment, said method further comprises:

Pre-emphasis is performed on the voice segment to compensate for high frequency components in the voice segment.

A speech recognition device, the device includes:

The framing module is used to obtain speech fragments, and divide the speech fragments into frames to obtain each frame of speech data;

The windowing module is configured to sequentially window each frame of voice data of the voice segment according to a preset smooth windowing algorithm to obtain a windowed voice frame of the voice segment;

An extraction module for extracting the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment;

A calculation module for calculating the distance between the MFCC and a voiceprint identification vector, where the voiceprint identification vector is obtained by pre-inputting sampled voice information of the user into a voiceprint feature training model for training;

The recognition module is configured to determine that the recognition result of the speech segment is passed when the distance is less than a preset threshold.

8. The speech recognition device of claim 8, wherein the smooth windowing algorithm is:

Among them, T1 is the time length of the windowed speech frame, w(t) represents the weighted value of the speech signal at time t within the time range of the speech frame, and K and K′ are constant variables, K< K'and K+K'=1, K is set according to the environmental noise.

The speech recognition device according to claim 9, wherein the windowing module is further used for:

8. The speech recognition device according to claim 8, wherein the voiceprint feature training model is a Gaussian mixture model GMM,

The calculation module is further configured to train the voiceprint discrimination vectors of different users through GMM and calculate the distances of the MFCC respectively;

The recognition module is further configured to select a first voiceprint identification vector corresponding to a distance smaller than a preset threshold and the smallest distance; and use the first user corresponding to the first voiceprint identification vector as the voice segment corresponding Target users.

8. The speech recognition device according to claim 8, wherein the distance is a cosine distance, and the calculation formula corresponding to the distance is:

A computer device, the computer device includes a memory and a processor, the memory stores computer-readable instructions that can run on the processor, and the computer-readable instructions implement steps when executed by the processor :

The computer device according to claim 13, wherein the smooth windowing algorithm is:

The computer device according to claim 14, wherein the computer-readable instructions when executed by the processor further implement the steps:

The computer device according to claim 13, wherein the voiceprint feature training model is a Gaussian mixture model GMM, and the computer-readable instructions further implement the steps when executed by the processor:

A non-volatile computer-readable storage medium, the non-volatile computer-readable storage medium stores computer-readable instructions, the computer-readable instructions can be executed by at least one processor, so that the at least one Processor execution steps:

The non-volatile computer-readable storage medium of claim 17, wherein the smooth windowing algorithm is:

The non-volatile computer-readable storage medium of claim 18, wherein the computer-readable instructions further implement the steps when executed by the processor:

The non-volatile computer-readable storage medium according to claim 17, wherein the voiceprint feature training model is a Gaussian Mixture Model GMM, and the computer-readable instructions further implement the steps when executed by the processor: