[go: up one dir, main page]

WO2021051572A1 - Voice recognition method and apparatus, and computer device - Google Patents

Voice recognition method and apparatus, and computer device Download PDF

Info

Publication number
WO2021051572A1
WO2021051572A1 PCT/CN2019/117761 CN2019117761W WO2021051572A1 WO 2021051572 A1 WO2021051572 A1 WO 2021051572A1 CN 2019117761 W CN2019117761 W CN 2019117761W WO 2021051572 A1 WO2021051572 A1 WO 2021051572A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
frame
voice
segment
voiceprint
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2019/117761
Other languages
French (fr)
Chinese (zh)
Inventor
王健宗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Publication of WO2021051572A1 publication Critical patent/WO2021051572A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Definitions

  • This application relates to the field of speech recognition technology, and in particular to a speech recognition method, device, computer equipment, and non-volatile computer-readable storage medium.
  • Voice recognition is a type of biometric recognition technology, which is a technology that automatically recognizes the user's identity corresponding to the voice based on the voice parameters that reflect the physiological or behavioral characteristics of the voice in the voice waveform.
  • speech recognition generally uses the voiceprint features in the voice signal for recognition.
  • the existing windowing process such as the use of Hanning window, Hamming window, and triangular window , Gaussian window, etc. to add windows to voice data.
  • the inventor realizes that the existing windowing processing methods almost always modify the original speech signal, which causes the loss of part of the voiceprint feature information and reduces the accuracy of speech recognition.
  • this application proposes a speech recognition method, device, computer equipment, and non-volatile computer-readable storage medium, which can obtain speech fragments and then divide into frames to obtain each frame of speech data, and then smoothly add according to a preset
  • the windowing algorithm windows each frame of speech data to obtain a windowed speech frame; then, extract the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment, and calculate the MFCC and voiceprint
  • the distance of the discrimination vector when the distance is less than a preset threshold, it is determined that the recognition result of the speech segment is passed.
  • the present application provides a voice recognition method, which includes:
  • a voice segment divide the voice segment into frames to obtain each frame of voice data; according to a preset smooth windowing algorithm, sequentially window each frame of the voice segment of the voice segment to obtain the voice segment Windowed speech frame; extracting the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment; calculating the distance between the MFCC and the voiceprint discrimination vector, wherein the voiceprint discrimination vector is the
  • the sampled voice information of the user is input into the voiceprint feature training model for training; when the distance is less than a preset threshold, it is determined that the recognition result of the voice segment is passed.
  • a voice recognition device which includes:
  • the framing module is used to obtain speech fragments, and the speech fragments are divided into frames to obtain each frame of speech data; the windowing module is used to sequentially calculate each frame of the speech fragments according to a preset smooth windowing algorithm
  • the speech data is windowed to obtain the windowed speech frame of the speech segment;
  • the extraction module is used to extract the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment;
  • the calculation module is used to calculate the The distance between the MFCC and the voiceprint identification vector, where the voiceprint identification vector is obtained by pre-inputting sampled voice information of the user into the voiceprint feature training model for training; the identification module is used for when the distance is less than a preset When the threshold is set, it is determined that the recognition result of the speech segment is passed.
  • this application also proposes a computer device, the computer device includes a memory and a processor, the memory stores computer-readable instructions that can run on the processor, and the computer-readable instructions are The implementation steps when the processor is executed:
  • a voice segment divide the voice segment into frames to obtain each frame of voice data; according to a preset smooth windowing algorithm, sequentially window each frame of the voice segment of the voice segment to obtain the voice segment Windowed speech frame; extracting the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment; calculating the distance between the MFCC and the voiceprint discrimination vector, wherein the voiceprint discrimination vector is the
  • the sampled voice information of the user is input into the voiceprint feature training model for training; when the distance is less than a preset threshold, it is determined that the recognition result of the voice segment is passed.
  • the present application also provides a non-volatile computer-readable storage medium, the non-volatile computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions can be At least one processor executes, so that the at least one processor executes the steps:
  • a voice segment divide the voice segment into frames to obtain each frame of voice data; according to a preset smooth windowing algorithm, sequentially window each frame of the voice segment of the voice segment to obtain the voice segment Windowed speech frame; extracting the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech fragment; calculating the distance between the MFCC and the voiceprint discrimination vector, wherein the voiceprint discrimination vector is the
  • the sampled voice information of the user is input into the voiceprint feature training model for training; when the distance is less than a preset threshold, it is determined that the recognition result of the voice segment is passed.
  • the speech recognition method, device, computer equipment, and non-volatile computer-readable storage medium proposed in this application can obtain speech fragments and then divide into frames to obtain each frame of speech data, and then compare all the speech data according to a preset smooth windowing algorithm.
  • Each frame of speech data is windowed to obtain a windowed speech frame; then, the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment is extracted, and the distance between the MFCC and the voiceprint discrimination vector is calculated ; When the distance is less than a preset threshold, it is determined that the recognition result of the speech segment is passed.
  • the feature vector in the speech segment can be calculated more accurately with a small amount of modification to the speech signal, thereby improving the accuracy of speech recognition.
  • Fig. 1 is a schematic diagram of an optional hardware architecture of the computer equipment of the present application
  • FIG. 2 is a schematic diagram of program modules of an embodiment of the speech recognition device of the present application.
  • Fig. 3 is a schematic flowchart of an embodiment of a speech recognition method according to the present application.
  • FIG. 1 is a schematic diagram of an optional hardware architecture of the computer device 1 of the present application.
  • the computer device 1 may include, but is not limited to, a memory 11, a processor 12, and a network interface 13 that can communicate with each other through a system bus.
  • the computer device 1 is connected to the network through the network interface 13 (not shown in FIG. 1), and connected to other terminal devices such as mobile terminals (Mobile Terminal), mobile phones (Mobile Telephone), user equipment (User Equipment, UE), and Mobile phones (handset) and portable equipment (portable equipment), PC terminal, etc.
  • the network may be Intranet, Internet, Global System of Mobile Communication (GSM), Wideband Code Division Multiple Access (WCDMA), 4G network, 5G Network, Bluetooth (Bluetooth), Wi-Fi, call network and other wireless or wired networks.
  • FIG. 1 only shows the computer device 1 with components 11-13, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.
  • the memory 11 includes at least one type of non-volatile computer-readable storage medium, and the non-volatile computer-readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX). Memory, etc.), random access memory (RAM), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory , Disks, CDs, etc.
  • the memory 11 may be an internal storage unit of the computer device 1, for example, a hard disk or a memory of the computer device 1.
  • the memory 11 may also be an external storage device of the computer device 1, such as a plug-in hard disk, a smart media card (SMC), and a secure digital ( Secure Digital, SD card, Flash Card, etc.
  • the memory 11 may also include both the internal storage unit of the computer device 1 and its external storage device.
  • the memory 11 is generally used to store an operating system and various application software installed in the computer device 1, such as the program code of the voice recognition apparatus 200.
  • the memory 11 can also be used to temporarily store various types of data that have been output or will be output.
  • the memory 11 stores computer readable instructions, and the computer readable instructions can be executed by at least one processor, so that the at least one processor executes the steps:
  • a voice segment divide the voice segment into frames to obtain each frame of voice data; according to a preset smooth windowing algorithm, sequentially window each frame of the voice segment of the voice segment to obtain the voice segment Windowed speech frame; extracting the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment; calculating the distance between the MFCC and the voiceprint discrimination vector, wherein the voiceprint discrimination vector is the
  • the sampled voice information of the user is input into the voiceprint feature training model for training; when the distance is less than a preset threshold, it is determined that the recognition result of the voice segment is passed.
  • the processor 12 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips.
  • the processor 12 is generally used to control the overall operation of the computer device 1, such as performing data interaction or communication-related control and processing.
  • the processor 12 is configured to run the program code or process data stored in the memory 11, for example, to run the voice recognition device 200.
  • the network interface 13 may include a wireless network interface or a wired network interface.
  • the network interface 13 is usually used to connect the computer device 1 with other terminal devices such as mobile terminals, mobile phones, user equipment, mobile phones and portable devices, PC terminals, etc. Establish a communication connection between.
  • a voice recognition device 200 when a voice recognition device 200 is installed and running in the computer device 1, when the voice recognition device 200 is running, it can obtain voice fragments and then divide into frames to obtain each frame of voice data, and then according to the preset
  • the stationary windowing algorithm of ”performs windowing on each frame of speech data to obtain a windowed speech frame; then, extracts the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment, and calculates the MFCC The distance to the voiceprint discrimination vector; when the distance is less than a preset threshold, it is determined that the recognition result of the voice information is passed.
  • this application proposes a voice recognition device 200.
  • FIG. 2 is a program module diagram of an embodiment of the speech recognition device 200 of the present application.
  • the speech recognition device 200 includes a series of computer-readable instructions stored on the memory 11, and when the computer-readable instructions are executed by the processor 12, the speech recognition functions of the various embodiments of the present application can be implemented. .
  • the speech recognition apparatus 200 may be divided into one or more modules based on specific operations implemented by various parts of the computer-readable instructions. For example, in FIG. 2, the speech recognition device 200 may be divided into a frame module 201, a windowing module 202, an extraction module 203, a calculation module 204 and a recognition module 205. among them:
  • the framing module 201 is used to obtain a voice segment, and framing the voice segment to obtain each frame of voice data.
  • the computer device 1 is connected to a user terminal, such as a mobile phone, a mobile terminal, a PC terminal, etc., and then the user's voice information is obtained through the user terminal.
  • the computer device 1 may also directly provide a pickup unit to collect the user's voice data.
  • the voice data includes at least one voice segment. Therefore, the framing module 201 can obtain the voice segment. After the framing module 201 obtains the speech segment, it further divides the speech segment to obtain the speech data of each frame. Of course, due to the physiological characteristics of the human body, the high frequency part of the speech segment is often suppressed. Therefore, in other embodiments, the framing module 201 also performs pre-emphasis processing on the speech segment to compensate for the The high frequency components.
  • the windowing module 202 is configured to sequentially window each frame of speech data of the speech segment according to a preset smooth windowing algorithm to obtain a windowed speech frame of the speech segment.
  • the windowing module 202 further performs windowing on each frame of speech data of the speech segment.
  • the windowing module 202 sequentially windows each frame of speech data of the speech segment according to a preset smooth windowing algorithm, and then obtains the windowed speech frame of the speech segment.
  • the stable windowing algorithm is:
  • T1 is the time length of the windowed speech frame
  • w(t) represents the weighted value of the speech signal at time t within the time length range of the speech frame
  • the computer device 1 when the computer device 1 performs windowing on each frame of speech data, it first obtains the frequency distribution information of the environmental noise in the speech data, and then automatically adjusts the variable K, and then divides the frame according to the variable K.
  • Segmented windowing includes: adopting cosine waveform-like windowing for the beginning and end of the speech frame to reduce environmental noise interference in the low-frequency part; adopting rectangular-like windowing for the middle part of the speech frame to avoid sudden bursts High frequency noise generated by mutation.
  • the computer device 1 may randomly select two voice sub-frames from the voice sub-frames in the voice fragment in advance, and then convert them to the frequency domain by Fourier transform, and detect the The frequency distribution of environmental noise, and then set the KT1 at a position higher than the maximum frequency of the environmental noise.
  • the extraction module 203 is configured to extract the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment.
  • the extraction module 203 further processes the windowed voice frames of the voice segment to extract the Mel frequency cepstrum feature Vector MFCC.
  • the extraction module 203 first performs discrete Fourier transform on the windowed speech frame, from the time domain to the frequency domain; then according to the formula:
  • the calculation module 204 is configured to calculate the distance between the MFCC and a voiceprint identification vector, where the voiceprint identification vector is obtained by pre-inputting sampled voice information of the user into a voiceprint feature training model for training.
  • the recognition module 205 is configured to determine that the recognition result of the speech segment is passed when the distance is less than a preset threshold.
  • the computer device 1 will sample the user's voice information in advance, and then input the adopted voice information into a voiceprint feature training model for training, so as to obtain a voiceprint identification vector corresponding to the user. Therefore, after the extraction module 203 extracts the MFCC of the speech segment, the calculation module 204 further calculates the distance between the MFCC and the voiceprint discrimination vector.
  • the distance is the cosine distance, and the calculation formula corresponding to the distance is:
  • x represents the standard voiceprint discrimination vector
  • y represents the current voiceprint discrimination vector.
  • the calculation module 204 uses the cosine distance formula to calculate the distance between the MFCC of the speech segment and the preset voiceprint discrimination vector, and then the recognition module 205 compares the distance with the preset When the distance is less than the threshold, it is determined that the recognition result of the speech segment is passed.
  • the computer device 1 preliminarily trains the voiceprint identification vectors of different users through GMM and calculates the distances of the MFCC respectively, thereby selecting the first voiceprint identification corresponding to the smallest distance which is less than the preset threshold.
  • Vector, the first user corresponding to the first voiceprint discrimination vector is used as the target user corresponding to the voice segment.
  • the computer device 1 will also pre-train a GMM (Gaussian Mixture Model) with higher accuracy, where the GMM serves as a universal background model (UBM, Universal Background Model) , Can be used to extract the voiceprint identification vector in speech, wherein the GMM can be trained through a series of sample data, so as to improve the training accuracy of the voiceprint identification vector.
  • GMM Global System for Mobile Communications
  • UBM Universal Background Model
  • Each voice data sample can be collected from the voices of different people in different environments (that is, corresponding to a voiceprint identification vector), such voice data samples It is used to train a general background model that can characterize general speech characteristics.
  • each voice data sample separately to extract the preset type voiceprint feature corresponding to each voice data sample, and construct the voiceprint feature corresponding to each voice data sample based on the preset type voiceprint feature corresponding to each voice data sample vector;
  • the model training ends, otherwise, increase the number of voice data samples, and re-execute the above steps B2, B3, B4, B5 based on the increased voice data samples .
  • the preset accuracy rate for example, 98.5%
  • the computer device 1 first trains the collected user's voice information according to the trained GMM to obtain the corresponding voiceprint discrimination vector, and then the calculation module 204 uses the voiceprint discrimination vector to calculate the difference between the voice segment and the voice segment. The distance of the corresponding MFCC to improve accuracy.
  • the computer device 1 can obtain the voice segment and then divide the frame to obtain each frame of voice data, and then perform windowing on each frame of voice data according to a preset smooth windowing algorithm to obtain a windowed voice frame ; Next, extract the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment, and calculate the distance between the MFCC and the voiceprint discrimination vector; when the distance is less than a preset threshold, determine the The recognition result of the speech fragment is passed.
  • the feature vector in the speech segment can be calculated more accurately with a small amount of modification to the speech signal, thereby improving the accuracy of speech recognition.
  • this application also proposes a voice recognition method, which is applied to computer equipment.
  • FIG. 3 is a schematic flowchart of an embodiment of a speech recognition method according to the present application.
  • the execution order of the steps in the flowchart shown in FIG. 3 can be changed, and some steps can be omitted.
  • Step S500 Acquire a voice segment, divide the voice segment into frames, and obtain each frame of voice data.
  • the computer device is connected to a user terminal, such as a mobile phone, a mobile terminal, a PC terminal, and other devices, and then the user's voice information is obtained through the user terminal.
  • the computer device can also directly provide a pickup unit to collect the user's voice data, and the voice data includes at least one voice segment. Therefore, the computer device can obtain the voice segment. After acquiring the voice segment, the computer device further divides the voice segment into frames to obtain the voice data of each frame.
  • the computer device also performs pre-emphasis processing on the speech segment to compensate for the high frequency in the speech segment. Frequency components.
  • each frame of voice data of the voice segment is sequentially windowed according to a preset smooth windowing algorithm to obtain a windowed voice frame of the voice segment.
  • the computer device After the computer device divides the speech segment into frames, it further performs windowing on each frame of speech data of the speech segment.
  • the computer device sequentially windows each frame of speech data of the speech segment according to a preset smooth windowing algorithm, and then obtains a windowed speech frame of the speech segment.
  • the stable windowing algorithm is:
  • T1 is the time length of the windowed speech frame
  • w(t) represents the weighted value of the speech signal at time t within the time length range of the speech frame
  • the computer device when the computer device performs windowing on each frame of speech data, it first obtains the frequency distribution information of the environmental noise in the speech data, and then automatically adjusts the variable K, and then divides the frame according to the variable K.
  • Segmented windowing includes: adopting cosine waveform-like windowing for the start and end of the speech frame to reduce environmental noise interference in the low frequency part; adopting rectangular-like windowing for the middle part of the speech frame to avoid sudden mutation High-frequency noise generated.
  • the computer device may randomly select two voice sub-frames from the voice sub-frames in the voice segment in advance, and then convert them to the frequency domain by Fourier transform, and detect the environment therein. The frequency distribution of the noise, and then set the KT1 at a position higher than the maximum frequency of the environmental noise.
  • Step S504 Extract the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment.
  • the computer device after the computer device performs windowing on all voice sub-frames of the voice segment, it further processes the windowed voice frames of the voice segment to extract the Mel frequency cepstrum feature vector MFCC.
  • the computer device first performs discrete Fourier transform on the windowed speech frame, from the time domain to the frequency domain; then according to the formula:
  • Step S506 Calculate the distance between the MFCC and the voiceprint identification vector, where the voiceprint identification vector is obtained by pre-inputting sampled voice information of the user into a voiceprint feature training model for training.
  • Step S508 When the distance is less than a preset threshold, it is determined that the recognition result of the speech segment is passed.
  • the computer device will sample the user's voice information in advance, and then input the adopted voice information into the voiceprint feature training model for training, so as to obtain the voiceprint identification vector corresponding to the user. Therefore, after the computer device extracts the MFCC of the speech segment, it will further calculate the distance between the MFCC and the voiceprint discrimination vector.
  • the distance is the cosine distance, and the calculation formula corresponding to the distance is:
  • x represents the standard voiceprint discrimination vector
  • y represents the current voiceprint discrimination vector.
  • the computer device uses the cosine distance formula to calculate the distance between the MFCC of the speech segment and the preset voiceprint discrimination vector, and then the computer device compares the distance with a preset threshold. Compare; when the distance is less than the threshold, it is determined that the recognition result of the speech segment is passed.
  • the computer device preliminarily trains the voiceprint identification vectors of different users through the GMM and calculates the distances of the MFCC respectively, thereby selecting the first voiceprint identification vector corresponding to the smallest distance that is less than a preset threshold. , Taking the first user corresponding to the first voiceprint discrimination vector as the target user corresponding to the voice segment.
  • the computer device will also pre-train a GMM (Gaussian Mixture Model) with higher accuracy, where the GMM is used as a universal background model (UBM, Universal Background Model), It can be used to extract the voiceprint discrimination vector in speech, where the GMM can be trained through a series of sample data, so as to improve the training accuracy of the voiceprint discrimination vector.
  • GMM Global System for Mobile Communications
  • UBM Universal Background Model
  • Each voice data sample can be collected from the voices of different people in different environments (that is, corresponding to a voiceprint identification vector), such voice data samples It is used to train a general background model that can characterize general speech characteristics.
  • each voice data sample separately to extract the preset type voiceprint feature corresponding to each voice data sample, and construct the voiceprint feature corresponding to each voice data sample based on the preset type voiceprint feature corresponding to each voice data sample vector;
  • the model training ends, otherwise, increase the number of voice data samples, and re-execute the above steps B2, B3, B4, B5 based on the increased voice data samples .
  • the preset accuracy rate for example, 98.5%
  • the computer device first trains the collected user's voice information according to the trained GMM to obtain the corresponding voiceprint identification vector, and then the calculation module 204 uses the voiceprint identification vector to calculate the corresponding voice segment MFCC distance, thereby improving accuracy.
  • the speech recognition method proposed in this embodiment can obtain each frame of speech data by framing after acquiring a speech fragment, and then windowing each frame of speech data according to a preset smooth windowing algorithm to obtain a windowed speech frame; Then, extract the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment, and calculate the distance between the MFCC and the voiceprint discrimination vector; when the distance is less than a preset threshold, determine the speech The recognition result of the fragment is passed.
  • the feature vector in the speech segment can be calculated more accurately with a small amount of modification to the speech signal, thereby improving the accuracy of speech recognition.
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to enable a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the method described in each embodiment of the present application.
  • a terminal device which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A voice recognition method and apparatus, a computer device and a non-volatile computer-readable storage medium. Said method comprises: acquiring a voice segment, and framing the voice segment to obtain each frame of voice data (S500); sequentially windowing, according to a preset stable windowing algorithm, each frame of voice data of the voice segment, so as to obtain a windowed voice frame of the voice segment (S502); extracting a Mel frequency cepstral coefficient vector (MFCC) of the windowed voice frame of the voice segment (S504); calculating the distance between the MFCC and a voiceprint discrimination vector (S506); and when the distance is smaller than a preset threshold, determining that the recognition result of the voice segment is passed (S508). The voice recognition method can calculate a characteristic vector in a voice segment more accurately, thereby improving the accuracy of voice recognition.

Description

语音识别方法、装置以及计算机设备Speech recognition method, device and computer equipment

本申请要求于2019年09月16日提交中国专利局、申请号为201910871726.5、发明名称为“语音识别方法、装置以及计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office, the application number is 201910871726.5, and the invention title is "speech recognition method, device and computer equipment" on September 16, 2019, the entire content of which is incorporated into this application by reference in.

技术领域Technical field

本申请涉及语音识别技术领域,尤其涉及一种语音识别方法、装置、计算机设备及非易失性计算机可读存储介质。This application relates to the field of speech recognition technology, and in particular to a speech recognition method, device, computer equipment, and non-volatile computer-readable storage medium.

背景技术Background technique

语音识别属于生物特征识别技术的一种,是一项根据语音波形中反映语音中生理或行为的特征的语音参数,自动识别出语音对应的用户身份的技术。现有技术中,语音识别一般都是利用语音信号中的声纹特征进行识别,其中,在声纹特征提取阶段,现有的加窗处理过程,比如使用汉宁窗、汉明窗、三角窗、高斯窗等对语音数据进行加窗。发明人意识到,现有的加窗处理方式几乎都会对原始语音信号进行了修改,从而造成了部分声纹特征信息的丢失,降低了语音识别的准确率。Voice recognition is a type of biometric recognition technology, which is a technology that automatically recognizes the user's identity corresponding to the voice based on the voice parameters that reflect the physiological or behavioral characteristics of the voice in the voice waveform. In the prior art, speech recognition generally uses the voiceprint features in the voice signal for recognition. Among them, in the voiceprint feature extraction stage, the existing windowing process, such as the use of Hanning window, Hamming window, and triangular window , Gaussian window, etc. to add windows to voice data. The inventor realizes that the existing windowing processing methods almost always modify the original speech signal, which causes the loss of part of the voiceprint feature information and reduces the accuracy of speech recognition.

发明内容Summary of the invention

有鉴于此,本申请提出一种语音识别方法、装置、计算机设备及非易失性计算机可读存储介质,能够获取语音片段之后进行分帧得到每一帧语音数据,然后根据预设的平稳加窗算法对所每一帧语音数据进行加窗以得到加窗语音帧;接着,再提取所述语音片段的加窗语音帧的梅尔频率倒谱特征向量MFCC,并计算所述MFCC与声纹鉴别向量的距离;当所述距离小于预设阈值时,判断所述语音片段的识别结果为通过。通过以上方式,在对语音信号进行少量修改的情况下能够更加精确地计算出语音片段中的特征向量,从而提升语音识别的精度。In view of this, this application proposes a speech recognition method, device, computer equipment, and non-volatile computer-readable storage medium, which can obtain speech fragments and then divide into frames to obtain each frame of speech data, and then smoothly add according to a preset The windowing algorithm windows each frame of speech data to obtain a windowed speech frame; then, extract the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment, and calculate the MFCC and voiceprint The distance of the discrimination vector; when the distance is less than a preset threshold, it is determined that the recognition result of the speech segment is passed. Through the above method, the feature vector in the speech segment can be calculated more accurately with a small amount of modification to the speech signal, thereby improving the accuracy of speech recognition.

首先,为实现上述目的,本申请提供一种语音识别方法,所述方法包括:First of all, in order to achieve the above objective, the present application provides a voice recognition method, which includes:

获取语音片段,对所述语音片段进行分帧,得到每一帧语音数据;根据预设的平稳加窗算法依次对所述语音片段的每一帧语音数据进行加窗,得到所述语音片段的加窗语音帧;提取所述语音片段的加窗语音帧的梅尔频率倒谱特征向量MFCC;计算所述MFCC与声纹鉴别向量的距离,其中,所述声纹鉴别向量是预先将所述用户的采样语音信息输入到声纹特征训练模型进行训练得到;当所述距离小于预设阈值时,判断所述语音片段的识别结果为通过。Acquire a voice segment, divide the voice segment into frames to obtain each frame of voice data; according to a preset smooth windowing algorithm, sequentially window each frame of the voice segment of the voice segment to obtain the voice segment Windowed speech frame; extracting the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment; calculating the distance between the MFCC and the voiceprint discrimination vector, wherein the voiceprint discrimination vector is the The sampled voice information of the user is input into the voiceprint feature training model for training; when the distance is less than a preset threshold, it is determined that the recognition result of the voice segment is passed.

此外,为实现上述目的,本申请还提供一种语音识别装置,所述装置包括:In addition, in order to achieve the above objective, the present application also provides a voice recognition device, which includes:

分帧模块,用于获取语音片段,对所述语音片段进行分帧,得到每一帧语音数据;加窗模块,用于根据预设的平稳加窗算法依次对所述语音片段的每一帧语音数据进行加窗,得到所述语音片段的加窗语音帧;提取模块,用于提取所述语音片段的加窗语音帧的梅尔频率倒谱特征向量MFCC;计算模块,用于计算所述MFCC与声纹鉴别向量的距离,其中,所述声纹鉴别向量是预先将所述用户的采样语音信息输入到声纹特征训练模型进行训练得到;识别模块,用于当所述距离小于预设阈值时,判断所述语音片段的识别结果为通过。The framing module is used to obtain speech fragments, and the speech fragments are divided into frames to obtain each frame of speech data; the windowing module is used to sequentially calculate each frame of the speech fragments according to a preset smooth windowing algorithm The speech data is windowed to obtain the windowed speech frame of the speech segment; the extraction module is used to extract the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment; the calculation module is used to calculate the The distance between the MFCC and the voiceprint identification vector, where the voiceprint identification vector is obtained by pre-inputting sampled voice information of the user into the voiceprint feature training model for training; the identification module is used for when the distance is less than a preset When the threshold is set, it is determined that the recognition result of the speech segment is passed.

进一步地,本申请还提出一种计算机设备,所述计算机设备包括存储器、处理器,所述存储器上存储有可在所述处理器上运行的计算机可读指令,所述计算机可读指令被所述处理器执行时实现步骤:Further, this application also proposes a computer device, the computer device includes a memory and a processor, the memory stores computer-readable instructions that can run on the processor, and the computer-readable instructions are The implementation steps when the processor is executed:

获取语音片段,对所述语音片段进行分帧,得到每一帧语音数据;根据预设的平稳加窗算法依次对所述语音片段的每一帧语音数据进行加窗,得到所述语音片段的加窗语音帧;提取所述语音片段的加窗语音帧的梅尔频率倒谱特征向量MFCC;计算所述MFCC与声纹鉴别向量的距离,其中,所述声纹鉴别向量是预先将所述用户的采样语音信息输入到声纹特征训练模型进行训练得到;当所述距离小于预设阈值时,判断所述语音片段的识别结果为通过。Acquire a voice segment, divide the voice segment into frames to obtain each frame of voice data; according to a preset smooth windowing algorithm, sequentially window each frame of the voice segment of the voice segment to obtain the voice segment Windowed speech frame; extracting the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment; calculating the distance between the MFCC and the voiceprint discrimination vector, wherein the voiceprint discrimination vector is the The sampled voice information of the user is input into the voiceprint feature training model for training; when the distance is less than a preset threshold, it is determined that the recognition result of the voice segment is passed.

进一步地,为实现上述目的,本申请还提供一种非易失性计算机可读存储介质,所述非易失性计算机可读存储介质存储有计算机可读指令,所述计算机可读指令可被至少一个处理器执行,以使所述至少一个处理器执行步骤:Further, in order to achieve the above object, the present application also provides a non-volatile computer-readable storage medium, the non-volatile computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions can be At least one processor executes, so that the at least one processor executes the steps:

获取语音片段,对所述语音片段进行分帧,得到每一帧语音数据;根据预设的平稳加窗算法依次对所述语音片段的每一帧语音数据进行加窗,得到所述语音片段的加窗语音帧; 提取所述语音片段的加窗语音帧的梅尔频率倒谱特征向量MFCC;计算所述MFCC与声纹鉴别向量的距离,其中,所述声纹鉴别向量是预先将所述用户的采样语音信息输入到声纹特征训练模型进行训练得到;当所述距离小于预设阈值时,判断所述语音片段的识别结果为通过。Acquire a voice segment, divide the voice segment into frames to obtain each frame of voice data; according to a preset smooth windowing algorithm, sequentially window each frame of the voice segment of the voice segment to obtain the voice segment Windowed speech frame; extracting the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech fragment; calculating the distance between the MFCC and the voiceprint discrimination vector, wherein the voiceprint discrimination vector is the The sampled voice information of the user is input into the voiceprint feature training model for training; when the distance is less than a preset threshold, it is determined that the recognition result of the voice segment is passed.

本申请所提出的语音识别方法、装置、计算机设备及非易失性计算机可读存储介质,能够获取语音片段之后进行分帧得到每一帧语音数据,然后根据预设的平稳加窗算法对所每一帧语音数据进行加窗以得到加窗语音帧;接着,再提取所述语音片段的加窗语音帧的梅尔频率倒谱特征向量MFCC,并计算所述MFCC与声纹鉴别向量的距离;当所述距离小于预设阈值时,判断所述语音片段的识别结果为通过。通过以上方式,在对语音信号进行少量修改的情况下能够更加精确地计算出语音片段中的特征向量,从而提升语音识别的精度。The speech recognition method, device, computer equipment, and non-volatile computer-readable storage medium proposed in this application can obtain speech fragments and then divide into frames to obtain each frame of speech data, and then compare all the speech data according to a preset smooth windowing algorithm. Each frame of speech data is windowed to obtain a windowed speech frame; then, the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment is extracted, and the distance between the MFCC and the voiceprint discrimination vector is calculated ; When the distance is less than a preset threshold, it is determined that the recognition result of the speech segment is passed. Through the above method, the feature vector in the speech segment can be calculated more accurately with a small amount of modification to the speech signal, thereby improving the accuracy of speech recognition.

附图说明Description of the drawings

图1是本申请计算机设备一可选的硬件架构的示意图;Fig. 1 is a schematic diagram of an optional hardware architecture of the computer equipment of the present application;

图2是本申请语音识别装置一实施例的程序模块示意图;2 is a schematic diagram of program modules of an embodiment of the speech recognition device of the present application;

图3是本申请语音识别方法一实施例的流程示意图。Fig. 3 is a schematic flowchart of an embodiment of a speech recognition method according to the present application.

具体实施方式detailed description

为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make the purpose, technical solutions, and advantages of this application clearer, the following further describes this application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not used to limit the present application. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

需要说明的是,在本申请中涉及“第一”、“第二”等的描述仅用于描述目的,而不能理解为指示或暗示其相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。另外,各个实施例之间的技术方案可以相互结合,但是必须是以本领域普通技术人员能够实现为基础,当技术方案的结合出现相互矛盾或无法实现时应当认为这种技术方案的结合不存在,也不在本申请要求的保 护范围之内。It should be noted that the descriptions related to "first", "second", etc. in this application are only for descriptive purposes, and cannot be understood as indicating or implying their relative importance or implicitly indicating the number of indicated technical features . Therefore, the features defined with "first" and "second" may explicitly or implicitly include at least one of the features. In addition, the technical solutions between the various embodiments can be combined with each other, but it must be based on what can be achieved by a person of ordinary skill in the art. When the combination of technical solutions is contradictory or cannot be achieved, it should be considered that such a combination of technical solutions does not exist. , Is not within the scope of protection required by this application.

参阅图1所示,是本申请计算机设备1一可选的硬件架构的示意图。Refer to FIG. 1, which is a schematic diagram of an optional hardware architecture of the computer device 1 of the present application.

本实施例中,所述计算机设备1可包括,但不仅限于,可通过系统总线相互通信连接存储器11、处理器12、网络接口13。In this embodiment, the computer device 1 may include, but is not limited to, a memory 11, a processor 12, and a network interface 13 that can communicate with each other through a system bus.

所述计算机设备1通过网络接口13连接网络(图1未标出),通过网络连接到其他终端设备如移动终端(Mobile Terminal)、移动电话(Mobile Telephone)、用户设备(User Equipment,UE)、手机(handset)及便携设备(portable equipment),PC端等。所述网络可以是企业内部网(Intranet)、互联网(Internet)、全球移动通讯系统(Global System of Mobile communication,GSM)、宽带码分多址(Wideband Code Division Multiple Access,WCDMA)、4G网络、5G网络、蓝牙(Bluetooth)、Wi-Fi、通话网络等无线或有线网络。The computer device 1 is connected to the network through the network interface 13 (not shown in FIG. 1), and connected to other terminal devices such as mobile terminals (Mobile Terminal), mobile phones (Mobile Telephone), user equipment (User Equipment, UE), and Mobile phones (handset) and portable equipment (portable equipment), PC terminal, etc. The network may be Intranet, Internet, Global System of Mobile Communication (GSM), Wideband Code Division Multiple Access (WCDMA), 4G network, 5G Network, Bluetooth (Bluetooth), Wi-Fi, call network and other wireless or wired networks.

需要指出的是,图1仅示出了具有组件11-13的计算机设备1,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。It should be pointed out that FIG. 1 only shows the computer device 1 with components 11-13, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.

其中,所述存储器11至少包括一种类型的非易失性计算机可读存储介质,所述非易失性计算机可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器11可以是所述计算机设备1的内部存储单元,例如该计算机设备1的硬盘或内存。在另一些实施例中,所述存储器11也可以是所述计算机设备1的外部存储设备,例如该计算机设备1配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器11还可以既包括所述计算机设备1的内部存储单元也包括其外部存储设备。本实施例中,所述存储器11通常用于存储安装于所述计算机设备1的操作系统和各类应用软件,例如语音识别装置200的程序代码等。此外,所述存储器11还可以用于暂时地存储已经输出或者将要输出的各类数据。Wherein, the memory 11 includes at least one type of non-volatile computer-readable storage medium, and the non-volatile computer-readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX). Memory, etc.), random access memory (RAM), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory , Disks, CDs, etc. In some embodiments, the memory 11 may be an internal storage unit of the computer device 1, for example, a hard disk or a memory of the computer device 1. In other embodiments, the memory 11 may also be an external storage device of the computer device 1, such as a plug-in hard disk, a smart media card (SMC), and a secure digital ( Secure Digital, SD card, Flash Card, etc. Of course, the memory 11 may also include both the internal storage unit of the computer device 1 and its external storage device. In this embodiment, the memory 11 is generally used to store an operating system and various application software installed in the computer device 1, such as the program code of the voice recognition apparatus 200. In addition, the memory 11 can also be used to temporarily store various types of data that have been output or will be output.

所述存储器11存储有计算机可读指令,所述计算机可读指令可被至少一个处理器执行,以使所述至少一个处理器执行步骤:The memory 11 stores computer readable instructions, and the computer readable instructions can be executed by at least one processor, so that the at least one processor executes the steps:

获取语音片段,对所述语音片段进行分帧,得到每一帧语音数据;根据预设的平稳加 窗算法依次对所述语音片段的每一帧语音数据进行加窗,得到所述语音片段的加窗语音帧;提取所述语音片段的加窗语音帧的梅尔频率倒谱特征向量MFCC;计算所述MFCC与声纹鉴别向量的距离,其中,所述声纹鉴别向量是预先将所述用户的采样语音信息输入到声纹特征训练模型进行训练得到;当所述距离小于预设阈值时,判断所述语音片段的识别结果为通过。Acquire a voice segment, divide the voice segment into frames to obtain each frame of voice data; according to a preset smooth windowing algorithm, sequentially window each frame of the voice segment of the voice segment to obtain the voice segment Windowed speech frame; extracting the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment; calculating the distance between the MFCC and the voiceprint discrimination vector, wherein the voiceprint discrimination vector is the The sampled voice information of the user is input into the voiceprint feature training model for training; when the distance is less than a preset threshold, it is determined that the recognition result of the voice segment is passed.

所述处理器12在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器12通常用于控制所述计算机设备1的总体操作,例如执行数据交互或者通信相关的控制和处理等。本实施例中,所述处理器12用于运行所述存储器11中存储的程序代码或者处理数据,例如运行所述的语音识别装置200等。In some embodiments, the processor 12 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips. The processor 12 is generally used to control the overall operation of the computer device 1, such as performing data interaction or communication-related control and processing. In this embodiment, the processor 12 is configured to run the program code or process data stored in the memory 11, for example, to run the voice recognition device 200.

所述网络接口13可包括无线网络接口或有线网络接口,该网络接口13通常用于在所述计算机设备1与其他终端设备如移动终端、移动电话、用户设备、手机及便携设备,PC端等之间建立通信连接。The network interface 13 may include a wireless network interface or a wired network interface. The network interface 13 is usually used to connect the computer device 1 with other terminal devices such as mobile terminals, mobile phones, user equipment, mobile phones and portable devices, PC terminals, etc. Establish a communication connection between.

本实施例中,所述计算机设备1内安装并运行有语音识别装置200时,当所述语音识别装置200运行时,能够获取语音片段之后进行分帧得到每一帧语音数据,然后根据预设的平稳加窗算法对所每一帧语音数据进行加窗以得到加窗语音帧;接着,再提取所述语音片段的加窗语音帧的梅尔频率倒谱特征向量MFCC,并计算所述MFCC与声纹鉴别向量的距离;当所述距离小于预设阈值时,判断所述语音信息的识别结果为通过。通过以上方式,能够更加精确地计算出语音片段中的特征向量,从而提升语音识别的精度。In this embodiment, when a voice recognition device 200 is installed and running in the computer device 1, when the voice recognition device 200 is running, it can obtain voice fragments and then divide into frames to obtain each frame of voice data, and then according to the preset The stationary windowing algorithm of ”performs windowing on each frame of speech data to obtain a windowed speech frame; then, extracts the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment, and calculates the MFCC The distance to the voiceprint discrimination vector; when the distance is less than a preset threshold, it is determined that the recognition result of the voice information is passed. Through the above method, the feature vector in the speech segment can be calculated more accurately, thereby improving the accuracy of speech recognition.

至此,己经详细介绍了本申请各个实施例的应用环境和相关设备的硬件结构和功能。下面,将基于上述应用环境和相关设备,提出本申请的各个实施例。So far, the application environment of each embodiment of the present application and the hardware structure and functions of related devices have been introduced in detail. Hereinafter, various embodiments of the present application will be proposed based on the above-mentioned application environment and related equipment.

首先,本申请提出一种语音识别装置200。First, this application proposes a voice recognition device 200.

参阅图2所示,是本申请语音识别装置200一实施例的程序模块图。Refer to FIG. 2, which is a program module diagram of an embodiment of the speech recognition device 200 of the present application.

本实施例中,所述语音识别装置200包括一系列的存储于存储器11上的计算机可读指令,当该计算机可读指令被处理器12执行时,可以实现本申请各实施例的语音识别功能。在一些实施例中,基于该计算机可读指令各部分所实现的特定的操作,语音识别装置200可以被划分为一个或多个模块。例如,在图2中,所述语音识别装置200可以被分割成分帧模块201、 加窗模块202、提取模块203、计算模块204和识别模块205。其中:In this embodiment, the speech recognition device 200 includes a series of computer-readable instructions stored on the memory 11, and when the computer-readable instructions are executed by the processor 12, the speech recognition functions of the various embodiments of the present application can be implemented. . In some embodiments, the speech recognition apparatus 200 may be divided into one or more modules based on specific operations implemented by various parts of the computer-readable instructions. For example, in FIG. 2, the speech recognition device 200 may be divided into a frame module 201, a windowing module 202, an extraction module 203, a calculation module 204 and a recognition module 205. among them:

所述分帧模块201,用于获取语音片段,对所述语音片段进行分帧,得到每一帧语音数据。The framing module 201 is used to obtain a voice segment, and framing the voice segment to obtain each frame of voice data.

在本实施例中,所述计算机设备1与用户终端,比如手机,移动终端,PC端等设备连接,然后通过用户终端获取用户的语音信息。当然,在其他实施例中,所述计算机设备1也可以直接提供拾音器单元采集用户的语音数据,所述语音数据包括至少一个语音片段,因此,所述分帧模块201可以获取语音片段。所述分帧模块201获取到语音片段之后,则进一步对所述语音片段进行分帧,得到每一帧的语音数据。当然,由于人体的生理特性,语音片段中的高频部分往往被压抑,因此,在其他实施例中,所述分帧模块201还会对所述语音片段进行预加重处理,从而补偿语音片段中的高频成分。In this embodiment, the computer device 1 is connected to a user terminal, such as a mobile phone, a mobile terminal, a PC terminal, etc., and then the user's voice information is obtained through the user terminal. Of course, in other embodiments, the computer device 1 may also directly provide a pickup unit to collect the user's voice data. The voice data includes at least one voice segment. Therefore, the framing module 201 can obtain the voice segment. After the framing module 201 obtains the speech segment, it further divides the speech segment to obtain the speech data of each frame. Of course, due to the physiological characteristics of the human body, the high frequency part of the speech segment is often suppressed. Therefore, in other embodiments, the framing module 201 also performs pre-emphasis processing on the speech segment to compensate for the The high frequency components.

所述加窗模块202,用于根据预设的平稳加窗算法依次对所述语音片段的每一帧语音数据进行加窗,得到所述语音片段的加窗语音帧。The windowing module 202 is configured to sequentially window each frame of speech data of the speech segment according to a preset smooth windowing algorithm to obtain a windowed speech frame of the speech segment.

具体地,所述分帧模块201将语音片段分帧之后,所述加窗模块202进一步对所述语音片段的每一帧语音数据进行加窗。在本实施例中,所述加窗模块202根据预设的平稳加窗算法依次对所述语音片段的每一帧语音数据进行加窗,然后得到所述语音片段的加窗语音帧。其中,所述平稳加窗算法为:Specifically, after the framing module 201 divides the speech segment into frames, the windowing module 202 further performs windowing on each frame of speech data of the speech segment. In this embodiment, the windowing module 202 sequentially windows each frame of speech data of the speech segment according to a preset smooth windowing algorithm, and then obtains the windowed speech frame of the speech segment. Wherein, the stable windowing algorithm is:

Figure PCTCN2019117761-appb-000001
Figure PCTCN2019117761-appb-000001

其中,T1为加窗语音帧的时长,w(t)表示在语音帧的时长范围内的t时刻的需对t时刻语音信号进行加窗的加权值,K和K′是常数变量,K<K′且K+K′=1,K是根据环境噪声进行设置的。Among them, T1 is the time length of the windowed speech frame, w(t) represents the weighted value of the speech signal at time t within the time length range of the speech frame, and K and K′ are constant variables, K< K'and K+K'=1, K is set according to the environmental noise.

在本实施例中,所述计算机设备1对每一帧语音数据进行加窗时,首先获取语音数据中的环境噪声的频率分布信息,然后自动调整变量K,再根据变量K对所述分帧进行分段加窗,包括:对于语音帧的帧首和帧尾采用类似余弦波形的加窗,减少低频部分的环境噪声干扰;对于语音帧的中间部分采用类似矩形的加窗,从而避免突发变异产生的高频噪声。其中,对于自动调整变量K的过程,所述计算机设备1可以预先随机从所述语音片段中的 语音分帧中选择两个语音分帧,然后经傅里叶变换转换到频域,检测其中的环境噪声的频率分布,然后将所述KT1设置在高于所述环境噪声的最大频率的位置。In this embodiment, when the computer device 1 performs windowing on each frame of speech data, it first obtains the frequency distribution information of the environmental noise in the speech data, and then automatically adjusts the variable K, and then divides the frame according to the variable K. Segmented windowing includes: adopting cosine waveform-like windowing for the beginning and end of the speech frame to reduce environmental noise interference in the low-frequency part; adopting rectangular-like windowing for the middle part of the speech frame to avoid sudden bursts High frequency noise generated by mutation. Wherein, for the process of automatically adjusting the variable K, the computer device 1 may randomly select two voice sub-frames from the voice sub-frames in the voice fragment in advance, and then convert them to the frequency domain by Fourier transform, and detect the The frequency distribution of environmental noise, and then set the KT1 at a position higher than the maximum frequency of the environmental noise.

所述提取模块203,用于提取所述语音片段的加窗语音帧的梅尔频率倒谱特征向量MFCC。The extraction module 203 is configured to extract the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment.

具体地,所述加窗模块202对所述语音片段的所有语音分帧进行加窗之后,所述提取模块203进一步对所述语音片段的加窗语音帧进行处理,提取梅尔频率倒谱特征向量MFCC。在本实施例中,所述提取模块203首先对加窗语音帧进行离散傅里叶变换,从时域转换到频域;接着再根据公式:Specifically, after the windowing module 202 performs windowing on all the voice sub-frames of the voice segment, the extraction module 203 further processes the windowed voice frames of the voice segment to extract the Mel frequency cepstrum feature Vector MFCC. In this embodiment, the extraction module 203 first performs discrete Fourier transform on the windowed speech frame, from the time domain to the frequency domain; then according to the formula:

Figure PCTCN2019117761-appb-000002
Figure PCTCN2019117761-appb-000002

将加窗语音帧的线性频谱域映射到梅尔频谱域;最后再输入到一组梅尔三角滤波器组,计算每个频段的滤波器输出的信号对数能量,得到一个对数能量序列;再将所述对数能量序列做离散余弦变换,从而提取出所述加窗语音帧的梅尔频率倒谱特征向量MFCC。Map the linear spectrum domain of the windowed speech frame to the Mel spectrum domain; finally input to a set of Mel triangle filter bank, calculate the logarithmic energy of the signal output by the filter of each frequency band, and obtain a logarithmic energy sequence; Then, the logarithmic energy sequence is subjected to discrete cosine transform, thereby extracting the Mel frequency cepstrum feature vector MFCC of the windowed speech frame.

所述计算模块204,用于计算所述MFCC与声纹鉴别向量的距离,其中,所述声纹鉴别向量是预先将所述用户的采样语音信息输入到声纹特征训练模型进行训练得到。所述识别模块205,用于当所述距离小于预设阈值时,判断所述语音片段的识别结果为通过。The calculation module 204 is configured to calculate the distance between the MFCC and a voiceprint identification vector, where the voiceprint identification vector is obtained by pre-inputting sampled voice information of the user into a voiceprint feature training model for training. The recognition module 205 is configured to determine that the recognition result of the speech segment is passed when the distance is less than a preset threshold.

具体地,所述计算机设备1预先将对所述用户进行语音信息采样,然后将采用语音信息输入到声纹特征训练模型进行训练,从而获得所述用户对应的声纹鉴别向量。因此,在所述提取模块203提取到所述语音片段的MFCC之后,所述计算模块204进一步计算所述MFCC与所述声纹鉴别向量的距离。所述距离为余弦距离,所述距离对应的计算公式为:Specifically, the computer device 1 will sample the user's voice information in advance, and then input the adopted voice information into a voiceprint feature training model for training, so as to obtain a voiceprint identification vector corresponding to the user. Therefore, after the extraction module 203 extracts the MFCC of the speech segment, the calculation module 204 further calculates the distance between the MFCC and the voiceprint discrimination vector. The distance is the cosine distance, and the calculation formula corresponding to the distance is:

Figure PCTCN2019117761-appb-000003
Figure PCTCN2019117761-appb-000003

其中,x代表标准声纹鉴别向量,y代表当前声纹鉴别向量。在本实施例中,所述计算模块204通过余弦距离公式计算出所述语音片段的MFCC与预设的声纹鉴别向量之间的距离,然后所述识别模块205将所述距离与预先设定的阈值进行比较;当所述距离小于所述阈值时,则判断所述语音片段的识别结果为通过。Among them, x represents the standard voiceprint discrimination vector, and y represents the current voiceprint discrimination vector. In this embodiment, the calculation module 204 uses the cosine distance formula to calculate the distance between the MFCC of the speech segment and the preset voiceprint discrimination vector, and then the recognition module 205 compares the distance with the preset When the distance is less than the threshold, it is determined that the recognition result of the speech segment is passed.

具体地,所述计算机设备1预先通过将GMM训练出不同用户的声纹鉴别向量与将所述MFCC分别进行距离计算,从而选择出小于预设阈值且最小的距离所对应的第一声纹鉴别向量,将所述第一声纹鉴别向量对应的第一用户作为所述语音片段对应的目标用户。Specifically, the computer device 1 preliminarily trains the voiceprint identification vectors of different users through GMM and calculates the distances of the MFCC respectively, thereby selecting the first voiceprint identification corresponding to the smallest distance which is less than the preset threshold. Vector, the first user corresponding to the first voiceprint discrimination vector is used as the target user corresponding to the voice segment.

当然,在其他实施例中,所述计算机设备1还会预先训练一个准确度较高的GMM(Gaussian Mixture Model,高斯混合模型),其中,所述GMM作为通用背景模型(UBM,Universal Background Model),可以用于提取语音中的声纹鉴别向量,其中,所述GMM可以经过一系列的样本数据训练,从而能够提升声纹鉴别向量的训练准确度。其中,所述GMM的训练过程如下:Of course, in other embodiments, the computer device 1 will also pre-train a GMM (Gaussian Mixture Model) with higher accuracy, where the GMM serves as a universal background model (UBM, Universal Background Model) , Can be used to extract the voiceprint identification vector in speech, wherein the GMM can be trained through a series of sample data, so as to improve the training accuracy of the voiceprint identification vector. The training process of the GMM is as follows:

B1、获取预设数量(例如,10万个)的语音数据样本,每个语音数据样本可以采集自不同的人在不同环境中的语音(即对应一个声纹鉴别向量),这样的语音数据样本用来训练能够表征一般语音特性的通用背景模型。B1. Obtain a preset number (for example, 100,000) of voice data samples. Each voice data sample can be collected from the voices of different people in different environments (that is, corresponding to a voiceprint identification vector), such voice data samples It is used to train a general background model that can characterize general speech characteristics.

B2、分别对各个语音数据样本进行处理以提取出各个语音数据样本对应的预设类型声纹特征,并基于各个语音数据样本对应的预设类型声纹特征构建各个语音数据样本对应的声纹特征向量;B2. Process each voice data sample separately to extract the preset type voiceprint feature corresponding to each voice data sample, and construct the voiceprint feature corresponding to each voice data sample based on the preset type voiceprint feature corresponding to each voice data sample vector;

B3、将构建出的所有预设类型声纹特征向量分为第一百分比的训练集和第二百分比的验证集,所述第一百分比和第二百分比之后小于或等于100%;B3. Divide the constructed voiceprint feature vectors of all preset types into a training set of a first percentage and a validation set of a second percentage, and the first percentage and the second percentage will be less than or Equal to 100%;

B4、利用训练集中的声纹特征向量对所述第一模型进行训练,并在训练完成之后利用验证集对训练的所述第一模型的准确率进行验证;B4. Use the voiceprint feature vectors in the training set to train the first model, and use the verification set to verify the accuracy of the trained first model after the training is completed;

B5、若准确率大于预设准确率(例如,98.5%),则模型训练结束,否则,增加语音数据样本的数量,并基于增加后的语音数据样本重新执行上述步骤B2、B3、B4、B5。B5. If the accuracy rate is greater than the preset accuracy rate (for example, 98.5%), the model training ends, otherwise, increase the number of voice data samples, and re-execute the above steps B2, B3, B4, B5 based on the increased voice data samples .

因此,所述计算机设备1先根据训练好的GMM对采集的用户的语音信息进行训练,得到对应的声纹鉴别向量,然后所述计算模块204利用所述声纹鉴别向量计算与所述语音片段对应的MFCC的距离,从而提升精确度。Therefore, the computer device 1 first trains the collected user's voice information according to the trained GMM to obtain the corresponding voiceprint discrimination vector, and then the calculation module 204 uses the voiceprint discrimination vector to calculate the difference between the voice segment and the voice segment. The distance of the corresponding MFCC to improve accuracy.

从上文可知,所述计算机设备1能够获取语音片段之后进行分帧得到每一帧语音数据,然后根据预设的平稳加窗算法对所每一帧语音数据进行加窗以得到加窗语音帧;接着,再提取所述语音片段的加窗语音帧的梅尔频率倒谱特征向量MFCC,并计算所述MFCC与声纹鉴别向量的距离;当所述距离小于预设阈值时,判断所述语音片段的识别结果为通过。通过以上方式,能够在对语音信号进行少量修改的情况下更加精确地计算出语音片段中的特征向量,从而提升语音识别的精度。It can be seen from the above that the computer device 1 can obtain the voice segment and then divide the frame to obtain each frame of voice data, and then perform windowing on each frame of voice data according to a preset smooth windowing algorithm to obtain a windowed voice frame ; Next, extract the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment, and calculate the distance between the MFCC and the voiceprint discrimination vector; when the distance is less than a preset threshold, determine the The recognition result of the speech fragment is passed. Through the above method, the feature vector in the speech segment can be calculated more accurately with a small amount of modification to the speech signal, thereby improving the accuracy of speech recognition.

此外,本申请还提出一种语音识别方法,所述方法应用于计算机设备。In addition, this application also proposes a voice recognition method, which is applied to computer equipment.

参阅图3所示,是本申请语音识别方法一实施例的流程示意图。在本实施例中,根据不同的需求,图3所示的流程图中的步骤的执行顺序可以改变,某些步骤可以省略。Refer to FIG. 3, which is a schematic flowchart of an embodiment of a speech recognition method according to the present application. In this embodiment, according to different requirements, the execution order of the steps in the flowchart shown in FIG. 3 can be changed, and some steps can be omitted.

步骤S500,获取语音片段,对所述语音片段进行分帧,得到每一帧语音数据。Step S500: Acquire a voice segment, divide the voice segment into frames, and obtain each frame of voice data.

在本实施例中,所述计算机设备与用户终端,比如手机,移动终端,PC端等设备连接,然后通过用户终端获取用户的语音信息。当然,在其他实施例中,所述计算机设备也可以直接提供拾音器单元采集用户的语音数据,所述语音数据包括至少一个语音片段,因此,所述计算机设备可以获取语音片段。所述计算机设备获取到语音片段之后,则进一步对所述语音片段进行分帧,得到每一帧的语音数据。当然,由于人体的生理特性,语音片段中的高频部分往往被压抑,因此,在其他实施例中,所述计算机设备还会对所述语音片段进行预加重处理,从而补偿语音片段中的高频成分。In this embodiment, the computer device is connected to a user terminal, such as a mobile phone, a mobile terminal, a PC terminal, and other devices, and then the user's voice information is obtained through the user terminal. Of course, in other embodiments, the computer device can also directly provide a pickup unit to collect the user's voice data, and the voice data includes at least one voice segment. Therefore, the computer device can obtain the voice segment. After acquiring the voice segment, the computer device further divides the voice segment into frames to obtain the voice data of each frame. Of course, due to the physiological characteristics of the human body, the high frequency part of the speech segment is often suppressed. Therefore, in other embodiments, the computer device also performs pre-emphasis processing on the speech segment to compensate for the high frequency in the speech segment. Frequency components.

步骤S502,根据预设的平稳加窗算法依次对所述语音片段的每一帧语音数据进行加窗,得到所述语音片段的加窗语音帧。In step S502, each frame of voice data of the voice segment is sequentially windowed according to a preset smooth windowing algorithm to obtain a windowed voice frame of the voice segment.

具体地,所述计算机设备将语音片段分帧之后,进一步对所述语音片段的每一帧语音数据进行加窗。在本实施例中,所述计算机设备根据预设的平稳加窗算法依次对所述语音片段的每一帧语音数据进行加窗,然后得到所述语音片段的加窗语音帧。其中,所述平稳加窗算法为:Specifically, after the computer device divides the speech segment into frames, it further performs windowing on each frame of speech data of the speech segment. In this embodiment, the computer device sequentially windows each frame of speech data of the speech segment according to a preset smooth windowing algorithm, and then obtains a windowed speech frame of the speech segment. Wherein, the stable windowing algorithm is:

Figure PCTCN2019117761-appb-000004
Figure PCTCN2019117761-appb-000004

其中,T1为加窗语音帧的时长,w(t)表示在语音帧的时长范围内的t时刻的需对t时刻语音信号进行加窗的加权值,K和K′是常数变量,K<K′且K+K′=1,K是根据环境噪声进行设置的。Among them, T1 is the time length of the windowed speech frame, w(t) represents the weighted value of the speech signal at time t within the time length range of the speech frame, and K and K′ are constant variables, K< K'and K+K'=1, K is set according to the environmental noise.

在本实施例中,所述计算机设备对每一帧语音数据进行加窗时,首先获取语音数据中的环境噪声的频率分布信息,然后自动调整变量K,再根据变量K对所述分帧进行分段加窗,包括:对于语音帧的帧首和帧尾采用类似余弦波形的加窗,减少低频部分的环境噪声干扰;对于语音帧的中间部分采用类似矩形的加窗,从而避免突发变异产生的高频噪声。其中,对于自动调整变量K的过程,所述计算机设备可以预先随机从所述语音片段中的语 音分帧中选择两个语音分帧,然后经傅里叶变换转换到频域,检测其中的环境噪声的频率分布,然后将所述KT1设置在高于所述环境噪声的最大频率的位置。In this embodiment, when the computer device performs windowing on each frame of speech data, it first obtains the frequency distribution information of the environmental noise in the speech data, and then automatically adjusts the variable K, and then divides the frame according to the variable K. Segmented windowing includes: adopting cosine waveform-like windowing for the start and end of the speech frame to reduce environmental noise interference in the low frequency part; adopting rectangular-like windowing for the middle part of the speech frame to avoid sudden mutation High-frequency noise generated. Wherein, for the process of automatically adjusting the variable K, the computer device may randomly select two voice sub-frames from the voice sub-frames in the voice segment in advance, and then convert them to the frequency domain by Fourier transform, and detect the environment therein. The frequency distribution of the noise, and then set the KT1 at a position higher than the maximum frequency of the environmental noise.

步骤S504,提取所述语音片段的加窗语音帧的梅尔频率倒谱特征向量MFCC。Step S504: Extract the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment.

具体地,所述计算机设备对所述语音片段的所有语音分帧进行加窗之后,还进一步对所述语音片段的加窗语音帧进行处理,提取梅尔频率倒谱特征向量MFCC。在本实施例中,所述计算机设备首先对加窗语音帧进行离散傅里叶变换,从时域转换到频域;接着再根据公式:Specifically, after the computer device performs windowing on all voice sub-frames of the voice segment, it further processes the windowed voice frames of the voice segment to extract the Mel frequency cepstrum feature vector MFCC. In this embodiment, the computer device first performs discrete Fourier transform on the windowed speech frame, from the time domain to the frequency domain; then according to the formula:

Figure PCTCN2019117761-appb-000005
Figure PCTCN2019117761-appb-000005

将加窗语音帧的线性频谱域映射到梅尔频谱域;最后再输入到一组梅尔三角滤波器组,计算每个频段的滤波器输出的信号对数能量,得到一个对数能量序列;再将所述对数能量序列做离散余弦变换,从而提取出所述加窗语音帧的梅尔频率倒谱特征向量MFCC。Map the linear spectrum domain of the windowed speech frame to the Mel spectrum domain; finally input to a set of Mel triangle filter bank, calculate the logarithmic energy of the signal output by the filter of each frequency band, and obtain a logarithmic energy sequence; Then, the logarithmic energy sequence is subjected to discrete cosine transform, thereby extracting the Mel frequency cepstrum feature vector MFCC of the windowed speech frame.

步骤S506,计算所述MFCC与声纹鉴别向量的距离,其中,所述声纹鉴别向量是预先将所述用户的采样语音信息输入到声纹特征训练模型进行训练得到。Step S506: Calculate the distance between the MFCC and the voiceprint identification vector, where the voiceprint identification vector is obtained by pre-inputting sampled voice information of the user into a voiceprint feature training model for training.

步骤S508,当所述距离小于预设阈值时,判断所述语音片段的识别结果为通过。Step S508: When the distance is less than a preset threshold, it is determined that the recognition result of the speech segment is passed.

具体地,所述计算机设备预先将对所述用户进行语音信息采样,然后将采用语音信息输入到声纹特征训练模型进行训练,从而获得所述用户对应的声纹鉴别向量。因此,在所述计算机设备提取到所述语音片段的MFCC之后,还会进一步计算所述MFCC与所述声纹鉴别向量的距离。所述距离为余弦距离,所述距离对应的计算公式为:Specifically, the computer device will sample the user's voice information in advance, and then input the adopted voice information into the voiceprint feature training model for training, so as to obtain the voiceprint identification vector corresponding to the user. Therefore, after the computer device extracts the MFCC of the speech segment, it will further calculate the distance between the MFCC and the voiceprint discrimination vector. The distance is the cosine distance, and the calculation formula corresponding to the distance is:

Figure PCTCN2019117761-appb-000006
Figure PCTCN2019117761-appb-000006

其中,x代表标准声纹鉴别向量,y代表当前声纹鉴别向量。在本实施例中,所述计算机设备通过余弦距离公式计算出所述语音片段的MFCC与预设的声纹鉴别向量之间的距离,然后所述计算机设备将所述距离与预先设定的阈值进行比较;当所述距离小于所述阈值时,则判断所述语音片段的识别结果为通过。Among them, x represents the standard voiceprint discrimination vector, and y represents the current voiceprint discrimination vector. In this embodiment, the computer device uses the cosine distance formula to calculate the distance between the MFCC of the speech segment and the preset voiceprint discrimination vector, and then the computer device compares the distance with a preset threshold. Compare; when the distance is less than the threshold, it is determined that the recognition result of the speech segment is passed.

具体地,所述计算机设备预先通过将GMM训练出不同用户的声纹鉴别向量与将所述MFCC分别进行距离计算,从而选择出小于预设阈值且最小的距离所对应的第一声纹鉴别向量,将所述第一声纹鉴别向量对应的第一用户作为所述语音片段对应的目标用户。Specifically, the computer device preliminarily trains the voiceprint identification vectors of different users through the GMM and calculates the distances of the MFCC respectively, thereby selecting the first voiceprint identification vector corresponding to the smallest distance that is less than a preset threshold. , Taking the first user corresponding to the first voiceprint discrimination vector as the target user corresponding to the voice segment.

当然,在其他实施例中,所述计算机设备还会预先训练一个准确度较高的 GMM(Gaussian Mixture Model,高斯混合模型),其中,所述GMM作为通用背景模型(UBM,Universal Background Model),可以用于提取语音中的声纹鉴别向量,其中,所述GMM可以经过一系列的样本数据训练,从而能够提升声纹鉴别向量的训练准确度。其中,所述GMM的训练过程如下:Of course, in other embodiments, the computer device will also pre-train a GMM (Gaussian Mixture Model) with higher accuracy, where the GMM is used as a universal background model (UBM, Universal Background Model), It can be used to extract the voiceprint discrimination vector in speech, where the GMM can be trained through a series of sample data, so as to improve the training accuracy of the voiceprint discrimination vector. The training process of the GMM is as follows:

B1、获取预设数量(例如,10万个)的语音数据样本,每个语音数据样本可以采集自不同的人在不同环境中的语音(即对应一个声纹鉴别向量),这样的语音数据样本用来训练能够表征一般语音特性的通用背景模型。B1. Obtain a preset number (for example, 100,000) of voice data samples. Each voice data sample can be collected from the voices of different people in different environments (that is, corresponding to a voiceprint identification vector), such voice data samples It is used to train a general background model that can characterize general speech characteristics.

B2、分别对各个语音数据样本进行处理以提取出各个语音数据样本对应的预设类型声纹特征,并基于各个语音数据样本对应的预设类型声纹特征构建各个语音数据样本对应的声纹特征向量;B2. Process each voice data sample separately to extract the preset type voiceprint feature corresponding to each voice data sample, and construct the voiceprint feature corresponding to each voice data sample based on the preset type voiceprint feature corresponding to each voice data sample vector;

B3、将构建出的所有预设类型声纹特征向量分为第一百分比的训练集和第二百分比的验证集,所述第一百分比和第二百分比之后小于或等于100%;B3. Divide the constructed voiceprint feature vectors of all preset types into a training set of a first percentage and a validation set of a second percentage, and the first percentage and the second percentage will be less than or Equal to 100%;

B4、利用训练集中的声纹特征向量对所述第一模型进行训练,并在训练完成之后利用验证集对训练的所述第一模型的准确率进行验证;B4. Use the voiceprint feature vectors in the training set to train the first model, and use the verification set to verify the accuracy of the trained first model after the training is completed;

B5、若准确率大于预设准确率(例如,98.5%),则模型训练结束,否则,增加语音数据样本的数量,并基于增加后的语音数据样本重新执行上述步骤B2、B3、B4、B5。B5. If the accuracy rate is greater than the preset accuracy rate (for example, 98.5%), the model training ends, otherwise, increase the number of voice data samples, and re-execute the above steps B2, B3, B4, B5 based on the increased voice data samples .

因此,所述计算机设备先根据训练好的GMM对采集的用户的语音信息进行训练,得到对应的声纹鉴别向量,然后所述计算模块204利用所述声纹鉴别向量计算与所述语音片段对应的MFCC的距离,从而提升精确度。Therefore, the computer device first trains the collected user's voice information according to the trained GMM to obtain the corresponding voiceprint identification vector, and then the calculation module 204 uses the voiceprint identification vector to calculate the corresponding voice segment MFCC distance, thereby improving accuracy.

本实施例所提出的语音识别方法能够获取语音片段之后进行分帧得到每一帧语音数据,然后根据预设的平稳加窗算法对所每一帧语音数据进行加窗以得到加窗语音帧;接着,再提取所述语音片段的加窗语音帧的梅尔频率倒谱特征向量MFCC,并计算所述MFCC与声纹鉴别向量的距离;当所述距离小于预设阈值时,判断所述语音片段的识别结果为通过。通过以上方式,能够在对语音信号进行少量修改的情况下更加精确地计算出语音片段中的特征向量,从而提升语音识别的精度。The speech recognition method proposed in this embodiment can obtain each frame of speech data by framing after acquiring a speech fragment, and then windowing each frame of speech data according to a preset smooth windowing algorithm to obtain a windowed speech frame; Then, extract the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment, and calculate the distance between the MFCC and the voiceprint discrimination vector; when the distance is less than a preset threshold, determine the speech The recognition result of the fragment is passed. Through the above method, the feature vector in the speech segment can be calculated more accurately with a small amount of modification to the speech signal, thereby improving the accuracy of speech recognition.

上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。The serial numbers of the foregoing embodiments of the present application are only for description, and do not represent the advantages and disadvantages of the embodiments.

通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可 借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to enable a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the method described in each embodiment of the present application.

以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only the preferred embodiments of the application, and do not limit the scope of the patent for this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of the application, or directly or indirectly applied to other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims (20)

一种语音识别方法,所述方法包括步骤:A speech recognition method, the method includes the steps: 获取语音片段,对所述语音片段进行分帧,得到每一帧语音数据;Acquiring a voice segment, framing the voice segment to obtain each frame of voice data; 根据预设的平稳加窗算法依次对所述语音片段的每一帧语音数据进行加窗,得到所述语音片段的加窗语音帧;Sequentially windowing each frame of speech data of the speech segment according to a preset smooth windowing algorithm to obtain a windowed speech frame of the speech segment; 提取所述语音片段的加窗语音帧的梅尔频率倒谱特征向量MFCC;Extracting the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment; 计算所述MFCC与声纹鉴别向量的距离,其中,所述声纹鉴别向量是预先将所述用户的采样语音信息输入到声纹特征训练模型进行训练得到;Calculating the distance between the MFCC and a voiceprint identification vector, where the voiceprint identification vector is obtained by pre-inputting sampled voice information of the user into a voiceprint feature training model for training; 当所述距离小于预设阈值时,判断所述语音片段的识别结果为通过。When the distance is less than the preset threshold, it is determined that the recognition result of the speech segment is passed. 如权利要求1所述的语音识别方法,所述平稳加窗算法为:The speech recognition method according to claim 1, wherein the stable windowing algorithm is:
Figure PCTCN2019117761-appb-100001
Figure PCTCN2019117761-appb-100001
其中,T1为加窗语音帧的时长,w(t)表示在语音帧的时长范围内的t时刻的需对t时刻语音信号进行加窗的加权值,K和K′是常数变量,K<K′且K+K′=1,K是根据环境噪声进行设置的。Among them, T1 is the time length of the windowed speech frame, w(t) represents the weighted value of the speech signal at time t within the time length range of the speech frame, and K and K′ are constant variables, K< K'and K+K'=1, K is set according to the environmental noise.
如权利要求2所述的语音识别方法,所述方法还包括:The speech recognition method according to claim 2, the method further comprising: 对每一帧语音数据进行加窗时,获取语音数据中的环境噪声的频率分布信息,再根据噪声的最高频率分布调整所述K。When each frame of speech data is windowed, the frequency distribution information of the environmental noise in the speech data is obtained, and then the K is adjusted according to the highest frequency distribution of the noise. 如权利要求2所述的语音识别方法,所述平稳加窗算法包括:The speech recognition method according to claim 2, wherein the smooth windowing algorithm comprises: 对于语音帧频率分布的帧首和帧尾采用余弦波形的加窗,减少低频部分的环境噪声干扰;For the frame start and end of the speech frame frequency distribution, use cosine waveform windowing to reduce the environmental noise interference in the low frequency part; 对于语音帧频率分布的中间部分采用矩形的加窗,从而避免突发变异产生的高频噪声。For the middle part of the frequency distribution of the speech frame, a rectangular window is used to avoid high frequency noise caused by sudden mutation. 如权利要求1所述的语音识别方法,所述声纹特征训练模型为高斯混合模型GMM,所述方法还包括:The speech recognition method according to claim 1, wherein the voiceprint feature training model is a Gaussian mixture model GMM, and the method further comprises: 通过将GMM训练出不同用户的声纹鉴别向量与将所述MFCC分别进行距离计算;The voiceprint identification vectors of different users are trained by GMM and the distance calculation is performed on the MFCC respectively; 选择出小于预设阈值且最小的距离所对应的第一声纹鉴别向量;Selecting the first voiceprint discrimination vector corresponding to the smallest distance smaller than the preset threshold; 将所述第一声纹鉴别向量对应的第一用户作为所述语音片段对应的目标用户。The first user corresponding to the first voiceprint discrimination vector is used as the target user corresponding to the voice segment. 如权利要求1所述的语音识别方法,所述距离为余弦距离,所述距离对应的计算公式为:The speech recognition method according to claim 1, wherein the distance is a cosine distance, and the calculation formula corresponding to the distance is:
Figure PCTCN2019117761-appb-100002
Figure PCTCN2019117761-appb-100002
其中,x代表标准声纹鉴别向量,y代表当前声纹鉴别向量。Among them, x represents the standard voiceprint discrimination vector, and y represents the current voiceprint discrimination vector.
如权利要求1所述的语音识别方法,在所述对所述语音片段进行分帧之前,所述方法还包括:5. The speech recognition method according to claim 1, before said framing said speech segment, said method further comprises: 对所述语音片段进行预加重处理,补偿语音片段中的高频成分。Pre-emphasis is performed on the voice segment to compensate for high frequency components in the voice segment. 一种语音识别装置,所述装置包括:A speech recognition device, the device includes: 分帧模块,用于获取语音片段,对所述语音片段进行分帧,得到每一帧语音数据;The framing module is used to obtain speech fragments, and divide the speech fragments into frames to obtain each frame of speech data; 加窗模块,用于根据预设的平稳加窗算法依次对所述语音片段的每一帧语音数据进行加窗,得到所述语音片段的加窗语音帧;The windowing module is configured to sequentially window each frame of voice data of the voice segment according to a preset smooth windowing algorithm to obtain a windowed voice frame of the voice segment; 提取模块,用于提取所述语音片段的加窗语音帧的梅尔频率倒谱特征向量MFCC;An extraction module for extracting the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment; 计算模块,用于计算所述MFCC与声纹鉴别向量的距离,其中,所述声纹鉴别向量是预先将所述用户的采样语音信息输入到声纹特征训练模型进行训练得到;A calculation module for calculating the distance between the MFCC and a voiceprint identification vector, where the voiceprint identification vector is obtained by pre-inputting sampled voice information of the user into a voiceprint feature training model for training; 识别模块,用于当所述距离小于预设阈值时,判断所述语音片段的识别结果为通过。The recognition module is configured to determine that the recognition result of the speech segment is passed when the distance is less than a preset threshold. 如权利要求8所述的语音识别装置,所述平稳加窗算法为:8. The speech recognition device of claim 8, wherein the smooth windowing algorithm is:
Figure PCTCN2019117761-appb-100003
Figure PCTCN2019117761-appb-100003
其中,T1为加窗语音帧的时长,w(t)表示在语音帧的时长范围内的t时刻的需对t时刻语音信号进行加窗的加权值,K和K′是常数变量,K<K′且K+K′=1,K是根据环境噪声进行设置的。Among them, T1 is the time length of the windowed speech frame, w(t) represents the weighted value of the speech signal at time t within the time range of the speech frame, and K and K′ are constant variables, K< K'and K+K'=1, K is set according to the environmental noise.
如权利要求9所述的语音识别装置,所述加窗模块还用于:The speech recognition device according to claim 9, wherein the windowing module is further used for: 对每一帧语音数据进行加窗时,获取语音数据中的环境噪声的频率分布信息,再 根据噪声的最高频率分布调整所述K。When each frame of speech data is windowed, the frequency distribution information of the environmental noise in the speech data is obtained, and then the K is adjusted according to the highest frequency distribution of the noise. 如权利要求8所述的语音识别装置,所述声纹特征训练模型为高斯混合模型GMM,8. The speech recognition device according to claim 8, wherein the voiceprint feature training model is a Gaussian mixture model GMM, 所述计算模块,还用于通过将GMM训练出不同用户的声纹鉴别向量与将所述MFCC分别进行距离计算;The calculation module is further configured to train the voiceprint discrimination vectors of different users through GMM and calculate the distances of the MFCC respectively; 所述识别模块,还用于选择出小于预设阈值且最小的距离所对应的第一声纹鉴别向量;并将所述第一声纹鉴别向量对应的第一用户作为所述语音片段对应的目标用户。The recognition module is further configured to select a first voiceprint identification vector corresponding to a distance smaller than a preset threshold and the smallest distance; and use the first user corresponding to the first voiceprint identification vector as the voice segment corresponding Target users. 如权利要求8所述的语音识别装置,所述距离为余弦距离,所述距离对应的计算公式为:8. The speech recognition device according to claim 8, wherein the distance is a cosine distance, and the calculation formula corresponding to the distance is:
Figure PCTCN2019117761-appb-100004
Figure PCTCN2019117761-appb-100004
其中,x代表标准声纹鉴别向量,y代表当前声纹鉴别向量。Among them, x represents the standard voiceprint discrimination vector, and y represents the current voiceprint discrimination vector.
一种计算机设备,所述计算机设备包括存储器、处理器,所述存储器上存储有可在所述处理器上运行的计算机可读指令,所述计算机可读指令被所述处理器执行时实现步骤:A computer device, the computer device includes a memory and a processor, the memory stores computer-readable instructions that can run on the processor, and the computer-readable instructions implement steps when executed by the processor : 获取语音片段,对所述语音片段进行分帧,得到每一帧语音数据;Acquiring a voice segment, framing the voice segment to obtain each frame of voice data; 根据预设的平稳加窗算法依次对所述语音片段的每一帧语音数据进行加窗,得到所述语音片段的加窗语音帧;Sequentially windowing each frame of speech data of the speech segment according to a preset smooth windowing algorithm to obtain a windowed speech frame of the speech segment; 提取所述语音片段的加窗语音帧的梅尔频率倒谱特征向量MFCC;Extracting the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment; 计算所述MFCC与声纹鉴别向量的距离,其中,所述声纹鉴别向量是预先将所述用户的采样语音信息输入到声纹特征训练模型进行训练得到;Calculating the distance between the MFCC and a voiceprint identification vector, where the voiceprint identification vector is obtained by pre-inputting sampled voice information of the user into a voiceprint feature training model for training; 当所述距离小于预设阈值时,判断所述语音片段的识别结果为通过。When the distance is less than the preset threshold, it is determined that the recognition result of the speech segment is passed. 如权利要求13所述的计算机设备,所述平稳加窗算法为:The computer device according to claim 13, wherein the smooth windowing algorithm is:
Figure PCTCN2019117761-appb-100005
Figure PCTCN2019117761-appb-100005
其中,T1为加窗语音帧的时长,w(t)表示在语音帧的时长范围内的t时刻的需对t时刻语音信号进行加窗的加权值,K和K′是常数变量,K<K′且K+K′=1,K是根据环境 噪声进行设置的。Among them, T1 is the time length of the windowed speech frame, w(t) represents the weighted value of the speech signal at time t within the time range of the speech frame, and K and K′ are constant variables, K< K'and K+K'=1, K is set according to the environmental noise.
如权利要求14所述的计算机设备,所述计算机可读指令被所述处理器执行时还实现步骤:The computer device according to claim 14, wherein the computer-readable instructions when executed by the processor further implement the steps: 对每一帧语音数据进行加窗时,获取语音数据中的环境噪声的频率分布信息,再根据噪声的最高频率分布调整所述K。When each frame of speech data is windowed, the frequency distribution information of the environmental noise in the speech data is obtained, and then the K is adjusted according to the highest frequency distribution of the noise. 如权利要求13所述的计算机设备,所述声纹特征训练模型为高斯混合模型GMM,所述计算机可读指令被所述处理器执行时还实现步骤:The computer device according to claim 13, wherein the voiceprint feature training model is a Gaussian mixture model GMM, and the computer-readable instructions further implement the steps when executed by the processor: 通过将GMM训练出不同用户的声纹鉴别向量与将所述MFCC分别进行距离计算;The voiceprint identification vectors of different users are trained by GMM and the distance calculation is performed on the MFCC respectively; 选择出小于预设阈值且最小的距离所对应的第一声纹鉴别向量;Selecting the first voiceprint discrimination vector corresponding to the smallest distance smaller than the preset threshold; 将所述第一声纹鉴别向量对应的第一用户作为所述语音片段对应的目标用户。The first user corresponding to the first voiceprint discrimination vector is used as the target user corresponding to the voice segment. 一种非易失性计算机可读存储介质,所述非易失性计算机可读存储介质存储有计算机可读指令,所述计算机可读指令可被至少一个处理器执行,以使所述至少一个处理器执行步骤:A non-volatile computer-readable storage medium, the non-volatile computer-readable storage medium stores computer-readable instructions, the computer-readable instructions can be executed by at least one processor, so that the at least one Processor execution steps: 获取语音片段,对所述语音片段进行分帧,得到每一帧语音数据;Acquiring a voice segment, framing the voice segment to obtain each frame of voice data; 根据预设的平稳加窗算法依次对所述语音片段的每一帧语音数据进行加窗,得到所述语音片段的加窗语音帧;Sequentially windowing each frame of speech data of the speech segment according to a preset smooth windowing algorithm to obtain a windowed speech frame of the speech segment; 提取所述语音片段的加窗语音帧的梅尔频率倒谱特征向量MFCC;Extracting the Mel frequency cepstrum feature vector MFCC of the windowed speech frame of the speech segment; 计算所述MFCC与声纹鉴别向量的距离,其中,所述声纹鉴别向量是预先将所述用户的采样语音信息输入到声纹特征训练模型进行训练得到;Calculating the distance between the MFCC and a voiceprint identification vector, where the voiceprint identification vector is obtained by pre-inputting sampled voice information of the user into a voiceprint feature training model for training; 当所述距离小于预设阈值时,判断所述语音片段的识别结果为通过。When the distance is less than the preset threshold, it is determined that the recognition result of the speech segment is passed. 如权利要求17所述的非易失性计算机可读存储介质,所述平稳加窗算法为:The non-volatile computer-readable storage medium of claim 17, wherein the smooth windowing algorithm is:
Figure PCTCN2019117761-appb-100006
Figure PCTCN2019117761-appb-100006
其中,T1为加窗语音帧的时长,w(t)表示在语音帧的时长范围内的t时刻的需对t时刻语音信号进行加窗的加权值,K和K′是常数变量,K<K′且K+K′=1,K是根据环境噪声进行设置的。Among them, T1 is the time length of the windowed speech frame, w(t) represents the weighted value of the speech signal at time t within the time length range of the speech frame, and K and K′ are constant variables, K< K'and K+K'=1, K is set according to the environmental noise.
如权利要求18所述的非易失性计算机可读存储介质,所述计算机可读指令被所述处理器执行时还实现步骤:The non-volatile computer-readable storage medium of claim 18, wherein the computer-readable instructions further implement the steps when executed by the processor: 对每一帧语音数据进行加窗时,获取语音数据中的环境噪声的频率分布信息,再根据噪声的最高频率分布调整所述K。When each frame of speech data is windowed, the frequency distribution information of the environmental noise in the speech data is obtained, and then the K is adjusted according to the highest frequency distribution of the noise. 如权利要求17所述的非易失性计算机可读存储介质,所述声纹特征训练模型为高斯混合模型GMM,所述计算机可读指令被所述处理器执行时还实现步骤:The non-volatile computer-readable storage medium according to claim 17, wherein the voiceprint feature training model is a Gaussian Mixture Model GMM, and the computer-readable instructions further implement the steps when executed by the processor: 通过将GMM训练出不同用户的声纹鉴别向量与将所述MFCC分别进行距离计算;The voiceprint identification vectors of different users are trained by GMM and the distance calculation is performed on the MFCC respectively; 选择出小于预设阈值且最小的距离所对应的第一声纹鉴别向量;Selecting the first voiceprint discrimination vector corresponding to the smallest distance smaller than the preset threshold; 将所述第一声纹鉴别向量对应的第一用户作为所述语音片段对应的目标用户。The first user corresponding to the first voiceprint discrimination vector is used as the target user corresponding to the voice segment.
PCT/CN2019/117761 2019-09-16 2019-11-13 Voice recognition method and apparatus, and computer device Ceased WO2021051572A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910871726.5A CN110556126B (en) 2019-09-16 2019-09-16 Speech recognition method and device and computer equipment
CN201910871726.5 2019-09-16

Publications (1)

Publication Number Publication Date
WO2021051572A1 true WO2021051572A1 (en) 2021-03-25

Family

ID=68740361

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117761 Ceased WO2021051572A1 (en) 2019-09-16 2019-11-13 Voice recognition method and apparatus, and computer device

Country Status (2)

Country Link
CN (1) CN110556126B (en)
WO (1) WO2021051572A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113744759A (en) * 2021-09-17 2021-12-03 广州酷狗计算机科技有限公司 Tone template customizing method and device, equipment, medium and product thereof
CN115497511A (en) * 2022-10-31 2022-12-20 广州方硅信息技术有限公司 Training and detection method, device, equipment and medium of voice activity detection model
CN115641871A (en) * 2022-09-29 2023-01-24 福建国电风力发电有限公司 Fan blade abnormity detection method based on voiceprint
CN116092457A (en) * 2023-02-03 2023-05-09 上海哔哩哔哩科技有限公司 Audio signal processing method and system
CN117577137A (en) * 2024-01-15 2024-02-20 宁德时代新能源科技股份有限公司 Cutter health assessment method, device, equipment and storage medium
CN119724245A (en) * 2025-02-26 2025-03-28 联通沃音乐文化有限公司 Intelligent call shorthand method, system and medium based on speech recognition
CN120877732A (en) * 2025-09-22 2025-10-31 珠海格力电器股份有限公司 Electrical equipment control method and device, electrical equipment, medium and product

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111210829B (en) * 2020-02-19 2024-07-30 腾讯科技(深圳)有限公司 Speech recognition method, apparatus, system, device and computer readable storage medium
CN111508498B (en) * 2020-04-09 2024-01-30 携程计算机技术(上海)有限公司 Conversational speech recognition method, conversational speech recognition system, electronic device, and storage medium
CN111933153B (en) * 2020-07-07 2024-03-08 北京捷通华声科技股份有限公司 Voice segmentation point determining method and device
CN113098850A (en) * 2021-03-24 2021-07-09 北京嘀嘀无限科技发展有限公司 Voice verification method and device and electronic equipment
CN115240646A (en) * 2022-05-07 2022-10-25 广州博冠信息科技有限公司 Live voice information processing method, device, device and storage medium
CN115129923B (en) * 2022-05-17 2023-10-20 荣耀终端有限公司 Voice searching method, device and storage medium
CN114945099B (en) * 2022-05-18 2024-04-26 广州博冠信息科技有限公司 Voice monitoring method, device, electronic equipment and computer readable medium
CN116129901B (en) * 2022-08-30 2025-07-18 马上消费金融股份有限公司 Speech recognition method, device, electronic equipment and readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8160877B1 (en) * 2009-08-06 2012-04-17 Narus, Inc. Hierarchical real-time speaker recognition for biometric VoIP verification and targeting
US20120158809A1 (en) * 2010-12-17 2012-06-21 Toshifumi Yamamoto Compensation Filtering Device and Method Thereof
CN105232064A (en) * 2015-10-30 2016-01-13 科大讯飞股份有限公司 System and method for predicting influence of music on behaviors of driver
CN107527620A (en) * 2017-07-25 2017-12-29 平安科技(深圳)有限公司 Electronic installation, the method for authentication and computer-readable recording medium
CN109040913A (en) * 2018-08-06 2018-12-18 中国船舶科学研究中心(中国船舶重工集团公司第七0二研究所) The beam-forming method of window function weighting electroacoustic transducer emission array
CN110197657A (en) * 2019-05-22 2019-09-03 大连海事大学 Dynamic sound feature extraction method based on cosine similarity

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107993071A (en) * 2017-11-21 2018-05-04 平安科技(深圳)有限公司 Electronic device, auth method and storage medium based on vocal print
CN108899032A (en) * 2018-06-06 2018-11-27 平安科技(深圳)有限公司 Method for recognizing sound-groove, device, computer equipment and storage medium
CN110047490A (en) * 2019-03-12 2019-07-23 平安科技(深圳)有限公司 Method for recognizing sound-groove, device, equipment and computer readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8160877B1 (en) * 2009-08-06 2012-04-17 Narus, Inc. Hierarchical real-time speaker recognition for biometric VoIP verification and targeting
US20120158809A1 (en) * 2010-12-17 2012-06-21 Toshifumi Yamamoto Compensation Filtering Device and Method Thereof
CN105232064A (en) * 2015-10-30 2016-01-13 科大讯飞股份有限公司 System and method for predicting influence of music on behaviors of driver
CN107527620A (en) * 2017-07-25 2017-12-29 平安科技(深圳)有限公司 Electronic installation, the method for authentication and computer-readable recording medium
CN109040913A (en) * 2018-08-06 2018-12-18 中国船舶科学研究中心(中国船舶重工集团公司第七0二研究所) The beam-forming method of window function weighting electroacoustic transducer emission array
CN110197657A (en) * 2019-05-22 2019-09-03 大连海事大学 Dynamic sound feature extraction method based on cosine similarity

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113744759A (en) * 2021-09-17 2021-12-03 广州酷狗计算机科技有限公司 Tone template customizing method and device, equipment, medium and product thereof
CN113744759B (en) * 2021-09-17 2023-09-22 广州酷狗计算机科技有限公司 Tone color template customizing method and device, equipment, medium and product thereof
CN115641871A (en) * 2022-09-29 2023-01-24 福建国电风力发电有限公司 Fan blade abnormity detection method based on voiceprint
CN115497511A (en) * 2022-10-31 2022-12-20 广州方硅信息技术有限公司 Training and detection method, device, equipment and medium of voice activity detection model
CN115497511B (en) * 2022-10-31 2025-01-07 广州方硅信息技术有限公司 Training and detecting method, device, equipment and medium for voice activity detection model
CN116092457A (en) * 2023-02-03 2023-05-09 上海哔哩哔哩科技有限公司 Audio signal processing method and system
CN117577137A (en) * 2024-01-15 2024-02-20 宁德时代新能源科技股份有限公司 Cutter health assessment method, device, equipment and storage medium
CN117577137B (en) * 2024-01-15 2024-05-28 宁德时代新能源科技股份有限公司 Cutter health assessment method, device, equipment and storage medium
CN119724245A (en) * 2025-02-26 2025-03-28 联通沃音乐文化有限公司 Intelligent call shorthand method, system and medium based on speech recognition
CN120877732A (en) * 2025-09-22 2025-10-31 珠海格力电器股份有限公司 Electrical equipment control method and device, electrical equipment, medium and product

Also Published As

Publication number Publication date
CN110556126A (en) 2019-12-10
CN110556126B (en) 2024-01-05

Similar Documents

Publication Publication Date Title
WO2021051572A1 (en) Voice recognition method and apparatus, and computer device
CN107527620B (en) Electronic device, the method for authentication and computer readable storage medium
CN106683680B (en) Speaker recognition method and apparatus, computer equipment and computer readable medium
US9343067B2 (en) Speaker verification
WO2021128741A1 (en) Voice emotion fluctuation analysis method and apparatus, and computer device and storage medium
WO2021042537A1 (en) Voice recognition authentication method and system
CN112053695A (en) Voiceprint recognition method and device, electronic equipment and storage medium
WO2020181824A1 (en) Voiceprint recognition method, apparatus and device, and computer-readable storage medium
WO2019100606A1 (en) Electronic device, voiceprint-based identity verification method and system, and storage medium
JP2019510248A (en) Voiceprint identification method, apparatus and background server
CN113223536B (en) Voiceprint recognition method and device and terminal equipment
WO2018166187A1 (en) Server, identity verification method and system, and a computer-readable storage medium
WO2019136912A1 (en) Electronic device, identity authentication method and system, and storage medium
EP3989217A1 (en) Method for detecting an audio adversarial attack with respect to a voice input processed by an automatic speech recognition system, corresponding device, computer program product and computer-readable carrier medium
WO2019136911A1 (en) Voice recognition method for updating voiceprint data, terminal device, and storage medium
CN108630208B (en) Server, voiceprint-based identity authentication method and storage medium
WO2019136909A1 (en) Voice living-body detection method based on deep learning, server and storage medium
WO2019232826A1 (en) I-vector extraction method, speaker recognition method and apparatus, device, and medium
CN109065043B (en) A kind of command word recognition method and computer storage medium
WO2019196305A1 (en) Electronic device, identity verification method, and storage medium
CN113436633A (en) Speaker recognition method, speaker recognition device, computer equipment and storage medium
WO2019218512A1 (en) Server, voiceprint verification method, and storage medium
CN112992174A (en) Voice analysis method and voice recording device thereof
CN114171032A (en) Cross-channel voiceprint model training method, identification method, device and readable medium
CN115547345A (en) Voiceprint recognition model training and related recognition method, electronic device and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19945851

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19945851

Country of ref document: EP

Kind code of ref document: A1