[go: up one dir, main page]

WO2020073839A1 - Procédé, appareil et système de réveil vocal et dispositif électronique - Google Patents

Procédé, appareil et système de réveil vocal et dispositif électronique Download PDF

Info

Publication number
WO2020073839A1
WO2020073839A1 PCT/CN2019/108828 CN2019108828W WO2020073839A1 WO 2020073839 A1 WO2020073839 A1 WO 2020073839A1 CN 2019108828 W CN2019108828 W CN 2019108828W WO 2020073839 A1 WO2020073839 A1 WO 2020073839A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
rhyme
signal sequence
voice
wake
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2019/108828
Other languages
English (en)
Chinese (zh)
Inventor
曹元斌
张智超
风翮
王刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Publication of WO2020073839A1 publication Critical patent/WO2020073839A1/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting

Definitions

  • This specification relates to the field of computer technology, and in particular to a voice wake-up method, device, system, and electronic equipment.
  • voice recognition technology As the basic interaction method of intelligent devices, plays an increasingly important role.
  • Voice recognition technology involves many aspects, including awakening the device through voice commands, controlling the operation of the device, man-machine dialogue with the device, and voice command control for multiple devices.
  • Efficient and accurate voice recognition technology and fast and convenient wake-up mode are important development directions for smart devices.
  • the main performance bottleneck of the custom wake-up is that the computing resources on the terminal (terminal device) are limited, and the number of categories of the core classifier on the voice feature directly affects the speed and accuracy of the wake-up.
  • the traditional Pinyin granularity classification strategy is to take the full spelling of commonly used Chinese characters for classification, with more than 1,200 tones and more than 400 tones removed, which can achieve an accuracy rate of about 80%.
  • it is necessary to improve the on-end computing performance and improve a lot of post-processing work.
  • the invention provides a voice wake-up method, device, system and electronic equipment, which can quickly and accurately identify wake-up words and improve the speed of the equipment being woken up.
  • a voice wake-up method including:
  • another voice wake-up method including:
  • a voice wake-up device including:
  • the signal acquisition module is used to acquire the first voice signal
  • the signal recognition module is used to recognize the pinyin rhyme signal included in the first speech signal to obtain a first rhyme signal sequence corresponding to the first speech signal;
  • the signal comparison module is used to compare the first rhyme signal sequence with the preset second rhyme signal sequence of the wake word to extract the second rhyme portion from the first rhyme signal sequence
  • the third rhythm signal sequence with the same signal sequence content
  • the speech recognition module is used to perform automatic speech recognition processing on the full spelling speech signal corresponding to the third rhyme signal sequence in the first voice signal to determine whether the full spelling speech signal corresponds to the wake-up word voice signal.
  • another voice wake-up device including:
  • the signal acquisition module is used to acquire the first voice signal
  • the signal recognition module is used to recognize the vowel signal contained in the first voice signal to obtain the first vowel signal sequence corresponding to the first voice signal;
  • the signal comparison module is used to compare the first vowel signal sequence with the preset second vowel signal sequence of the wake word to extract the second vowel from the first vowel signal sequence A third vowel signal sequence with the same signal sequence content;
  • a voice recognition module configured to perform automatic voice recognition processing on the full amount of voice signals corresponding to the third vowel signal sequence in the first voice signal, and determine whether the full amount of voice signals are the voice signals corresponding to the wake-up words .
  • a voice wake-up system including:
  • the server is configured to perform automatic speech recognition processing on the full spelling speech signal corresponding to the third rhyme signal sequence in the first voice signal, and determine whether the full spelling speech signal corresponds to the wake-up word voice signal.
  • a voice wake-up method including:
  • the terminal acquires the first voice signal; recognizes the pinyin rhyme signal included in the first voice signal to obtain a first rhyme signal sequence corresponding to the first voice signal; and compares the first rhyme signal sequence with Compare the second rhyme signal sequence of the preset wake-up word to extract a third rhyme signal sequence with the same content as the second rhyme signal sequence from the first rhyme signal sequence;
  • the Quanpin speech signal corresponding to the signal sequence of the Sanyun Department is sent to the server;
  • the server performs automatic speech recognition processing on the full spelling speech signal corresponding to the third rhythm signal sequence in the first voice signal to determine whether the full spelling speech signal is a speech signal corresponding to the wake-up word.
  • an electronic device including:
  • a processor coupled to the memory, is used to execute the program for:
  • another electronic device including:
  • a processor coupled to the memory, is used to execute the program for:
  • the invention provides a voice wake-up method, device, system and electronic equipment. After acquiring the first voice signal to be recognized, the pinyin rhythm signal included in the first voice signal is recognized first to obtain the first voice signal The corresponding first rhyme signal sequence; then, the first rhyme signal sequence is compared with the preset second rhyme signal sequence of the wake-up word to extract the second rhyme signal from the first rhyme signal sequence The third rhyme signal sequence with the same sequence content; finally, the automatic speech recognition process is performed on the full-speech speech signal corresponding to the third voice-part signal sequence in the first speech signal to determine whether the full-speech speech signal is the speech corresponding to the wake-up word Signal to further identify whether the first speech signal contains a wake-up word.
  • the rhythm signal in the speech signal to be recognized is first compared with the rhythm part of the wake word to extract the part of the speech signal in the speech signal to be recognized that is the same as the rhythm part of the wake word, and then for the part of the speech
  • the signal is then processed through automatic speech recognition to determine whether it contains wake-up words, so as to achieve fast and accurate recognition of wake-up words and improve the speed of the device being awakened.
  • Figure 1 is a logic schematic diagram of the basic flow of voice wake-up
  • Figure 2 is a schematic diagram of the processing logic of the wake-up engine on the upper end of the basic process of voice wake-up;
  • FIG. 3 is a schematic diagram of processing logic of a wake-up engine according to an embodiment of the present invention.
  • FIG. 4 is a structural diagram of a voice wake-up system according to an embodiment of the present invention.
  • FIG. 5 is a flowchart 1 of a voice wake-up method according to an embodiment of the present invention.
  • FIG. 6 is a second flowchart of a voice wake-up method according to an embodiment of the present invention.
  • FIG. 7 is a flowchart 1 of a rhythm class training method according to an embodiment of the present invention.
  • FIG. 8 is a flowchart 2 of a rhythm class training method according to an embodiment of the present invention.
  • FIG. 9 is a structural diagram 1 of a voice wake-up device according to an embodiment of the invention.
  • FIG. 10 is a second structural diagram of a voice wake-up device according to an embodiment of the present invention.
  • FIG. 11 is a structural diagram 1 of a rhythm class training device according to an embodiment of the present invention.
  • FIG. 12 is a structural diagram 2 of a rhythm class training device according to an embodiment of the present invention.
  • FIG. 13 is a flowchart 3 of a voice wake-up method according to an embodiment of the present invention.
  • FIG. 14 is a first schematic structural diagram of an electronic device according to an embodiment of the present invention.
  • 15 is a second schematic structural diagram of an electronic device according to an embodiment of the present invention.
  • the voice device After receiving the voice signal, the voice device first performs signal processing (mainly including noise reduction and echo cancellation) and feature extraction on the voice signal, thereby converting the original input audio signal into the terminal
  • the features that is, the frequency spectrum signal of the voice
  • the wake-up engine on the (terminal) that can be recognized by the wake-up engine on the (terminal); then enter the features into the wake-up engine for comparison and recognition of wake-up words; when the wake-up word hits, it will continue to instruct the server to execute subsequent instructions, such as playing songs , Crosstalk, etc.
  • the on-end wake-up engine can be considered as the core part of performing wake-up.
  • the wake-up engine on this end mainly includes two parts: a classifier and a post-processing part.
  • the classifier is used to convert continuous speech features into different categories. This part of the calculation is often the most expensive part of all wake-up work. Usually the number of classifications output by the last layer of the neural network directly determines the entire network. Calculation scale.
  • the traditional Hidden Markov Model-Deep Neural Network (HMM-DNN) is modeled using the probability density function (Probability Density, PDF) of the speed of sound (phone). Production availability requires at least 6000 to 8000 classifications; using Pinyin for classification also requires more than 1200 to 400 classifications.
  • post-processing there is a post-processing part in the detection of wake-up words.
  • the traditional method detects the entire word, and can use dynamic time warping algorithm (Dynamic Time Warping, DTW) recognition after smoothing the speech output by the classifier. Whether the voice is the same as the wake-up word; automatic speech recognition (Automatic Speech Recognition, ASR) technology can also be used to recognize whether the voice hits the wake-up word.
  • DTW Dynamic Time Warping
  • ASR Automatic Speech Recognition
  • the classification network is huge, and a high computing performance needs to be configured on the end.
  • the embodiment of the present invention improves the defect in the prior art that the huge classification network leads to the need to configure higher computing resources on the end to accurately and quickly perform voice wake-up.
  • the core idea is to split the core part of performing voice wake-up into two The recognition process of the second wake word.
  • the first wake-up word recognition process is completed on the terminal. This process only classifies and recognizes the pinyin rhyme part of the wake-up word, completing the preliminary recognition process of the speech signal to be recognized. Then, the full-scale speech signal corresponding to the rhyme signal that is initially selected and the same as the rhyme signal of the arousal word is sent to the cloud, and the cloud recognizes the entire speech signal again to determine whether the speech signal hits the arousal word.
  • FIG. 3 it is a schematic diagram of a processing logic of a wake-up engine according to an embodiment of the present invention, and relates to two main bodies that perform voice wake-up, a device side (a terminal that can receive and recognize voice such as a smart speaker) and a cloud side (a server is provided).
  • a device side a terminal that can receive and recognize voice such as a smart speaker
  • a cloud side a server is provided.
  • the speech signal to be recognized first undergoes the first wake-up word recognition.
  • This recognition process only performs rhythm class recognition on the rhythm signal of the speech signal through the pre-trained classifier; then, the recognized rhyme
  • the part signal sequence is compared with the prosodic part of the wake-up word through post-processing to determine whether the prosodic part of the wake-up word is hit in the voice signal, and the full amount of voice signal hitting the promising part of the wake-up word is transmitted to the cloud.
  • the voice signal to be recognized is a full-volume voice signal with the same rhyme signal and arousal word rhyme.
  • the recognition process is to recognize the entire voice signal For example, ASR technology is used to identify whether the voice signal hits the wake word.
  • FIG. 4 is a structural diagram of a voice wake-up system provided by an embodiment of the present invention. As shown in FIG. 4, the system includes a terminal 410 and a server 420, where:
  • Terminal 410 includes:
  • a signal acquisition module for acquiring a first voice signal
  • the first voice signal is, for example, a Chinese voice signal
  • the signal recognition module is used for recognizing the pinyin rhyme signal included in the first speech signal to obtain the first rhyme signal sequence corresponding to the first speech signal;
  • the signal comparison module is used to compare the first rhyme signal sequence with the preset second rhyme signal sequence of the wake-up word, so as to extract from the first rhyme signal sequence the same content as the second rhyme signal sequence.
  • Three rhyme signal sequence Three rhyme signal sequence;
  • the server 420 includes:
  • the speech recognition module is used to perform automatic speech recognition processing on the full spelling speech signal corresponding to the third rhyme signal sequence in the first voice signal, and determine whether the full spelling speech signal is a speech signal corresponding to the wake word.
  • FIG. 5 is a flowchart 1 of the voice wake-up method shown in an embodiment of the present invention.
  • the method may be executed by modules deployed in the terminal 410 and the server 420 in FIG. 4.
  • steps S510-530 can be executed on the terminal (terminal)
  • step S540 can be executed on the cloud (server).
  • the voice wake-up method includes the following steps:
  • the first voice signal may be a voice signal received through the voice device, and the wake-up word is recognized by the voice signal to further wake up the target device.
  • S520 Identify the pinyin rhyme signal included in the first speech signal to obtain a first rhyme signal sequence corresponding to the first speech signal.
  • the pinyin and rhyme parts are separated: such as tian-> t, ian; mao-> m, ao.
  • the pinyin parts referred to as "voices"
  • the voice part is a short-lived peak or trough, basically all the extensions
  • the sounds are in the Pinyin Rhyme Department (referred to as "Rhyme Department”).
  • Rhyme Department In the traditional triphone modeling, it is often necessary to combine the front and back phones to achieve a good recognition accuracy.
  • the rhythm signals included in the signal are identified to obtain a first rhyme signal sequence corresponding to the first speech signal.
  • the first rhyme signal sequence includes a time sequence and a rhyme signal located at each time point in the time sequence.
  • the traditional wake word recognition method is to detect the entire word. In order to reduce the amount of calculation on the end, this scheme only recognizes the rhyme of each word on the end, that is, the above first rhyme signal sequence and the second of the preset wake word.
  • the rhyme signal sequences are compared to extract a third rhyme signal sequence with the same content as the second rhyme signal sequence from the first rhyme signal sequence.
  • the second verification in this session is to filter out the part of the voice signal that is different from the part of the wake word.
  • the advantage of this is that most non-wake words are filtered on the end, and the cloud only needs to do the final verification.
  • the real wake-up word can be recognized, so that the calculation of the end and the server is balanced, which can have a high accuracy rate, and at the same time, there will be no high delay caused by the large model on the end.
  • the voice wake-up method after acquiring the first voice signal to be recognized, first recognizes the pinyin rhyme signal included in the first voice signal to obtain the first rhyme signal sequence corresponding to the first voice signal; Then, the first rhyme signal sequence is compared with the preset second rhyme signal sequence of the wake-up word to extract a third rhyme signal having the same content as the second rhyme signal sequence from the first rhyme signal sequence Sequence; finally, automatic speech recognition processing is performed on the full spelling speech signal corresponding to the third rhythm signal sequence in the first voice signal to determine whether the full spelling speech signal is a speech signal corresponding to the wake-up word, and then the first speech signal is recognized Whether to include the wake word.
  • the rhythm signal in the speech signal to be recognized is first compared with the rhythm part of the wake word to extract the part of the speech signal in the speech signal to be recognized that is the same as the rhythm part of the wake word, and then for the part of the speech
  • the signal is then processed through automatic speech recognition to determine whether it contains wake-up words, so as to achieve fast and accurate recognition of wake-up words and improve the speed of the device being awakened.
  • FIG. 6 it is a flowchart 2 of a voice wake-up method according to an embodiment of the present invention.
  • a preprocessing link is added, and steps S520 and S530 are refined.
  • the voice wake-up method includes the following steps:
  • S610 Acquire a first voice signal, where the first voice signal is, for example, a Chinese voice signal.
  • step S610 The content of step S610 is the same as that of step S510.
  • S620 Perform pre-processing for denoising the first speech signal.
  • the first speech signal may be subjected to pre-processing such as noise reduction and echo cancellation to maximize the retention of the effective signal ratio in the first speech signal.
  • the so-called feature spectrum refers to the voice signal to be processed needs to be converted into a spectrum signal that meets certain feature requirements when performing classification recognition or classification training.
  • the audio is cut into a frame spectrum signal of about 20 ms according to a fixed time length, which is used as a subsequent classification recognition Characteristic spectrum.
  • S640 Perform classification calculation on the characteristic spectrum of the first speech signal by using a rhyme classifier to obtain a first rhyme signal sequence corresponding to the first speech signal.
  • the rhyme classifier may be a speech classification model generated in advance, but the speech classification model only classifies the rhyme signal in the speech signal and outputs the sequence value of the corresponding rhyme signal.
  • Steps S630 to S640 are refinements of the above step S520.
  • rhythm class training method as shown in FIG. 7 may be adopted to train and generate the above-mentioned rhythm class classifier.
  • the method includes:
  • the labeled pinyin rhyme signal is used as a training sample, and a neural network algorithm and a joint model algorithm connected with time series classification are used to train and generate a rhyme classifier.
  • the training process mainly includes two processing links, one is how to accurately classify the characteristic spectrum signals of different rhyme parts; the other is how to place the classified rhyme parts into the correct position in the speech signal.
  • a neural network algorithm can be used to accurately classify the characteristic spectrum signals of different rhythm parts, and combined with the connection timing classification (ConnectionistTemporalClassification, CTC) algorithm to lock the rhyme of the classified category The correct position of the Ministry in the voice signal.
  • CTC connection timing classification
  • These two model algorithms are used for joint modeling to generate a rhyme classifier based on training samples.
  • rhythm class training method as shown in FIG. 8 can also be used to train and generate the above-mentioned rhythm class classifier.
  • the method includes:
  • the marked pinyin rhyme signal is used as a training sample, and a hidden Markov model and a deep neural network combined model algorithm are used to train and generate a rhyme class classifier.
  • the training process mainly includes two processing links, one is how to accurately classify the characteristic spectrum signals of different rhyme parts; the other is how to place the classified rhyme parts into the correct position in the speech signal.
  • HMM-DNN hidden Markov model
  • the classifier in this solution is a classifier for the rhyme part that classifies the pinyin rhyme part.
  • a dynamic time warping algorithm is used to compare the first rhyme signal sequence with the preset second rhyme signal sequence of the wake-up word according to the time sequence, so as to extract the second rhyme signal sequence from the first rhyme signal sequence
  • the third rhyme signal sequence with the same content.
  • the dynamic time warping (Dynamic Time Warping, DTW) algorithm can be used to align the two signal sequences that are compared, Then, the comparison is performed according to the timing correspondence to extract the third rhyme signal sequence with the same content as the second rhyme signal sequence from the first rhyme signal sequence.
  • DTW Dynamic Time Warping
  • S660 Perform automatic speech recognition processing on the full spelling speech signal corresponding to the third rhythm signal sequence in the first voice signal to determine whether the full spelling speech signal is a speech signal corresponding to the wake word.
  • Step S660 is the same as step S540.
  • the first speech signal is preprocessed to retain the effective signal ratio in the first speech signal to the greatest extent.
  • the pinyin rhyme signal contained in the first speech signal is recognized through a pre-trained rhyme classifier to obtain a first rhyme signal sequence corresponding to the first speech signal, so as to realize rapid recognition.
  • the neural network algorithm and the joint model algorithm connected to the time series classification are used for training and modeling, or the hidden Markov model and the deep neural network joint model algorithm are used for training and modeling to ensure that the trained The accuracy of the rhyme classifier.
  • a dynamic time warping algorithm is used to compare the first rhyme signal sequence with the preset second rhythm signal sequence of the wake-up word to quickly and accurately obtain the third rhyme signal sequence.
  • FIG. 9 it is a structural diagram 1 of a voice wake-up device according to an embodiment of the present invention.
  • the voice wake-up device may be installed in the voice wake-up device system shown in FIG. 4 for performing the method steps shown in FIG. 5. include:
  • the signal obtaining module 910 is used to obtain a first voice signal
  • the signal recognition module 920 is used for recognizing the pinyin rhyme signal included in the first speech signal to obtain the first rhyme signal sequence corresponding to the first speech signal;
  • the signal comparison module 930 is configured to compare the first rhyme signal sequence with the preset second rhyme signal sequence of the wake-up word to extract the same content as the second rhyme signal sequence from the first rhyme signal sequence The third rhythm signal sequence;
  • the speech recognition module 940 is configured to perform automatic speech recognition processing on the full spelling speech signal corresponding to the third rhyme signal sequence in the first voice signal, and determine whether the full spelling speech signal is a speech signal corresponding to the wake word.
  • the signal recognition module 920 may include:
  • the feature obtaining unit 101 is used to obtain a feature spectrum of the first voice signal
  • the signal recognition unit 102 is configured to perform classification calculation on the characteristic spectrum of the first speech signal by using a rhyme classifier to obtain a first rhythm signal sequence corresponding to the first speech signal.
  • the voice wake-up device shown in FIG. 10 may further include:
  • the pre-processing module 103 is used to perform pre-processing for denoising the first speech signal.
  • the above-mentioned signal comparison module 930 may be specifically used for,
  • a dynamic time warping algorithm is used to compare the first rhyme signal sequence with the second rhyme signal sequence of the preset wake-up word according to the time sequence to extract the same content as the second rhyme signal sequence from the first rhyme signal sequence The third rhyme signal sequence.
  • the voice wake-up device shown in FIG. 10 can be used to perform the method steps shown in FIG. 6.
  • the above voice wake-up device may further include:
  • the first spectrum acquisition module 111 is used to acquire the characteristic spectrum of the speech signal used for model training
  • the first signal labeling module 112 is used to label the pinyin rhyme signal in the characteristic spectrum
  • the first training module 113 is configured to use the marked Pinyin rhyme signal as a training sample, and use a neural network algorithm and a joint model algorithm connected with time series classification to train and generate a rhyme classifier.
  • the foregoing voice wake-up device may further include:
  • the second spectrum acquisition module 121 is used to acquire the characteristic spectrum of the speech signal used for model training
  • the second signal labeling module 122 is used to label the pinyin rhyme signal in the feature spectrum
  • the second training module 123 is configured to use the marked Pinyin rhyme signal as a training sample, and use a hidden Markov model and a deep neural network joint model algorithm to train and generate the rhyme classifier.
  • FIGS. 11 and 12 can be used to correspondingly execute the method steps shown in FIGS. 7 and 8.
  • the voice wake-up device after acquiring the first voice signal to be recognized, first recognizes the pinyin rhyme signal included in the first voice signal to obtain the first rhyme signal sequence corresponding to the first voice signal; Then, the first rhyme signal sequence is compared with the preset second rhyme signal sequence of the wake-up word to extract a third rhyme signal having the same content as the second rhyme signal sequence from the first rhyme signal sequence Sequence; finally, automatic speech recognition processing is performed on the full spelling speech signal corresponding to the third rhythm signal sequence in the first voice signal to determine whether the full spelling speech signal is a speech signal corresponding to the wake-up word, and then the first speech signal is recognized Whether to include the wake word.
  • the rhythm signal in the speech signal to be recognized is first compared with the rhythm part of the wake word to extract the part of the speech signal in the speech signal to be recognized that is the same as the rhythm part of the wake word, and then for the part of the speech
  • the signal is then processed through automatic speech recognition to determine whether it contains wake-up words, so as to achieve fast and accurate recognition of wake-up words and improve the speed of the device being awakened.
  • the first speech signal is preprocessed to retain the effective signal ratio in the first speech signal to the greatest extent.
  • the pinyin rhyme signal included in the first speech signal is recognized by a pre-trained rhyme classifier to obtain a first rhyme signal sequence corresponding to the first speech signal, so as to realize rapid recognition.
  • the neural network algorithm and the joint model algorithm connected to the time series classification are used for training and modeling, or the hidden Markov model and the deep neural network joint model algorithm are used for training and modeling to ensure that the trained The accuracy of the rhyme classifier.
  • a dynamic time warping algorithm is used to compare the first rhyme signal sequence with the preset second rhyme signal sequence of the wake-up word to quickly and accurately obtain the third rhyme signal sequence.
  • FIG. 13 is a flowchart 3 of the voice wake-up method shown in an embodiment of the present invention.
  • the method may be executed by modules deployed in the terminal 410 and the server 420 in FIG. 4. Among them, steps S131 to 133 can be executed on the terminal (terminal), and step S134 can be executed on the cloud (server).
  • the voice wake-up method includes the following steps:
  • the language type of the first voice signal is not limited, for example, it may be Chinese, English, Japanese, and so on.
  • the first voice signal may be a voice signal received through a voice device, and a wake-up word recognition is performed on the voice signal to further wake up the target device.
  • S132 Identify the vowel signal included in the first voice signal to obtain a first vowel signal sequence corresponding to the first voice signal.
  • Natural speech is divided into phonological categories and can be vowels and consonants.
  • vowels correspond to rhymes in Pinyin and consonants correspond to parts in Pinyin; for example, in English, it contains 5 vowels : A, e, i, o, u, 21 consonants; for example, in Japanese, it contains 5 vowels, which are represented by the five pseudonyms " ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ".
  • the first vowel signal sequence corresponding to the first voice signal can be obtained by identifying the vowel signal contained in the first voice signal of any language type. For example, when the first speech signal is a Chinese speech signal, the first vowel signal sequence corresponding to the first speech signal may be the first rhyme signal in the method shown in FIG. 5.
  • step S530 may be performed, and the first rhyme signal and the second rhythm signal sequence of the arousal word are compared, so as to extract and A third rhyme signal sequence with the same content as the second rhythm signal sequence.
  • S134 Perform automatic voice recognition processing on the full amount of voice signals corresponding to the third vowel signal sequence in the first voice signal to determine whether the full amount of voice signals are the voice signals corresponding to the wake-up words.
  • the third vowel signal sequence corresponds to the full amount of voice signals in the first voice signal is all voice signals within the interval range of the third vowel signal sequence corresponding to the first voice signal.
  • the full-volume voice signal is the full Pinyin voice signal corresponding to the first voice signal corresponding to the third rhyme signal sequence.
  • the vowel signal included in the first voice signal may specifically be a voice signal corresponding to a vowel in a single syllable included in the language type to which the first voice signal belongs.
  • the vowel signal included in the first speech signal is the speech signal corresponding to the rhyme part of the single word included in Chinese.
  • the voice wake-up device may include all the modules shown in FIG. 9 for performing the method steps shown in FIG. 13, which include:
  • the signal obtaining module 910 is used to obtain a first voice signal
  • the signal recognition module 920 is used to recognize the vowel signal contained in the first voice signal to obtain the first vowel signal sequence corresponding to the first voice signal;
  • the signal comparison module 930 is used to compare the first vowel signal sequence with the preset second vowel signal sequence of the wake word to extract the same content as the second vowel signal sequence from the first vowel signal sequence Third vowel signal sequence;
  • the voice recognition module 940 is configured to perform automatic voice recognition processing on the full-volume voice signal corresponding to the third vowel signal sequence in the first voice signal, and determine whether the full-volume voice signal is a voice signal corresponding to the wake-up word.
  • the vowel signal included in the first voice signal may be a voice signal corresponding to a vowel in a monosyllable included in the language type to which the first voice signal belongs.
  • the language type to which the first voice signal belongs may include: Chinese, English, Japanese, and so on.
  • the voice wake-up device in this embodiment may perform the method steps shown in FIG. 5.
  • This embodiment provides a voice wake-up system, including:
  • the terminal is used to obtain a first voice signal, for example, a Chinese voice signal; identify the pinyin rhyme signal included in the first voice signal to obtain a first rhyme signal sequence corresponding to the first voice signal; Comparing the first rhyme signal sequence with the preset second rhyme signal sequence of the wake-up word to extract a third rhyme signal sequence with the same content as the second rhyme signal sequence from the first rhyme signal sequence; Send the whole Pinyin speech signal corresponding to the third rhyme signal sequence to the server;
  • a first voice signal for example, a Chinese voice signal
  • identify the pinyin rhyme signal included in the first voice signal to obtain a first rhyme signal sequence corresponding to the first voice signal
  • Comparing the first rhyme signal sequence with the preset second rhyme signal sequence of the wake-up word to extract a third rhyme signal sequence with the same content as the second rhyme signal sequence from the first rhyme signal sequence
  • the server is configured to perform automatic speech recognition processing on the full spelling speech signal corresponding to the third rhyme signal sequence in the first voice signal, and determine whether the full spelling speech signal is a speech signal corresponding to the wake word.
  • this embodiment also provides a voice wake-up method, that is, the voice wake-up method is described from the execution flow on both sides of the terminal and the server.
  • the method includes:
  • the terminal acquires a first voice signal, which is, for example, a Chinese voice signal; recognizes the pinyin rhyme signal included in the first voice signal to obtain a first rhyme signal sequence corresponding to the first voice signal; converts the first The rhyme signal sequence is compared with the second rhyme signal sequence of the preset wake word to extract a third rhyme signal sequence with the same content as the second rhyme signal sequence from the first rhyme signal sequence; the third The whole Pinyin speech signal corresponding to the rhyme signal sequence is sent to the server;
  • a first voice signal which is, for example, a Chinese voice signal
  • recognizes the pinyin rhyme signal included in the first voice signal to obtain a first rhyme signal sequence corresponding to the first voice signal
  • converts the first The rhyme signal sequence is compared with the second rhyme signal sequence of the preset wake word to extract a third rhyme signal sequence with the same content as the second rhyme signal sequence from the first rhyme signal sequence
  • the third The whole Pinyin speech signal corresponding to the rhyme signal sequence is sent to the server;
  • the server performs automatic speech recognition processing on the full spelling speech signal corresponding to the third rhyme signal sequence in the first voice signal, and determines whether the full spelling speech signal is a speech signal corresponding to the wake word.
  • the first part recognizes the wake word for the first time on the terminal side by recognizing the rhyme signal in the first voice signal; the second part uses the rhythm part refined by the initial recognition on the server side
  • the Quanpin speech signal corresponding to the signal is automatically speech recognized, thereby completing the recognition process of whether the entire speech signal hits the wake word.
  • Embodiment 3 describes the overall architecture of a voice wake-up device.
  • the functions of the device can be implemented by means of an electronic device.
  • FIG. 14 it is a schematic structural diagram of an electronic device according to an embodiment of the present invention, which specifically includes: Memory 141 and processor 142.
  • the memory 141 is used to store programs.
  • the memory 141 may be configured to store various other data to support operations on the electronic device. Examples of such data include instructions for any application or method for operating on the electronic device, contact data, phone book data, messages, pictures, videos, etc.
  • the memory 141 may be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable and removable Programmable read only memory (EPROM), programmable read only memory (PROM), read only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read only memory
  • EPROM erasable and removable Programmable read only memory
  • PROM programmable read only memory
  • ROM read only memory
  • magnetic memory magnetic memory
  • flash memory magnetic disk or optical disk.
  • the processor 142 coupled to the memory 141, is used to execute the program in the memory 141 for:
  • Automatic speech recognition processing is performed on the full spelling speech signal corresponding to the third rhythm signal sequence in the first voice signal to determine whether the full spelling speech signal is a speech signal corresponding to the wake-up word.
  • the electronic device may further include: a communication component 143, a power component 144, an audio component 145, a display 146, and other components. Only some components are schematically shown in FIG. 14, and it does not mean that the electronic device includes only the components shown in FIG.
  • the communication component 143 is configured to facilitate wired or wireless communication between the electronic device and other devices.
  • Electronic devices can access wireless networks based on communication standards, such as WiFi, 2G or 3G, or a combination thereof.
  • the communication component 143 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel.
  • the communication component 143 also includes a near field communication (NFC) module to facilitate short-range communication.
  • NFC near field communication
  • the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra wideband
  • Bluetooth Bluetooth
  • the power supply component 144 provides power for various components of the electronic device.
  • the power component 144 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for electronic devices.
  • the audio component 145 is configured to output and / or input audio signals.
  • the audio component 145 includes a microphone (MIC).
  • the microphone When the electronic device is in an operation mode, such as a call mode, a recording mode, and a voice recognition mode, the microphone is configured to receive an external audio signal.
  • the received audio signal may be further stored in the memory 141 or transmitted via the communication component 143.
  • the audio component 145 further includes a speaker for outputting audio signals.
  • the display 146 includes a screen, which may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user.
  • the touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. The touch sensor may not only sense the boundary of the touch or sliding action, but also detect the duration and pressure related to the touch or sliding operation.
  • Embodiment 5 describes the overall architecture of a voice wake-up device.
  • the functions of the device can be implemented by means of an electronic device.
  • FIG. 15 it is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
  • Memory 151 and processor 152 are schematic structural diagrams of an electronic device according to an embodiment of the present invention.
  • the memory 151 is used to store programs.
  • the memory 151 may be configured to store various other data to support operations on the electronic device. Examples of such data include instructions for any application or method for operating on the electronic device, contact data, phone book data, messages, pictures, videos, etc.
  • the memory 151 may be implemented by any type of volatile or nonvolatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable and removable Programmable read only memory (EPROM), programmable read only memory (PROM), read only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read only memory
  • EPROM erasable and removable Programmable read only memory
  • PROM programmable read only memory
  • ROM read only memory
  • magnetic memory magnetic memory
  • flash memory magnetic disk or optical disk.
  • the processor 152 coupled to the memory 151, is used to execute the program in the memory 151 for:
  • Automatic speech recognition processing is performed on the full-volume voice signal corresponding to the third vowel signal sequence in the first voice signal to determine whether the full-volume voice signal is a voice signal corresponding to the wake-up word.
  • the electronic device may further include: a communication component 153, a power component 154, an audio component 155, a display 156, and other components.
  • FIG. 15 only schematically shows some components, which does not mean that the electronic device includes only the components shown in FIG. 15.
  • the communication component 153 is configured to facilitate wired or wireless communication between the electronic device and other devices.
  • Electronic devices can access wireless networks based on communication standards, such as WiFi, 2G or 3G, or a combination thereof.
  • the communication component 153 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel.
  • the communication component 153 also includes a near field communication (NFC) module to facilitate short-range communication.
  • NFC near field communication
  • the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra wideband
  • Bluetooth Bluetooth
  • the power supply component 154 provides power for various components of the electronic device.
  • the power supply component 154 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for electronic devices.
  • the audio component 155 is configured to output and / or input audio signals.
  • the audio component 155 includes a microphone (MIC).
  • the microphone When the electronic device is in an operation mode, such as a call mode, a recording mode, and a voice recognition mode, the microphone is configured to receive an external audio signal.
  • the received audio signal may be further stored in the memory 151 or transmitted via the communication component 153.
  • the audio component 155 further includes a speaker for outputting audio signals.
  • the display 156 includes a screen, which may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user.
  • the touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. The touch sensor may not only sense the boundary of the touch or sliding action, but also detect the duration and pressure related to the touch or sliding operation.
  • a program instructing relevant hardware may be completed by a program instructing relevant hardware.
  • the aforementioned program may be stored in a computer-readable storage medium.
  • the steps including the foregoing method embodiments are executed; and the foregoing storage medium includes various media that can store program codes, such as ROM, RAM, magnetic disk, or optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)
  • Document Processing Apparatus (AREA)

Abstract

La présente invention porte sur un procédé, sur un appareil et sur un système de réveil vocal, ainsi que sur un dispositif électronique. Le procédé consiste : à obtenir un premier signal vocal (S510) ; à reconnaître un signal de catégorie de rime pinyin inclus dans un premier signal vocal pour obtenir une première séquence de signaux de catégorie de rime correspondant au premier signal vocal (S520) ; à comparer la première séquence de signaux de catégorie de rime avec une deuxième séquence de signaux de catégorie de rime d'un mot de réveil prédéfini pour extraire de la première séquence de signaux de catégorie de rime une troisième séquence de signaux de catégorie de rime ayant le même contenu que la deuxième séquence de signaux de catégorie de rime (S530) ; à effectuer un traitement de reconnaissance vocale automatique sur un signal vocal d'orthographe complet, qui correspond à la troisième séquence de signaux de catégorie de rime, dans le premier signal vocal, et à déterminer si le signal vocal d'orthographe complet est un signal vocal correspondant au mot de réveil (S540). Le procédé peut reconnaître rapidement et avec précision le mot de réveil et améliorer une vitesse d'éveil d'un dispositif.
PCT/CN2019/108828 2018-10-11 2019-09-29 Procédé, appareil et système de réveil vocal et dispositif électronique Ceased WO2020073839A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811186019.4 2018-10-11
CN201811186019.4A CN111048068B (zh) 2018-10-11 2018-10-11 语音唤醒方法、装置、系统及电子设备

Publications (1)

Publication Number Publication Date
WO2020073839A1 true WO2020073839A1 (fr) 2020-04-16

Family

ID=70164846

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/108828 Ceased WO2020073839A1 (fr) 2018-10-11 2019-09-29 Procédé, appareil et système de réveil vocal et dispositif électronique

Country Status (2)

Country Link
CN (1) CN111048068B (fr)
WO (1) WO2020073839A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113782005A (zh) * 2021-01-18 2021-12-10 北京沃东天骏信息技术有限公司 语音识别方法及装置、存储介质及电子设备
US20230054011A1 (en) * 2021-08-20 2023-02-23 Beijing Xiaomi Mobile Software Co., Ltd. Voice collaborative awakening method and apparatus, electronic device and storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116259319A (zh) * 2023-03-09 2023-06-13 四川长虹电器股份有限公司 一种降低远场语音误激活的方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08314490A (ja) * 1995-05-23 1996-11-29 Nippon Hoso Kyokai <Nhk> ワードスポッティング型音声認識方法と装置
CN107221325A (zh) * 2016-03-22 2017-09-29 华硕电脑股份有限公司 有向性关键字验证方法以及使用该方法的电子装置
WO2018151772A1 (fr) * 2017-02-14 2018-08-23 Google Llc Fonction hotword côté serveur

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1741131B (zh) * 2004-08-27 2010-04-14 中国科学院自动化研究所 一种非特定人孤立词语音识别方法
CN101819772B (zh) * 2010-02-09 2012-03-28 中国船舶重工集团公司第七○九研究所 一种基于语音分段的孤立词识别方法
CN102208186B (zh) * 2011-05-16 2012-12-19 南宁向明信息科技有限责任公司 汉语语音识别方法
CN102970618A (zh) * 2012-11-26 2013-03-13 河海大学 基于音节识别的视频点播方法
CN103745722B (zh) * 2014-02-10 2017-02-08 上海金牌软件开发有限公司 一种语音交互智能家居系统及语音交互方法
KR101459050B1 (ko) * 2014-05-08 2014-11-12 주식회사 소니스트 단말기의 잠금 설정 및 해제 장치 및 그 방법
CN106847273B (zh) * 2016-12-23 2020-05-05 北京云知声信息技术有限公司 语音识别的唤醒词选择方法及装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08314490A (ja) * 1995-05-23 1996-11-29 Nippon Hoso Kyokai <Nhk> ワードスポッティング型音声認識方法と装置
CN107221325A (zh) * 2016-03-22 2017-09-29 华硕电脑股份有限公司 有向性关键字验证方法以及使用该方法的电子装置
WO2018151772A1 (fr) * 2017-02-14 2018-08-23 Google Llc Fonction hotword côté serveur

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113782005A (zh) * 2021-01-18 2021-12-10 北京沃东天骏信息技术有限公司 语音识别方法及装置、存储介质及电子设备
CN113782005B (zh) * 2021-01-18 2024-03-01 北京沃东天骏信息技术有限公司 语音识别方法及装置、存储介质及电子设备
US20230054011A1 (en) * 2021-08-20 2023-02-23 Beijing Xiaomi Mobile Software Co., Ltd. Voice collaborative awakening method and apparatus, electronic device and storage medium
US12008993B2 (en) * 2021-08-20 2024-06-11 Beijing Xiaomi Mobile Software Co., Ltd. Voice collaborative awakening method and apparatus, electronic device and storage medium

Also Published As

Publication number Publication date
CN111048068A (zh) 2020-04-21
CN111048068B (zh) 2023-04-18

Similar Documents

Publication Publication Date Title
CN110364143B (zh) 语音唤醒方法、装置及其智能电子设备
CN110718223B (zh) 用于语音交互控制的方法、装置、设备和介质
CN110706690A (zh) 语音识别方法及其装置
US11676625B2 (en) Unified endpointer using multitask and multidomain learning
US9589564B2 (en) Multiple speech locale-specific hotword classifiers for selection of a speech locale
US20150325240A1 (en) Method and system for speech input
WO2020043123A1 (fr) Procédé de reconnaissance d&#39;entité nommée, appareil et dispositif de reconnaissance d&#39;entité nommée et support
CN112102850B (zh) 情绪识别的处理方法、装置、介质及电子设备
CN111210829B (zh) 语音识别方法、装置、系统、设备和计算机可读存储介质
CN110570873B (zh) 声纹唤醒方法、装置、计算机设备以及存储介质
CN111341325A (zh) 声纹识别方法、装置、存储介质、电子装置
CN108305616A (zh) 一种基于长短时特征提取的音频场景识别方法及装置
CN110223687B (zh) 指令执行方法、装置、存储介质及电子设备
WO2014048113A1 (fr) Procédé et dispositif de reconnaissance vocale
CN113611316A (zh) 人机交互方法、装置、设备以及存储介质
CN111128134A (zh) 声学模型训练方法和语音唤醒方法、装置及电子设备
CN112037772A (zh) 基于多模态的响应义务检测方法、系统及装置
CN104538025A (zh) 手势到汉藏双语语音转换方法及装置
CN114120979A (zh) 语音识别模型的优化方法、训练方法、设备及介质
CN110689887B (zh) 音频校验方法、装置、存储介质及电子设备
CN113129867B (zh) 语音识别模型的训练方法、语音识别方法、装置和设备
WO2020073839A1 (fr) Procédé, appareil et système de réveil vocal et dispositif électronique
CN112185357A (zh) 一种同时识别人声和非人声的装置及方法
US11769491B1 (en) Performing utterance detection using convolution
CN115691478A (zh) 语音唤醒方法、装置、人机交互设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19870531

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19870531

Country of ref document: EP

Kind code of ref document: A1