CN111986653A

CN111986653A - A method, device and device for speech intent recognition

Info

Publication number: CN111986653A
Application number: CN202010785605.1A
Authority: CN
Inventors: 陈展
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2020-08-06
Filing date: 2020-08-06
Publication date: 2020-11-24
Anticipated expiration: 2040-08-06
Also published as: WO2022028378A1; CN111986653B

Abstract

The present application provides a voice intent recognition method, device and device, the method includes: determining a phoneme set to be recognized according to the voice to be recognized; obtaining a phoneme vector to be recognized corresponding to the phoneme set to be recognized; Input to the trained target network model, so that the target network model outputs the speech intent corresponding to the to-be-recognized phoneme vector; wherein, the target network model is used to record the mapping relationship between the phoneme vector and the speech intent. Through the technical solution of the present application, the accuracy rate of speech intention recognition is high, the user's speech intention can be accurately recognized, and the accuracy rate of speech intention recognition is effectively improved.

Description

Voice intention recognition method, device and equipment

Technical Field

The present application relates to the field of voice interaction, and in particular, to a method, an apparatus, and a device for recognizing a voice intention.

Background

With the rapid development of artificial intelligence technology and the wide use of artificial intelligence technology in life, voice interaction becomes an important bridge for communication between people and machines. The robot system needs to talk with a user and complete a specific task, wherein one core technology is recognition of voice intention, namely, after the user inputs a voice to be recognized into the robot system, the robot system can judge the voice intention of the user through the voice to be recognized.

In the related art, the recognition manner of the voice intention includes: a speech recognition stage and an intent recognition stage. In the Speech Recognition stage, Speech Recognition is performed on the Speech to be recognized through an Automatic Speech Recognition (ASR) technology, and the Speech to be recognized is converted into a text. Then, in the intent recognition stage, the text is semantically understood by Natural Language Processing (NLP) technology to obtain keyword information, and the voice intent of the user is recognized based on the keyword information.

In the text-based intention recognition mode, the accuracy rate is seriously dependent on the accuracy rate of converting the voice into the text, and the accuracy rate of converting the voice into the text is low, so that the accuracy rate of recognizing the voice intention is very low, and the voice intention of the user cannot be accurately recognized. For example, "trees" are present in the speech, but when the speech is converted into text, the text content may be "numbers", resulting in a recognition error of the speech intention.

Disclosure of Invention

The application provides a voice intention recognition method, which comprises the following steps:

determining a phoneme set to be recognized according to the speech to be recognized;

acquiring a phoneme vector to be recognized corresponding to the phoneme set to be recognized;

inputting the phoneme vector to be recognized to a trained target network model so that the target network model outputs a speech intention corresponding to the phoneme vector to be recognized;

wherein the target network model is used for recording the mapping relation between the phoneme vector and the voice intention.

In a possible implementation manner, the acquiring the phoneme vector to be recognized corresponding to the phoneme set to be recognized includes:

determining a phoneme characteristic value corresponding to each phoneme to be recognized; and acquiring a phoneme vector to be recognized corresponding to the phoneme set to be recognized based on the phoneme characteristic value corresponding to each phoneme to be recognized, wherein the phoneme vector to be recognized comprises the phoneme characteristic value corresponding to each phoneme to be recognized.

In a possible implementation, before the inputting the phoneme vector to be recognized to the trained target network model to make the target network model output the speech intention corresponding to the phoneme vector to be recognized, the method further includes: acquiring sample voice and a sample intention corresponding to the sample voice;

determining a sample phoneme set according to the sample voice;

acquiring a sample phoneme vector corresponding to the sample phoneme set;

and inputting the sample phoneme vector and the sample intention into an initial network model, and training the initial network model through the sample phoneme vector and the sample intention to obtain the target network model.

In one possible implementation, the obtaining a sample phoneme vector corresponding to the sample phoneme set includes:

for each sample phoneme, determining a phoneme characteristic value corresponding to the sample phoneme;

and acquiring a sample phoneme vector corresponding to the sample phoneme set based on the phoneme characteristic value corresponding to each sample phoneme, wherein the sample phoneme vector comprises the phoneme characteristic value corresponding to each sample phoneme.

determining a pinyin set to be recognized according to the voice to be recognized;

acquiring a pinyin vector to be identified corresponding to the pinyin set to be identified;

inputting the pinyin vector to be recognized to a trained target network model so that the target network model outputs a voice intention corresponding to the pinyin vector to be recognized;

the target network model is used for recording the mapping relation between the pinyin vector and the voice intention.

In a possible implementation manner, the pinyin set to be recognized includes multiple pinyins to be recognized, and the obtaining a pinyin vector to be recognized corresponding to the pinyin set to be recognized includes:

aiming at each pinyin to be recognized, determining a pinyin characteristic value corresponding to the pinyin to be recognized; and acquiring a pinyin vector to be identified corresponding to the pinyin set to be identified based on the pinyin characteristic value corresponding to each pinyin to be identified, wherein the pinyin vector to be identified comprises the pinyin characteristic value corresponding to each pinyin to be identified.

In one possible implementation, before the inputting the pinyin vector to be recognized to the trained target network model so that the target network model outputs the voice intention corresponding to the pinyin vector to be recognized, the method further includes: acquiring sample voice and a sample intention corresponding to the sample voice;

determining a sample pinyin set according to the sample voice;

acquiring a sample pinyin vector corresponding to the sample pinyin set;

and inputting the sample pinyin vector and the sample intention to an initial network model, and training the initial network model through the sample pinyin vector and the sample intention to obtain the target network model.

In one possible implementation, the obtaining a sample pinyin vector corresponding to the sample pinyin set includes:

aiming at each sample pinyin, determining a pinyin characteristic value corresponding to the sample pinyin;

and acquiring a sample pinyin vector corresponding to the sample pinyin set based on the pinyin characteristic value corresponding to each sample pinyin, wherein the sample pinyin vector comprises the pinyin characteristic value corresponding to each sample pinyin.

The present application provides a voice intention recognition apparatus, the apparatus including:

the determining module is used for determining a phoneme set to be recognized according to the speech to be recognized;

the acquisition module is used for acquiring the phoneme vector to be identified corresponding to the phoneme set to be identified;

the processing module is used for inputting the phoneme vector to be recognized to a trained target network model so as to enable the target network model to output a voice intention corresponding to the phoneme vector to be recognized;

the determining module is used for determining a pinyin set to be recognized according to the voice to be recognized;

the acquisition module is used for acquiring the pinyin vector to be identified corresponding to the pinyin set to be identified;

the processing module is used for inputting the pinyin vector to be recognized to a trained target network model so as to enable the target network model to output a voice intention corresponding to the pinyin vector to be recognized;

The present application provides a speech intention recognition apparatus including: a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor; the processor is configured to execute machine executable instructions to perform the steps of:

wherein, the target network model is used for recording the mapping relation between the phoneme vector and the voice intention;

or determining a pinyin set to be recognized according to the voice to be recognized;

According to the technical scheme, the voice intention is recognized based on the phonemes to be recognized, the voice intention is not recognized based on the text, and the accuracy of converting the voice into the text is not required. Because the phoneme is the minimum voice unit divided according to the natural attributes of the voice and is analyzed based on the pronunciation action, and one action forms one phoneme, the accuracy rate of determining the phoneme to be recognized based on the voice to be recognized is very high, the accuracy rate of recognizing the voice intention is very high, the voice intention of a user can be accurately recognized, the accuracy rate of recognizing the voice intention is effectively improved, the intention recognition has stronger reliability, a large number of language model algorithm libraries of the voice recognition are not needed, and the performance and the memory are greatly optimized.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments of the present application or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art according to the drawings of the embodiments of the present application.

FIG. 1 is a flow chart illustrating a method for recognizing speech intent according to an embodiment of the present application;

FIG. 2 is a flow chart illustrating a method for recognizing a speech intent in one embodiment of the present application;

FIG. 3 is a flow chart illustrating a method for recognizing speech intent according to an embodiment of the present application;

FIG. 4 is a flow chart illustrating a method for recognizing a speech intent in one embodiment of the present application;

FIG. 5A is a schematic diagram of a voice intent recognition apparatus according to an embodiment of the present application;

FIG. 5B is a schematic diagram of a voice intent recognition apparatus according to an embodiment of the present application;

fig. 6 is a hardware configuration diagram of a speech intention recognition apparatus according to an embodiment of the present application.

Detailed Description

The terminology used in the embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein is meant to encompass any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in the embodiments of the present application to describe various information, the information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. Depending on the context, moreover, the word "if" as used may be interpreted as "at … …" or "when … …" or "in response to a determination".

Before the technical solutions of the present application are introduced, concepts related to the embodiments of the present application are introduced.

Machine learning: machine learning is a way to implement artificial intelligence, and is used to study how a computer simulates or implements human learning behaviors to acquire new knowledge or skills, and reorganize an existing knowledge structure to continuously improve its performance. Deep learning, which is a subclass of machine learning, is a process of modeling a specific problem in the real world using a mathematical model to solve similar problems in the field. The neural network is an implementation of deep learning, and for convenience of description, the structure and function of the neural network are described herein by taking the neural network as an example, and for other subclasses of machine learning, the structure and function of the neural network are similar.

A neural network: the neural network includes, but is not limited to, a Convolutional Neural Network (CNN), a cyclic neural network (RNN), a fully-connected network, and the like, and the structural units of the neural network may include, but are not limited to, a convolutional layer (Conv), a pooling layer (Pool), an excitation layer, a fully-connected layer (FC), and the like, which is not limited thereto.

In practical application, one or more convolution layers, one or more pooling layers, one or more excitation layers, and one or more fully-connected layers may be combined to construct a neural network according to different requirements.

In the convolutional layer, the input data features are enhanced by performing a convolution operation on the input data features using a convolution kernel, the convolution kernel may be a matrix of m × n, the input data features of the convolutional layer are convolved with the convolution kernel, the output data features of the convolutional layer may be obtained, and the convolution operation is actually a filtering process.

In the pooling layer, the input data features (such as the output of the convolutional layer) are subjected to operations of taking the maximum value, taking the minimum value, taking the average value and the like, so that the input data features are sub-sampled by utilizing the principle of local correlation, the processing amount is reduced, the feature invariance is kept, and the operation of the pooling layer is actually a down-sampling process.

In the excitation layer, the input data features can be mapped using an activation function (e.g., a nonlinear function), thereby introducing a nonlinear factor such that the neural network enhances expressive power through a combination of nonlinearities.

The activation function may include, but is not limited to, a ReLU (Rectified Linear Unit) function that is used to set features less than 0 to 0, while features greater than 0 remain unchanged.

In the fully-connected layer, the fully-connected layer is configured to perform fully-connected processing on all data features input to the fully-connected layer, so as to obtain a feature vector, and the feature vector may include a plurality of data features.

And (3) network model: models constructed using machine learning algorithms (e.g., deep learning algorithms), such as models constructed using neural networks, i.e., the network model may be composed of one or more convolutional layers, one or more pooling layers, one or more excitation layers, and one or more fully-connected layers. For the sake of differentiation, the untrained network model is referred to as the initial network model and the trained network model is referred to as the target network model.

In the training process of the initial network model, sample data is used to train various network parameters in the initial network model, such as convolutional layer parameters (e.g., convolutional kernel parameters), pooling layer parameters, excitation layer parameters, full-link layer parameters, and the like, which is not limited herein. And fitting the initial network model to obtain the mapping relation between input and output by training each network parameter in the initial network model. After the training of the initial network model is completed, the trained initial network model is the target network model, and the voice intention is recognized through the target network model.

Phoneme: phonemes are the smallest units of speech that are divided according to the natural properties of the speech, and are analyzed according to the pronunciation actions in the syllables, with one action constituting a phoneme. For example, the chinese syllable o (a) has only one phoneme (a), ai (ai) has two phonemes (a and i), dai (dai) has three phonemes (d, a and i), and so on. For another example, a Chinese syllable tree (shumu) has five phonemes (s, h, u, m, and u), and so on.

Spelling: pinyin is a combination of more than one phoneme into a compound sound, e.g. a generation (dai) has three phonemes (d, a and i) which make up a pinyin (dai). As another example, a tree (shumu) has five phonemes (s, h, u, m, and u), which constitute two pinyins (shu and mu).

In the related art, the recognition manner of the voice intention includes: a speech recognition stage and an intent recognition stage. In the speech recognition stage, speech recognition is carried out on the speech to be recognized through an automatic speech recognition technology, and the speech to be recognized is converted into a text. In the intention recognition stage, semantic understanding is carried out on the text through a natural language processing technology to obtain keywords, and the voice intention of the user is recognized based on the keywords. In the text-based intention recognition mode, the accuracy depends on the accuracy of converting the voice into the text, and the accuracy of converting the voice into the text is low, so that the accuracy of recognizing the voice intention is low, and the voice intention of the user cannot be recognized accurately.

In view of the above findings, in the embodiment of the present application, the speech intention is recognized based on the phoneme to be recognized, instead of recognizing the speech intention based on the text, so that the accuracy of converting speech into text is not required to be relied on.

The technical solutions of the embodiments of the present application are described below with reference to specific embodiments.

The embodiment of the application provides a voice intention recognition method, which can be applied to a man-machine interaction application scene and is mainly used for controlling equipment according to a voice intention. For example, the method may be applied to any device that needs to be controlled according to a voice intention, such as an access control device, a screen projection device, an IPC (IP Camera), a server, an intelligent terminal, a robot system, an air conditioning device, and the like, without limitation.

The embodiment of the application relates to a training process of an initial network model and an identification process based on a target network model. In the training process of the initial network model, the initial network model can be trained to obtain a trained target network model. In the target network model-based recognition process, the speech intent may be recognized based on the target network model. The training process of the initial network model and the recognition process based on the target network model can be realized in the same equipment or different equipment. For example, the device a performs a training process of an initial network model to obtain a target network model, and recognizes a voice intention based on the target network model. For another example, the device a1 implements a training process of the initial network model to obtain a target network model, deploys the target network model to the device a2, and recognizes the voice intention by the device a2 based on the target network model.

Referring to fig. 1, for a training process of an initial network model, a speech intention recognition method is provided in an embodiment of the present application, where the method may implement training of the initial network model, and the method includes:

step 101, a sample voice and a sample intention corresponding to the sample voice are obtained.

For example, without limitation, a plurality of sample voices may be obtained from the historical data and/or a plurality of sample voices may be received from the user, where the sample voices represent sounds made during speaking. For example, if the voice uttered during speaking is "turn on the air conditioner", the sample voice is "turn on the air conditioner".

For each sample voice, a voice intention corresponding to the sample voice may be obtained, and for the sake of convenience of distinction, the voice intention corresponding to the sample voice may be referred to as a sample intention (i.e., a sample voice intention). For example, if the sample voice is "turn on air conditioner", the sample intention may be "turn on air conditioner".

Step 102, determining a sample phoneme set according to the sample voice.

For example, for each sample speech, a sample phoneme set may be determined from the sample speech, the sample phoneme set may include a plurality of sample phonemes, the process of determining the sample phonemes from the sample speech is a process of recognizing each phoneme from the sample speech, and each recognized phoneme is referred to as a sample phoneme for convenience of distinction, and thus, a plurality of sample phonemes may be recognized from the sample speech, and this recognition process is not limited as long as the plurality of sample phonemes can be recognized from the sample speech.

For example, for the sample speech "turn on air conditioner", the sample phone set may include the following sample phones "b, a, k, o, n, g, t, i, a, o, d, a, k, a, i".

Step 103, obtaining a sample phoneme vector corresponding to the sample phoneme set.

Illustratively, for each sample phoneme in the sample phoneme set, a phoneme feature value corresponding to the sample phoneme is determined, and a sample phoneme vector corresponding to the sample phoneme set is obtained based on the phoneme feature value corresponding to each sample phoneme, wherein the sample phoneme vector includes the phoneme feature value corresponding to each sample phoneme.

For example, the mapping relationship between each of all phonemes and the phoneme characteristic value is maintained in advance, and assuming that there are 50 phonemes in total, the mapping relationship between phoneme 1 and phoneme characteristic value 1, the mapping relationship between phoneme 2 and phoneme characteristic value 2, …, and the mapping relationship between phoneme 50 and phoneme characteristic value 50 may be maintained.

In step 103, for each sample phoneme in the sample phoneme set, the mapping relationship is queried to obtain a phoneme feature value corresponding to the sample phoneme, and the phoneme feature values corresponding to each sample phoneme in the sample phoneme set are combined to obtain the sample phoneme vector.

For example, for a sample phone set "b, a, k, o, n, g, t, i, a, o, d, a, k, a, i", the sample phone vector is a 15-dimensional feature vector, and the feature vector sequentially includes a phone feature value corresponding to "b", a phone feature value corresponding to "a", a phone feature value corresponding to "k", a phone feature value corresponding to "o", a phone feature value corresponding to "n", a phone feature value corresponding to "g", a phone feature value corresponding to "t", a phone feature value corresponding to "i", a phone feature value corresponding to "a", a phone feature value corresponding to "o", a phone feature value corresponding to "d", a phone feature value corresponding to "a", a phone feature value corresponding to "k", a phone feature value corresponding to "a", and a phone feature value corresponding to "i".

In a possible implementation, all phonemes may be ordered, and assuming that there are 50 phonemes in total, the sequence numbers of the 50 phonemes are 1-50 respectively, and the phoneme characteristic value may be a 50-bit numerical value for the phoneme characteristic value corresponding to each phoneme. Assuming that the sequence number of the phoneme is M, in the phoneme feature value corresponding to the phoneme, the value of the mth bit is a first value, and the values of the other bits except the mth bit are second values. For example, the value of the 1 st bit is a first value, and the values of the 2 nd to 50 th bits are second values; the value of the 2 nd bit is the first value, the values of the 1 st bit and the 3 rd to 50 th bits are the second values, and so on.

In summary, for the sample phone set "b, a, k, o, n, g, t, i, a, o, d, a, k, a, i", the sample phone vector may be a 15 × 50 dimensional feature vector, the feature vector includes 15 rows and 50 columns, each row represents a phone feature value corresponding to a phone, and details thereof are not repeated.

In the above embodiment, the first value and the second value may be configured empirically, and are not limited thereto, for example, the first value is 1, and the second value is 0, or the first value is 0 and the second value is 1, or the first value is 255 and the second value is 0, or the first value is 0 and the second value is 255.

And 104, inputting the sample phoneme vector and a sample intention corresponding to the sample phoneme vector into the initial network model, and training the initial network model through the sample phoneme vector and the sample intention to obtain a trained target network model. Illustratively, since the initial network model is trained by using the sample phoneme vector and the sample intention (i.e., the sample phonetic intention), a trained target network model is obtained, and thus, the target network model can be used to record the mapping relationship between the phoneme vector and the phonetic intention.

Referring to the above embodiment, a large number of sample voices may be obtained, and for each sample voice, a sample intention corresponding to the sample voice is obtained, and a sample phoneme vector corresponding to a sample phoneme set corresponding to the sample voice, that is, a sample phoneme vector and a sample intention corresponding to the sample voice are obtained (the sample phoneme vector is taken into training as label information of the sample phoneme vector). Based on this, a large number of sample phoneme vectors and sample intentions (i.e., label information) corresponding to each sample phoneme vector can be input to the initial network model, so that each network parameter in the initial network model is trained by using the sample phoneme vectors and the sample intentions, and the training process is not limited. After the initial network model training is completed, the initial network model that has completed training is the target network model.

For example, a large number of sample phoneme vectors and sample intentions may be input to a first network layer of the initial network model, the first network layer processes the data to obtain output data of the first network layer, the output data of the first network layer is input to a second network layer of the initial network model, and so on until the data is input to a last network layer of the initial network model, the last network layer processes the data to obtain output data, and the output data is recorded as a target feature vector.

After the target feature vector is obtained, whether the initial network model has converged is determined based on the target feature vector. And if the initial network model is converged, determining the converged initial network model as a trained target network model, and finishing the training process of the initial network model. And if the initial network model is not converged, adjusting the network parameters of the initial network model which is not converged to obtain the adjusted initial network model.

Based on the adjusted initial network model, a large number of sample phoneme vectors and sample intentions can be input to the adjusted initial network model, so that the adjusted initial network model is retrained, and specific training processes refer to the above embodiments and are not repeated herein. And the like until the initial network model is converged, and determining the converged initial network model as the trained target network model.

In the above embodiments, determining whether the initial network model has converged based on the target feature vector may include, but is not limited to: the loss function is constructed in advance, is not limited and can be set according to experience. After the target feature vector is obtained, the loss value of the loss function may be determined according to the target feature vector, for example, the loss value of the loss function may be obtained by substituting the target feature vector into the loss function. And after obtaining the loss value of the loss function, determining whether the initial network model is converged according to the loss value of the loss function.

For example, whether the initial network model has converged may be determined according to a loss value, for example, a loss value 1 is obtained based on the target feature vector, and if the loss value 1 is not greater than a threshold, it is determined that the initial network model has converged. If the loss value 1 is greater than the threshold, it is determined that the initial network model is not converged. Or,

whether the initial network model has converged can be determined according to a plurality of loss values of a plurality of iterations, for example, in each iteration, the initial network model of the last iteration is adjusted to obtain an adjusted initial network model, and the loss value can be obtained in each iteration. And determining a change amplitude curve of the plurality of loss values, and determining that the initial network model of the last iteration process is converged if the change amplitude of the loss values is determined to be stable (the loss values of the continuous iteration processes are not changed or the change amplitude is small) according to the change amplitude curve and the loss value of the last iteration process is not greater than a threshold value. Otherwise, determining that the initial network model of the last iteration process is not converged, continuing to perform the next iteration process to obtain the loss value of the next iteration process, and re-determining the change amplitude curve of the loss values.

In practical applications, it may also be determined whether the initial network model has converged in other manners, which is not limited in this respect. For example, if the iteration number reaches a preset number threshold, it is determined that the initial network model has converged; for another example, if the iteration duration reaches the preset duration threshold, it is determined that the initial network model has converged.

In summary, the initial network model may be trained through the sample phoneme vector and the sample intention corresponding to the sample phoneme vector, so as to obtain the trained target network model.

Referring to fig. 2, for a recognition process based on a target network model, a speech intention recognition method is provided in the embodiment of the present application, and the method can implement recognition of a speech intention, and includes:

step 201, determining a phoneme set to be recognized according to the speech to be recognized.

For example, after obtaining the speech to be recognized, a set of phonemes to be recognized may be determined from the speech to be recognized, the set of phonemes to be recognized may include a plurality of phonemes to be recognized, the process of determining the phonemes to be recognized from the speech to be recognized is a process of recognizing each phoneme from the speech to be recognized, and for convenience of distinction, each recognized phoneme is referred to as a phoneme to be recognized, so that a plurality of phonemes to be recognized may be recognized from the speech to be recognized, and this recognition process is not limited as long as the plurality of phonemes to be recognized can be recognized from the speech to be recognized. For example, for the speech to be recognized "turn on the air conditioner", the phoneme set to be recognized may include phonemes to be recognized "k, a, i, k, o, n, g, t, i, a, o" as follows.

Step 202, obtaining a phoneme vector to be recognized corresponding to the phoneme set to be recognized. Illustratively, for each phoneme to be recognized in the phoneme set to be recognized, a phoneme feature value corresponding to the phoneme to be recognized is determined, and a phoneme vector to be recognized corresponding to the phoneme set to be recognized is obtained based on the phoneme feature value corresponding to each phoneme to be recognized, wherein the phoneme vector to be recognized comprises the phoneme feature value corresponding to each phoneme to be recognized.

In step 202, for each phoneme to be recognized in the phoneme set to be recognized, by querying the mapping relationship, a phoneme feature value corresponding to the phoneme to be recognized may be obtained, and the phoneme feature values corresponding to each phoneme to be recognized in the phoneme set to be recognized are combined to obtain the phoneme vector to be recognized.

Step 203, inputting the phoneme vector to be recognized to the trained target network model, so that the target network model outputs the speech intention corresponding to the phoneme vector to be recognized. For example, the target network model is used to record a mapping relationship between the phoneme vector and the speech intent, and after the phoneme vector to be recognized is input to the target network model, the target network model may output the speech intent corresponding to the phoneme vector to be recognized.

For example, the phoneme vector to be recognized may be input to a first network layer of the target network model, the first network layer processes the phoneme vector to be recognized to obtain output data of the first network layer, the output data of the first network layer is input to a second network layer of the target network model, and so on until the data is input to a last network layer of the target network model, the last network layer processes the data to obtain output data, and the output data is recorded as the target feature vector.

Since the target network model is used for recording the mapping relationship between the phoneme vector and the phonetic intent, after obtaining the target feature vector, the mapping relationship may be queried based on the target feature vector to obtain the phonetic intent corresponding to the target feature vector, where this phonetic intent may be the phonetic intent corresponding to the phoneme vector to be recognized, and the target network model may output the phonetic intent corresponding to the phoneme vector to be recognized.

After the voice intention corresponding to the phoneme vector to be recognized is obtained, the equipment can be controlled based on the voice intention, and the control mode is not limited, for example, when the voice intention is 'turn on the air conditioner', the air conditioner is turned on.

In one possible embodiment, when the target network model outputs the voice intention corresponding to the phoneme vector to be recognized, a probability value corresponding to the voice intention (e.g., a probability value between 0 and 1, which may also be referred to as a confidence) may also be output, for example, the target network model may output a probability value 1 (e.g., 0.8) of the voice intention 1 and the voice intention 1, a probability value 2 (e.g., 0.1) of the voice intention 2 and the voice intention 2, a probability value 3 (e.g., 0.08) of the voice intention 3 and the like.

Based on the above output data, the speech intention with the largest probability value may be taken as the speech intention corresponding to the phoneme vector to be recognized, for example, the speech intention 1 with the largest probability value may be taken as the speech intention corresponding to the phoneme vector to be recognized. Or, firstly, determining the voice intention with the maximum probability value, determining whether the probability value (namely, the maximum probability value) of the voice intention is greater than a preset probability threshold, if so, taking the voice intention as the voice intention corresponding to the phoneme vector to be recognized, otherwise, not taking the voice intention corresponding to the phoneme vector to be recognized.

According to the technical scheme, the voice intention is recognized based on the phonemes to be recognized, the voice intention is not recognized based on the text, and the accuracy of converting the voice into the text is not required. The accuracy rate of the phoneme to be recognized is determined to be high based on the voice to be recognized, the accuracy rate of the voice intention recognition is high, the voice intention of the user can be recognized accurately, and the accuracy rate of the voice intention recognition is effectively improved.

For example, a user sends a to-be-recognized voice, "i wants to see a picture of a tree," a phoneme determined by a terminal device (e.g., IPC, smartphone, etc.) based on the to-be-recognized voice is "w, o, x, i, a, n, g, k, a, n, y, o, u, s, h, u, m, u, d, e, z, h, a, o, p, i, a, n," i.e., "tree" corresponds to a phoneme of "s, h, u, m, u," so that a voice intention is determined based on the phonemes, but "number" or "tree" does not need to be parsed from the to-be-recognized voice, "i want to see a picture of a tree," number "or" tree "is avoided to determine the voice intention, so that the intention recognition has stronger reliability, a large number of language model algorithm libraries for voice recognition are not needed, and performance and memory are greatly optimized.

In another implementation manner of the embodiment of the application, the voice intention is recognized based on the pinyin to be recognized, but not based on the text, so that the accuracy of converting the voice into the text is not required.

In the embodiment of the application, the training process of the initial network model and the identification process based on the target network model can be involved. In the training process of the initial network model, the initial network model can be trained to obtain a trained target network model. In the target network model-based recognition process, the speech intent may be recognized based on the target network model. For example, the training process of the initial network model and the recognition process based on the target network model may be implemented in the same device or in different devices.

Referring to fig. 3, for a training process of an initial network model, a speech intention recognition method is provided in the embodiment of the present application, where the method may implement training of the initial network model, and the method includes:

step 301, a sample voice and a sample intention corresponding to the sample voice are obtained.

Step 302, determining a sample pinyin set according to the sample voice.

For example, for each sample voice, a sample pinyin set may be determined according to the sample voice, the sample pinyin set may include a plurality of sample pinyins, the process of determining a sample pinyin according to the sample voice is a process of identifying each pinyin from the sample voice, and for convenience of distinguishing, each identified pinyin is referred to as a sample pinyin.

For example, for the sample speech "turn on the air conditioner", the sample pinyin set may include the sample pinyins "ba", "kong", "tiao", "da", "kai" as follows.

Step 303, obtaining a sample pinyin vector corresponding to the sample pinyin set.

Illustratively, for each sample pinyin in the sample pinyin set, a pinyin characteristic value corresponding to the sample pinyin is determined, and a sample pinyin vector corresponding to the sample pinyin set is obtained based on the pinyin characteristic value corresponding to each sample pinyin, wherein the sample pinyin vector includes the pinyin characteristic value corresponding to each sample pinyin.

For example, the mapping relationship between each pinyin in all pinyins and the pinyin characteristic value is maintained in advance, and if there are 400 pinyins in total, the mapping relationship between pinyin 1 and the pinyin characteristic value 1, the mapping relationship between pinyin 2 and the pinyin characteristic value 2, and …, the mapping relationship between pinyin 400 and the pinyin characteristic value 400 can be maintained.

On the basis, in step 303, for each sample pinyin in the sample pinyin set, the pinyin characteristic values corresponding to the sample pinyin can be obtained by querying the mapping relationship, and the pinyin characteristic values corresponding to each sample pinyin in the sample pinyin set are combined to obtain the sample pinyin vector.

For example, for the sample pinyin set "ba", "kong", "tiao", "da", and "kai", the sample pinyin vector may be a 5-dimensional feature vector, and the feature vector may sequentially include a pinyin feature value corresponding to "ba", a pinyin feature value corresponding to "kong", a pinyin feature value corresponding to "tiao", a pinyin feature value corresponding to "da", and a pinyin feature value corresponding to "kai".

In one possible implementation, all pinyins may be sorted, and assuming that there are 400 pinyins in total, the serial numbers of the 400 pinyins are 1-400, respectively, and the pinyin feature value may be a 400-bit value for the pinyin feature value corresponding to each pinyin. If the number of the pinyin is N, the value of the nth bit is the first value, and the values of the other bits except the nth bit are the second values in the pinyin characteristic value corresponding to the pinyin. For example, the value of the 1 st digit is a first value, and the values of the 2 nd-400 th digits are second values; the pinyin characteristic value corresponding to the pinyin with the sequence number 2, the value of the 2 nd digit is a first value, the values of the 1 st digit and the 3 rd-400 th digits are second values, and so on.

In summary, for the sample pinyin set "ba", "kong", "tiao", "da", and "kai", the sample pinyin vector may be a feature vector with 5 × 400 dimensions, the feature vector includes 5 rows and 400 columns, and each row represents a pinyin feature value corresponding to a pinyin, which is not described herein again.

Step 304, inputting the sample pinyin vector and the sample intention corresponding to the sample pinyin vector to the initial network model, and training the initial network model through the sample pinyin vector and the sample intention to obtain a trained target network model. Illustratively, since the initial network model is trained by using the sample pinyin vector and the sample intention (i.e., the sample voice intention), a trained target network model is obtained, and thus, the target network model can be used for recording the mapping relationship between the pinyin vector and the voice intention.

Referring to the above embodiment, a large number of sample voices may be obtained, and for each sample voice, a sample intention corresponding to the sample voice is obtained, that is, a sample pinyin vector and a sample intention corresponding to the sample pinyin set corresponding to the sample voice are obtained (the sample pinyin vector and the sample intention participate in training as label information of the sample pinyin vector). Based on the method, a large number of sample pinyin vectors and sample intentions (namely label information) corresponding to each sample pinyin vector can be input to the initial network model, so that each network parameter in the initial network model is trained by using the sample pinyin vectors and the sample intentions, and the training process is not limited. After the initial network model training is completed, the initial network model that has completed training is the target network model.

For example, a large number of sample pinyin vectors and sample intentions may be input to a first network layer of the initial network model, the first network layer processes the data to obtain output data of the first network layer, the output data of the first network layer is input to a second network layer of the initial network model, and so on until the data is input to a last network layer of the initial network model, the last network layer processes the data to obtain output data, and the output data is recorded as a target feature vector.

Based on the adjusted initial network model, a large number of sample pinyin vectors and sample intentions can be input to the adjusted initial network model, so that the adjusted initial network model is retrained, and specific training processes refer to the above embodiments and are not repeated herein. And the like until the initial network model is converged, and determining the converged initial network model as the trained target network model.

In summary, the initial network model may be trained through the sample pinyin vector and the sample intention corresponding to the sample pinyin vector, so as to obtain the trained target network model.

Referring to fig. 4, for a recognition process based on a target network model, a speech intention recognition method is provided in the embodiment of the present application, and the method can implement recognition of a speech intention, and includes:

step 401, determining a pinyin set to be recognized according to the voice to be recognized.

For example, after the speech to be recognized is obtained, the pinyin set to be recognized may be determined according to the speech to be recognized, the pinyin set to be recognized may include a plurality of pinyins to be recognized, the process of determining the pinyin to be recognized according to the speech to be recognized is a process of recognizing each pinyin from the speech to be recognized, and for convenience of distinction, each recognized pinyin may be referred to as a pinyin to be recognized. For example, for the speech to be recognized, the "air conditioner" is turned on, the pinyin set to be recognized may include the following pinyins "kai", "kong" and "tiao" to be recognized.

Step 402, obtaining the pinyin vector to be identified corresponding to the pinyin set to be identified. Illustratively, a pinyin characteristic value corresponding to the pinyin to be identified is determined for each pinyin to be identified in the pinyin set to be identified, and a pinyin vector to be identified corresponding to the pinyin set to be identified is obtained based on the pinyin characteristic value corresponding to each pinyin to be identified, wherein the pinyin vector to be identified includes the pinyin characteristic value corresponding to each pinyin to be identified.

In step 402, for each pinyin to be recognized in the pinyin set to be recognized, by querying the mapping relationship, a pinyin feature value corresponding to the pinyin to be recognized can be obtained, and the pinyin feature values corresponding to each pinyin to be recognized in the pinyin set to be recognized are combined to obtain the pinyin vector to be recognized.

Step 403, inputting the pinyin vector to be recognized to the trained target network model, so that the target network model outputs a voice intention corresponding to the pinyin vector to be recognized. Illustratively, the target network model is used for recording the mapping relationship between the pinyin vector and the voice intention, and after the pinyin vector to be recognized is input to the target network model, the target network model can output the voice intention corresponding to the pinyin vector to be recognized.

For example, the pinyin vector to be recognized may be input to a first network layer of the target network model, the first network layer processes the pinyin vector to be recognized to obtain output data of the first network layer, the output data of the first network layer is input to a second network layer of the target network model, and so on until the data is input to a last network layer of the target network model, the last network layer processes the data to obtain output data, and the output data is recorded as the target feature vector.

Because the target network model is used for recording the mapping relation between the pinyin vectors and the voice intentions, after the target feature vectors are obtained, the mapping relation can be inquired based on the target feature vectors to obtain the voice intentions corresponding to the target feature vectors, the voice intentions can be the voice intentions corresponding to the pinyin vectors to be recognized, and the target network model can output the voice intentions corresponding to the pinyin vectors to be recognized.

After the voice intention corresponding to the pinyin vector to be recognized is obtained, the equipment can be controlled based on the voice intention, the control mode is not limited, and if the voice intention is 'turn on the air conditioner', the air conditioner is turned on.

In one possible embodiment, when the target network model outputs the voice intention corresponding to the pinyin vector to be recognized, a probability value corresponding to the voice intention (e.g., a probability value between 0 and 1, which may also be referred to as a confidence level) may also be output, for example, the target network model may output a probability value 1 (e.g., 0.8) of the voice intention 1 and the voice intention 1, a probability value 2 (e.g., 0.1) of the voice intention 2 and the voice intention 2, a probability value 3 (e.g., 0.08) of the voice intention 3 and the like.

Based on the output data, the voice intention with the maximum probability value can be used as the voice intention corresponding to the pinyin vector to be recognized, for example, the voice intention 1 with the maximum probability value can be used as the voice intention corresponding to the pinyin vector to be recognized. Or, firstly, determining the voice intention with the maximum probability value, determining whether the probability value (namely, the maximum probability value) of the voice intention is greater than a preset probability threshold value, if so, taking the voice intention as the voice intention corresponding to the pinyin vector to be recognized, otherwise, not, taking the voice intention corresponding to the pinyin vector to be recognized.

According to the technical scheme, the voice intention is recognized based on the pinyin to be recognized, the voice intention is not recognized based on the text, and the accuracy of converting the voice into the text is not required. The accuracy rate of determining the pinyin to be recognized based on the voice to be recognized is high, and the accuracy rate of recognizing the voice intention is high, so that the voice intention of the user can be recognized accurately, and the accuracy rate of recognizing the voice intention is improved effectively. For example, a user sends a to-be-recognized voice "i want to see a picture of a tree", a terminal device (e.g., IPC, smartphone, etc.) determines that a pinyin is "wo, xiang, kan, you, shu, mu, de, zhao, pian" based on the to-be-recognized voice, that is, the pinyin corresponding to the "tree" is "shu, mu", so as to determine a voice intention based on the pinyin, without analyzing out a "number" or a "tree" from the "picture of a tree that i want to see" of the to-be-recognized voice, thereby avoiding determining the voice intention by using the "number" or the "tree", so that the intention recognition has stronger reliability, and a large number of language model algorithm libraries of the voice recognition are not needed, which brings about substantial optimization of performance and memory.

Based on the same application concept as the above method, an embodiment of the present application provides a speech intention recognition apparatus, as shown in fig. 5A, which is a schematic structural diagram of the apparatus, and the apparatus may include:

a determining module 511, configured to determine a phoneme set to be recognized according to a speech to be recognized;

an obtaining module 512, configured to obtain a phoneme vector to be recognized corresponding to the phoneme set to be recognized;

a processing module 513, configured to input the phoneme vector to be recognized to a trained target network model, so that the target network model outputs a speech intention corresponding to the phoneme vector to be recognized;

In a possible implementation manner, the phone set to be recognized includes a plurality of phones to be recognized, and the obtaining module 512 is specifically configured to:

In a possible implementation, the determining module 511 is further configured to: acquiring sample voice and a sample intention corresponding to the sample voice; determining a sample phoneme set according to the sample voice; the obtaining module 512 is further configured to: acquiring a sample phoneme vector corresponding to the sample phoneme set; the processing module 513 is further configured to: and inputting the sample phoneme vector and the sample intention into an initial network model, and training the initial network model through the sample phoneme vector and the sample intention to obtain the target network model.

In a possible implementation manner, the sample phoneme set includes a plurality of sample phonemes, and the obtaining module 51 is specifically configured to, when obtaining a sample phoneme vector corresponding to the sample phoneme set:

Based on the same application concept as the above method, an embodiment of the present application provides a speech intention recognition apparatus, as shown in fig. 5B, which is a schematic structural diagram of the apparatus, and the apparatus may include:

the determining module 521 is configured to determine a pinyin set to be recognized according to the speech to be recognized;

an obtaining module 522, configured to obtain a pinyin vector to be identified corresponding to the pinyin set to be identified;

a processing module 523, configured to input the pinyin vector to be recognized to a trained target network model, so that the target network model outputs a voice intention corresponding to the pinyin vector to be recognized;

In a possible implementation manner, the pinyin set to be identified includes multiple pinyins to be identified, and the obtaining module 522 is specifically configured to: aiming at each pinyin to be recognized, determining a pinyin characteristic value corresponding to the pinyin to be recognized; and acquiring a pinyin vector to be identified corresponding to the pinyin set to be identified based on the pinyin characteristic value corresponding to each pinyin to be identified, wherein the pinyin vector to be identified comprises the pinyin characteristic value corresponding to each pinyin to be identified.

In a possible implementation, the determining module 521 is further configured to: acquiring sample voice and a sample intention corresponding to the sample voice; determining a sample pinyin set according to the sample voice; the obtaining module 522 is further configured to: acquiring a sample pinyin vector corresponding to the sample pinyin set; the processing module 523 is further configured to: and inputting the sample pinyin vector and the sample intention to an initial network model, and training the initial network model through the sample pinyin vector and the sample intention to obtain the target network model.

In one possible implementation, the sample pinyin set includes a plurality of sample pinyins, and the obtaining module 522 is specifically configured to:

Based on the same application concept as the above method, the embodiment of the present application provides a speech intention recognition apparatus, which is shown in fig. 6 and includes: a processor 61 and a machine-readable storage medium 62, the machine-readable storage medium 62 storing machine-executable instructions executable by the processor 61; the processor 61 is configured to execute machine executable instructions to perform the following steps:

Based on the same application concept as the method, embodiments of the present application further provide a machine-readable storage medium, where several computer instructions are stored, and when the computer instructions are executed by a processor, the method for recognizing a speech intention disclosed in the above examples of the present application can be implemented.

The machine-readable storage medium may be any electronic, magnetic, optical, or other physical storage device that can contain or store information such as executable instructions, data, and the like. For example, the machine-readable storage medium may be: a RAM (random Access Memory), a volatile Memory, a non-volatile Memory, a flash Memory, a storage drive (e.g., a hard drive), a solid state drive, any type of storage disk (e.g., an optical disk, a dvd, etc.), or similar storage medium, or a combination thereof.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Furthermore, these computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method of speech intent recognition, the method comprising:

2. The method of claim 1, wherein the phone set to be recognized comprises a plurality of phones to be recognized, and wherein the obtaining the phone vector to be recognized corresponding to the phone set to be recognized comprises:

3. The method of claim 1,

before the inputting the phoneme vector to be recognized to the trained target network model so that the target network model outputs the speech intention corresponding to the phoneme vector to be recognized, the method further includes:

acquiring sample voice and a sample intention corresponding to the sample voice;

determining a sample phoneme set according to the sample voice;

acquiring a sample phoneme vector corresponding to the sample phoneme set;

4. The method of claim 3, wherein the sample phone set comprises a plurality of sample phones, and wherein obtaining a sample phone vector corresponding to the sample phone set comprises:

5. A method of speech intent recognition, the method comprising:

6. The method of claim 5, wherein the pinyin set to be identified includes a plurality of pinyins to be identified, and wherein the obtaining the pinyin vector to be identified corresponding to the pinyin set to be identified includes:

7. The method of claim 5,

before the inputting the pinyin vector to be recognized to the trained target network model so that the target network model outputs the voice intention corresponding to the pinyin vector to be recognized, the method further includes:

determining a sample pinyin set according to the sample voice;

acquiring a sample pinyin vector corresponding to the sample pinyin set;

8. The method of claim 7, wherein the sample pinyin set includes a plurality of sample pinyins, and wherein obtaining sample pinyin vectors corresponding to the sample pinyin set includes:

9. A speech intent recognition apparatus, the apparatus comprising:

10. A speech intent recognition apparatus, the apparatus comprising:

11. A speech intent recognition device, comprising: a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor; the processor is configured to execute machine executable instructions to perform the steps of: