WO2019037382A1 - Emotion recognition-based voice quality inspection method and device, equipment and storage medium - Google Patents
Emotion recognition-based voice quality inspection method and device, equipment and storage medium Download PDFInfo
- Publication number
- WO2019037382A1 WO2019037382A1 PCT/CN2018/072967 CN2018072967W WO2019037382A1 WO 2019037382 A1 WO2019037382 A1 WO 2019037382A1 CN 2018072967 W CN2018072967 W CN 2018072967W WO 2019037382 A1 WO2019037382 A1 WO 2019037382A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- emotion recognition
- recognition result
- training
- voice
- voice data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/174—Facial expression recognition
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M3/00—Automatic or semi-automatic exchanges
- H04M3/42—Systems providing special services or facilities to subscribers
- H04M3/50—Centralised arrangements for answering calls; Centralised arrangements for recording messages for absent or busy subscribers ; Centralised arrangements for recording messages
- H04M3/51—Centralised call answering arrangements requiring operator intervention, e.g. call or contact centers for telemarketing
- H04M3/5175—Call or contact centers supervision arrangements
Definitions
- the present invention relates to the field of voice quality inspection technology, and in particular, to a voice quality inspection method, apparatus, device and storage medium based on emotion recognition.
- the service organization has a customer service voice question answering system.
- the agent of the service organization provides services to the customer through the customer service voice question answering system.
- Voice quality inspection is to monitor the call between the agent and the customer service to evaluate the quality of the call, quality of service, quality of service and satisfaction, so as to improve the quality of service based on the quality inspection results.
- the quality of voice quality inspection is mainly carried out by means of manual sampling, which has the problems of low sampling efficiency, untimely strain and a large amount of manpower and material resources.
- the present application provides a method, device, device and storage medium for voice quality detection based on emotion recognition, so as to solve the problem that the current voice quality inspection adopts manual sampling mode.
- the present application provides a voice quality verification method based on emotion recognition, including:
- the present application provides a voice quality verification device based on emotion recognition, including:
- a voice data acquiring module to be used for acquiring voice data to be tested
- a voice data feature acquiring module configured to perform feature extraction on the voice data to be tested, and acquire a voice feature
- An emotion recognition result obtaining module configured to identify the voice feature by using an emotion recognition model, and obtain an emotion recognition result
- an emotion recognition result feedback module configured to send the emotion recognition result to the associated terminal, so that the association terminal displays the emotion recognition result.
- the present application provides a terminal device including a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, the processor executing the computer readable instructions The following steps are implemented:
- the present application provides a computer readable storage medium storing computer readable instructions that, when executed by a processor, implement the following steps:
- the speech recognition-based voice quality inspection method, device, device and storage medium provided by the present application, feature extraction is performed on the voice data to be measured to acquire a voice feature, and then the voice recognition model is used to identify the voice feature to obtain the emotion recognition. As a result, the emotion recognition result is displayed by the associated terminal, so that the user can understand the speaker emotion corresponding to the voice data to be tested by viewing the emotion recognition result.
- the method, apparatus, device and storage medium for voice recognition based on emotion recognition can realize intelligent recognition of the voice data to be measured to obtain the emotion recognition result, and the recognition process has high processing efficiency, and can implement the test corresponding to the speaker. Timely and comprehensive sampling of voice data, without manual intervention, is conducive to saving labor costs.
- Embodiment 1 is a flow chart of a method for voice quality detection based on emotion recognition in Embodiment 1.
- FIG. 2 is a specific flowchart of step S30 in FIG. 1.
- FIG. 3 is another specific flowchart of step S30 in FIG. 1.
- FIG. 5 is a schematic block diagram of a voice quality checking apparatus based on emotion recognition in Embodiment 2.
- Figure 6 is a schematic diagram of a terminal device in Embodiment 4.
- FIG. 1 shows a voice quality inspection method based on emotion recognition in the present embodiment.
- the emotion recognition-based voice quality inspection method is applied to a financial institution such as a bank, a securities, an insurance, and a peer-to-peer lending (P2P), or a terminal device of another institution that needs to perform emotion recognition, for use in a terminal device of a financial institution such as a peer-to-peer lending (P2P)
- P2P peer-to-peer lending
- the speaker's voice data to be tested is identified to determine the emotion of the speaker.
- the terminal equipment includes but is not limited to a PC, a smart phone, a tablet computer, and a customer service voice question answering system.
- the terminal device is a customer service voice question answering system.
- the emotion recognition based voice quality inspection method comprises the following steps:
- the voice data to be tested refers to the voice data of the speaker collected by the terminal device.
- the voice data to be tested may be voice data in wav, mp3 or other formats. It can be understood that each voice data to be tested carries a data source ID, which is an identifier for uniquely identifying the speaker of the voice data to be tested.
- the recording module integrated on the terminal device or the recording device connected to the terminal device collects the voice data to be tested.
- the manner in which the voice data to be tested is obtained includes online real-time acquisition and offline acquisition.
- the online real-time acquisition refers to recording the content of the call between the client and the agent during the conversation between the client and the agent to obtain the voice data to be tested.
- Offline acquisition refers to acquiring the voice data to be tested that has been saved in the database from the background of the system connected to the terminal device.
- each voice data to be tested carries a data source ID
- the speaker corresponding to the data source ID may be a client or an agent. Therefore, the data source ID may be a customer ID for uniquely identifying the customer, or may be an agent ID for uniquely identifying the agent.
- the customer ID may be the customer's ID number, mobile phone number, or the account number when the organization handles the business.
- the agent ID can be the number of the agent inside the organization.
- speech features include, but are not limited to, prosodic features, phonological features, spectral features, lexical features, and voiceprint features.
- the prosodic feature also known as the super-sound quality feature or the super-segment feature, refers to the change in pitch, pitch length and sound intensity in the speech other than the sound quality feature.
- the prosody features include, but are not limited to, the pitch frequency, the utterance duration, the utterance amplitude, and the utterance rate in the present embodiment.
- Sound quality features include, but are not limited to, formants F1-F3, band energy distribution, harmonic signal to noise ratio, and short-term energy jitter in this embodiment.
- Spectral characteristics also known as vibrational spectral features, refer to the decomposition of complex oscillations into resonant waveforms of different amplitudes and frequencies, and the amplitudes of these resonant oscillations are arranged in a frequency pattern.
- the spectral features are combined with prosodic features and sound quality features to improve the anti-noise effect of the characteristic parameters.
- the spectral features are Mel-Frequency Cepstral Coefficients (MFCC), which can reflect the auditory characteristics of the human ear.
- MFCC Mel-Frequency Cepstral Coefficients
- the vocabulary feature is a part of speech feature for embodying words in the speech data to be tested, including but not limited to positive words and negative words in the embodiment.
- the part-of-speech feature is combined with other phonetic features to facilitate the recognition of the speaker's emotion corresponding to the speech data to be tested.
- the voiceprint feature i.e., i-vector feature
- the voiceprint feature is a speaker-related feature that, combined with other phonetic features, can more effectively improve the accuracy of recognition in the speech recognition process.
- the process of performing feature extraction on the voice data to be tested specifically includes pre-emphasis processing on the voice data to be tested, performing frame division and windowing operations on the processed voice data, and then performing fast Fourier transform and logarithmic operation processing. Finally, the discrete cosine transform is performed to obtain the above-mentioned speech features. Then, the above speech features are spliced into feature vectors, and the feature vectors are used as input emotion recognition models for emotion recognition.
- the emotion recognition model is used to identify the voice features, and the emotion recognition result is obtained.
- the emotion recognition model is a pre-trained model for identity verification in the organization.
- the results of the emotion recognition include, but are not limited to, recognizing emotions such as happiness, anger, sadness, trouble, and calmness of the speaker corresponding to the voice feature.
- the terminal device uses the pre-trained emotion recognition model to identify the voice feature, so as to obtain the emotion recognition results such as happiness, anger, sadness, trouble, and calmness carried by the speaker when speaking.
- the emotion recognition model is used to intelligently identify the voice features to obtain the emotion recognition result, and the recognition process has high processing efficiency, and can realize timely and comprehensive sampling inspection of the voice data corresponding to the speaker, without manual intervention. Conducive to saving labor costs.
- the emotion recognition model is used to identify the voice feature, and the emotion recognition result is obtained, which specifically includes the following steps, as shown in FIG. 2:
- S311 Using a neural network-based emotion recognition model to identify the voice features and determine whether the accurate recognition result can be output.
- the neural network-based emotion recognition model is a model for identifying emotions in the voice data obtained by training the training voice data using the neural network model.
- the emotion recognition model based on the neural network includes an input layer, a hidden layer and an output layer, and the voice feature acquired in step S20 is input into the input layer of the emotion recognition model based on the neural network, and the voice feature is identified and processed through the hidden layer, and then passed through the output layer.
- the recognition result is output, and the recognition result includes an accurate recognition result and a fuzzy recognition result.
- the accurate recognition result is a recognition result for indicating that the voice feature corresponds to a specific emotion; and the fuzzy recognition result is a recognition result for indicating that the voice feature cannot correspond to a specific emotion.
- the training process of the emotion recognition model based on the neural network is as follows: First, the training voice data is acquired, and the training voice data is emotionally labeled, so that the training voice data carries the emotion tag. Among them, the training voice data is voice data for training the emotion recognition model. Each training emotional data corresponding to each emotional label is selected 500, so that the training speech data corresponding to the five emotions of happiness, anger, sadness, trouble, and calm are proportionally, so as to avoid over-fitting in the training process of the emotion recognition model. .
- the training speech feature carrying the emotional tag is obtained, which is specifically represented by the training speech feature x and the corresponding emotional tag y, such as (training speech feature 1, happy), (training speech) Feature 2, anger)... (training speech feature x, emotion y).
- iteratively calculates all the training speech features to extract the features corresponding to emotions such as happiness, anger, sadness, trouble and calmness from the training speech features.
- the loss of the trained model converges, then the training is stopped, and the finally trained neural network-based emotion recognition model is obtained, which makes the neural network-based emotion recognition model have strong nonlinear fitting ability. It can map complex nonlinear relationships and has strong robustness and memory ability.
- the neural network of this embodiment is specifically a Deep Neural Networks (DNN).
- the emotion recognition model based on the neural network When the emotion recognition model based on the neural network is used to identify the speech features corresponding to the speech data, the speech features corresponding to the speech data to be tested are input into the input layer of the neural network based emotion recognition model, and the emotion recognition model based on the neural network is used.
- the hidden layer performs emotion recognition according to the characteristics learned in the pre-training process, acquires the corresponding emotion recognition result, and outputs the emotion recognition result from the output layer.
- the hidden layer calculates the probability of acquiring the voice data to be tested as happy, angry, sad, troublesome or calm, and compares the highest probability with the second highest. Whether the difference of the probability is greater than the preset probability difference.
- the preset probability difference is a value set in advance for evaluating whether the emotion category can be determined.
- the preset probability difference is 20%
- the speech characteristics are identified as happy, angry, sad, troubled or calm, respectively, 2%, 60%, 15%, 20% and 3%
- the highest probability is 60%.
- the second highest probability is 20%
- the difference between the highest probability and the second highest probability is 40%, which is greater than the preset probability difference, and the emotion corresponding to the highest probability is output as the recognition result.
- the speech characteristics are identified as happy, angry, sad, troubled, or calm, respectively, 2%, 40%, 20%, 35%, and 3%
- the highest probability is 40%
- the second highest probability is 35%.
- the highest probability and the second highest probability of the impairment are less than the preset probability difference, and the accurate recognition result cannot be output, but the fuzzy recognition result is output.
- the emotion recognition model based on the neural network can output an accurate recognition result
- the highest probability that the speaker emotion corresponding to the voice data to be tested belongs to happiness, anger, sadness, trouble, and calm is much greater than the second highest probability. If the difference between the two is greater than the preset probability difference, the recognition result output by the emotion recognition model based on the neural network is more accurate, and the accurate recognition result can be directly output as the emotion recognition result.
- the emotion recognition model based on the neural network cannot output an accurate recognition result, that is, the emotion recognition model based on the neural network outputs a fuzzy recognition result, indicating that the speaker emotion corresponding to the voice data to be tested belongs to happiness, anger, sadness, The difference between the highest probability far and the second high probability in the noisy and calm is not greater than the preset probability difference, that is, the speaker does not accurately recognize a specific emotion. Therefore, the emotion recognition model based on support vector machine is needed to further identify the speech features to obtain the emotion recognition results, thereby further improving the accuracy of emotion recognition.
- the emotion recognition model based on the support vector machine is a model for identifying emotions in the voice data obtained by training the training voice data by using the support vector machine model.
- the support vector machine SVM
- the emotion recognition model of support vector machine has less computational complexity, and can determine the final result according to a few support vectors. It helps to grasp key samples and eliminate redundant samples in the training process, which has better robustness.
- the training process of the emotion recognition model of the support vector machine is as follows: SVM training is to separate different types of emotional attributes on a hyperplane, including five emotions of happiness, anger, sadness, irritability and calmness.
- the hyperplane is divided into five-dimensional hyperplanes according to the five kinds of emotional attributes, and the dividing lines dividing the five emotions are searched in the five-dimensional hyperplane, and the expression of the dividing line is obtained to complete the training of the SVM. .
- the speech features of the training speech data ie, prosodic features, sound quality features, spectral features, lexical features, and voiceprint features
- audio feature extraction is performed on the data to be measured, and an optimal solution space corresponding to the feature is obtained, and the space is To express the space for the corresponding emotions, complete the emotional classification and judgment of the input training speech data.
- the SVM is a two-category model, and its implementation may be in the form of a binary tree, that is, each attribute is separately judged to determine whether it belongs to the emotional attribute or does not belong to the emotional attribute. Since the SVM is a two-category model, the SVM-based happy recognition model, the anger recognition model, the sadness recognition model, the irritability recognition model, and the calm recognition model are separately created in the emotion recognition model based on the support vector machine in this embodiment. When the speech recognition model based on the support vector machine is used to identify the speech features corresponding to the speech data, the speech features need to be identified by the happy recognition model, the angry recognition model, the sad recognition model, the irritability recognition model and the calm recognition model, respectively. Obtain the corresponding emotional scores; then compare the five emotional scores and select the highest-scoring emotion as the emotional recognition result.
- the neural network-based emotion recognition model is used to identify the voice features corresponding to the voice data, and the voice features with clearer emotional attributes are more accurately identified, and the recognition process is faster; then the support vector machine is used.
- the emotion recognition model further recognizes the speech features corresponding to the speech data to be tested that cannot be outputted by the neural network based emotion recognition model, which is beneficial to improve the accuracy of the speech.
- the data source ID is used by the terminal device to obtain the data source ID of the voice data to be tested, and the data source ID is used to indicate the speaker of the voice data to be tested.
- the emotion recognition model is used to identify the voice feature, and the emotion recognition result is obtained, which specifically includes the following steps:
- S321 Acquire a target emotion recognition model associated with the data source ID based on the data source ID of the voice data to be tested.
- the target emotion recognition model is an emotion recognition model for training training voice data carrying the same data source ID.
- the target emotion recognition model may be an emotion recognition model trained according to the emotion recognition model training method mentioned in the embodiment, and the emotion recognition model has its own data source ID, that is, the target emotion recognition model may be based on a neural network.
- the emotion recognition model may also be a support vector machine based emotion recognition model, which is different from the emotion recognition model in steps S311-S313 in that the training voice data carries the same data source ID. It can be understood that the target emotion recognition model may be an emotion recognition model that is preliminarily trained in the training voice data carrying the same data source ID and stored in the database.
- the terminal device queries the database according to the data source ID in the received voice data to be tested, and determines whether there is a target emotion recognition model associated with the data source ID in the database; if the target emotion recognition model exists, Step S322 is performed; if the target emotion recognition model does not exist, steps S311-S313 are performed to perform voice emotion recognition using the emotion recognition model not associated with the data source ID, that is, the emotion recognition model in steps S311-S313 can be applied to all The voice data to be tested corresponding to the speaker is identified.
- S322 Identifying a voice feature by using a target emotion recognition model to obtain an emotion recognition result.
- the target emotion recognition model is an emotion recognition model trained by the same data source ID, it is an emotion recognition model for a specific speaker, and the speech data to be tested carries the same emotion recognition model, and therefore, the target emotion recognition model is adopted.
- the target emotion recognition model is adopted.
- the voice features corresponding to the voice data are identified, the emotion recognition result can be more accurate. It can be understood that the target emotion recognition model only identifies the voice data to be tested carrying the same data source ID, and is highly targeted and the recognition result is more accurate.
- the voice recognition quality-based quality inspection method includes pre-training and before step S321.
- the step of the target emotion recognition model associated with the data source ID specifically includes the following steps:
- S331 Acquire training voice data associated with the data source ID in the database based on the data source ID.
- the training voice data may be voice data collected by a recording module integrated on the terminal device or a recording device connected to the terminal device when the agent communicates with the client, and the training voice data is stored in a database connected to the terminal device, and Stored in association with the data source ID.
- the target emotion recognition model associated with the data source ID needs to be trained, the database needs to be queried to obtain all the training voice data associated with the data source ID.
- S332 Determine whether the number of training voice data reaches the emotional model training threshold.
- the emotional model training threshold is a quantity in which the training speech data required to train the emotion recognition model is set in advance. If the number of training speech data reaches the emotional model training threshold, the database is stored with training speech data capable of training a target emotion recognition model associated with the data source ID. If the number of training speech data does not reach the emotional model training threshold, the target emotion recognition model cannot be trained.
- the training process of the emotion recognition model training based on the training voice data associated with the data source ID and the neural network based emotion recognition model mentioned in steps S311-S313 and/or the support vector machine based emotion recognition model The training process is the same. To avoid repetition, we will not repeat them here.
- the target emotion recognition model acquired by using the training voice data corresponding to the data source ID for the emotional model training is more suitable for the speaker's emotion corresponding to the data source ID, so that the trained target emotion recognition model carries the same data source ID.
- the corresponding emotion recognition of the voice data to be tested is more accurate, and the error caused by the training voice data of different speakers can be effectively avoided.
- the recording module integrated on the terminal device or the recording device connected to the terminal device records the call process, and stores the acquired voice data in a database, each voice data and data source.
- the ID is associated with the storage, and the data source ID may be a customer ID or an agent ID.
- the database periodically counts the number of voices corresponding to each data source ID. When the number of voice data corresponding to any data source ID reaches the emotion model training threshold, step S333 is performed to obtain a target emotion recognition model corresponding to the data source ID. .
- S40 Send the emotion recognition result to the associated terminal, so that the associated terminal displays the emotion recognition result.
- the associated terminal may be a terminal that performs human-computer interaction with the agent or the quality inspector, wherein the quality inspector is a person who performs quality inspection on the agent service of the institution.
- the terminal includes, but is not limited to, a terminal such as a smartphone, a PC, a tablet, etc., which can display the result of the emotion recognition.
- the terminal device of the financial institution or other institution that needs to perform the emotion recognition sends the acquired emotion recognition result to the associated terminal after the emotion recognition is performed, so that the associated terminal displays the emotion recognition result, so that the associated terminal is used.
- the agent or the quality inspector can know the emotion of the speaker corresponding to the voice data to be tested according to the emotion recognition result.
- the terminal device sends the emotion recognition result to the agent who is in a call with the client, so that the agent adjusts the communication manner according to the customer's emotion (eg, when the customer is angry, Remind the staff to properly appease), thereby improving the quality of service and improving customer satisfaction with the organization's services.
- the terminal device sends the emotion recognition result to the quality inspector, so that the quality inspector monitors the communication process between the agent and the client, and evaluates the working state of the agent, and implements Reward and punishment, prompting agents to better serve customers.
- the method for voice quality detection based on emotion recognition extracts features from the voice data to be acquired to obtain voice features, and then uses the emotion recognition model to identify the voice features to obtain the emotion recognition result, and displays the same through the associated terminal.
- the emotion recognition result is such that the user can understand the speaker emotion corresponding to the voice data to be tested by viewing the emotion recognition result.
- by causing the associated terminal to display the emotion recognition result corresponding to the voice data to be tested it helps the assistant agent to improve the service quality of the customer service, thereby improving the customer's service satisfaction with the organization.
- the voice data to be tested can be intelligently recognized to obtain the emotion recognition result, and the recognition process is highly efficient, and the voice data to be tested corresponding to the speaker can be timely and comprehensively sampled. No manual intervention is required, which is conducive to saving labor costs.
- the step S10 specifically includes: acquiring the voice data to be tested collected by the calling terminal in real time.
- the calling terminal may be a terminal that performs voice communication with a client or a terminal that performs voice communication with an agent.
- the calling terminal can be a voice call device such as a fixed telephone, a mobile phone or a walkie-talkie.
- the calling terminal is connected to the terminal device of the financial institution or other institution that needs to perform the emotion recognition, so that the terminal device can obtain the voice data to be tested collected by the calling terminal in real time, so as to facilitate real-time monitoring of the voice data to be tested.
- the terminal device acquires the voice data to be tested collected by the calling terminal in real time, and the terminal device obtains the voice data to be tested collected in real time during the call between the client and the agent, so as to implement the emotion of the client or the agent who is calling. Monitor.
- Step S40 specifically includes: transmitting the emotion recognition result to the associated terminal in real time, so that the associated terminal displays the emotion recognition result.
- the terminal device sends the acquired emotion recognition result to the associated terminal in real time, so that the associated terminal can display the emotion of the speaker corresponding to the voice data to be tested in real time, and prompts the agent to adjust the communication mode, thereby improving the client to the agent. Even the agency's service satisfaction.
- the voice recognition based voice quality detection method adopts artificial intelligence recognition mode, and the processing efficiency is high, and the process does not need to be equipped with professional quality inspection personnel for sampling inspection, which can save labor cost and reduce fraud risk.
- the method for voice quality detection based on emotion recognition obtains the voice data to be tested collected by the calling terminal in real time, and then extracts the feature data of the voice data to obtain the voice feature, and then uses the emotion recognition model to identify the voice feature.
- the emotion recognition result is obtained, and the emotion recognition result is sent to the associated terminal in real time, and the emotion recognition result is displayed by the associated terminal, so that the agent or the quality inspection personnel corresponding to the associated terminal can view the emotion to be tested by viewing the emotion recognition result.
- the speaker's sentiment corresponding to the data and the adjustment of communication methods can help improve the service quality of the organization and thus improve the customer's service satisfaction with the organization.
- the voice data to be tested can be intelligently recognized to obtain the emotion recognition result, and the recognition process is highly efficient, and the voice data to be tested corresponding to the speaker can be timely and comprehensively sampled. No manual intervention is required, which is conducive to saving labor costs.
- FIG. 5 is a schematic block diagram showing the emotion recognition based voice quality inspection apparatus corresponding to the emotion recognition based voice quality inspection method in Embodiment 1.
- the emotion recognition based voice quality inspection apparatus includes a voice data acquisition module 10 to be tested, a voice data feature acquisition module 20, an emotion recognition result acquisition module 30, and an emotion recognition result feedback module 40.
- the steps of the voice data acquisition module 10, the voice data feature acquisition module 20, the emotion recognition result acquisition module 30, and the emotion recognition result feedback module 40 are compared with the voice quality detection method based on the emotion recognition in the embodiment.
- the present embodiment will not be described in detail.
- the voice data acquiring module 10 is configured to acquire voice data to be tested.
- the voice data feature acquisition module 20 is configured to perform feature extraction on the voice data to be measured, and acquire voice features.
- the emotion recognition result obtaining module 30 is configured to identify the voice feature by using the emotion recognition model, and obtain the emotion recognition result.
- the emotion recognition result feedback module 40 is configured to send the emotion recognition result to the associated terminal, so that the associated terminal displays the emotion recognition result.
- the emotion recognition result acquisition module package 30 includes a recognition result output determination unit 311, a first recognition result processing unit 312, and a second recognition result processing unit 313.
- the recognition result output judging unit 311 is configured to recognize the speech feature by using the neural network-based emotion recognition model, and determine whether the accurate recognition result can be output.
- the first recognition result processing unit 312 is configured to use the accurate recognition result as the emotion recognition result when the accurate recognition result can be output.
- the second recognition result processing unit 313 is configured to identify the voice feature by using the emotion recognition model based on the support vector machine to obtain the emotion recognition result when the accurate recognition result cannot be output.
- the emotion recognition result acquisition module 30 includes a target model acquisition unit 321 and a recognition result acquisition unit 322.
- the target model obtaining unit 321 is configured to acquire a target emotion recognition model associated with the data source ID based on the data source ID of the voice data to be tested.
- the recognition result obtaining unit 322 is configured to identify the voice feature by using the target emotion recognition model, and obtain the emotion recognition result.
- the emotion recognition based voice quality checking device further comprises a target model training module 50.
- the target model training module 50 is configured to pre-train the target emotion recognition model associated with the data source ID.
- the target model training module 50 includes a training voice data acquiring unit 51, a number determining unit 52, and a target model training unit 53.
- the training voice data acquiring unit 51 is configured to acquire training voice data associated with the data source ID in the database based on the data source ID.
- the quantity determining unit 52 is configured to determine whether the number of training voice data reaches the emotional model training threshold.
- the target model training unit 53 is configured to perform the emotion recognition model training based on the training voice data associated with the data source ID to obtain the target emotion recognition model when the training voice data reaches the emotion model training threshold.
- the to-be-tested voice data acquisition module 10 is configured to acquire the voice data to be tested collected by the associated terminal in real time.
- the emotion recognition result feedback module 40 is configured to send the emotion recognition result to the associated terminal in real time, so that the associated terminal displays the emotion recognition result.
- the voice data acquiring module 10 can obtain the voice data to be tested in real time online, and can also obtain the voice data stored in the database offline, and meet different voice data acquisition requirements.
- the voice data feature acquisition module 20 is configured to perform feature extraction on the voice data to be measured, and acquire a voice feature.
- the voice feature extraction method is a combination of the neural network method and the support vector machine recognition method.
- the emotion recognition result acquisition module 30 uses the emotion recognition model to identify the voice features and obtain the emotion recognition results.
- the emotion recognition model is established by using the neural network based emotion recognition model training and the support vector machine emotion recognition model training. Make the output more accurate and realistic.
- the emotion recognition result obtaining module 30 may further acquire the target emotion recognition model associated with the data source ID through the data source ID of the voice data to be tested for emotion recognition.
- the emotion recognition result feedback module 40 is configured to send the emotion recognition result to the associated terminal, so that the associated terminal displays the emotion recognition result, and the associated terminal can display the test result in real time, and remind the agent to adjust the dialogue strategy with the customer according to the customer's emotion in time to ensure The call is pleasant and smooth, and the quality inspectors can also complete the random inspection of the agents.
- the embodiment provides a computer readable storage medium on which computer readable instructions are stored, and when the computer readable instructions are executed by the processor, the voice recognition based voice quality detection method in Embodiment 1 is implemented. Avoid repetition, no more details here.
- the computer readable instructions are executed by the processor, the functions of the modules/units in the voice recognition based voice quality checking apparatus in Embodiment 2 are implemented. To avoid repetition, details are not described herein again.
- FIG. 6 is a schematic diagram of a terminal device according to an embodiment of the present application.
- the terminal device 60 of this embodiment includes a processor 61, a memory 62, and computer readable instructions 63 stored in the memory 62 and operable on the processor 61.
- the processor 61 implements various steps of the emotion recognition based speech quality inspection method in Embodiment 1 when the computer readable instructions 63 are executed, such as steps S10 to S13 shown in FIG.
- the processor 61 executes the computer readable instructions 63
- the functions of the modules/units in the foregoing device embodiments are implemented, such as the voice data acquisition module 10 to be tested, the voice data feature acquisition module 20, and the emotion recognition result shown in FIG. 5.
- the module 30 and the emotion recognition result feedback module 40 are obtained.
- computer readable instructions 63 may be partitioned into one or more modules/units, one or more modules/units being stored in memory 62 and executed by processor 61 to complete the application.
- the one or more modules/units may be an instruction segment of a series of computer readable instructions 63 capable of performing a particular function for describing the execution of computer readable instructions 63 in the terminal device 60.
- the computer readable instructions 63 may be divided into the to-be-tested speech data acquisition module 10, the speech data feature acquisition module 20, the emotion recognition result acquisition module 30, the emotion recognition result feedback module 40, and the target model training module 50 in Embodiment 2.
- the function of the function is described in detail in Embodiment 2, and details are not described herein.
- the terminal device 60 can be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
- the terminal device may include, but is not limited to, a processor 61, a memory 62. It will be understood by those skilled in the art that FIG. 6 is only an example of the terminal device 60, and does not constitute a limitation on the terminal device 60, and may include more or less components than those illustrated, or combine some components, or different components.
- the terminal device may further include an input/output device, a network access device, a bus, and the like.
- the processor 61 may be a central processing unit (CPU), or may be other general-purpose processors, a digital signal processor (DSP), an application specific integrated circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc.
- the general purpose processor may be a microprocessor or the processor or any conventional processor or the like.
- the memory 62 may be an internal storage unit of the terminal device 60, such as a hard disk or memory of the terminal device 60.
- the memory 62 may also be an external storage device of the terminal device 60, such as a plug-in hard disk provided on the terminal device 60, a smart memory card (SMC), a Secure Digital (SD) card, and a flash memory card (Flash). Card) and so on.
- the memory 62 may also include both an internal storage unit of the terminal device 60 and an external storage device.
- the memory 62 is used to store computer readable instructions as well as other programs and data required by the terminal device.
- the memory 62 can also be used to temporarily store data that has been or will be output.
- each functional unit and module in the foregoing system may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit, and the integrated unit may be implemented by hardware.
- Formal implementation can also be implemented in the form of software functional units.
- the specific names of the respective functional units and modules are only for the purpose of facilitating mutual differentiation, and are not intended to limit the scope of protection of the present application.
- the disclosed apparatus and method may be implemented in other manners.
- the device embodiments described above are merely illustrative.
- the division of the modules or units is only a logical function division.
- there may be another division manner for example, multiple units or components may be used. Combinations can be integrated into another system, or some features can be ignored or not executed.
- the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in electrical, mechanical or other form.
- the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
- each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
- the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
- the integrated modules/units if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium.
- the present application implements all or part of the processes in the foregoing embodiments, and may also be implemented by computer readable instructions, which may be stored in a computer readable storage medium.
- the computer readable instructions when executed by a processor, may implement the steps of the various method embodiments described above.
- the computer readable instructions comprise computer readable instruction code, which may be in the form of source code, an object code form, an executable file or some intermediate form or the like.
- the computer readable medium can include any entity or device capable of carrying the computer readable instruction code, a recording medium, a USB flash drive, a removable hard drive, a magnetic disk, an optical disk, a computer memory, a read only memory (ROM, Read-Only) Memory), random access memory (RAM), electrical carrier signals, telecommunications signals, and software distribution media.
- a recording medium a USB flash drive
- a removable hard drive a magnetic disk, an optical disk
- a computer memory a read only memory (ROM, Read-Only) Memory
- RAM random access memory
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Marketing (AREA)
- Computational Linguistics (AREA)
- Business, Economics & Management (AREA)
- Hospice & Palliative Care (AREA)
- Child & Adolescent Psychology (AREA)
- Acoustics & Sound (AREA)
- Psychiatry (AREA)
- Oral & Maxillofacial Surgery (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
一种基于情绪识别的语音质检方法、装置、设备及存储介质。基于情绪识别的语音质检方法包括:获取待测语音数据;将待测语音数据送入语音情绪识别平台进行情绪识别;将情绪识别结果发送给关联终端,以使关联终端显示情绪识别结果。基于情绪识别的语音质检方法进行情绪识别时,具有效率高且人工成本低的优点。A voice quality inspection method, device, device and storage medium based on emotion recognition. The voice quality detection method based on emotion recognition includes: acquiring voice data to be tested; sending the voice data to be tested to the voice emotion recognition platform for emotion recognition; and transmitting the emotion recognition result to the associated terminal, so that the associated terminal displays the emotion recognition result. When the emotion recognition method based on emotion recognition performs emotion recognition, it has the advantages of high efficiency and low labor cost.
Description
本专利申请以2017年8月24日提交的申请号为201710734303.X,名称为“基于情绪识别的语音质检方法、装置、设备及存储介质”的中国发明专利申请为基础,并要求其优先权。This patent application is based on the Chinese invention patent application filed on August 24, 2017, with the application number of 201710734303.X, entitled "Speech quality inspection method, device, device and storage medium based on emotion recognition", and requires priority right.
本申请涉及语音质检技术领域,尤其涉及一种基于情绪识别的语音质检方法、装置、设备及存储介质。The present invention relates to the field of voice quality inspection technology, and in particular, to a voice quality inspection method, apparatus, device and storage medium based on emotion recognition.
在银行、证券、保险和P2P(peer-to-peer lending,点对点借贷,以下简称P2T)等服务机构均设有客服语音问答系统,服务机构的坐席人员通过该客服语音问答系统为客户提供服务。语音质检是通过对坐席人员与客服之间的通话进行监控,以评估通话质量、服务质量、业务解决质量和满意度等质检结果,以便基于质检结果提高服务质量。当前语音质检主要采用人工抽检方式进行质检,存在抽检效率低、应变不及时和需耗费大量人力物力的问题。In the banking, securities, insurance and P2P (peer-to-peer lending, P2T), the service organization has a customer service voice question answering system. The agent of the service organization provides services to the customer through the customer service voice question answering system. Voice quality inspection is to monitor the call between the agent and the customer service to evaluate the quality of the call, quality of service, quality of service and satisfaction, so as to improve the quality of service based on the quality inspection results. At present, the quality of voice quality inspection is mainly carried out by means of manual sampling, which has the problems of low sampling efficiency, untimely strain and a large amount of manpower and material resources.
发明内容Summary of the invention
本申请提供一种基于情绪识别的语音质检方法、装置、设备及存储介质,以解决当前语音质检采用人工抽检方式存在的问题。The present application provides a method, device, device and storage medium for voice quality detection based on emotion recognition, so as to solve the problem that the current voice quality inspection adopts manual sampling mode.
第一方面,本申请提供一种基于情绪识别的语音质检方法,包括:In a first aspect, the present application provides a voice quality verification method based on emotion recognition, including:
获取待测语音数据;Obtaining voice data to be tested;
对所述待测语音数据进行特征提取,获取语音特征;Performing feature extraction on the voice data to be tested to obtain a voice feature;
采用情绪识别模型对所述语音特征进行识别,获取情绪识别结果;Identifying the voice feature by using an emotion recognition model to obtain an emotion recognition result;
将所述情绪识别结果发送给关联终端,以使所述关联终端显示所述情绪识别结果。Transmitting the emotion recognition result to the associated terminal, so that the associated terminal displays the emotion recognition result.
第二方面,本申请提供一种基于情绪识别的语音质检装置,包括:In a second aspect, the present application provides a voice quality verification device based on emotion recognition, including:
待测语音数据获取模块,用于获取待测语音数据;a voice data acquiring module to be used for acquiring voice data to be tested;
语音数据特征获取模块,用于对所述待测语音数据进行特征提取,获取语音特征;a voice data feature acquiring module, configured to perform feature extraction on the voice data to be tested, and acquire a voice feature;
情绪识别结果获取模块,用于采用情绪识别模型对所述语音特征进行识别,获取情绪识别结果;An emotion recognition result obtaining module, configured to identify the voice feature by using an emotion recognition model, and obtain an emotion recognition result;
情绪识别结果反馈模块,用于将所述情绪识别结果发送给关联终端,以使所述关联终 端显示所述情绪识别结果。And an emotion recognition result feedback module, configured to send the emotion recognition result to the associated terminal, so that the association terminal displays the emotion recognition result.
第三方面,本申请提供一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:In a third aspect, the present application provides a terminal device including a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, the processor executing the computer readable instructions The following steps are implemented:
获取待测语音数据;Obtaining voice data to be tested;
对所述待测语音数据进行特征提取,获取语音特征;Performing feature extraction on the voice data to be tested to obtain a voice feature;
采用情绪识别模型对所述语音特征进行识别,获取情绪识别结果;Identifying the voice feature by using an emotion recognition model to obtain an emotion recognition result;
将所述情绪识别结果发送给关联终端,以使所述关联终端显示所述情绪识别结果。Transmitting the emotion recognition result to the associated terminal, so that the associated terminal displays the emotion recognition result.
第四方面,本申请提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如下步骤:In a fourth aspect, the present application provides a computer readable storage medium storing computer readable instructions that, when executed by a processor, implement the following steps:
获取待测语音数据;Obtaining voice data to be tested;
对所述待测语音数据进行特征提取,获取语音特征;Performing feature extraction on the voice data to be tested to obtain a voice feature;
采用情绪识别模型对所述语音特征进行识别,获取情绪识别结果;Identifying the voice feature by using an emotion recognition model to obtain an emotion recognition result;
将所述情绪识别结果发送给关联终端,以使所述关联终端显示所述情绪识别结果。Transmitting the emotion recognition result to the associated terminal, so that the associated terminal displays the emotion recognition result.
本申请提供的基于情绪识别的语音质检方法、装置、设备及存储介质中,通过对待测语音数据进行特征提取,以获取语音特征,再采用情绪识别模型对语音特征进行识别,以获取情绪识别结果,并通过关联终端显示该情绪识别结果,以使用户通过查看该情绪识别结果,了解该待测语音数据对应的说话人情绪。该基于情绪识别的语音质检方法、装置、设备及存储介质中,可实现对待测语音数据进行智能识别,以获取情绪识别结果,其识别过程处理效率高,可实现对说话人对应的待测语音数据进行及时且全面抽检,无需人工干预,有利于节省人工成本。In the speech recognition-based voice quality inspection method, device, device and storage medium provided by the present application, feature extraction is performed on the voice data to be measured to acquire a voice feature, and then the voice recognition model is used to identify the voice feature to obtain the emotion recognition. As a result, the emotion recognition result is displayed by the associated terminal, so that the user can understand the speaker emotion corresponding to the voice data to be tested by viewing the emotion recognition result. The method, apparatus, device and storage medium for voice recognition based on emotion recognition can realize intelligent recognition of the voice data to be measured to obtain the emotion recognition result, and the recognition process has high processing efficiency, and can implement the test corresponding to the speaker. Timely and comprehensive sampling of voice data, without manual intervention, is conducive to saving labor costs.
为了更清楚地说明本申请的技术方案,下面将对本申请的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the present application, the drawings to be used in the description of the present application will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the present application. For ordinary technicians, other drawings can be obtained based on these drawings without paying for creative labor.
图1是实施例1中基于情绪识别的语音质检方法的一流程图。1 is a flow chart of a method for voice quality detection based on emotion recognition in Embodiment 1.
图2是图1中步骤S30的一具体流程图。FIG. 2 is a specific flowchart of step S30 in FIG. 1.
图3是图1中步骤S30的另一具体流程图。FIG. 3 is another specific flowchart of step S30 in FIG. 1.
图4是实施例1中基于情绪识别的语音质检方法的另一具体流程图。4 is another specific flowchart of the voice quality-based voice quality checking method in Embodiment 1.
图5是实施例2中基于情绪识别的语音质检装置的一原理框图。FIG. 5 is a schematic block diagram of a voice quality checking apparatus based on emotion recognition in Embodiment 2.
图6是实施例4中终端设备的一示意图。Figure 6 is a schematic diagram of a terminal device in Embodiment 4.
下面将结合本申请中的附图,对本申请中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the present application will be clearly and completely described in the following with reference to the drawings in the present application. It is obvious that the described embodiments are a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.
实施例1Example 1
图1示出本实施例中基于情绪识别的语音质检方法。该基于情绪识别的语音质检方法应用在银行、证券、保险和P2P(peer-to-peer lending,点对点借贷,简称P2P)等金融机构或需要进行情绪识别的其他机构的终端设备中,用于实现对说话人的待测语音数据进行识别,确定说话人的情绪。其中,终端设备包括但不限于PC端、智能手机、平板电脑和客服语音问答系统等设备。本实施例中,终端设备是客服语音问答系统。如图1所示,该基于情绪识别的语音质检方法包括如下步骤:FIG. 1 shows a voice quality inspection method based on emotion recognition in the present embodiment. The emotion recognition-based voice quality inspection method is applied to a financial institution such as a bank, a securities, an insurance, and a peer-to-peer lending (P2P), or a terminal device of another institution that needs to perform emotion recognition, for use in a terminal device of a financial institution such as a peer-to-peer lending (P2P) The speaker's voice data to be tested is identified to determine the emotion of the speaker. Among them, the terminal equipment includes but is not limited to a PC, a smart phone, a tablet computer, and a customer service voice question answering system. In this embodiment, the terminal device is a customer service voice question answering system. As shown in FIG. 1, the emotion recognition based voice quality inspection method comprises the following steps:
S10:获取待测语音数据。S10: Acquire voice data to be tested.
其中,待测语音数据是指终端设备采集到的说话人的语音数据。该待测语音数据可以是wav、mp3或其他格式的语音数据。可以理解地,每一待测语音数据携带一数据来源ID,该数据来源ID是用于唯一识别待测语音数据说法人的标识。当坐席人员与客户进行电话沟通时,终端设备上集成的录音模块或与终端设备相连的录音设备会采集待测语音数据。The voice data to be tested refers to the voice data of the speaker collected by the terminal device. The voice data to be tested may be voice data in wav, mp3 or other formats. It can be understood that each voice data to be tested carries a data source ID, which is an identifier for uniquely identifying the speaker of the voice data to be tested. When the agent communicates with the customer, the recording module integrated on the terminal device or the recording device connected to the terminal device collects the voice data to be tested.
本实施例中,待测语音数据的获取方式包括在线实时获取和离线获取两种方式。其中,在线实时获取是指在客户和坐席人员通话过程中对客户和坐席人员双方的通话内容进行录音,以获取待测语音数据。离线获取是指从与终端设备相连的系统后台获取已保存在数据库中的待测语音数据。可以理解地,每一待测语音数据携带有数据来源ID,该数据来源ID所对应的说话人可以是客户也可以是坐席人员。因此,该数据来源ID可以是用于唯一识别客户的客户ID,也可以是用于唯一识别坐席人员的坐席ID。其中,客户ID可以是客户的身份证号、手机号或者在机构办理业务时的开户号。坐席ID可以是坐席人员在机构内部的工号。In this embodiment, the manner in which the voice data to be tested is obtained includes online real-time acquisition and offline acquisition. The online real-time acquisition refers to recording the content of the call between the client and the agent during the conversation between the client and the agent to obtain the voice data to be tested. Offline acquisition refers to acquiring the voice data to be tested that has been saved in the database from the background of the system connected to the terminal device. It can be understood that each voice data to be tested carries a data source ID, and the speaker corresponding to the data source ID may be a client or an agent. Therefore, the data source ID may be a customer ID for uniquely identifying the customer, or may be an agent ID for uniquely identifying the agent. The customer ID may be the customer's ID number, mobile phone number, or the account number when the organization handles the business. The agent ID can be the number of the agent inside the organization.
S20:对待测语音数据进行特征提取,获取语音特征。S20: Feature extraction of the voice data to be measured, and acquiring voice features.
可以理解地,语音特征包括但不限于韵律特征、音质特征、频谱特征、词汇特征和声纹特征。其中,韵律特征,又叫超音质特征或者超音段特征,是指语音中除音质特征之外的音高、音长和音强方面的变化。该韵律特征包括但不限于本实施例中的基音频率、发音 持续时间、发音振幅和发音语速。音质特征包括但不限于本实施例中的共振峰F1-F3、频带能量分布、谐波信噪比和短时能量抖动。频谱特征,又称振动谱特征,是指将复杂振荡分解为振幅不同和频率不同的谐振荡,这些谐振荡的幅值按频率排列形成的图形。频谱特征与韵律特征和音质特征相融合,以提高特征参数的抗噪声效果。本实施例中,频谱特征采用能够反映人耳听觉特性的梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficients,以下简称MFCC)。词汇特征是用于体现待测语音数据中用词的词性特征,包括但不限于本实施例中的积极词和消极词。词性特征与其他语音特征结合,有利于识别待测语音数据对应的说话人的情绪。声纹特征(即i-vector特征)是与说话人相关的特征,其与其他语音特征结合,在语音识别过程中可更有效提高识别的准确率。It will be appreciated that speech features include, but are not limited to, prosodic features, phonological features, spectral features, lexical features, and voiceprint features. Among them, the prosodic feature, also known as the super-sound quality feature or the super-segment feature, refers to the change in pitch, pitch length and sound intensity in the speech other than the sound quality feature. The prosody features include, but are not limited to, the pitch frequency, the utterance duration, the utterance amplitude, and the utterance rate in the present embodiment. Sound quality features include, but are not limited to, formants F1-F3, band energy distribution, harmonic signal to noise ratio, and short-term energy jitter in this embodiment. Spectral characteristics, also known as vibrational spectral features, refer to the decomposition of complex oscillations into resonant waveforms of different amplitudes and frequencies, and the amplitudes of these resonant oscillations are arranged in a frequency pattern. The spectral features are combined with prosodic features and sound quality features to improve the anti-noise effect of the characteristic parameters. In this embodiment, the spectral features are Mel-Frequency Cepstral Coefficients (MFCC), which can reflect the auditory characteristics of the human ear. The vocabulary feature is a part of speech feature for embodying words in the speech data to be tested, including but not limited to positive words and negative words in the embodiment. The part-of-speech feature is combined with other phonetic features to facilitate the recognition of the speaker's emotion corresponding to the speech data to be tested. The voiceprint feature (i.e., i-vector feature) is a speaker-related feature that, combined with other phonetic features, can more effectively improve the accuracy of recognition in the speech recognition process.
具体地,对待测语音数据进行特征提取的过程具体包括对待测语音数据进行预加重处理,对处理后的语音数据进行分帧、加窗运算,然后通过快速傅里叶变换和对数运算处理,最后经离散余弦变换以获取上述的语音特征。再将上述语音特征拼接成特征向量,将特征向量作为输入情绪识别模型进行情绪识别。Specifically, the process of performing feature extraction on the voice data to be tested specifically includes pre-emphasis processing on the voice data to be tested, performing frame division and windowing operations on the processed voice data, and then performing fast Fourier transform and logarithmic operation processing. Finally, the discrete cosine transform is performed to obtain the above-mentioned speech features. Then, the above speech features are spliced into feature vectors, and the feature vectors are used as input emotion recognition models for emotion recognition.
S30:采用情绪识别模型对语音特征进行识别,获取情绪识别结果。S30: The emotion recognition model is used to identify the voice features, and the emotion recognition result is obtained.
其中,情绪识别模型是机构内预先训练好的用于进行身份验证的模型。情绪识别结果包括但不限于识别出语音特征对应的说话人说话时带有的高兴、愤怒、悲伤、烦噪和平静等情绪。具体地,终端设备采用预先训练好的情绪识别模型对语音特征进行识别,以获取说话人说话时携带的高兴、愤怒、悲伤、烦噪和平静等情绪识别结果。本实施例中,采用情绪识别模型对语音特征进行智能识别,以获取情绪识别结果,其识别过程处理效率高,可实现对说话人对应的待测语音数据进行及时且全面抽检,无需人工干预,有利于节省人工成本。Among them, the emotion recognition model is a pre-trained model for identity verification in the organization. The results of the emotion recognition include, but are not limited to, recognizing emotions such as happiness, anger, sadness, trouble, and calmness of the speaker corresponding to the voice feature. Specifically, the terminal device uses the pre-trained emotion recognition model to identify the voice feature, so as to obtain the emotion recognition results such as happiness, anger, sadness, trouble, and calmness carried by the speaker when speaking. In this embodiment, the emotion recognition model is used to intelligently identify the voice features to obtain the emotion recognition result, and the recognition process has high processing efficiency, and can realize timely and comprehensive sampling inspection of the voice data corresponding to the speaker, without manual intervention. Conducive to saving labor costs.
在一具体实施方式中,S30中,采用情绪识别模型对语音特征进行识别,获取情绪识别结果,具体包括如下步骤,如图2所示:In a specific implementation manner, in S30, the emotion recognition model is used to identify the voice feature, and the emotion recognition result is obtained, which specifically includes the following steps, as shown in FIG. 2:
S311:采用基于神经网络的情绪识别模型对语音特征进行识别,判断能否输出准确识别结果。S311: Using a neural network-based emotion recognition model to identify the voice features and determine whether the accurate recognition result can be output.
其中,基于神经网络的情绪识别模型是采用神经网络模型对训练语音数据进行训练获取的用于识别语音数据中的情绪的模型。基于神经网络的情绪识别模型包括输入层、隐藏层和输出层,将步骤S20获取的语音特征输入基于神经网络的情绪识别模型的输入层,通过隐藏层对语音特征进行识别处理,再通过输出层输出识别结果,该识别结果包括准确识别结果和模糊识别结果。准确识别结果是用于指示语音特征对应一具体情绪的识别结果; 而模糊识别结果是用于指示语音特征无法对应一具体情绪的识别结果。Among them, the neural network-based emotion recognition model is a model for identifying emotions in the voice data obtained by training the training voice data using the neural network model. The emotion recognition model based on the neural network includes an input layer, a hidden layer and an output layer, and the voice feature acquired in step S20 is input into the input layer of the emotion recognition model based on the neural network, and the voice feature is identified and processed through the hidden layer, and then passed through the output layer. The recognition result is output, and the recognition result includes an accurate recognition result and a fuzzy recognition result. The accurate recognition result is a recognition result for indicating that the voice feature corresponds to a specific emotion; and the fuzzy recognition result is a recognition result for indicating that the voice feature cannot correspond to a specific emotion.
具体地,基于神经网络的情绪识别模型的训练过程如下:首先,获取训练语音数据,并对训练语音数据进行情绪标注,以使训练语音数据携带情绪标签。其中,训练语音数据是用于训练情绪识别模型的语音数据。每种情绪标签对应的训练语音数据各选取500条,以使高兴、愤怒、悲伤、烦噪和平静这五种情绪对应的训练语音数据等比例,避免情绪识别模型训练过程中出现过拟合现象。然后,对携带情绪标签的训练语音数据进行特征提取,获取携带情绪标签的训练语音特征,具体用训练语音特征x和对应的情绪标签y表示,如(训练语音特征1,高兴)、(训练语音特征2,愤怒)……(训练语音特征x,情绪y)。再在采用神经网络模型中的逻辑回归算法对所有训练语音特征进行迭代计算,以将高兴、愤怒、悲伤、烦噪和平静等情绪对应的特征从训练语音特征中提取出来。在两万次迭代后,训练的模型的损失发生收敛,则停止训练,得到最终训练好的基于神经网络的情绪识别模型,使得该基于神经网络的情绪识别模型具有较强的非线性拟合能力,可映射复杂的非线性关系,并具有较强的鲁棒性和记忆能力。进一步地,本实施例的神经网络具体为深度神经网络(Deep Neural Networks,即DNN)。Specifically, the training process of the emotion recognition model based on the neural network is as follows: First, the training voice data is acquired, and the training voice data is emotionally labeled, so that the training voice data carries the emotion tag. Among them, the training voice data is voice data for training the emotion recognition model. Each training emotional data corresponding to each emotional label is selected 500, so that the training speech data corresponding to the five emotions of happiness, anger, sadness, trouble, and calm are proportionally, so as to avoid over-fitting in the training process of the emotion recognition model. . Then, feature extraction is performed on the training speech data carrying the emotional tag, and the training speech feature carrying the emotional tag is obtained, which is specifically represented by the training speech feature x and the corresponding emotional tag y, such as (training speech feature 1, happy), (training speech) Feature 2, anger)... (training speech feature x, emotion y). Then, using the logistic regression algorithm in the neural network model, iteratively calculates all the training speech features to extract the features corresponding to emotions such as happiness, anger, sadness, trouble and calmness from the training speech features. After 20,000 iterations, the loss of the trained model converges, then the training is stopped, and the finally trained neural network-based emotion recognition model is obtained, which makes the neural network-based emotion recognition model have strong nonlinear fitting ability. It can map complex nonlinear relationships and has strong robustness and memory ability. Further, the neural network of this embodiment is specifically a Deep Neural Networks (DNN).
在基于神经网络的情绪识别模型对待测语音数据对应的语音特征进行识别时,将待测语音数据对应的语音特征输入到基于神经网络的情绪识别模型的输入层中,基于神经网络的情绪识别模型的隐藏层根据预先训练过程学习到的特性进行情绪识别,获取对应的情绪识别结果,并将情绪识别结果从输出层输出。在基于神经网络的情绪识别模型对待测语音数据进行识别过程中,隐藏层分别计算获取待测语音数据为高兴、愤怒、悲伤、烦噪或平静等情绪的概率,并比较最高概率和第二高概率的差值是否大于预设概率差值。若最高概率和第二高概率的差值大于预设概率差值,则将最高概率对应的情绪作为准确识别结果输出。若最高概率和第二高概率的差值不大于预设概率差值,则基于神经网络的情绪识别模型不能输出准确识别结果,而是输出模糊识别结果。该预设概率差值是预先设置的用于评估能否确定情绪类别的值。When the emotion recognition model based on the neural network is used to identify the speech features corresponding to the speech data, the speech features corresponding to the speech data to be tested are input into the input layer of the neural network based emotion recognition model, and the emotion recognition model based on the neural network is used. The hidden layer performs emotion recognition according to the characteristics learned in the pre-training process, acquires the corresponding emotion recognition result, and outputs the emotion recognition result from the output layer. In the process of identifying the voice data to be measured based on the neural network-based emotion recognition model, the hidden layer calculates the probability of acquiring the voice data to be tested as happy, angry, sad, troublesome or calm, and compares the highest probability with the second highest. Whether the difference of the probability is greater than the preset probability difference. If the difference between the highest probability and the second highest probability is greater than the preset probability difference, the emotion corresponding to the highest probability is output as an accurate recognition result. If the difference between the highest probability and the second highest probability is not greater than the preset probability difference, the neural network based emotion recognition model cannot output an accurate recognition result, but outputs a fuzzy recognition result. The preset probability difference is a value set in advance for evaluating whether the emotion category can be determined.
如若预设概率差值为20%时,若识别出语音特征为高兴、愤怒、悲伤、烦噪或平静分别为2%、60%、15%、20%和3%,则最高概率为60%,第二高概率为20%,则最高概率和第二高概率的差值为40%,大于预设概率差值,则将最高概率对应的情绪作为识别结果输出。反之,若识别出语音特征为高兴、愤怒、悲伤、烦噪或平静分别为2%、40%、20%、35%和3%,则最高概率为40%,第二高概率为35%,最高概率和第二高概率的减值小于预设概率差值,不能输出准确识别结果,而是输出模糊识别结果。If the preset probability difference is 20%, if the speech characteristics are identified as happy, angry, sad, troubled or calm, respectively, 2%, 60%, 15%, 20% and 3%, the highest probability is 60%. The second highest probability is 20%, and the difference between the highest probability and the second highest probability is 40%, which is greater than the preset probability difference, and the emotion corresponding to the highest probability is output as the recognition result. Conversely, if the speech characteristics are identified as happy, angry, sad, troubled, or calm, respectively, 2%, 40%, 20%, 35%, and 3%, the highest probability is 40%, and the second highest probability is 35%. The highest probability and the second highest probability of the impairment are less than the preset probability difference, and the accurate recognition result cannot be output, but the fuzzy recognition result is output.
S312:若能输出准确识别结果,则将准确识别结果作为情绪识别结果。S312: If an accurate recognition result can be output, the accurate recognition result is used as the emotion recognition result.
本实施例中,若基于神经网络的情绪识别模型能够输出准确识别结果,说明该待测语音数据对应的说话人情绪属于高兴、愤怒、悲伤、烦噪和平静的最高概率远大于第二高概率,两者的差值大于预设概率差值,则基于神经网络的情绪识别模型输出的识别结果较准确,可直接将准确识别结果作为情绪识别结果输出。In this embodiment, if the emotion recognition model based on the neural network can output an accurate recognition result, the highest probability that the speaker emotion corresponding to the voice data to be tested belongs to happiness, anger, sadness, trouble, and calm is much greater than the second highest probability. If the difference between the two is greater than the preset probability difference, the recognition result output by the emotion recognition model based on the neural network is more accurate, and the accurate recognition result can be directly output as the emotion recognition result.
S313:若不能输出准确识别结果,则采用基于支持向量机的情绪识别模型对语音特征进行识别,获取情绪识别结果。S313: If the accurate recognition result cannot be output, the emotion recognition model based on the support vector machine is used to identify the voice feature, and the emotion recognition result is obtained.
本实施例中,若基于神经网络的情绪识别模型不能输出准确识别结果,即基于神经网络的情绪识别模型输出模糊识别结果,说明该待测语音数据对应的说话人情绪属于高兴、愤怒、悲伤、烦噪和平静中的最高概率远和第二高概率的差值不大于预设概率差值,即不能准确识别出说话人对应一具体情绪。因此,需再采用基于支持向量机的情绪识别模型对语音特征进行进一步识别,以获取情绪识别结果,从而进一步提高情绪识别的准确度。In this embodiment, if the emotion recognition model based on the neural network cannot output an accurate recognition result, that is, the emotion recognition model based on the neural network outputs a fuzzy recognition result, indicating that the speaker emotion corresponding to the voice data to be tested belongs to happiness, anger, sadness, The difference between the highest probability far and the second high probability in the noisy and calm is not greater than the preset probability difference, that is, the speaker does not accurately recognize a specific emotion. Therefore, the emotion recognition model based on support vector machine is needed to further identify the speech features to obtain the emotion recognition results, thereby further improving the accuracy of emotion recognition.
其中,基于支持向量机的情绪识别模型是采用支持向量机模型对训练语音数据进行训练获取的用于识别语音数据中的情绪的模型。其中,支持向量机(support vector machine,即SVM)是通过支持向量运算的分类器,支持向量机可实现线性分类和非线性分类。支持向量机的情绪识别模型计算复杂度较小,可根据少数支持向量决定最终结果,在训练过程中有助于抓住关键样本,剔除冗余样本,具有较好的鲁棒性。The emotion recognition model based on the support vector machine is a model for identifying emotions in the voice data obtained by training the training voice data by using the support vector machine model. Among them, the support vector machine (SVM) is a classifier that supports vector operations, and the support vector machine can realize linear classification and nonlinear classification. The emotion recognition model of support vector machine has less computational complexity, and can determine the final result according to a few support vectors. It helps to grasp key samples and eliminate redundant samples in the training process, which has better robustness.
其中,支持向量机的情绪识别模型的训练过程如下:SVM训练是为了在某个超平面上将不同类别的情绪属性分开,该情绪属性包括高兴、愤怒、悲伤、烦躁和平静这五种情绪。本实施例中,根据这五种情绪属性将超平面分割为五维超平面,在五维超平面中寻找将五种情绪分割开的分割线,获取分割线的表达式即可完成SVM的训练。具体地,通过输入训练语音数据的语音特征(即韵律特征、音质特征、频谱特征、词汇特征和声纹特征),对待测数据进行音频特征提取,得到特征对应的最优求解空间,该空间即为对应的情绪表达空间,完成对输入训练语音数据的情绪分类与判断。The training process of the emotion recognition model of the support vector machine is as follows: SVM training is to separate different types of emotional attributes on a hyperplane, including five emotions of happiness, anger, sadness, irritability and calmness. In this embodiment, the hyperplane is divided into five-dimensional hyperplanes according to the five kinds of emotional attributes, and the dividing lines dividing the five emotions are searched in the five-dimensional hyperplane, and the expression of the dividing line is obtained to complete the training of the SVM. . Specifically, by inputting the speech features of the training speech data (ie, prosodic features, sound quality features, spectral features, lexical features, and voiceprint features), audio feature extraction is performed on the data to be measured, and an optimal solution space corresponding to the feature is obtained, and the space is To express the space for the corresponding emotions, complete the emotional classification and judgment of the input training speech data.
SVM是一个二分类模型,其实现方式可以为二叉树形式,即对每一属性分别判断,以确定属于该情绪属性或者不属于该情绪属性。由于SVM为二分类模型,因此,本实现例中基于支持向量机的情绪识别模型中需分别创建基于SVM的高兴识别模型、愤怒识别模型、悲伤识别模型、烦躁识别模型和平静识别模型。在基于支持向量机的情绪识别模型对待测语音数据对应的语音特征进行识别时,需使语音特征分别通过高兴识别模型、愤怒识别模型、悲伤识别模型、烦躁识别模型和平静识别模型进行识别,分别获取相应的情绪得分; 再将五个情绪得分进行比较,选取得分最高的情绪作为情绪识别结果。The SVM is a two-category model, and its implementation may be in the form of a binary tree, that is, each attribute is separately judged to determine whether it belongs to the emotional attribute or does not belong to the emotional attribute. Since the SVM is a two-category model, the SVM-based happy recognition model, the anger recognition model, the sadness recognition model, the irritability recognition model, and the calm recognition model are separately created in the emotion recognition model based on the support vector machine in this embodiment. When the speech recognition model based on the support vector machine is used to identify the speech features corresponding to the speech data, the speech features need to be identified by the happy recognition model, the angry recognition model, the sad recognition model, the irritability recognition model and the calm recognition model, respectively. Obtain the corresponding emotional scores; then compare the five emotional scores and select the highest-scoring emotion as the emotional recognition result.
本实施例中,采用基于神经网络的情绪识别模型先对待测语音数据对应的语音特征进行识别,对于情绪属性较明确的语音特征的识别较准确,而且识别过程较快;然后再采用支持向量机的情绪识别模型对基于神经网络的情绪识别模型不能输出准确识别结果的待测语音数据对应的语音特征进行进一识别,有利于提高语音的准确性。In this embodiment, the neural network-based emotion recognition model is used to identify the voice features corresponding to the voice data, and the voice features with clearer emotional attributes are more accurately identified, and the recognition process is faster; then the support vector machine is used. The emotion recognition model further recognizes the speech features corresponding to the speech data to be tested that cannot be outputted by the neural network based emotion recognition model, which is beneficial to improve the accuracy of the speech.
在一具体实施方式中,由于终端设备获取待测语音数据携带数据来源ID,该数据来源ID用于指示该待测语音数据的说话人。如图3所示,该基于情绪识别的语音质检方法中,步骤S30,采用情绪识别模型对语音特征进行识别,获取情绪识别结果,具体包括如下步骤:In a specific implementation, the data source ID is used by the terminal device to obtain the data source ID of the voice data to be tested, and the data source ID is used to indicate the speaker of the voice data to be tested. As shown in FIG. 3, in the emotion recognition-based voice quality inspection method, in step S30, the emotion recognition model is used to identify the voice feature, and the emotion recognition result is obtained, which specifically includes the following steps:
S321:基于待测语音数据的数据来源ID,获取与数据来源ID相关联的目标情绪识别模型。S321: Acquire a target emotion recognition model associated with the data source ID based on the data source ID of the voice data to be tested.
其中,该目标情绪识别模型是针对携带相同的数据来源ID的训练语音数据进行训练的情绪识别模型。该目标情绪识别模型可以是根据本实施例中提到的情绪识别模型训练方法训练出的情绪识别模型,该情绪识别模型带有自己的数据来源ID,即该目标情绪识别模型可以是基于神经网络的情绪识别模型,也可以是基于支持向量机的情绪识别模型,其与步骤S311-S313中的情绪识别模型的区别点在于其训练语音数据携带相同的数据来源ID。可以理解地,该目标情绪识别模型可以是预先采用携带相同数据来源ID的训练语音数据进行训练好后存储在数据库的情绪识别模型。在情绪识别过程中,终端设备根据接收到的待测语音数据中的数据来源ID查询数据库,判断数据库中是否存在与数据来源ID相关联的目标情绪识别模型;若存在该目标情绪识别模型,则执行步骤S322;若不存在该目标情绪识别模型,则执行步骤S311-S313,采用不是与数据来源ID关联的情绪识别模型进行语音情绪识别,即步骤S311-S313中的情绪识别模型可应用于所有说话人对应的待测语音数据进行识别。The target emotion recognition model is an emotion recognition model for training training voice data carrying the same data source ID. The target emotion recognition model may be an emotion recognition model trained according to the emotion recognition model training method mentioned in the embodiment, and the emotion recognition model has its own data source ID, that is, the target emotion recognition model may be based on a neural network. The emotion recognition model may also be a support vector machine based emotion recognition model, which is different from the emotion recognition model in steps S311-S313 in that the training voice data carries the same data source ID. It can be understood that the target emotion recognition model may be an emotion recognition model that is preliminarily trained in the training voice data carrying the same data source ID and stored in the database. In the process of emotion recognition, the terminal device queries the database according to the data source ID in the received voice data to be tested, and determines whether there is a target emotion recognition model associated with the data source ID in the database; if the target emotion recognition model exists, Step S322 is performed; if the target emotion recognition model does not exist, steps S311-S313 are performed to perform voice emotion recognition using the emotion recognition model not associated with the data source ID, that is, the emotion recognition model in steps S311-S313 can be applied to all The voice data to be tested corresponding to the speaker is identified.
S322:采用目标情绪识别模型对语音特征进行识别,获取情绪识别结果。S322: Identifying a voice feature by using a target emotion recognition model to obtain an emotion recognition result.
由于目标情绪识别模型是由于采用相同数据来源ID训练出来的情绪识别模型,是针对特定说话人的情绪识别模型,而待测语音数据中携带相同的情绪识别模型,因此,采用目标情绪识别模型对待测语音数据对应的语音特征进行识别时,可使情绪识别结果更准确。可以理解地,该目标情绪识别模型只是针对携带相同数据来源ID的待测语音数据进行识别,针对性强且识别结果更准确。Since the target emotion recognition model is an emotion recognition model trained by the same data source ID, it is an emotion recognition model for a specific speaker, and the speech data to be tested carries the same emotion recognition model, and therefore, the target emotion recognition model is adopted. When the voice features corresponding to the voice data are identified, the emotion recognition result can be more accurate. It can be understood that the target emotion recognition model only identifies the voice data to be tested carrying the same data source ID, and is highly targeted and the recognition result is more accurate.
在一具体实施方式中,由于步骤S321和步骤S322中应用到与数据来源ID相关联的 目标情绪识别模型,因此,该基于情绪识别的语音质检方法中,步骤S321之前,还包括预先训练与数据来源ID相关联的目标情绪识别模型这一步骤。如图4所示,预先训练与数据来源ID相关联的目标情绪识别模型具体包括如下步骤:In a specific embodiment, since the target emotion recognition model associated with the data source ID is applied in steps S321 and S322, the voice recognition quality-based quality inspection method includes pre-training and before step S321. The step of the target emotion recognition model associated with the data source ID. As shown in FIG. 4, pre-training the target emotion recognition model associated with the data source ID specifically includes the following steps:
S331:基于数据来源ID,获取数据库中与数据来源ID关联的训练语音数据。S331: Acquire training voice data associated with the data source ID in the database based on the data source ID.
该训练语音数据可以是坐席人员与客户进行电话沟通时,终端设备上集成的录音模块或与终端设备相连的录音设备采集的语音数据,该训练语音数据存储在与终端设备相连的数据库中,并与数据来源ID关联存储。在需训练与数据来源ID相关联的目标情绪识别模型时,需查询数据库,以获取所有与数据来源ID关联的训练语音数据。The training voice data may be voice data collected by a recording module integrated on the terminal device or a recording device connected to the terminal device when the agent communicates with the client, and the training voice data is stored in a database connected to the terminal device, and Stored in association with the data source ID. When the target emotion recognition model associated with the data source ID needs to be trained, the database needs to be queried to obtain all the training voice data associated with the data source ID.
S332:判断训练语音数据的数量是否达到情绪模型训练阈值。S332: Determine whether the number of training voice data reaches the emotional model training threshold.
情绪模型训练阈值是预先设置好能够训练情绪识别模型所需训练语音数据的数量。若训练语音数据的数量达到该情绪模型训练阈值,则说明数据库存储有能够训练出一与数据来源ID相关联目标情绪识别模型所需的训练语音数据。若训练语音数据的数量没有达到该情绪模型训练阈值,则无法训练目标情绪识别模型。The emotional model training threshold is a quantity in which the training speech data required to train the emotion recognition model is set in advance. If the number of training speech data reaches the emotional model training threshold, the database is stored with training speech data capable of training a target emotion recognition model associated with the data source ID. If the number of training speech data does not reach the emotional model training threshold, the target emotion recognition model cannot be trained.
S333:若训练语音数据达到情绪模型训练阈值,则基于与数据来源ID关联的训练语音数据进行情绪识别模型训练,获取目标情绪识别模型。S333: If the training speech data reaches the emotional model training threshold, the emotion recognition model training is performed based on the training speech data associated with the data source ID, and the target emotion recognition model is obtained.
本实施例中,基于与数据来源ID关联的训练语音数据进行情绪识别模型训练的训练过程与步骤S311-S313中提及的基于神经网络的情绪识别模型和/或基于支持向量机的情绪识别模型的训练过程相同,为避免重复,在此不一一赘述。由于采用与数据来源ID相对应的训练语音数据进行情绪模型训练获取的目标情绪识别模型更加贴合数据来源ID对应的说话人的情绪,使得所训练得到的目标情绪识别模型对携带相同数据来源ID对应的待测语音数据的情绪识别更准确,可有效避免基于不同说话人的训练语音数据所导致的误差。In this embodiment, the training process of the emotion recognition model training based on the training voice data associated with the data source ID and the neural network based emotion recognition model mentioned in steps S311-S313 and/or the support vector machine based emotion recognition model The training process is the same. To avoid repetition, we will not repeat them here. The target emotion recognition model acquired by using the training voice data corresponding to the data source ID for the emotional model training is more suitable for the speaker's emotion corresponding to the data source ID, so that the trained target emotion recognition model carries the same data source ID. The corresponding emotion recognition of the voice data to be tested is more accurate, and the error caused by the training voice data of different speakers can be effectively avoided.
由于客户与坐席人员在通话过程中,终端设备上集成的录音模块或与终端设备相连的录音设备会对通话过程进行录音,并将获取的语音数据存储在数据库中,每一语音数据与数据来源ID关联存储,且该数据来源ID可以是客户ID也可以是坐席ID。数据库定时统计每一数据来源ID对应的语音数量的数量,在任一数据来源ID对应的语音数据的数量达到情绪模型训练阈值时,执行步骤S333,以获取与数据来源ID相对应的目标情绪识别模型。在对待测语音数据进行识别时,先查找数据库中是否存储有与待测语音数据所携带的数据来源ID相关联的目标情绪识别模型;若存在目标情绪识别模型,则执行步骤S321-S322,以保证情绪识别的准确性;若不存在目标情绪识别模型,则执行步骤S311-S313。Since the client and the agent are in the process of talking, the recording module integrated on the terminal device or the recording device connected to the terminal device records the call process, and stores the acquired voice data in a database, each voice data and data source. The ID is associated with the storage, and the data source ID may be a customer ID or an agent ID. The database periodically counts the number of voices corresponding to each data source ID. When the number of voice data corresponding to any data source ID reaches the emotion model training threshold, step S333 is performed to obtain a target emotion recognition model corresponding to the data source ID. . When identifying the voice data to be tested, first searching whether a target emotion recognition model associated with the data source ID carried by the voice data to be tested is stored in the database; if the target emotion recognition model exists, performing steps S321-S322 to The accuracy of the emotion recognition is guaranteed; if there is no target emotion recognition model, steps S311-S313 are performed.
S40:将情绪识别结果发送给关联终端,以使关联终端显示情绪识别结果。S40: Send the emotion recognition result to the associated terminal, so that the associated terminal displays the emotion recognition result.
其中,关联终端可以为与坐席人员或质检人员进行人机交互的终端,其中,质检人员是用于对机构内部坐席人员服务进行质检的人员。该终端包括但不限于可显示情绪识别结果的智能手机、PC、平板电脑等终端。本实施例中,金融机构或需进行情绪识别的其他机构的终端设备在进行情绪识别后,将获取的情绪识别结果发送给关联终端,以使关联终端显示情绪识别结果,使得使用该关联终端的坐席人员或质检人员可根据情绪识别结果,了解待测语音数据对应的说话人的情绪。The associated terminal may be a terminal that performs human-computer interaction with the agent or the quality inspector, wherein the quality inspector is a person who performs quality inspection on the agent service of the institution. The terminal includes, but is not limited to, a terminal such as a smartphone, a PC, a tablet, etc., which can display the result of the emotion recognition. In this embodiment, the terminal device of the financial institution or other institution that needs to perform the emotion recognition sends the acquired emotion recognition result to the associated terminal after the emotion recognition is performed, so that the associated terminal displays the emotion recognition result, so that the associated terminal is used. The agent or the quality inspector can know the emotion of the speaker corresponding to the voice data to be tested according to the emotion recognition result.
本实施例中,若待测语音数据的说话人为客户,则终端设备将情绪识别结果发送给与该客户进行通话的坐席人员,以使坐席人员根据客户的情绪调节沟通方式(如客户愤怒时,提醒坐席人员进行适当安抚),从而提高服务质量,提升客户对机构的服务满意度。若待测语音数据的说话人为坐席人员,则终端设备将情绪识别结果发送给质检人员,以使质检人员对坐席人员与客户沟通过程进行监控,对坐席人员的工作状态做出评价,实行奖惩,促使坐席人员更好的给客户提供服务。In this embodiment, if the speaker of the voice data to be tested is a client, the terminal device sends the emotion recognition result to the agent who is in a call with the client, so that the agent adjusts the communication manner according to the customer's emotion (eg, when the customer is angry, Remind the staff to properly appease), thereby improving the quality of service and improving customer satisfaction with the organization's services. If the speaker of the voice data to be tested is an agent, the terminal device sends the emotion recognition result to the quality inspector, so that the quality inspector monitors the communication process between the agent and the client, and evaluates the working state of the agent, and implements Reward and punishment, prompting agents to better serve customers.
本申请提供的基于情绪识别的语音质检方法,通过对待测语音数据进行特征提取,以获取语音特征,再采用情绪识别模型对语音特征进行识别,以获取情绪识别结果,并通过关联终端显示该情绪识别结果,以使用户通过查看该情绪识别结果,了解该待测语音数据对应的说话人情绪。本实施例中,通过使关联终端显示待测语音数据对应的情绪识别结果,有助于辅助坐席人员提高对客户服务的服务质量,从而提高客户对机构的服务满意度。该基于情绪识别的语音质检方法中,可实现对待测语音数据进行智能识别,以获取情绪识别结果,其识别过程处理效率高,可实现对说话人对应的待测语音数据进行及时且全面抽检,无需人工干预,有利于节省人工成本。The method for voice quality detection based on emotion recognition provided by the present application extracts features from the voice data to be acquired to obtain voice features, and then uses the emotion recognition model to identify the voice features to obtain the emotion recognition result, and displays the same through the associated terminal. The emotion recognition result is such that the user can understand the speaker emotion corresponding to the voice data to be tested by viewing the emotion recognition result. In this embodiment, by causing the associated terminal to display the emotion recognition result corresponding to the voice data to be tested, it helps the assistant agent to improve the service quality of the customer service, thereby improving the customer's service satisfaction with the organization. In the voice quality detection method based on emotion recognition, the voice data to be tested can be intelligently recognized to obtain the emotion recognition result, and the recognition process is highly efficient, and the voice data to be tested corresponding to the speaker can be timely and comprehensively sampled. No manual intervention is required, which is conducive to saving labor costs.
在一具体实施方式中,该基于情绪识别的语音质检方法中,步骤S10具体包括:获取呼叫终端实时采集的待测语音数据。In a specific implementation manner, in the voice recognition based voice quality checking method, the step S10 specifically includes: acquiring the voice data to be tested collected by the calling terminal in real time.
本实施例中,该呼叫终端可以是与客户进行语音通信的终端或者与坐席人员进行语音通信的终端。呼叫终端可以是固定电话、移动电话或者对讲机等语音通话设备。其中,呼叫终端与设置在金融机构或需进行情绪识别的其他机构的终端设备通信相连,以使终端设备可获取呼叫终端实时采集的待测语音数据,便于对该待测语音数据进行实时监控。具体地,终端设备获取呼叫终端实时采集的待测语音数据,是指终端设备获取客户与坐席人员通话过程中实时采集到的待测语音数据,以便于实现对正在通话的客户或坐席人员的情绪进行监控。In this embodiment, the calling terminal may be a terminal that performs voice communication with a client or a terminal that performs voice communication with an agent. The calling terminal can be a voice call device such as a fixed telephone, a mobile phone or a walkie-talkie. The calling terminal is connected to the terminal device of the financial institution or other institution that needs to perform the emotion recognition, so that the terminal device can obtain the voice data to be tested collected by the calling terminal in real time, so as to facilitate real-time monitoring of the voice data to be tested. Specifically, the terminal device acquires the voice data to be tested collected by the calling terminal in real time, and the terminal device obtains the voice data to be tested collected in real time during the call between the client and the agent, so as to implement the emotion of the client or the agent who is calling. Monitor.
步骤S40具体包括:将情绪识别结果实时发送给关联终端,以使关联终端显示情绪识别结果。Step S40 specifically includes: transmitting the emotion recognition result to the associated terminal in real time, so that the associated terminal displays the emotion recognition result.
本实施例中,终端设备将获取的情绪识别结果实时发送给关联终端,以使关联终端可实时显示待测语音数据对应的说话人的情绪,促使坐席人员调整沟通方式,从而提高客户对坐席人员甚至机构的服务满意度。该基于情绪识别的语音质检方法采用人工智能识别方式,处理效率高,且其过程无需配备专业的质检人员进行抽检,可节省人工成本,降低欺诈风险。In this embodiment, the terminal device sends the acquired emotion recognition result to the associated terminal in real time, so that the associated terminal can display the emotion of the speaker corresponding to the voice data to be tested in real time, and prompts the agent to adjust the communication mode, thereby improving the client to the agent. Even the agency's service satisfaction. The voice recognition based voice quality detection method adopts artificial intelligence recognition mode, and the processing efficiency is high, and the process does not need to be equipped with professional quality inspection personnel for sampling inspection, which can save labor cost and reduce fraud risk.
本申请提供的基于情绪识别的语音质检方法,通过获取呼叫终端实时采集的待测语音数据,再对待测语音数据进行特征提取,以获取语音特征,然后采用情绪识别模型对语音特征进行识别,以获取情绪识别结果,将情绪识别结果实时发送给关联终端,通过关联终端显示该情绪识别结果,以使关联终端对应的坐席人员或质检人员可通过查看该情绪识别结果,了解该待测语音数据对应的说话人情绪,并进行沟通方式调整,可有助于提高机构的服务质量,进而提高客户对机构的服务满意度。该基于情绪识别的语音质检方法中,可实现对待测语音数据进行智能识别,以获取情绪识别结果,其识别过程处理效率高,可实现对说话人对应的待测语音数据进行及时且全面抽检,无需人工干预,有利于节省人工成本。The method for voice quality detection based on emotion recognition provided by the present application obtains the voice data to be tested collected by the calling terminal in real time, and then extracts the feature data of the voice data to obtain the voice feature, and then uses the emotion recognition model to identify the voice feature. The emotion recognition result is obtained, and the emotion recognition result is sent to the associated terminal in real time, and the emotion recognition result is displayed by the associated terminal, so that the agent or the quality inspection personnel corresponding to the associated terminal can view the emotion to be tested by viewing the emotion recognition result. The speaker's sentiment corresponding to the data and the adjustment of communication methods can help improve the service quality of the organization and thus improve the customer's service satisfaction with the organization. In the voice quality detection method based on emotion recognition, the voice data to be tested can be intelligently recognized to obtain the emotion recognition result, and the recognition process is highly efficient, and the voice data to be tested corresponding to the speaker can be timely and comprehensively sampled. No manual intervention is required, which is conducive to saving labor costs.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请的实施过程构成任何限定。It should be understood that the size of the serial number of each step in the above embodiments does not mean the order of execution order, and the order of execution of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the present application.
实施例2Example 2
对应于实施例1中基于情绪识别的语音质检方法,图5示出与实施例1中基于情绪识别的语音质检方法一一对应的基于情绪识别的语音质检装置的原理框图。如图5所示,该基于情绪识别的语音质检装置包括待测语音数据获取模块10、语音数据特征获取模块20、情绪识别结果获取模块30和情绪识别结果反馈模块40。其中,待测语音数据获取模块10、语音数据特征获取模块20、情绪识别结果获取模块30和情绪识别结果反馈模块40的实现功能与实施例中基于情绪识别的语音质检方法对应的步骤一一对应,为避免赘述,本实施例不一一详述。Corresponding to the voice recognition based voice quality inspection method in Embodiment 1, FIG. 5 is a schematic block diagram showing the emotion recognition based voice quality inspection apparatus corresponding to the emotion recognition based voice quality inspection method in Embodiment 1. As shown in FIG. 5, the emotion recognition based voice quality inspection apparatus includes a voice data acquisition module 10 to be tested, a voice data feature
待测语音数据获取模块10,用于获取待测语音数据。The voice data acquiring module 10 is configured to acquire voice data to be tested.
语音数据特征获取模块20,用于对待测语音数据进行特征提取,获取语音特征。The voice data feature
情绪识别结果获取模块30,用于采用情绪识别模型对语音特征进行识别,获取情绪识别结果。The emotion recognition result obtaining module 30 is configured to identify the voice feature by using the emotion recognition model, and obtain the emotion recognition result.
情绪识别结果反馈模块40,用于将情绪识别结果发送给关联终端,以使关联终端显示情绪识别结果。The emotion recognition result feedback module 40 is configured to send the emotion recognition result to the associated terminal, so that the associated terminal displays the emotion recognition result.
优选地,情绪识别结果获取模块包30包括识别结果输出判断单元311、第一识别结果处理单元312和第二识别结果处理单元313。Preferably, the emotion recognition result acquisition module package 30 includes a recognition result output determination unit 311, a first recognition result processing unit 312, and a second recognition result processing unit 313.
识别结果输出判断单元311,用于采用基于神经网络的情绪识别模型对语音特征进行识别,判断能否输出准确识别结果。The recognition result output judging unit 311 is configured to recognize the speech feature by using the neural network-based emotion recognition model, and determine whether the accurate recognition result can be output.
第一识别结果处理单元312,用于在能输出准确识别结果时,将准确识别结果作为情绪识别结果。The first recognition result processing unit 312 is configured to use the accurate recognition result as the emotion recognition result when the accurate recognition result can be output.
第二识别结果处理单元313,用于在不能输出准确识别结果时,采用基于支持向量机的情绪识别模型对语音特征进行识别,获取情绪识别结果。The second recognition result processing unit 313 is configured to identify the voice feature by using the emotion recognition model based on the support vector machine to obtain the emotion recognition result when the accurate recognition result cannot be output.
优选地,情绪识别结果获取模块30包括目标模型获取单元321和识别结果获取单元322。Preferably, the emotion recognition result acquisition module 30 includes a target model acquisition unit 321 and a recognition result acquisition unit 322.
目标模型获取单元321,用于基于待测语音数据的数据来源ID,获取与数据来源ID相关联的目标情绪识别模型。The target model obtaining unit 321 is configured to acquire a target emotion recognition model associated with the data source ID based on the data source ID of the voice data to be tested.
识别结果获取单元322,用于采用目标情绪识别模型对语音特征进行识别,获取情绪识别结果。The recognition result obtaining unit 322 is configured to identify the voice feature by using the target emotion recognition model, and obtain the emotion recognition result.
优选地,该基于情绪识别的语音质检装置还包括目标模型训练模块50。Preferably, the emotion recognition based voice quality checking device further comprises a target model training module 50.
目标模型训练模块50,用于预先训练与数据来源ID相关联的目标情绪识别模型。The target model training module 50 is configured to pre-train the target emotion recognition model associated with the data source ID.
优选地,目标模型训练模块50包括训练语音数据获取单元51、数量判断单元52和目标模型训练单元53。Preferably, the target model training module 50 includes a training voice data acquiring unit 51, a number determining unit 52, and a target model training unit 53.
训练语音数据获取单元51,用于基于数据来源ID,获取数据库中与数据来源ID关联的训练语音数据。The training voice data acquiring unit 51 is configured to acquire training voice data associated with the data source ID in the database based on the data source ID.
数量判断单元52,用于判断训练语音数据的数量是否达到情绪模型训练阈值。The quantity determining unit 52 is configured to determine whether the number of training voice data reaches the emotional model training threshold.
目标模型训练单元53,用于在训练语音数据达到情绪模型训练阈值时,基于与数据来源ID关联的训练语音数据进行情绪识别模型训练,获取目标情绪识别模型。The target model training unit 53 is configured to perform the emotion recognition model training based on the training voice data associated with the data source ID to obtain the target emotion recognition model when the training voice data reaches the emotion model training threshold.
优选地,待测语音数据获取模块10,用于获取关联终端实时采集的待测语音数据。Preferably, the to-be-tested voice data acquisition module 10 is configured to acquire the voice data to be tested collected by the associated terminal in real time.
情绪识别结果反馈模块40,用于将情绪识别结果实时发送给关联终端,以使关联终端显示情绪识别结果。The emotion recognition result feedback module 40 is configured to send the emotion recognition result to the associated terminal in real time, so that the associated terminal displays the emotion recognition result.
本实施例所提供的基于情绪识别的语音质检装置中,待测语音数据获取模块10可以在线实时获取待测语音数据也可以离线获取数据库中存储的语音数据,满足不同的语音数 据获取要求。语音数据特征获取模块20用于对待测语音数据进行特征提取,获取语音特征,语音特征提取方法为采用神经网络方法和支持向量机识别方法处理两种方式结合。情绪识别结果获取模块30采用情绪识别模型对语音特征进行识别,获取情绪识别结果,其中情绪识别模型的建立采用基于神经网络的情绪识别模型训练和支持向量机的情绪识别模型训练两种方法进行识别,使得输出结果更加准确,贴合实际。另外,情绪识别结果获取模块30还可以通过待测语音数据的数据来源ID获取与数据来源ID相关联的目标情绪识别模型进行情绪识别。情绪识别结果反馈模块40,用于将情绪识别结果发送给关联终端,以使关联终端显示情绪识别结果,关联终端可以实时显示测试结果,提醒坐席人员根据客户情绪及时调整与客户的对话策略,保证通话愉快顺利的进行,质检人员也可以高效率的完成对坐席人员的抽检。In the voice quality-checking device based on the emotion recognition provided by the embodiment, the voice data acquiring module 10 can obtain the voice data to be tested in real time online, and can also obtain the voice data stored in the database offline, and meet different voice data acquisition requirements. The voice data feature
实施例3Example 3
本实施例提供一计算机可读存储介质,该计算机可读存储介质上存储有计算机可读指令,该计算机可读指令被处理器执行时实现实施例1中基于情绪识别的语音质检方法,为避免重复,这里不再赘述。或者,该计算机可读指令被处理器执行时实现实施例2中基于情绪识别的语音质检装置中各模块/单元的功能,为避免重复,这里不再赘述。The embodiment provides a computer readable storage medium on which computer readable instructions are stored, and when the computer readable instructions are executed by the processor, the voice recognition based voice quality detection method in Embodiment 1 is implemented. Avoid repetition, no more details here. Alternatively, when the computer readable instructions are executed by the processor, the functions of the modules/units in the voice recognition based voice quality checking apparatus in Embodiment 2 are implemented. To avoid repetition, details are not described herein again.
实施例4Example 4
图6是本申请一实施例提供的终端设备的示意图。如图6所示,该实施例的终端设备60包括处理器61、存储器62以及存储在存储器62中并可在处理器61上运行的计算机可读指令63。处理器61执行计算机可读指令63时实现实施例1中基于情绪识别的语音质检方法的各个步骤,例如图1所示的步骤S10至S13。或者,处理器61执行计算机可读指令63时实现上述各装置实施例中各模块/单元的功能,例如图5所示的待测语音数据获取模块10、语音数据特征获取模块20、情绪识别结果获取模块30和情绪识别结果反馈模块40。FIG. 6 is a schematic diagram of a terminal device according to an embodiment of the present application. As shown in FIG. 6, the
示例性的,计算机可读指令63可以被分割成一个或多个模块/单元,一个或者多个模块/单元被存储在存储器62中,并由处理器61执行,以完成本申请。一个或多个模块/单元可以是能够完成特定功能的一系列计算机可读指令63的指令段,该指令段用于描述计算机可读指令63在终端设备60中的执行过程。例如,计算机可读指令63可以被分割成实施例2中的待测语音数据获取模块10、语音数据特征获取模块20、情绪识别结果获取模块30、情绪识别结果反馈模块40和目标模型训练模块50,其功能作用在实施例2中有详细描述,在此不一一赘述。Illustratively, computer
终端设备60可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。终 端设备可包括,但不仅限于,处理器61、存储器62。本领域技术人员可以理解,图6仅仅是终端设备60的示例,并不构成对终端设备60的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如终端设备还可以包括输入输出设备、网络接入设备、总线等。The
所称处理器61可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The
存储器62可以是终端设备60的内部存储单元,例如终端设备60的硬盘或内存。存储器62也可以是终端设备60的外部存储设备,例如终端设备60上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,存储器62还可以既包括终端设备60的内部存储单元也包括外部存储设备。存储器62用于存储计算机可读指令以及终端设备所需的其他程序和数据。存储器62还可以用于暂时地存储已经输出或者将要输出的数据。The
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、模块的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。上述系统中单元、模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。It will be apparent to those skilled in the art that, for convenience and brevity of description, only the division of each functional unit and module described above is exemplified. In practical applications, the above functions may be assigned to different functional units as needed. The module is completed, dividing the internal structure of the device into different functional units or modules to perform all or part of the functions described above. Each functional unit and module in the embodiment may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit, and the integrated unit may be implemented by hardware. Formal implementation can also be implemented in the form of software functional units. In addition, the specific names of the respective functional units and modules are only for the purpose of facilitating mutual differentiation, and are not intended to limit the scope of protection of the present application. For the specific working process of the unit and the module in the foregoing system, reference may be made to the corresponding process in the foregoing method embodiment, and details are not described herein again.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。In the above embodiments, the descriptions of the various embodiments are different, and the parts that are not detailed or described in the specific embodiments may be referred to the related descriptions of other embodiments.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本 申请的范围。Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the various examples described in connection with the embodiments disclosed herein can be implemented in electronic hardware or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the solution. A person skilled in the art can use different methods for implementing the described functions for each particular application, but such implementation should not be considered to be beyond the scope of this application.
在本申请所提供的实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口,装置或单元的间接耦合或通讯连接,可以是电性,机械或其它的形式。In the embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be used. Combinations can be integrated into another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in electrical, mechanical or other form.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
所述集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一计算机可读存储介质中,该计算机可读指令在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机可读指令包括计算机可读指令代码,所述计算机可读指令代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机可读指令代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括是电载波信号和电信信号。The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, the present application implements all or part of the processes in the foregoing embodiments, and may also be implemented by computer readable instructions, which may be stored in a computer readable storage medium. The computer readable instructions, when executed by a processor, may implement the steps of the various method embodiments described above. Wherein, the computer readable instructions comprise computer readable instruction code, which may be in the form of source code, an object code form, an executable file or some intermediate form or the like. The computer readable medium can include any entity or device capable of carrying the computer readable instruction code, a recording medium, a USB flash drive, a removable hard drive, a magnetic disk, an optical disk, a computer memory, a read only memory (ROM, Read-Only) Memory), random access memory (RAM), electrical carrier signals, telecommunications signals, and software distribution media. It should be noted that the content contained in the computer readable medium may be appropriately increased or decreased according to the requirements of legislation and patent practice in a jurisdiction, for example, in some jurisdictions, according to legislation and patent practice, computer readable media It does not include electrical carrier signals and telecommunication signals.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to explain the technical solutions of the present application, and are not limited thereto; although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still implement the foregoing embodiments. The technical solutions described in the examples are modified or equivalently replaced with some of the technical features; and the modifications or substitutions do not deviate from the spirit and scope of the technical solutions of the embodiments of the present application, and should be included in Within the scope of protection of this application.
Claims (20)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201710734303.X | 2017-08-24 | ||
| CN201710734303.XA CN107705807B (en) | 2017-08-24 | 2017-08-24 | Voice quality detecting method, device, equipment and storage medium based on Emotion identification |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2019037382A1 true WO2019037382A1 (en) | 2019-02-28 |
Family
ID=61169845
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2018/072967 Ceased WO2019037382A1 (en) | 2017-08-24 | 2018-01-17 | Emotion recognition-based voice quality inspection method and device, equipment and storage medium |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN107705807B (en) |
| WO (1) | WO2019037382A1 (en) |
Cited By (21)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110378562A (en) * | 2019-06-17 | 2019-10-25 | 中国平安人寿保险股份有限公司 | Voice quality detecting method, device, computer equipment and storage medium |
| CN110379445A (en) * | 2019-06-20 | 2019-10-25 | 深圳壹账通智能科技有限公司 | Method for processing business, device, equipment and storage medium based on mood analysis |
| CN110598612A (en) * | 2019-08-30 | 2019-12-20 | 深圳智慧林网络科技有限公司 | Patient nursing method based on mobile terminal, mobile terminal and readable storage medium |
| CN110705349A (en) * | 2019-08-26 | 2020-01-17 | 深圳壹账通智能科技有限公司 | Customer satisfaction recognition method, device, terminal and medium based on micro expression |
| CN110827857A (en) * | 2019-11-28 | 2020-02-21 | 哈尔滨工程大学 | Speech emotion recognition method based on spectral features and ELM |
| CN111081280A (en) * | 2019-12-30 | 2020-04-28 | 苏州思必驰信息科技有限公司 | Text-independent speech emotion recognition method and device and emotion recognition algorithm model generation method |
| CN111694938A (en) * | 2020-04-27 | 2020-09-22 | 平安科技(深圳)有限公司 | Emotion recognition-based answering method and device, computer equipment and storage medium |
| CN111709557A (en) * | 2020-05-28 | 2020-09-25 | 武汉中海庭数据技术有限公司 | High-precision map data optimal production flow calculation method and device |
| CN112052663A (en) * | 2020-08-31 | 2020-12-08 | 平安科技(深圳)有限公司 | Customer service statement quality inspection method and related equipment |
| CN112329437A (en) * | 2020-10-21 | 2021-02-05 | 交通银行股份有限公司 | Intelligent customer service voice quality inspection scoring method, equipment and storage medium |
| CN112347774A (en) * | 2019-08-06 | 2021-02-09 | 北京搜狗科技发展有限公司 | A model determination method and device for user emotion recognition |
| CN112383593A (en) * | 2020-10-30 | 2021-02-19 | 中国平安人寿保险股份有限公司 | Intelligent content pushing method and device based on offline accompanying visit and computer equipment |
| CN112836718A (en) * | 2020-12-08 | 2021-05-25 | 上海大学 | An Image Emotion Recognition Method Based on Fuzzy Knowledge Neural Network |
| CN112883932A (en) * | 2021-03-30 | 2021-06-01 | 中国工商银行股份有限公司 | Method, device and system for detecting abnormal behaviors of staff |
| CN112954104A (en) * | 2021-04-15 | 2021-06-11 | 北京蓦然认知科技有限公司 | Method and device for line quality inspection |
| CN113345468A (en) * | 2021-05-25 | 2021-09-03 | 平安银行股份有限公司 | Voice quality inspection method, device, equipment and storage medium |
| CN113792625A (en) * | 2018-08-11 | 2021-12-14 | 昆山美卓智能科技有限公司 | A smart desk with status monitoring function, status monitoring system and server |
| CN113903358A (en) * | 2021-10-15 | 2022-01-07 | 北京房江湖科技有限公司 | Voice quality inspection method, readable storage medium and computer program product |
| CN114519596A (en) * | 2020-11-18 | 2022-05-20 | 中国移动通信有限公司研究院 | A data processing method, device and equipment |
| CN115460317A (en) * | 2022-09-05 | 2022-12-09 | 西安万像电子科技有限公司 | Emotion recognition and voice feedback method, device, medium and electronic equipment |
| CN119005213A (en) * | 2024-10-24 | 2024-11-22 | 杭州度言软件有限公司 | Intelligent strategy quality inspection method and platform based on real-time ASR voice stream |
Families Citing this family (35)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108494952B (en) * | 2018-03-05 | 2021-07-09 | Oppo广东移动通信有限公司 | Voice call processing method and related equipment |
| CN108446388A (en) * | 2018-03-22 | 2018-08-24 | 平安科技(深圳)有限公司 | Text data quality detecting method, device, equipment and computer readable storage medium |
| CN110415727B (en) * | 2018-04-28 | 2021-12-07 | 科大讯飞股份有限公司 | Pet emotion recognition method and device |
| CN108763499B (en) * | 2018-05-30 | 2024-02-23 | 平安科技(深圳)有限公司 | Call quality inspection method, device, equipment and storage medium based on intelligent voice |
| CN108899050B (en) * | 2018-06-14 | 2020-10-02 | 南京云思创智信息科技有限公司 | Voice signal analysis subsystem based on multi-modal emotion recognition system |
| CN108962255B (en) * | 2018-06-29 | 2020-12-08 | 北京百度网讯科技有限公司 | Emotion recognition method, device, server and storage medium for speech conversation |
| CN108985358B (en) * | 2018-06-29 | 2021-03-02 | 北京百度网讯科技有限公司 | Emotion recognition method, device, device and storage medium |
| CN109119069B (en) * | 2018-07-23 | 2020-08-14 | 深圳大学 | Specific crowd identification method, electronic device and computer-readable storage medium |
| CN110890089B (en) * | 2018-08-17 | 2022-08-19 | 珠海格力电器股份有限公司 | Voice recognition method and device |
| CN109243491B (en) * | 2018-10-11 | 2023-06-02 | 平安科技(深圳)有限公司 | Method, system and storage medium for emotion recognition of speech in frequency spectrum |
| CN109658923B (en) * | 2018-10-19 | 2024-01-30 | 平安科技(深圳)有限公司 | Speech quality inspection method, equipment, storage medium and device based on artificial intelligence |
| CN109286726B (en) * | 2018-10-25 | 2021-05-14 | 维沃移动通信有限公司 | Content display method and terminal equipment |
| CN109243492A (en) * | 2018-10-28 | 2019-01-18 | 国家计算机网络与信息安全管理中心 | A kind of speech emotion recognition system and recognition methods |
| CN109473106B (en) * | 2018-11-12 | 2023-04-28 | 平安科技(深圳)有限公司 | Voiceprint sample collection method, voiceprint sample collection device, voiceprint sample collection computer equipment and storage medium |
| CN109587360B (en) * | 2018-11-12 | 2021-07-13 | 平安科技(深圳)有限公司 | Electronic device, method for coping with tactical recommendation, and computer-readable storage medium |
| CN109767335A (en) * | 2018-12-15 | 2019-05-17 | 深圳壹账通智能科技有限公司 | Double recording quality inspection method, device, computer equipment and storage medium |
| CN109600526A (en) * | 2019-01-08 | 2019-04-09 | 上海上湖信息技术有限公司 | Customer service quality determining method and device, readable storage medium storing program for executing |
| CN109829415A (en) * | 2019-01-25 | 2019-05-31 | 平安科技(深圳)有限公司 | Gender identification method, device, medium and equipment based on depth residual error network |
| CN109587347A (en) * | 2019-01-28 | 2019-04-05 | 珠海格力电器股份有限公司 | Display screen parameter adjusting method, device and system and mobile terminal |
| CN109992505B (en) * | 2019-03-15 | 2024-07-02 | 平安科技(深圳)有限公司 | Application program testing method and device, computer equipment and storage medium |
| CN111739558B (en) * | 2019-03-21 | 2023-03-28 | 杭州海康威视数字技术股份有限公司 | Monitoring system, method, device, server and storage medium |
| US20200381130A1 (en) * | 2019-05-30 | 2020-12-03 | Insurance Services Office, Inc. | Systems and Methods for Machine Learning of Voice Attributes |
| CN110719370A (en) * | 2019-09-04 | 2020-01-21 | 平安科技(深圳)有限公司 | Code scanning vehicle moving method, electronic device and storage medium |
| CN110738998A (en) * | 2019-09-11 | 2020-01-31 | 深圳壹账通智能科技有限公司 | Voice-based personal credit evaluation method, device, terminal and storage medium |
| CN110580899A (en) * | 2019-10-12 | 2019-12-17 | 上海上湖信息技术有限公司 | Voice recognition method and device, storage medium and computing equipment |
| CN110634491B (en) * | 2019-10-23 | 2022-02-01 | 大连东软信息学院 | Series connection feature extraction system and method for general voice task in voice signal |
| CN112509561A (en) * | 2020-12-03 | 2021-03-16 | 中国联合网络通信集团有限公司 | Emotion recognition method, device, equipment and computer readable storage medium |
| CN112668857A (en) * | 2020-12-23 | 2021-04-16 | 深圳壹账通智能科技有限公司 | Data classification method, device, equipment and storage medium for grading quality inspection |
| CN113129927B (en) * | 2021-04-16 | 2023-04-07 | 平安科技(深圳)有限公司 | Voice emotion recognition method, device, equipment and storage medium |
| CN113347491A (en) * | 2021-05-24 | 2021-09-03 | 北京格灵深瞳信息技术股份有限公司 | Video editing method and device, electronic equipment and computer storage medium |
| CN113449967B (en) * | 2021-06-04 | 2024-02-09 | 广东昭信平洲电子有限公司 | Quality inspection management system of inductance coil |
| CN113197579A (en) * | 2021-06-07 | 2021-08-03 | 山东大学 | Intelligent psychological assessment method and system based on multi-mode information fusion |
| CN113988155A (en) * | 2021-09-27 | 2022-01-28 | 北京智象信息技术有限公司 | Electronic photo frame picture display method and system based on intelligent voice |
| CN114662499A (en) * | 2022-03-17 | 2022-06-24 | 平安科技(深圳)有限公司 | Text-based emotion recognition method, device, equipment and storage medium |
| CN117332946A (en) * | 2023-09-12 | 2024-01-02 | 上海数禾信息科技有限公司 | Case distribution method and device based on canvas |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104036776A (en) * | 2014-05-22 | 2014-09-10 | 毛峡 | A speech emotion recognition method applied to mobile terminals |
| US20140257820A1 (en) * | 2013-03-10 | 2014-09-11 | Nice-Systems Ltd | Method and apparatus for real time emotion detection in audio interactions |
| CN104538043A (en) * | 2015-01-16 | 2015-04-22 | 北京邮电大学 | Real-time emotion reminder for call |
| CN105609117A (en) * | 2016-02-19 | 2016-05-25 | 郑洪亮 | Device and method for identifying voice emotion |
| CN106469560A (en) * | 2016-07-27 | 2017-03-01 | 江苏大学 | A Speech Emotion Recognition Method Based on Unsupervised Domain Adaptation |
| CN106782602A (en) * | 2016-12-01 | 2017-05-31 | 南京邮电大学 | Speech-emotion recognition method based on length time memory network and convolutional neural networks |
-
2017
- 2017-08-24 CN CN201710734303.XA patent/CN107705807B/en active Active
-
2018
- 2018-01-17 WO PCT/CN2018/072967 patent/WO2019037382A1/en not_active Ceased
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20140257820A1 (en) * | 2013-03-10 | 2014-09-11 | Nice-Systems Ltd | Method and apparatus for real time emotion detection in audio interactions |
| CN104036776A (en) * | 2014-05-22 | 2014-09-10 | 毛峡 | A speech emotion recognition method applied to mobile terminals |
| CN104538043A (en) * | 2015-01-16 | 2015-04-22 | 北京邮电大学 | Real-time emotion reminder for call |
| CN105609117A (en) * | 2016-02-19 | 2016-05-25 | 郑洪亮 | Device and method for identifying voice emotion |
| CN106469560A (en) * | 2016-07-27 | 2017-03-01 | 江苏大学 | A Speech Emotion Recognition Method Based on Unsupervised Domain Adaptation |
| CN106782602A (en) * | 2016-12-01 | 2017-05-31 | 南京邮电大学 | Speech-emotion recognition method based on length time memory network and convolutional neural networks |
Cited By (28)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113792625A (en) * | 2018-08-11 | 2021-12-14 | 昆山美卓智能科技有限公司 | A smart desk with status monitoring function, status monitoring system and server |
| CN110378562B (en) * | 2019-06-17 | 2023-07-28 | 中国平安人寿保险股份有限公司 | Voice quality inspection method, device, computer equipment and storage medium |
| CN110378562A (en) * | 2019-06-17 | 2019-10-25 | 中国平安人寿保险股份有限公司 | Voice quality detecting method, device, computer equipment and storage medium |
| CN110379445A (en) * | 2019-06-20 | 2019-10-25 | 深圳壹账通智能科技有限公司 | Method for processing business, device, equipment and storage medium based on mood analysis |
| CN112347774A (en) * | 2019-08-06 | 2021-02-09 | 北京搜狗科技发展有限公司 | A model determination method and device for user emotion recognition |
| CN110705349A (en) * | 2019-08-26 | 2020-01-17 | 深圳壹账通智能科技有限公司 | Customer satisfaction recognition method, device, terminal and medium based on micro expression |
| CN110598612A (en) * | 2019-08-30 | 2019-12-20 | 深圳智慧林网络科技有限公司 | Patient nursing method based on mobile terminal, mobile terminal and readable storage medium |
| CN110598612B (en) * | 2019-08-30 | 2023-06-09 | 深圳智慧林网络科技有限公司 | Patient nursing method based on mobile terminal, mobile terminal and readable storage medium |
| CN110827857A (en) * | 2019-11-28 | 2020-02-21 | 哈尔滨工程大学 | Speech emotion recognition method based on spectral features and ELM |
| CN110827857B (en) * | 2019-11-28 | 2022-04-12 | 哈尔滨工程大学 | Speech emotion recognition method based on spectral features and ELM |
| CN111081280A (en) * | 2019-12-30 | 2020-04-28 | 苏州思必驰信息科技有限公司 | Text-independent speech emotion recognition method and device and emotion recognition algorithm model generation method |
| CN111694938B (en) * | 2020-04-27 | 2024-05-14 | 平安科技(深圳)有限公司 | Emotion recognition-based reply method and device, computer equipment and storage medium |
| CN111694938A (en) * | 2020-04-27 | 2020-09-22 | 平安科技(深圳)有限公司 | Emotion recognition-based answering method and device, computer equipment and storage medium |
| CN111709557A (en) * | 2020-05-28 | 2020-09-25 | 武汉中海庭数据技术有限公司 | High-precision map data optimal production flow calculation method and device |
| CN111709557B (en) * | 2020-05-28 | 2023-04-28 | 武汉中海庭数据技术有限公司 | High-precision map data optimal production flow calculation method and device |
| CN112052663A (en) * | 2020-08-31 | 2020-12-08 | 平安科技(深圳)有限公司 | Customer service statement quality inspection method and related equipment |
| CN112052663B (en) * | 2020-08-31 | 2022-08-02 | 平安科技(深圳)有限公司 | Customer service statement quality inspection method and related equipment |
| CN112329437A (en) * | 2020-10-21 | 2021-02-05 | 交通银行股份有限公司 | Intelligent customer service voice quality inspection scoring method, equipment and storage medium |
| CN112329437B (en) * | 2020-10-21 | 2024-05-28 | 交通银行股份有限公司 | Intelligent customer service voice quality inspection scoring method, equipment and storage medium |
| CN112383593A (en) * | 2020-10-30 | 2021-02-19 | 中国平安人寿保险股份有限公司 | Intelligent content pushing method and device based on offline accompanying visit and computer equipment |
| CN114519596A (en) * | 2020-11-18 | 2022-05-20 | 中国移动通信有限公司研究院 | A data processing method, device and equipment |
| CN112836718A (en) * | 2020-12-08 | 2021-05-25 | 上海大学 | An Image Emotion Recognition Method Based on Fuzzy Knowledge Neural Network |
| CN112883932A (en) * | 2021-03-30 | 2021-06-01 | 中国工商银行股份有限公司 | Method, device and system for detecting abnormal behaviors of staff |
| CN112954104A (en) * | 2021-04-15 | 2021-06-11 | 北京蓦然认知科技有限公司 | Method and device for line quality inspection |
| CN113345468A (en) * | 2021-05-25 | 2021-09-03 | 平安银行股份有限公司 | Voice quality inspection method, device, equipment and storage medium |
| CN113903358A (en) * | 2021-10-15 | 2022-01-07 | 北京房江湖科技有限公司 | Voice quality inspection method, readable storage medium and computer program product |
| CN115460317A (en) * | 2022-09-05 | 2022-12-09 | 西安万像电子科技有限公司 | Emotion recognition and voice feedback method, device, medium and electronic equipment |
| CN119005213A (en) * | 2024-10-24 | 2024-11-22 | 杭州度言软件有限公司 | Intelligent strategy quality inspection method and platform based on real-time ASR voice stream |
Also Published As
| Publication number | Publication date |
|---|---|
| CN107705807A (en) | 2018-02-16 |
| CN107705807B (en) | 2019-08-27 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2019037382A1 (en) | Emotion recognition-based voice quality inspection method and device, equipment and storage medium | |
| CN112804400B (en) | Customer service call voice quality inspection method and device, electronic equipment and storage medium | |
| Hildebrand et al. | Voice analytics in business research: Conceptual foundations, acoustic feature extraction, and applications | |
| US10771627B2 (en) | Personalized support routing based on paralinguistic information | |
| US20190253558A1 (en) | System and method to automatically monitor service level agreement compliance in call centers | |
| CN112259106A (en) | Voiceprint recognition method and device, storage medium and computer equipment | |
| WO2019037205A1 (en) | Voice fraud identifying method and apparatus, terminal device, and storage medium | |
| WO2019210557A1 (en) | Voice quality inspection method and device, computer device and storage medium | |
| CN106504768B (en) | Phone testing audio frequency classification method and device based on artificial intelligence | |
| CN109767765A (en) | Vocabulary matching method and device, storage medium, and computer equipment | |
| CN107452385A (en) | A kind of voice-based data evaluation method and device | |
| WO2021047319A1 (en) | Voice-based personal credit assessment method and apparatus, terminal and storage medium | |
| CN113628627B (en) | Electric power industry customer service quality inspection system based on structured voice analysis | |
| CN114418320A (en) | Customer service quality evaluation method, apparatus, device, medium, and program product | |
| CN119181380B (en) | Speech fraud analysis method, device, equipment and storage medium | |
| CN115643341A (en) | Artificial intelligence customer service response system | |
| CN114925159A (en) | User sentiment analysis model training method, device, electronic device and storage medium | |
| US20250232768A1 (en) | System method and apparatus for combining words and behaviors | |
| CN115101053B (en) | Dialogue processing method, device, terminal and storage medium based on emotion recognition | |
| CN116631412A (en) | Method for judging voice robot through voiceprint matching | |
| WO2021098637A1 (en) | Voice transliteration method and apparatus, and related system and device | |
| US10446138B2 (en) | System and method for assessing audio files for transcription services | |
| Pandharipande et al. | A novel approach to identify problematic call center conversations | |
| CN119360881B (en) | Audio data labeling method, device, equipment and medium | |
| CN118982292A (en) | Method, device and equipment for quality inspection of call recordings based on large models |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18847940 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 25.09.2020) |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 18847940 Country of ref document: EP Kind code of ref document: A1 |