WO2023184942A1

WO2023184942A1 - Voice interaction method and apparatus and electric appliance

Info

Publication number: WO2023184942A1
Application number: PCT/CN2022/126640
Authority: WO
Inventors: 吴岩; 张桂芳
Original assignee: Qingdao Haier Air Conditioner Gen Corp Ltd; Qingdao Haier Air Conditioning Electric Co Ltd; Haier Smart Home Co Ltd
Current assignee: Qingdao Haier Air Conditioner Gen Corp Ltd; Qingdao Haier Air Conditioning Electric Co Ltd; Haier Smart Home Co Ltd
Priority date: 2022-03-29
Filing date: 2022-10-21
Publication date: 2023-10-05
Anticipated expiration: 2024-09-29
Also published as: CN114708869A

Abstract

A voice interaction method and apparatus, and an electric appliance, the method comprising: receiving a voice input of a target user, and determining target voice information (110); performing voice recognition on the target voice information to obtain a voice recognition result, the voice recognition result comprising at least one of age information of the target user and target emotion features of the target voice information (120); and outputting a response voice, the response voice being set on the basis of the voice recognition result (130). The voice interaction method can make corresponding humanized responses in respect of ages and emotion features according to the voice recognition result, thereby improving the user experience.

Description

Voice interaction method, device and electrical appliance

相关申请的交叉引用Cross-references to related applications

本申请要求于2022年3月29日提交的申请号为202210324200.7，名称为“语音交互方法、装置及电器”的中国专利申请的优先权，其通过引用方式全部并入本文。This application claims priority to the Chinese patent application with application number 202210324200.7 and titled "Voice Interaction Method, Device and Electrical Appliance" filed on March 29, 2022, which is fully incorporated herein by reference.

Technical field

本申请涉及智能电器技术领域，尤其涉及一种语音交互方法、装置及电器。The present application relates to the technical field of smart electrical appliances, and in particular to a voice interaction method, device and electrical appliances.

Background technique

随着科技的发展，语音交互技术的应用场景也越来越丰富，通过利用人工智能以及TTS(Text To Speech，从文本到语音)等技术，能够实现与用户群体的语音交流。With the development of science and technology, the application scenarios of voice interaction technology are becoming more and more abundant. By using artificial intelligence and TTS (Text To Speech, from text to speech) and other technologies, voice communication with user groups can be achieved.

相关技术中，语音交互设备的语音应答较为生硬，无法针对不同用户进行人性化的应答，给用户造成了不好的使用体验。In related technologies, the voice response of voice interaction equipment is relatively stiff and cannot provide humanized responses to different users, resulting in a bad user experience for users.

发明内容Contents of the invention

本申请提供一种语音交互方法、装置及电器，用以解决现有技术中语音交互装置应答生硬的缺陷，实现了能针对不同用户来进行有感情地应答的效果。The present application provides a voice interaction method, device and electrical appliance to solve the problem of blunt responses in the existing voice interaction devices and achieve the effect of responding emotionally to different users.

本申请提供一种语音交互方法，包括：This application provides a voice interaction method, including:

接收目标用户的语音输入，确定目标语音信息；Receive the voice input from the target user and determine the target voice information;

对所述目标语音信息进行语音识别，得到语音识别结果；所述语音识别结果包括所述目标用户的年龄信息和所述目标语音信息的目标情感特征中的至少一个；Perform speech recognition on the target speech information to obtain a speech recognition result; the speech recognition result includes at least one of the age information of the target user and the target emotional characteristics of the target speech information;

输出应答语音，所述应答语音为基于所述语音识别结果设置的。A response voice is output, and the response voice is set based on the voice recognition result.

根据本申请提供的一种语音交互方法，所述对所述目标语音信息进行语音识别，得到语音识别结果，包括：According to a voice interaction method provided by this application, performing voice recognition on the target voice information to obtain a voice recognition result includes:

获取所述目标语音信息相关的声纹特征识别结果；Obtain the voiceprint feature recognition results related to the target voice information;

基于所述声纹特征识别结果，确定所述目标用户的年龄信息；Based on the voiceprint feature recognition result, determine the age information of the target user;

基于所述年龄信息，设置所述应答语音的第一情感特征。Based on the age information, a first emotional characteristic of the response voice is set.

根据本申请提供的一种语音交互方法，所述获取所述目标语音信息相关的声纹特征识别结果，包括：According to a voice interaction method provided by this application, obtaining the voiceprint feature recognition results related to the target voice information includes:

基于所述目标语音信息，确定目标声纹特征；Based on the target voice information, determine the target voiceprint characteristics;

在声纹特征库中查找与所述目标声纹特征相匹配的声纹特征样本，在所述声纹特征库中预先存储有声纹特征样本集；Search for voiceprint feature samples that match the target voiceprint feature in a voiceprint feature database, where a voiceprint feature sample set is pre-stored in the voiceprint feature database;

在查找到与所述目标声纹特征相匹配的声纹特征样本的情况下，确定所述声纹特征样本对应的年龄信息。When a voiceprint feature sample matching the target voiceprint feature is found, age information corresponding to the voiceprint feature sample is determined.

根据本申请提供的一种语音交互方法，所述在声纹特征库中查找与所述目标声纹特征相匹配的声纹特征样本之后，所述方法还包括：According to a voice interaction method provided by this application, after searching for a voiceprint feature sample that matches the target voiceprint feature in the voiceprint feature database, the method further includes:

在未查找到与所述目标声纹特征相匹配的声纹特征样本的情况下，将所述目标声纹特征输入至年龄预测模型，得到所述年龄预测模型输出的年龄信息，所述年龄预测模型为以年龄样本声纹特征为样本，以年龄样本声纹特征中发声者的年龄信息为标签训练得到的。When no voiceprint feature sample matching the target voiceprint feature is found, the target voiceprint feature is input to the age prediction model to obtain the age information output by the age prediction model, and the age prediction The model is trained using age sample voiceprint features as samples and the age information of the speaker in the age sample voiceprint features as labels.

根据本申请提供的一种语音交互方法，在所述获取所述目标语音信息相关的声纹特征识别结果之后，所述方法还包括：According to a voice interaction method provided by this application, after obtaining the voiceprint feature recognition results related to the target voice information, the method further includes:

在基于所述声纹特征识别结果，未确定出所述目标用户的年龄信息的情况下，对所述目标语音信息进行语音情感识别，得到所述目标语音信息的目标情感特征；If the age information of the target user is not determined based on the voiceprint feature recognition result, perform voice emotion recognition on the target voice information to obtain the target emotion feature of the target voice information;

基于所述目标情感特征，设置所述应答语音的第二情感特征。Based on the target emotional feature, a second emotional feature of the response voice is set.

根据本申请提供的一种语音交互方法，所述对所述目标语音信息进行语音情感识别，得到所述目标语音信息的目标情感特征，包括：According to a voice interaction method provided by this application, performing voice emotion recognition on the target voice information to obtain the target emotion characteristics of the target voice information includes:

基于所述目标语音信息，生成目标文本信息；Based on the target voice information, generate target text information;

提取所述目标文本信息中的目标语料；Extract target corpus from the target text information;

在情感词典库中查找到所述目标语料的情况下，确定所述目标语料对应的所述语料情感特征；When the target corpus is found in the emotional dictionary, determine the emotional characteristics of the corpus corresponding to the target corpus;

基于所述语料情感特征，确定所述目标情感特征；Based on the emotional characteristics of the corpus, determine the target emotional characteristics;

所述情感词典库包括多个语料以及与所述语料对应的情感特征。The emotion dictionary includes a plurality of corpus and emotional features corresponding to the corpus.

本申请还提供一种语音交互装置，包括：This application also provides a voice interaction device, including:

接收模块，用于接收目标用户的语音输入，确定目标语音信息；The receiving module is used to receive the voice input of the target user and determine the target voice information;

处理模块，用于对所述目标语音信息进行语音识别，得到语音识别结果；所述语音识别结果包括所述目标用户的年龄信息和所述目标语音信息的目标情感特征中的至少一个；A processing module configured to perform speech recognition on the target speech information and obtain a speech recognition result; the speech recognition result includes at least one of the age information of the target user and the target emotional characteristics of the target speech information;

输出模块，用于输出应答语音，所述应答语音为基于所述语音识别结果设置的。An output module is configured to output a response voice, where the response voice is set based on the voice recognition result.

本申请还提供一种电器，包括如上述的语音交互装置。This application also provides an electrical appliance, including the above-mentioned voice interaction device.

本申请还提供一种空调，包括室内机、室外机和设置在所述室内机或室外机中的处理器和存储器；还包括存储在所述存储器上并可在所述处理器上运行的程序或指令，所述程序或指令被所述处理器执行时执行如上述任一种所述语音交互方法。The application also provides an air conditioner, including an indoor unit, an outdoor unit, a processor and a memory provided in the indoor unit or outdoor unit; and a program stored on the memory and executable on the processor. Or an instruction, when the program or instruction is executed by the processor, any one of the above voice interaction methods is executed.

本申请还提供一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现如上述任一种所述语音交互方法。This application also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, it implements any one of the above voice interaction methods. .

本申请还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现如上述任一种所述语音交互方法。This application also provides a non-transitory computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, it implements any one of the above voice interaction methods.

本申请还提供一种计算机程序产品，包括计算机程序，所述计算机程序被处理器执行时实现如上述任一种所述语音交互方法。The present application also provides a computer program product, which includes a computer program. When the computer program is executed by a processor, it implements any one of the above voice interaction methods.

本申请提供的语音交互方法、装置及电器，通过对目标用户的目标语音信息进行语音识别，得到目标用户的年龄或者目标语音信息的目标情感特征，进而能根据语音识别结果针对年龄和情感特征做出相应的人性化应答，从而提高了用户体验。The voice interaction method, device and electrical appliance provided by this application can obtain the age of the target user or the target emotional characteristics of the target voice information by performing speech recognition on the target voice information of the target user, and then can make decisions based on the age and emotional characteristics based on the speech recognition results. Provide corresponding humanized responses, thereby improving user experience.

Description of drawings

为了更清楚地说明本申请或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to explain the technical solutions in this application or the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are of the present invention. For some embodiments of the application, those of ordinary skill in the art can also obtain other drawings based on these drawings without exerting creative efforts.

图1是本申请提供的语音交互方法的流程示意图之一；Figure 1 is one of the flow diagrams of the voice interaction method provided by this application;

图2是本申请提供的语音交互方法的流程示意图之二；Figure 2 is the second schematic flow chart of the voice interaction method provided by this application;

图3是本申请提供的语音交互装置的结构示意图；Figure 3 is a schematic structural diagram of the voice interaction device provided by this application;

图4是本申请提供的电子设备的结构示意图。Figure 4 is a schematic structural diagram of an electronic device provided by this application.

Detailed ways

为使本申请的目的、技术方案和优点更加清楚，下面将结合本申请中的附图，对本申请中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。In order to make the purpose, technical solutions and advantages of this application clearer, the technical solutions in this application will be clearly and completely described below in conjunction with the drawings in this application. Obviously, the described embodiments are part of the embodiments of this application. , not all examples. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this application.

下面结合图1-图4描述本申请的语音交互方法、装置及电器。The voice interaction method, device and electrical appliance of the present application will be described below with reference to Figures 1-4.

本申请实施例的语音交互方法的执行主体可以是处理器，当然，在一些实施例中，本申请实施例的语音交互方法的执行主体还可以是服务器，此处不作限制。下面以执行主体为处理器为例来对本申请实施例的语音交互方法进行说明。The execution subject of the voice interaction method in the embodiment of the present application may be a processor. Of course, in some embodiments, the execution subject of the voice interaction method in the embodiment of the present application may also be a server, which is not limited here. The voice interaction method in the embodiment of the present application will be described below by taking the execution subject as a processor as an example.

如图1所示，本申请实施例的语音交互方法主要包括步骤110、步骤120和步骤130。As shown in Figure 1, the voice interaction method in the embodiment of the present application mainly includes step 110, step 120 and step 130.

步骤110，接收目标用户的语音输入，确定目标语音信息。Step 110: Receive the target user's voice input and determine the target voice information.

需要说明的是，在目标用户发出语音后，可以通过采集目标用户的语音来实现对语音输入的接收。It should be noted that after the target user sends a voice, the voice input can be received by collecting the target user's voice.

在一些实施例中，可以通过拾音器或者麦克风等装置来实现对目标用户语音的采集，当然，在其他实施例中，也可以通过其他语音采集装置来实现对目标用户的语音采集。In some embodiments, the target user's voice can be collected through devices such as pickups or microphones. Of course, in other embodiments, other voice collection devices can also be used to collect the target user's voice.

在家庭场景下，在通过拾音器来对目标用户的语音进行采集的情况下，拾音器可以与电器设备进行集成。例如，拾音器可以安装于空调的室内机外壳上。在此种情况下，可以通过空调来实现与目标用户的语音交互。In a home scenario, when the target user's voice is collected through a microphone, the microphone can be integrated with electrical equipment. For example, the pickup can be installed on the indoor unit housing of the air conditioner. In this case, voice interaction with the target user can be achieved through air conditioning.

或者，在通过麦克风来对目标用户的语音进行采集的情况下，麦克风可以是目标用户的智能手机上所配置的麦克风。在此种情况下，可以通过手机来实现与目标用户的语音交互。Alternatively, when the target user's voice is collected through a microphone, the microphone may be a microphone configured on the target user's smartphone. In this case, voice interaction with the target user can be achieved through the mobile phone.

当然，在其他一些场景下，拾音器或者麦克风还可以安装于其他位置，此处对语音交互的介质不作限制，此处对拾音器或者麦克风的安装位置以及安装方式也不作限制。Of course, in some other scenarios, the pickup or microphone can also be installed at other locations. There is no restriction on the medium of voice interaction here, and there is no restriction on the installation position and installation method of the pickup or microphone.

在一些实施例中，可以通过特殊的唤醒语音来实现语音接收的唤醒，即在接收到唤醒语音的情况下，开始采集目标用户的语音，并确定目标语音信息。In some embodiments, the wake-up of voice reception can be achieved through a special wake-up voice, that is, when the wake-up voice is received, the voice of the target user is collected and the target voice information is determined.

例如，在采集到的语音为“小优小优，打开空调！”的情况下，可以识别到“小优”为唤醒语音。在此种情况下，识别到唤醒语音后再开始采集目标用户的语音，从而实现对目标用户语音输入的接收。For example, when the collected voice is "Xiaoyou, Xiaoyou, turn on the air conditioner!", "Xiaoyou" can be recognized as the wake-up voice. In this case, the wake-up voice is recognized before starting to collect the target user's voice, thereby realizing the reception of the target user's voice input.

当然，在其他一些实施例中，也可以不间断地对环境声音进行采集，在识别到目标用户的语音后，确定目标语音信息。Of course, in some other embodiments, environmental sounds can also be collected continuously, and after the target user's voice is recognized, the target voice information is determined.

可以理解的是，在采集到目标用户的语音后，可以对采集到的语音进行简单的预处理，例如人声分离以及降噪等处理，得到有效的语音信息流，进而得到有效的目标语音信息。It can be understood that after collecting the target user's voice, simple pre-processing can be performed on the collected voice, such as voice separation and noise reduction, to obtain an effective voice information flow, and then obtain effective target voice information. .

步骤120，对目标语音信息进行语音识别，得到语音识别结果。Step 120: Perform speech recognition on the target speech information to obtain a speech recognition result.

可以理解的是，语音识别结果至少包括目标语音信息中的文本信息。It can be understood that the speech recognition result at least includes text information in the target speech information.

需要说明的是，语音识别结果还包括目标用户的年龄信息和目标语音信息的目标情感特征中的至少一个。It should be noted that the speech recognition result also includes at least one of the target user's age information and the target emotional characteristics of the target voice information.

可以理解的是，通过对目标语音信息进行语音识别，可以确定出目标语音信息对应的目标用户的年龄信息。It can be understood that by performing speech recognition on the target speech information, the age information of the target user corresponding to the target speech information can be determined.

年龄信息可以是具体的年龄值，还可以是一定的年龄大小区间。在本实施方式中，可以根据年龄信息来判断目标用户为儿童、中年人或者老年人等。The age information can be a specific age value or a certain age range. In this embodiment, the target user can be determined to be a child, a middle-aged person, an elderly person, etc. based on the age information.

例如，在确定的年龄信息的值处于区间0-14之间，可以将目标用户确定为儿童，在确定的年龄信息的值大于55，则可以将目标用户确定为老年人。For example, when the value of the determined age information is in the interval 0-14, the target user can be determined to be a child; when the value of the determined age information is greater than 55, the target user can be determined to be an elderly person.

在一些实施例中，可以通过提取目标语音信息的声纹特征来确定目标语音信息对应的目标用户的年龄信息。In some embodiments, the age information of the target user corresponding to the target voice information can be determined by extracting voiceprint features of the target voice information.

在另一些实施例中，可以通过将目标语音信息与预设的用户库中的用户语音进行匹配来确定目标用户的年龄信息。In other embodiments, the age information of the target user may be determined by matching the target voice information with user voices in a preset user library.

当然，在其他实施方式中，还可以通过其他方式来确定目标语音信息对应的目标用户的年龄信息，此处不作限制。Of course, in other implementations, the age information of the target user corresponding to the target voice information can also be determined in other ways, which is not limited here.

可以理解的是，通过对目标语音信息进行语音识别，可以确定出目标语音信息的目标情感特征。It can be understood that by performing speech recognition on the target speech information, the target emotional characteristics of the target speech information can be determined.

目标情感特征用于表示目标对象的目标语音信息所带有的情感色彩，按照不同的分类标准，可以将目标情感特征分成不同的类型。The target emotional features are used to represent the emotional color of the target voice information of the target object. According to different classification standards, the target emotional features can be divided into different types.

在一些实施例中，可以按照积极或者消极因素将目标情感特征划分为正面情感、中性情感以及负面情感。In some embodiments, the target emotional characteristics can be divided into positive emotions, neutral emotions, and negative emotions according to positive or negative factors.

正面情感中可以包括愉快、开心以及放松等情绪，负面情绪中可以包括痛苦、沮丧、伤心以及愤怒等情绪，中性情绪中可以包括严肃、坚定以及疑惑等情绪。Positive emotions can include emotions such as pleasure, joy, and relaxation, negative emotions can include emotions such as pain, frustration, sadness, and anger, and neutral emotions can include emotions such as seriousness, determination, and doubt.

可以理解的是，可以根据正面情感、中性情感以及负面情感所对应的语气信息以及文本信息构建数据库，数据库包含不同的语气信息以及文本信息所分别对应的情感特征。It can be understood that a database can be constructed based on the tone information and text information corresponding to positive emotions, neutral emotions and negative emotions, and the database contains different emotional characteristics corresponding to different tone information and text information.

在本实施方式中，可以根据目标语音信息中的语气信息、文本信息以及数据库来确定目标语音信息的目标情感特征。In this embodiment, the target emotional characteristics of the target voice information can be determined based on the tone information, text information and database in the target voice information.

例如，当目标语音信息的内容为“小优，小优，我好烦！”，该语音消息中含有消极的信息“烦”，因此可以将该目标语音信息的目标情感特征确定为负面情感。For example, when the content of the target voice message is "Xiaoyou, Xiaoyou, I'm so annoyed!", the voice message contains the negative information "annoying", so the target emotional characteristic of the target voice message can be determined as negative emotion.

在一些实施例中，可以通过将目标语音信息输入至语音情感识别模型，通过对目标语音信息进行综合识别与分析，进而得到目标语音信息的目标情感特征。In some embodiments, the target speech information can be input into the speech emotion recognition model, and the target speech information can be comprehensively recognized and analyzed to obtain the target emotion characteristics of the target speech information.

在另一些实施例中，可以通过提取目标语音信息中的相关语料来对情感词进行匹配，进而得到目标语音信息的目标情感特征。In other embodiments, the emotional words can be matched by extracting relevant corpus in the target speech information, thereby obtaining the target emotional characteristics of the target speech information.

当然，在其他实施方式中，还可以通过其他方式来确定目标语音信息的目标情感特征，此处不作限制。Of course, in other implementations, the target emotional characteristics of the target speech information can also be determined through other methods, which are not limited here.

步骤130，输出应答语音，应答语音为基于语音识别结果设置的。Step 130: Output a response voice, which is set based on the voice recognition result.

在得到目标语音信息的识别结果后，可以针对目标语音信息得到应答语音。After obtaining the recognition result of the target voice information, a response voice can be obtained for the target voice information.

需要说明的是，应答语音包括对目标语音信息中的文本信息进行回复的语音播报内容，播报语音可以基于TTS技术来进行生成。It should be noted that the response voice includes voice broadcast content that responds to the text information in the target voice information, and the broadcast voice can be generated based on TTS technology.

由于语音识别结果包括目标用户的年龄信息和目标语音信息的目标情感特征，因此，应答语音在进行播报时可以针对目标用户的年龄和语音交互的目标情感特征做出适应性地设置，进而实现了对不同类型的目标用户进行个性化的语音交互的效果。Since the speech recognition results include the age information of the target user and the target emotional characteristics of the target voice information, the response voice can be adaptively set according to the age of the target user and the target emotional characteristics of the voice interaction when broadcasting, thereby achieving The effect of personalized voice interaction for different types of target users.

根据本申请实施例的语音交互方法，通过对目标用户的目标语音信息进行语音识别，得到目标用户的年龄或者目标语音信息的目标情感特征，进而能根据语音识别结果针对年龄和情感特征做出相应的人性化应答，从而提高了用户体验。According to the voice interaction method of the embodiment of the present application, by performing speech recognition on the target voice information of the target user, the age of the target user or the target emotional characteristics of the target voice information can be obtained, and then the corresponding age and emotional characteristics can be made according to the speech recognition results. humanized responses, thereby improving user experience.

如图2所示，在一些实施例中，步骤120：对目标语音信息进行语音识别，得到语音识别结果，主要包括步骤1201、步骤1202和步骤1203。As shown in Figure 2, in some embodiments, step 120: Perform speech recognition on the target speech information to obtain a speech recognition result, which mainly includes step 1201, step 1202 and step 1203.

步骤1201，获取目标语音信息相关的声纹特征识别结果。Step 1201: Obtain the voiceprint feature recognition results related to the target voice information.

可以理解的是，可以理解的是，对目标语音信息进行声纹特征识别可以包括识别目标语音信息相应的目标用户的身份、年龄或者性别等信息。It can be understood that the voiceprint feature recognition of the target voice information may include identifying the identity, age, gender and other information of the target user corresponding to the target voice information.

步骤1202，基于声纹特征识别结果，确定目标用户的年龄信息。Step 1202: Determine the age information of the target user based on the voiceprint feature recognition result.

在一些实施例中，可以根据识别出的目标用户的身份信息来确定出目标用户的年龄信息。In some embodiments, the age information of the target user may be determined based on the identified identity information of the target user.

在另一些实施例中，可以直接通过声纹特征识别结果确定出目标用户的年龄信息。In other embodiments, the age information of the target user can be determined directly through the voiceprint feature recognition results.

在此种情况下，步骤1201：获取目标语音信息相关的声纹特征识别结果具体可以包括基于目标语音信息，确定目标声纹特征。In this case, step 1201: Obtaining the voiceprint feature recognition result related to the target voice information may specifically include determining the target voiceprint feature based on the target voice information.

可以理解的是，在对目标语音信息进行声纹特征识别之前，需要对目标语音信息进行声纹特征的提取。It can be understood that before performing voiceprint feature recognition on the target voice information, it is necessary to extract the voiceprint features of the target voice information.

在一些实施例中，可以通过预先训练好的声纹特征提取神经网络模型来对目标语音信息提取声纹特征。In some embodiments, voiceprint features can be extracted from the target voice information through a pre-trained voiceprint feature extraction neural network model.

可以理解的是，预先训练的神经网络模型可以以样本语音信息为样本，以样本语音信息中发声者的声纹特征为标签训练而成。It can be understood that the pre-trained neural network model can be trained using the sample voice information as the sample and the voiceprint characteristics of the speaker in the sample voice information as the label.

具体训练过程可以为，将样本语音信息输入至声纹特征提取神经网络模型中，输出识别出的发生者的声纹特征，将识别出的声纹特征与标签间的相似度作为损失，根据损失调整声纹特征提取神经网络模型中需要更新的量，直至损失小于预设阈值或者训练测试的数量达到预设数目。The specific training process can be as follows: input the sample voice information into the voiceprint feature extraction neural network model, output the identified voiceprint feature of the speaker, and use the similarity between the identified voiceprint feature and the label as the loss. According to the loss Adjust the amount of updates needed in the voiceprint feature extraction neural network model until the loss is less than the preset threshold or the number of training tests reaches the preset number.

在提取出目标语音信息中的目标声纹特征后，在声纹特征库中查找与目标声纹特征相匹配的声纹特征样本。After extracting the target voiceprint feature in the target voice information, a voiceprint feature sample matching the target voiceprint feature is searched for in the voiceprint feature database.

可以理解的是，在声纹特征库中预先存储有声纹特征样本集，声纹特征样本集包括各个声纹特征样本以及各个声纹特征样本所对应的年龄信息。It can be understood that a voiceprint feature sample set is pre-stored in the voiceprint feature database, and the voiceprint feature sample set includes each voiceprint feature sample and the age information corresponding to each voiceprint feature sample.

在进行查找的过程中，可以计算目标声纹特征与各声纹特征样本之间的匹配度。During the search process, the matching degree between the target voiceprint feature and each voiceprint feature sample can be calculated.

在进行匹配度的计算时，可以通过线性判别模型对目标声纹特征与各声纹特征样本之间的匹配度进行计算，在匹配度大于预设值的情况下，将该声纹特征样本所对应的年龄信息作为目标声纹特征所对应的年龄信息。When calculating the matching degree, the matching degree between the target voiceprint feature and each voiceprint feature sample can be calculated through a linear discriminant model. When the matching degree is greater than the preset value, the voiceprint feature sample will be The corresponding age information is used as the age information corresponding to the target voiceprint feature.

当然，在一些实施例中，当目标声纹特征与多个声纹特征样本的匹配度均大于预设值时，可以将满足要求的多个声纹特征样本中匹配度最大的声纹特征样本对应的年龄信息作为目标声纹特征所对应的年龄信息。Of course, in some embodiments, when the matching degree between the target voiceprint feature and multiple voiceprint feature samples is greater than the preset value, the voiceprint feature sample with the largest matching degree among the multiple voiceprint feature samples that meets the requirements can be The corresponding age information is used as the age information corresponding to the target voiceprint feature.

换言之，在查找到与目标声纹特征相匹配的声纹特征样本的情况下，可以确定声纹特征样本对应的年龄信息，进而再将声纹特征样本对应的年龄信息作为目标声纹特征所对应的年龄信息。In other words, when a voiceprint feature sample matching the target voiceprint feature is found, the age information corresponding to the voiceprint feature sample can be determined, and then the age information corresponding to the voiceprint feature sample is used as the target voiceprint feature. age information.

根据本申请实施例的语音交互方法，通过在声纹特征库中查找目标声纹特征，能快速而又准确地确认出目标声纹特征所对应的年龄信息，提高了语音交互应答内容的准确性，进而提升了用户的体验。According to the voice interaction method of the embodiment of the present application, by searching the target voiceprint feature in the voiceprint feature database, the age information corresponding to the target voiceprint feature can be quickly and accurately confirmed, thereby improving the accuracy of the voice interaction response content. , thereby improving the user experience.

在一些实施例中，在声纹特征库中查找与目标声纹特征相匹配的声纹特征样本之后，本申请实施例的语音交互方法还包括：In some embodiments, after searching for voiceprint feature samples that match the target voiceprint feature in the voiceprint feature database, the voice interaction method in the embodiment of the present application further includes:

在未查找到与目标声纹特征相匹配的声纹特征样本的情况下，将目标声纹特征输入至年龄预测模型，得到年龄预测模型输出的年龄信息。When no voiceprint feature sample matching the target voiceprint feature is found, the target voiceprint feature is input to the age prediction model to obtain the age information output by the age prediction model.

可以理解的是，年龄预测模型为以年龄样本声纹特征为样本，以年龄样本声纹特征中发声者的年龄信息为标签训练得到的。It can be understood that the age prediction model is trained using age sample voiceprint features as samples and the age information of the speaker in the age sample voiceprint features as labels.

可以理解的是，年龄预测模型可以是卷积神经网络模型、隐马尔科夫模型或者高斯混合模型等，此处对年龄预测模型的类型不作限制。It can be understood that the age prediction model can be a convolutional neural network model, a hidden Markov model or a Gaussian mixture model, etc. There is no restriction on the type of the age prediction model here.

在本实施方式中，在声纹特征库中查找不到与目标声纹特征相匹配的声纹特征样本之后，通过年龄预测模型能得到目标语音信息对应的目标信息的年龄预测值，能准确地确认出目标声纹特征所对应的年龄信息，提高了语音交互应答内容的准确性，进而提升了用户的体验。In this embodiment, after no voiceprint feature sample matching the target voiceprint feature is found in the voiceprint feature database, the age prediction value of the target information corresponding to the target voice information can be obtained through the age prediction model, which can accurately Confirming the age information corresponding to the target voiceprint characteristics improves the accuracy of the voice interactive response content, thereby improving the user experience.

步骤1203，基于年龄信息，设置应答语音的第一情感特征。Step 1203: Based on the age information, set the first emotional feature of the response voice.

可以理解的是，针对目标用户的不同年龄信息，可以设置应答语音的第一情感特征以实现更加人性化的语音交互。It is understandable that according to the different age information of the target users, the first emotional feature of the response voice can be set to achieve a more humanized voice interaction.

在一些实施例中，在确定的年龄信息的值处于区间0-14之间，可以将目标用户确定为儿童；在确定的年龄信息的值大于55，则可以将目标用户确定为老年人；在确定的年龄信息的值处于区间14-55之间，可以将目标用户确定为中年人。In some embodiments, when the value of the determined age information is in the interval 0-14, the target user can be determined to be a child; when the value of the determined age information is greater than 55, the target user can be determined to be an elderly person; The value of the determined age information is between 14 and 55, and the target user can be determined to be a middle-aged person.

根据确定的年龄信息，在确定出目标语音信息对应的目标用户为儿童的情况下，应答语音的第一情感特征可以为正面情感特征。正面情感中可以包括愉快、开心、活泼以及放松等情绪。According to the determined age information, when it is determined that the target user corresponding to the target voice information is a child, the first emotional feature of the response voice may be a positive emotional feature. Positive emotions can include emotions such as joy, joy, liveliness, and relaxation.

在此种情况下，应答语音可以设置为更加愉快、活泼以及放松的语气，应答语音的内容中可以增添更多愉快、轻松的词汇，进而能够更加贴合儿童的使用需求。In this case, the response voice can be set to a more pleasant, lively and relaxing tone, and more pleasant and relaxing words can be added to the content of the response voice, so as to better meet the needs of children.

根据确定的年龄信息，在确定出目标语音信息对应的目标用户为老年人的情况下，应答语音的第一情感特征可以为中性情感特征。中性情感中可以包括严肃以及坚定等情绪。According to the determined age information, when it is determined that the target user corresponding to the target voice information is an elderly person, the first emotional feature of the response voice may be a neutral emotional feature. Neutral emotions can include emotions such as seriousness and firmness.

在此种情况下，应答语音可以设置为更加严肃以及坚定的语气，应答语音的内容中可以增添更多正式以及肯定的词汇，进而能够更加贴合老年人的使用需求。In this case, the response voice can be set to a more serious and firm tone, and more formal and affirmative vocabulary can be added to the content of the response voice, so as to better meet the needs of the elderly.

根据确定的年龄信息，在确定出目标语音信息对应的目标用户为中年人的情况下，应答语音的第一情感特征可以为中性情感特征或者正面情感特征。According to the determined age information, when it is determined that the target user corresponding to the target voice information is a middle-aged person, the first emotional feature of the response voice may be a neutral emotional feature or a positive emotional feature.

在此种情况下，应答语音可以设置为更加严肃或者愉快的语气，应答语音的内容中可以增添更多正式以及愉快的词汇，进而能够更加贴合中年人的使用需求。In this case, the response voice can be set to a more serious or pleasant tone, and more formal and pleasant vocabulary can be added to the content of the response voice, so as to better meet the needs of middle-aged people.

当然，在其他实施例中，还可以对不同年龄阶段的目标用户设置其他类型的情感特征，不同年龄阶段的目标用户可以对应其他不同类型的情感特征，可以根据实际情况进行设置，此处对不同年龄阶段的目标用户对应的情感特征的类型不作限制。Of course, in other embodiments, other types of emotional features can also be set for target users of different ages. Target users of different ages can correspond to other different types of emotional features, which can be set according to the actual situation. Here, different types of emotional features can be set for target users of different ages. There are no restrictions on the types of emotional characteristics corresponding to target users in age groups.

根据本申请实施例的语音交互方法，通过确定出目标语音信息对应的目标用户的年龄信息，再根据年龄信息来设置输出的应答语音的情感特征，能够针对不同年龄的目标用户对应答语音在情感上进行设置，使得应答语音更加人性化，满足了不同用户的不同个性化使用需求，提升了用户体验。According to the voice interaction method of the embodiment of the present application, by determining the age information of the target user corresponding to the target voice information, and then setting the emotional characteristics of the output response voice based on the age information, the emotional characteristics of the response voice can be adjusted for target users of different ages. Settings on the phone make the voice response more humane, meet the different personalized needs of different users, and improve the user experience.

在一些实施例中，在步骤1201：在获取目标语音信息相关的声纹特征识别结果之后，本申请实施例的语音交互方法还包括：In some embodiments, in step 1201: after obtaining the voiceprint feature recognition results related to the target voice information, the voice interaction method in the embodiment of the present application also includes:

在基于声纹特征识别结果，未确定出目标用户的年龄信息的情况下，对目标语音信息进行语音情感识别，得到目标语音信息的目标情感特征。When the age information of the target user is not determined based on the voiceprint feature recognition results, speech emotion recognition is performed on the target voice information to obtain the target emotion characteristics of the target voice information.

可以理解的是，在对目标声纹特征识别时，若无法从声纹特征识别结果中确定出目标用户的年龄信息，可以通过对目标语音信息进行语音情感识别，得到目标语音信息的目标情感特征。It is understandable that when identifying target voiceprint features, if the age information of the target user cannot be determined from the voiceprint feature recognition results, the target emotional features of the target voice information can be obtained by performing voice emotion recognition on the target voice information. .

在一些实施例中，在声纹特征库中查找不到与目标声纹特征的匹配度满足预设条件的声纹特征样本，且年龄预测模型无法输出目标声纹特征对应的年龄信息的情况下，可以通过对目标语音信息进行语音情感识别。In some embodiments, a voiceprint feature sample whose matching degree with the target voiceprint feature meets the preset conditions cannot be found in the voiceprint feature database, and the age prediction model cannot output age information corresponding to the target voiceprint feature. , speech emotion recognition can be performed on the target speech information.

在一些实施例中，可以通过语音情感识别神经网络模型来对目标语音信息进行语音情感识别。In some embodiments, speech emotion recognition can be performed on the target speech information through a speech emotion recognition neural network model.

可以理解的是，语音情感识别神经网络模型为以带有情感特征的语音信息为样本，以带有情感特征的语音信息的情感特征为标签训练得到的。It can be understood that the speech emotion recognition neural network model is trained by taking speech information with emotional characteristics as samples and using the emotional characteristics of the speech information with emotional characteristics as labels.

在另一些实施例中，对目标语音信息进行语音情感识别，得到目标语音信息的目标情感特征具体包括基于目标语音信息，生成目标文本信息。In other embodiments, performing speech emotion recognition on the target speech information and obtaining the target emotion characteristics of the target speech information specifically includes generating target text information based on the target speech information.

可以理解的是，在进行语音情感识别的过程中，先将目标语音信息转化为文本信息，得到目标文本信息。It is understandable that in the process of speech emotion recognition, the target speech information is first converted into text information to obtain the target text information.

在得到目标文本信息后，提取目标文本信息中的目标语料。可以理解的是，目标语料为目标文本信息中带有情感倾向的字或者词。After obtaining the target text information, extract the target corpus in the target text information. It can be understood that the target corpus is the words or words with emotional tendencies in the target text information.

例如，当目标语音信息的目标文本信息为“小优，小优，我好烦！”，目标语料可以是“烦”和“好”。“烦”为情绪化词语，可以表示目标用户的负面情绪，“好”为程度副词，可以用于表示目标用户的负面情绪的程度很大。For example, when the target text information of the target speech information is "Xiaoyou, Xiaoyou, I'm so annoyed!", the target corpus can be "annoyed" and "good". "Annoying" is an emotional word, which can express the target user's negative emotions, and "good" is an adverb of degree, which can be used to express the degree of the target user's negative emotions.

在情感词典库中查找到目标语料的情况下，确定目标语料对应的语料情感特征。When the target corpus is found in the emotional dictionary, the emotional characteristics of the corpus corresponding to the target corpus are determined.

可以理解的是，情感词典库包括多个语料以及与语料对应的情感特征。It can be understood that the emotional dictionary includes multiple corpora and emotional features corresponding to the corpora.

在确定出目标语料对应的语料情感特征后，可以基于语料情感特征，确定目标情感特征。After determining the corpus emotional characteristics corresponding to the target corpus, the target emotional characteristics can be determined based on the corpus emotional characteristics.

在本实施方式中，可以针对目标语料在情感词典库查找对应的情感特征。在目标文本信息中只具有单个目标语料的情况下，可以直接将在情感词典库中查询到的该目标语料的情感特征作为目标情感特征。In this implementation, the corresponding emotional features can be searched in the emotional dictionary for the target corpus. When there is only a single target corpus in the target text information, the emotional features of the target corpus queried in the emotional dictionary can be directly used as the target emotional features.

在目标文本信息中具有多个目标语料的情况下，则对多个目标语料所对应的情感特征进行相应的权重处理。When there are multiple target corpora in the target text information, corresponding weight processing is performed on the emotional features corresponding to the multiple target corpora.

例如可以针对不同的目标语料所对应的不同情感特征设置不同的权重分值。在一些实施例中，针对不同的语料，可以将不同的语料分为正面词、负面词、否定词以及程度副词等。For example, different weight scores can be set for different emotional features corresponding to different target corpus. In some embodiments, for different corpus, different corpus can be divided into positive words, negative words, negative words, adverbs of degree, etc.

在此种情况下，可以将正面词权重分值做加法，将负面词权重分值做减法，将否定词权重取相反数，将程度副词权重和它修饰的词语权重相乘，得出最终的权重分值。In this case, you can add the weight scores of positive words, subtract the weight scores of negative words, take the inverse of the weight of negative words, and multiply the weight of the adverb of degree and the weight of the words it modifies to get the final result. Weight score.

根据最终的权重分值，可以确定情感特征为正面情感、中性情感以及负面情感中的某一类型。According to the final weight score, the emotional characteristics can be determined to be a certain type of positive emotion, neutral emotion, or negative emotion.

根据本申请实施例的语音交互方法，通过得到目标语音信息的目标文本信息，再对目标文本信息中的目标语料进行分析，得到目标语料的语料情感特征，进而确定出目标语音信息的目标情感特征，能够较为准确地确定出目标语音信息的目标情感特征，进而方便针对具有不同情感特征的目标用户来设置相应的应答语音。According to the voice interaction method of the embodiment of the present application, by obtaining the target text information of the target speech information, and then analyzing the target corpus in the target text information, the corpus emotional characteristics of the target corpus are obtained, and then the target emotional characteristics of the target speech information are determined. , can more accurately determine the target emotional characteristics of the target voice information, and then facilitate the setting of corresponding response voices for target users with different emotional characteristics.

可以理解的是，在确定出目标语音信息的目标情感特征后，可以基于目标情感特征，设置应答语音的第二情感特征。It can be understood that after the target emotional characteristics of the target voice information are determined, the second emotional characteristics of the response voice can be set based on the target emotional characteristics.

在一些实施例中，在确定出目标情感特征为正面情感特征的情况下，应答语音的第二情感特征可以为正面情感特征。正面情感中可以包括愉快、开心、活泼以及放松等情绪。In some embodiments, when it is determined that the target emotional feature is a positive emotional feature, the second emotional feature of the response voice may be a positive emotional feature. Positive emotions can include emotions such as joy, joy, liveliness, and relaxation.

例如，目标情感特征包括开心以及放松的情绪，应答语音可以设置为更加愉快、活泼以及放松的语气，应答语音的内容中可以增添更多愉快、轻松的词汇，进而能够更加贴合目标用户当前的心情。For example, the target emotional characteristics include happy and relaxed emotions. The response voice can be set to a more pleasant, lively and relaxed tone. More pleasant and relaxing words can be added to the content of the response voice, which can better fit the current needs of the target users. Feeling.

在一些实施例中，在确定出目标情感特征为负面情感特征的情况下，应答语音的第二情感特征可以为中性情感特征。负面情感中可以包括愤怒以及没有耐心等情绪。In some embodiments, when it is determined that the target emotional feature is a negative emotional feature, the second emotional feature of the response voice may be a neutral emotional feature. Negative emotions can include emotions such as anger and impatience.

例如，目标情感特征包括愤怒的情绪，应答语音可以设置为更加严肃和肯定的语气，应答语音的内容中可以增添更多正式、肯定的词汇，进而能够更加贴合目标用户当前的心情。For example, if the target emotional characteristics include angry emotions, the response voice can be set to a more serious and affirmative tone, and more formal and affirmative words can be added to the content of the response voice to better fit the current mood of the target user.

当然，在其他实施例中，还可以对不同目标情感特征设置其他类型的第二情感特征，第二情感特征可以根据实际情况进行设置，此处对第二情感特征的类型不作限制。Of course, in other embodiments, other types of second emotional features can also be set for different target emotional features. The second emotional features can be set according to the actual situation. The type of the second emotional feature is not limited here.

根据本申请实施例的语音交互方法，通过确定出目标语音信息对应的目标情感特征，再根据目标情感特征设置不同的第二情感特征来设置应答语音，能够针对不同目标用户的情感特征在情感上进行设置应答语音，使得应答语音更加人性化，更加贴合不同用户的不同个性化使用需求，提升了用户体验。According to the voice interaction method of the embodiment of the present application, by determining the target emotional characteristics corresponding to the target voice information, and then setting different second emotional characteristics according to the target emotional characteristics to set the response voice, it is possible to emotionally respond to the emotional characteristics of different target users. Setting the response voice makes the response voice more humane, more suitable for the different personalized needs of different users, and improves the user experience.

如图3所示，下面对本申请提供的与交互装置进行描述，下文描述的语音交互装置与上文描述的语音交互方法可相互对应参照。As shown in FIG. 3 , the interaction device provided by the present application is described below. The voice interaction device described below and the voice interaction method described above can be mutually referenced.

本申请实施例的语音交互装置包括接收模块310、处理模块320以及输出模块330。The voice interaction device in this embodiment of the present application includes a receiving module 310, a processing module 320, and an output module 330.

接收模块310用于接收目标用户的语音输入，确定目标语音信息；The receiving module 310 is used to receive the voice input of the target user and determine the target voice information;

处理模块320用于对目标语音信息进行语音识别，得到语音识别结果；语音识别结果包括目标用户的年龄信息和目标语音信息的目标情感特征中的至少一个；The processing module 320 is configured to perform speech recognition on the target speech information and obtain a speech recognition result; the speech recognition result includes at least one of the age information of the target user and the target emotional characteristics of the target speech information;

输出模块330用于输出应答语音，应答语音为基于语音识别结果设置的。The output module 330 is used to output the response voice, and the response voice is set based on the voice recognition result.

根据本申请实施例提供的语音交互装置，通过对目标用户的目标语音信息进行语音识别，得到目标用户的年龄或者目标语音信息的目标情感特征，进而能根据识别语音识别结果针对年龄和情感特征做出相应的人性化应答，从而提高了用户体验。According to the voice interaction device provided by the embodiment of the present application, by performing speech recognition on the target speech information of the target user, the age of the target user or the target emotional characteristics of the target speech information can be obtained, and then the target user's age and emotional characteristics can be determined based on the speech recognition results. Provide corresponding humanized responses, thereby improving user experience.

在一些实施例中，处理模块320还用于获取目标语音信息相关的声纹特征识别结果；基于声纹特征识别结果，确定目标用户的年龄信息；基于年龄信息，设置应答语音的第一情感特征。In some embodiments, the processing module 320 is also used to obtain the voiceprint feature recognition results related to the target voice information; determine the age information of the target user based on the voiceprint feature recognition results; and set the first emotional feature of the response voice based on the age information. .

在一些实施例中，处理模块320还用于基于目标语音信息，确定目标声纹特征；在声纹特征库中查找与目标声纹特征相匹配的声纹特征样本，在声纹特征库中预先存储有声纹特征样本集；在查找到与目标声纹特征相匹配的声纹特征样本的情况下，确定声纹特征样本对应的年龄信息。In some embodiments, the processing module 320 is also used to determine the target voiceprint features based on the target voice information; search for voiceprint feature samples that match the target voiceprint features in the voiceprint feature database, and pre-set the target voiceprint features in the voiceprint feature database. A voiceprint feature sample set is stored; when a voiceprint feature sample matching the target voiceprint feature is found, the age information corresponding to the voiceprint feature sample is determined.

在一些实施例中，处理模块320还用于在未查找到与目标声纹特征相匹配的声纹特征样本的情况下，将目标声纹特征输入至年龄预测模型，得到年龄预测模型输出的年龄信息，年龄预测模型为以年龄样本声纹特征为样本，以年龄样本声纹特征中发声者的年龄信息为标签训练得到的。In some embodiments, the processing module 320 is also configured to input the target voiceprint feature to the age prediction model when no voiceprint feature sample matching the target voiceprint feature is found, and obtain the age output by the age prediction model. Information, the age prediction model is trained using age sample voiceprint features as samples and the age information of the speaker in the age sample voiceprint features as labels.

在一些实施例中，处理模块320还用于在基于声纹特征识别结果，未确定出目标用户的年龄信息的情况下，对目标语音信息进行语音情感识别，得到目标语音信息的目标情感特征；基于目标情感特征，设置应答语音的第二情感特征。In some embodiments, the processing module 320 is also used to perform speech emotion recognition on the target voice information when the age information of the target user is not determined based on the voiceprint feature recognition result, and obtain the target emotion characteristics of the target voice information; Based on the target emotional characteristics, set the second emotional characteristics of the response voice.

在一些实施例中，处理模块320还用于基于目标语音信息，生成目标文本信息；提取目标文本信息中的目标语料；在情感词典库中查找到目标语料的情况下，确定目标语料对应的语料情感特征；基于语料情感特征，确定目标情感特征；情感词典库包括多个语料以及与语料对应的情感特征。In some embodiments, the processing module 320 is also used to generate target text information based on the target speech information; extract the target corpus in the target text information; and determine the corpus corresponding to the target corpus when the target corpus is found in the emotional dictionary. Emotional features; based on the emotional features of the corpus, determine the target emotional features; the emotional dictionary includes multiple corpora and the emotional features corresponding to the corpora.

本申请实施例还提供一种电器，电器包括上述的语音交互装置，能与用户进行语音交互。电器可以是空调、电视机、洗衣机、冰箱以及净水器等，此处对电器的类型不作限制。An embodiment of the present application also provides an electrical appliance. The electrical appliance includes the above-mentioned voice interaction device and can perform voice interaction with the user. Electrical appliances can be air conditioners, televisions, washing machines, refrigerators, water purifiers, etc. There is no restriction on the type of electrical appliances.

本申请实施例还提供一种空调，包括室内机、室外机和设置在室内机或室外机中的处理器和处理器；还包括存储在存储器上并可在处理器上运行的程序或指令，程序或指令被处理器执行时执行如上述的语音交互方法。Embodiments of the present application also provide an air conditioner, which includes an indoor unit, an outdoor unit, a processor and a processor provided in the indoor unit or the outdoor unit; and also includes a program or instructions stored in a memory and executable on the processor, When the program or instruction is executed by the processor, the voice interaction method as described above is performed.

图4示例了一种电子设备的实体结构示意图，如图4所示，该电子设备可以包括：处理器(processor)410、通信接口(Communications Interface)420、存储器(memory)430和通信总线440，其中，处理器410，通信接口420，存储器430通过通信总线440完成相互间的通信。处理器410可以调用存储器430中的逻辑指令，以执行语音交互方法，该方法包括：接收目标用户的语音输入，确定目标语音信息；对目标语音信息进行语音识别，得到语音识别结果；语音识别结果包括目标用户的年龄信息和目标语音信息的目标情感特征中的至少一个；输出应答语音，应答语音为基于语音识别结果设置的。Figure 4 illustrates a schematic diagram of the physical structure of an electronic device. As shown in Figure 4, the electronic device may include: a processor (processor) 410, a communications interface (Communications Interface) 420, a memory (memory) 430 and a communication bus 440. Among them, the processor 410, the communication interface 420, and the memory 430 complete communication with each other through the communication bus 440. The processor 410 can call logical instructions in the memory 430 to execute the voice interaction method. The method includes: receiving voice input from the target user and determining the target voice information; performing voice recognition on the target voice information to obtain a voice recognition result; the voice recognition result At least one of the target emotional characteristics including age information of the target user and target voice information; outputting a response voice, where the response voice is set based on the voice recognition result.

此外，上述的存储器430中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the above-mentioned logical instructions in the memory 430 can be implemented in the form of software functional units and can be stored in a computer-readable storage medium when sold or used as an independent product. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of this application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program code. .

另一方面，本申请还提供一种计算机程序产品，所述计算机程序产品包括计算机程序，计算机程序可存储在非暂态计算机可读存储介质上，所述计算机程序被处理器执行时，计算机能够执行上述各方法所提供的语音交互方法，该方法包括：接收目标用户的语音输入，确定目标语音信息；对目标语音信息进行语音识别，得到语音识别结果；语音识别结果包括目标用户的年龄信息和目标语音信息的目标情感特征中的至少一个；输出应答语音，应答语音为基于语音识别结果设置的。On the other hand, the present application also provides a computer program product. The computer program product includes a computer program. The computer program can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can Execute the voice interaction method provided by each of the above methods. The method includes: receiving the voice input of the target user and determining the target voice information; performing voice recognition on the target voice information to obtain a voice recognition result; the voice recognition result includes the age information of the target user and At least one of the target emotional characteristics of the target voice information; output a response voice, and the response voice is set based on the speech recognition result.

又一方面，本申请还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现以执行上述各方法提供的语音交互方法，该方法包括：接收目标用户的语音输入，确定目标语音信息；对目标语音信息进行语音识别，得到语音识别结果；语音识别结果包括目标用户的年龄信息和目标语音信息的目标情感特征中的至少一个；输出应答语音，应答语音为基于语音识别结果设置的。On the other hand, the present application also provides a non-transitory computer-readable storage medium on which a computer program is stored. The computer program is implemented when executed by a processor to execute the voice interaction method provided by each of the above methods. The method includes: Receive the target user's voice input and determine the target voice information; perform voice recognition on the target voice information to obtain a voice recognition result; the voice recognition result includes at least one of the target user's age information and the target emotional characteristics of the target voice information; and output a response voice. , the response voice is set based on the speech recognition results.

以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The device embodiments described above are only illustrative. The units described as separate components may or may not be physically separated. The components shown as units may or may not be physical units, that is, they may be located in One location, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. Persons of ordinary skill in the art can understand and implement the method without any creative effort.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and of course, it can also be implemented by hardware. Based on this understanding, the part of the above technical solution that essentially contributes to the existing technology can be embodied in the form of a software product. The computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., including a number of instructions to cause a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in various embodiments or certain parts of the embodiments.

最后应说明的是：以上实施例仅用以说明本申请的技术方案，而非对其限制；尽管参照前述实施例对本申请进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present application, but not to limit it; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still be Modifications are made to the technical solutions described in the foregoing embodiments, or equivalent substitutions are made to some of the technical features; however, these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions in the embodiments of the present application.

Claims

A voice interaction method including:

Receive the voice input from the target user and determine the target voice information;

Perform speech recognition on the target speech information to obtain a speech recognition result; the speech recognition result includes at least one of the age information of the target user and the target emotional characteristics of the target speech information;

A response voice is output, and the response voice is set based on the voice recognition result.

The voice interaction method according to claim 1, wherein said performing voice recognition on the target voice information to obtain a voice recognition result includes:

Obtain the voiceprint feature recognition results related to the target voice information;

Based on the voiceprint feature recognition result, determine the age information of the target user;

Based on the age information, a first emotional characteristic of the response voice is set.

The voice interaction method according to claim 2, wherein said obtaining the voiceprint feature recognition results related to the target voice information includes:

Based on the target voice information, determine the target voiceprint characteristics;

Search for voiceprint feature samples that match the target voiceprint feature in a voiceprint feature database, where a voiceprint feature sample set is pre-stored in the voiceprint feature database;

When a voiceprint feature sample matching the target voiceprint feature is found, age information corresponding to the voiceprint feature sample is determined.

The voice interaction method according to claim 3, wherein after searching for voiceprint feature samples matching the target voiceprint feature in the voiceprint feature database, the method further includes:

When no voiceprint feature sample matching the target voiceprint feature is found, the target voiceprint feature is input to the age prediction model to obtain the age information output by the age prediction model, and the age prediction The model is trained using age sample voiceprint features as samples and the age information of the speaker in the age sample voiceprint features as labels.

The voice interaction method according to any one of claims 2 to 4, wherein after obtaining the voiceprint feature recognition result related to the target voice information, the method further includes:

If the age information of the target user is not determined based on the voiceprint feature recognition result, perform voice emotion recognition on the target voice information to obtain the target emotion feature of the target voice information;

Based on the target emotional feature, a second emotional feature of the response voice is set.

The voice interaction method according to claim 5, wherein the performing voice emotion recognition on the target voice information to obtain the target emotion characteristics of the target voice information includes:

Based on the target voice information, generate target text information;

Extract target corpus from the target text information;

When the target corpus is found in the emotional dictionary, determine the emotional characteristics of the corpus corresponding to the target corpus;

Based on the emotional characteristics of the corpus, determine the target emotional characteristics;

The emotion dictionary includes a plurality of corpus and emotional features corresponding to the corpus.

A voice interaction device including:

The receiving module is used to receive the voice input of the target user and determine the target voice information;

A processing module configured to perform speech recognition on the target speech information and obtain a speech recognition result; the speech recognition result includes at least one of the age information of the target user and the target emotional characteristics of the target speech information;

An output module is configured to output a response voice, where the response voice is set based on the voice recognition result.

An electrical appliance including the voice interaction device according to claim 7.

An air conditioner includes an indoor unit, an outdoor unit, a processor and a memory provided in the indoor unit or outdoor unit; and also includes a program or instructions stored on the memory and executable on the processor, wherein , when the program or instruction is executed by the processor, the voice interaction method according to any one of claims 1 to 6 is executed.

An electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein when the processor executes the program, any one of claims 1 to 6 is implemented. The voice interaction method described in the item.