[go: up one dir, main page]

CN114093356B - Voice interaction method, voice interaction device, electronic equipment and storage medium - Google Patents

Voice interaction method, voice interaction device, electronic equipment and storage medium

Info

Publication number
CN114093356B
CN114093356B CN202111297097.3A CN202111297097A CN114093356B CN 114093356 B CN114093356 B CN 114093356B CN 202111297097 A CN202111297097 A CN 202111297097A CN 114093356 B CN114093356 B CN 114093356B
Authority
CN
China
Prior art keywords
information
response
voice
voice response
target voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111297097.3A
Other languages
Chinese (zh)
Other versions
CN114093356A (en
Inventor
周毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Apollo Intelligent Connectivity Beijing Technology Co Ltd
Original Assignee
Apollo Intelligent Connectivity Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Apollo Intelligent Connectivity Beijing Technology Co Ltd filed Critical Apollo Intelligent Connectivity Beijing Technology Co Ltd
Priority to CN202111297097.3A priority Critical patent/CN114093356B/en
Publication of CN114093356A publication Critical patent/CN114093356A/en
Application granted granted Critical
Publication of CN114093356B publication Critical patent/CN114093356B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
    • G10L2015/228Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of application context

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

本公开提供了语音交互方法、装置、电子设备以及存储介质,涉计算机技术领域领域,尤其涉及自动驾驶、智能座舱、车联网技术领域。具体实现方案为:响应于接收到来自用户的交互语音,对交互语音进行语义分析,得到用户意图信息;基于用户意图信息,确定目标语音应答方式;以及按照目标语音应答方式输出语音应答信息。

This disclosure provides a voice interaction method, apparatus, electronic device, and storage medium, relating to the fields of computer technology, particularly autonomous driving, smart cockpits, and connected vehicles. A specific implementation scheme comprises: in response to receiving interactive voice from a user, performing semantic analysis on the interactive voice to obtain user intent information; determining a target voice response method based on the user intent information; and outputting voice response information according to the target voice response method.

Description

Voice interaction method, voice interaction device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of computer technology, the field of internet technology, and in particular, to the field of intelligent voice technology, and specifically, to a voice interaction method, a voice interaction device, an electronic apparatus, a storage medium, and a program product.
Background
Along with the development of technology, the voice interaction technology is widely applied to various intelligent voice devices such as intelligent robots, intelligent sound boxes, intelligent vehicle-mounted devices, intelligent electric appliances and the like. The intelligent voice device may perform corresponding operations, such as answering questions in the user's interactive voice, starting or stopping the device, etc., based on the interactive voice uttered by the user.
Disclosure of Invention
The present disclosure provides a method for voice interaction, an apparatus for voice interaction, an electronic device, a storage medium, and a program product.
According to one aspect of the disclosure, a voice interaction method is provided, which comprises the steps of responding to the received interaction voice from a user, carrying out semantic analysis on the interaction voice to obtain user intention information, determining a target voice response mode based on the user intention information, and outputting voice response information according to the target voice response mode.
According to another aspect of the disclosure, a voice interaction device is provided, which comprises an intention determining module, a response mode determining module and a response output module, wherein the intention determining module is used for responding to the received interaction voice from a user and carrying out semantic analysis on the interaction voice to obtain user intention information, the response mode determining module is used for determining a target voice response mode based on the user intention information, and the response output module is used for outputting voice response information according to the target voice response mode.
According to another aspect of the present disclosure, there is provided an electronic device comprising at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the method as described above.
According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method as described above.
According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 schematically illustrates an exemplary system architecture to which voice interaction methods and apparatus may be applied, according to embodiments of the present disclosure;
FIG. 2 schematically illustrates a flow chart of a voice interaction method according to an embodiment of the disclosure;
FIG. 3 schematically illustrates an application scenario diagram of a voice interaction method according to an embodiment of the present disclosure;
FIG. 4 schematically illustrates an application scenario diagram of a user in voice interaction according to an embodiment of the present disclosure;
FIG. 5 schematically illustrates a block diagram of a voice interaction device according to an embodiment of the disclosure;
FIG. 6 illustrates a schematic block diagram of an example electronic device that may be used to implement embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The present disclosure provides a voice interaction method, a voice interaction apparatus, an electronic device, a storage medium, and a program product.
According to the embodiment of the disclosure, the voice interaction method comprises the steps of responding to the received interaction voice from a user, conducting semantic analysis on the interaction voice to obtain user intention information, determining a target voice response mode based on the user intention information, and outputting voice response information according to the target voice response mode.
According to the embodiment of the disclosure, through carrying out semantic analysis on interactive voice of a user, determining user intention information, determining a target voice response mode based on the user intention information, the target voice response mode can be determined on the basis of fully considering the user intention information, and outputting the voice response information according to the target voice response mode, the personalized requirements of the user can be met according to the user intention, and the interestingness and the use experience of the user are improved.
In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.
Fig. 1 schematically illustrates an exemplary system architecture to which voice interaction methods and apparatus may be applied, according to embodiments of the present disclosure.
It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios. For example, in another embodiment, an exemplary system architecture to which the voice interaction method and apparatus may be applied may include a terminal device, but the terminal device may implement the voice interaction method and apparatus provided by the embodiments of the present disclosure without interacting with a server.
As shown in fig. 1, a system architecture 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired and/or wireless communication links, and the like.
The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications may be installed on the terminal devices 101, 102, 103, such as a knowledge reading class application, a web browser application, a search class application, an instant messaging tool, a mailbox client and/or social platform software, etc. (as examples only).
The terminal devices 101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.
The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for content browsed by the user using the terminal devices 101, 102, 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.
It should be noted that, the voice interaction method provided by the embodiments of the present disclosure may be generally performed by the terminal device 101, 102, or 103. Accordingly, the voice interaction apparatus provided by the embodiments of the present disclosure may also be provided in the terminal device 101, 102, or 103.
Or the voice interaction method provided by the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the voice interaction apparatus provided by the embodiments of the present disclosure may be generally disposed in the server 105. The voice interaction method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Accordingly, the voice interaction apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster different from the server 105 and capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.
For example, when the user sends out interactive voice, the terminal device 101, 102, 103 may acquire the interactive voice from the user, then send the acquired interactive voice from the user to the server 105, and the server 105 performs semantic analysis on the interactive voice to obtain user intention information, determines a target voice response mode based on the user intention information, and outputs voice response information according to the target voice response mode. Or the server cluster capable of communicating with the terminal equipment 101, 102, 103 and/or the server 105 performs semantic analysis on the interactive voice, and finally outputs voice response information according to the target voice response mode.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Fig. 2 schematically illustrates a flow chart of a voice interaction method according to an embodiment of the disclosure.
As shown in FIG. 2, the method includes operations S210-S230.
In response to receiving the interactive voice from the user, semantic analysis is performed on the interactive voice to obtain user intention information in operation S210.
In operation S220, a target voice response mode is determined based on the user intention information.
In operation S230, voice response information is output in accordance with the target voice response mode.
According to embodiments of the present disclosure, the user intent information may include information reflecting the needs of the user, such as that the interactive voice from the user is "please play singer Axx's song. By performing semantic analysis on the interactive voice, it may be determined that the user intent information includes information reflecting the need of the user to enjoy the song of singer Axx. However, the user intention information is not limited thereto, and may be information reflecting other needs of the user, such as information required to acquire weather information, information required to listen to a voiced novel, and the like.
It should be noted that, the semantic analysis may be performed on the interactive voice through a neural network module in the related art, and the neural network module may be constructed based on a long-short-term memory network, for example, or may also be a model constructed based on a statistical algorithm. The specific technical means for performing semantic analysis on the interactive voice in the embodiment of the disclosure is not limited.
According to embodiments of the present disclosure, a voice response manner may refer to a voice response style, a speaker of a voice response, and the like. For example, the voice response style may include styles in terms of intonation, speed, volume, etc., and the speaker of the voice response may refer to the tone characteristics of the speaker, etc.
According to embodiments of the present disclosure, the target voice response mode may refer to a voice response mode that matches user intention information. For a plurality of different user intention information, a target voice response mode matched with each of the plurality of different user intention information can be matched.
According to embodiments of the present disclosure, the voice response information may include voice information that responds to user intention information. For example, where the user intent information is to reflect the user's desire to enjoy songs by singer Axx, the voice response information may be "good," which is to play songs "ABC" by singer Axx for you.
According to the embodiment of the disclosure, the target voice response mode can be determined by determining the obtained user intention information, and the target voice response mode matched with the user intention information is adjusted according to the difference of the user intention information. Therefore, the voice response information is output according to the target voice response mode, so that the voice response information has different voice information characteristics, such as different voice response styles, different emotion colors and the like. Thereby meeting the personalized requirements of users and improving the interestingness and the use experience of the users.
The data evaluation method according to the embodiments of the present disclosure is further described below with reference to fig. 3 to 4.
According to an embodiment of the present disclosure, operation S220, determining a target voice response mode based on user intention information may include:
and determining a target voice response mode based on the application scene information.
According to embodiments of the present disclosure, application scenario information may refer to category information of a terminal application, or a scenario in which an operation is performed using the terminal application, such as a map class application to play navigation information, a song play class application to play music, a voiced novel play class application to play a novel, or a query class application to query weather information. Based on different user intention information, different application scene information can be determined, and the application scene information can correspond to the user intention information, so that the user requirement is met. For example, in the case where the user demand reflected by the user intention information is to enjoy songs of singers Axx, the application scenario information may be to play songs using a song play class application.
According to the embodiment of the disclosure, when the application scene information is the navigation information played by using the map application program, the response style of which the target voice response mode is urgent can be determined, for example, by accelerating the voice speed to represent the urgency of the response style, the voice response information is output according to the urgent response style, and the attention of the user to the voice response information can be improved so as to play a role in warning the user.
According to the embodiment of the disclosure, the corresponding target voice response mode can be determined based on different application scene information, so that the applicability of the target voice response mode to different application scene information is improved, and the personalized requirements of a user are met.
According to an embodiment of the present disclosure, the voice interaction method may further include, after operation S230, that is, after outputting the voice response information in the target voice response manner:
based on the user intention information, subject content information for the performed operation is determined.
According to an embodiment of the present disclosure, the subject content information for performing the operation may include played voice information, for example, song content information of song ABC of the played singer Axx, but not limited thereto, and may include navigation content information, weather forecast information, vocal content information, vocal novels content information, or the like.
According to the embodiment of the disclosure, different user intention information can reflect information of different user demands, corresponding theme content information for executing operations is determined based on the user intention information, personalized demands of users can be met, and user experience is improved.
Fig. 3 schematically illustrates an application scenario diagram of a voice interaction method according to an embodiment of the present disclosure.
As shown in fig. 3, the interactive voice 310 from the user may be "navigate AA mall to BB cinema. In response to receiving the interactive voice 310 from the user, semantic analysis of the interactive voice 310 may determine that the user intent information includes information reflecting the user's need to obtain navigation information. According to the user intention information, the application scene information can be determined to be the playing navigation information, and then the target voice response mode can be determined based on the application scene information. For example, accent information of a local dialect of a navigational region may be included. The voice response information 320 is output according to the target voice response mode of the local dialect of the navigation area, "good," and navigation for you is started. Then, the subject content information 330 for the performed operation is determined as "please go straight ahead, and.
According to embodiments of the present disclosure, it may be determined that application scenario information is to play navigation information using a map-like application by determining user intent information, e.g., reflecting a need for a user to obtain navigation information. According to the application scene information, determining that the target voice response mode is a local dialect voice response mode adopting a navigation area, and outputting voice response information according to the target voice response mode. Therefore, the method and the device can meet the requirement of the user for acquiring the navigation information, output the voice response information according to the target voice response mode of the local dialect of the navigation area, further meet the personalized requirement of the user and promote the interestingness and the use experience. According to an embodiment of the present disclosure, the target voice response means includes a response style.
Determining the target voice response mode based on the application scene information may include identifying emotion information in the subject content information in response to the application scene information being first predetermined scene information, and determining a response style of the target voice response mode based on the emotion information.
According to the embodiment of the disclosure, whether the application scene information is matched with the first preset scene information can be judged, and the target voice response mode is determined according to the rule matched with the first preset scene information under the condition that the application scene information is determined to be the first preset scene information.
For example, the first predetermined scene information may include playing music with a song play class application, playing novels with a voiced novice play class application, and so forth. The subject content information may include songs, vocals, etc. corresponding to the first predetermined scene information. In the case where the application scene information is the first predetermined scene information, emotion information in the subject content information may be identified, for example, emotion information in a song to be played or a novel subject content information may be cheerful, sad, or the like.
According to embodiments of the present disclosure, the response style may include a response style that characterizes speech emotion, for example, may be a response style that characterizes cheering or anxiety. Based on the emotion information in the topic content information, the response style of the target voice response mode is determined, and the response style can be unified with the emotion information in the topic content information, so that the voice response information is consistent with the emotion of the topic content information of the follow-up execution operation, emotion colors can be rendered in advance, and the intelligence and the interestingness of voice interaction are improved.
It should be noted that, emotion information in the subject content information may be identified by analyzing semantic information of the subject content information. For example, semantic information of the subject content information is analyzed through a neural network model, emotion information in the subject content information is identified, but the method is not limited to the method, and emotion information in the subject content information can be identified through analyzing frequency and amplitude of sound in the subject content information. The embodiment of the present disclosure does not limit specific technical means for identifying emotion information in subject matter information.
According to an embodiment of the present disclosure, the target voice response means includes a response voice feature.
Based on the application scene information, determining the target voice response mode comprises the following steps:
Identifying the voice characteristics of the speaker in the subject content information in response to the application scene information being second predetermined scene information; and determining response voice characteristics of the target voice response mode based on the voice characteristics of the speaker.
According to an embodiment of the present disclosure, the second predetermined scene information may include playing music with a song play class application, playing novels with a voiced novice play class application, and so forth. The subject content information may include song or vocal voice information corresponding to the second predetermined scene information. The phonetic features of the speaker may include the tone color of the speaker.
For example, when the second predetermined scene information is a speech of a speech actor Byy using a speech novel play class application, the subject content information may be a speech work "EDF" of the speech actor Byy, and the speaker of the speech work "EDF" may be Byy. By recognizing the voice characteristics of the speaker Byy in the phasor work "EDF", the voice characteristics of the speaker Byy can be determined as the response voice characteristics of the target voice response mode.
It should be noted that, the voice features of the speaker in the subject content information may be identified by analyzing the voiceprint features of the voice information in the subject content information.
According to an embodiment of the present disclosure, determining the response voice feature of the target voice response mode based on the voice feature of the speaker may be to use the voice feature of the speaker as the response voice feature. The voice response information is output according to the voice characteristics of the speaker in the subject content information, so that the voice response information and the subject content information have the same voice characteristics, and the intelligence of voice interaction is improved.
According to an exemplary embodiment of the present disclosure, the target voice response mode includes a response voice feature and a response style, and emotion information in the subject content information may be identified in the case where the application scene information is both first predetermined scene information and second predetermined scene information. And determining the response style of the target voice response mode based on the emotion information. And determining response voice characteristics of the target voice response mode based on the voice characteristics of the speaker. For example, the target voice response mode may include a happy response style, and a response voice feature of a female, so that voice response information is output according to the response voice feature and the response style, and voice response information may be output according to the target voice response mode of "happy girl".
It should be understood that the first predetermined scene information and the second predetermined scene information may be the same or different, and those skilled in the art may set the first predetermined scene information and the second predetermined scene information according to actual needs.
Fig. 4 schematically illustrates an application scenario diagram of a user in voice interaction according to an embodiment of the present disclosure.
As shown in fig. 4, the interactive voice 411 uttered by the user 410 may be "please play song ABC of singer Axx". The voice interaction device may be disposed in the vehicle 420, for example, the voice interaction device is an in-vehicle terminal device. It should be appreciated that the user 410 may be located within the vehicle 420, or the user 410 may also be located outside of the vehicle 420, so long as the interactive voice 411 uttered by the user 410 is receivable by the voice interactive device.
The voice interaction device of vehicle 420 may perform semantic analysis on the interaction voice 411 and may determine that the user intent information includes information reflecting the need for the user to enjoy songs by singer Axx. Based on the user intent information, it may be determined that the first predetermined application scenario may be to play music using a song play class application, and based on the user intent information, it may be determined that the subject content information is song ABC 422.
Identifying the emotion information in the song ABC 422 as a worry injury, determining that the response style can be the worry injury based on the emotion information in the song ABC 422, determining that the target voice response mode comprises the response style of the worry injury according to the response style, and further outputting voice response information 421 with the response style of the worry injury, wherein the voice response information is good, and playing Axx of the song ABC for you. "
Further, under the condition that the second predetermined application scenario is the same as the first predetermined application scenario, the voice characteristics of the speaker in the voice information 422 of the song ABC, that is, the tone of the singer Axx of the song ABC, can be identified, based on the voice characteristics of the speaker, the response voice characteristics can be determined to be the tone of the singer Axx, according to the response voice characteristics, the target voice response mode can be determined to include the tone of the singer Axx, further according to the response style of the anxiety, and according to the tone of the singer Axx, the voice response information 421: "good, song ABC of Axx is played for you" is output. "after outputting the voice response information 421, the song" ABC "422 can be played.
By unifying the response style with the emotion information of the song ABC and outputting the voice response information according to the tone of the speaker of the song ABC, the user can be helped to enter the psychological state of enjoying the song ABC in advance before playing the voice information of the song ABC, so that the personalized requirements of the user are further met, and the use experience of the user is improved.
According to an embodiment of the present disclosure, outputting voice response information in a target voice response manner includes:
and determining one template voice response mode from the plurality of template voice response modes as a target voice response mode according to the attention recommendation rule in response to no target voice response mode from the plurality of template voice response modes.
According to embodiments of the present disclosure, the attention may include popularity or heat, and the attention may be generated based on a user browsing operation, a retrieving operation, or a praying operation. The attention degree recommendation rule may be to determine one template voice response mode from N template voice response modes with highest attention degrees as the template voice response mode. For example, in the case where the target voice response mode includes the tone of the singer Axx, the target voice response mode including the tone of the singer Axx is not searched among the plurality of template voice response modes, the template voice response mode with the highest attention degree may be determined as the target voice response mode from among the plurality of template voice response modes in the cloud server, and the voice response information may be output according to the target voice response mode. But is not limited thereto. And a template voice response mode can be determined from a plurality of template voice response modes according to the interest degree of the user and used as a target voice response mode.
By utilizing the voice interaction method provided by the embodiment of the disclosure, various determination channels of the target voice response mode are provided, the diversity of the target voice response mode is enriched, the application range of the voice interaction method is expanded, and the use experience of a user is further improved.
Fig. 5 schematically illustrates a block diagram of a voice interaction apparatus according to an embodiment of the disclosure.
As shown in fig. 5, the voice interaction apparatus 500 may include an intention determination module 510, a response mode determination module 520, and a response output module 530.
The intention determining module 510 is configured to perform semantic analysis on the interactive voice to obtain user intention information in response to receiving the interactive voice from the user.
The response mode determining module 520 is configured to determine a target voice response mode based on the user intention information.
The response output module 530 is configured to output voice response information according to the target voice response mode.
According to the embodiment of the disclosure, the response determining module comprises an application scene determining sub-module and a response mode determining sub-module.
And the application scene determining sub-module is used for determining application scene information based on the user intention information.
And the response mode determining submodule is used for determining a target voice response mode based on the application scene information.
According to an embodiment of the present disclosure, the voice interaction device further includes, after outputting the voice response information in the target voice response manner:
And an operation determining module for determining subject content information of an operation to be performed based on the user intention information.
According to an embodiment of the present disclosure, the target voice response means includes a response style.
The answer mode determination submodule comprises a first identification unit and an answer style determination unit.
The first identifying unit is used for identifying emotion information in the theme content information in response to the application scene information being first preset scene information.
And the response style determining unit is used for determining the response style of the target voice response mode based on the emotion information.
According to an embodiment of the present disclosure, the target voice response means includes a response voice feature;
the response mode determining submodule comprises a second identification unit and a voice characteristic determining unit.
And a second recognition unit for recognizing the voice characteristics of the speaker in the subject content information in response to the application scene information being second predetermined scene information.
And the voice characteristic determining unit is used for determining response voice characteristics of the target voice response mode based on the voice characteristics of the speaker.
According to the embodiment of the disclosure, the response output module comprises a first voice response sub-module and a second voice response sub-module.
The first voice response sub-module is used for responding to the target voice response mode searched from the plurality of template voice response modes and outputting voice response information according to the target voice response mode.
And the second voice response sub-module is used for determining one template voice response mode from the plurality of template voice response modes as a target voice response mode according to the attention recommendation rule in response to the fact that the target voice response mode is not searched from the plurality of template voice response modes.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
According to an embodiment of the present disclosure, an electronic device includes at least one processor and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.
According to an embodiment of the present disclosure, a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method as described above.
According to an embodiment of the present disclosure, a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.
FIG. 6 illustrates a schematic block diagram of an example electronic device that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 6, the apparatus 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 may also be stored. The computing unit 601, ROM 602, and RAM 603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
Various components in the device 600 are connected to the I/O interface 605, including an input unit 606, e.g., keyboard, mouse, etc., an output unit 607, e.g., various types of displays, speakers, etc., a storage unit 608, e.g., magnetic disk, optical disk, etc., and a communication unit 609, e.g., network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the various methods and processes described above, such as voice interaction methods. For example, in some embodiments, the voice interaction method may be implemented as a computer software program tangibly embodied on a machine-readable medium, e.g., storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into the RAM 603 and executed by the computing unit 601, one or more steps of the voice interaction method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the voice interaction method in any other suitable way (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be a special or general purpose programmable processor, operable to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user, for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback), and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a Local Area Network (LAN), a Wide Area Network (WAN), and the Internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (11)

1.一种语音交互方法,包括:1. A voice interaction method, comprising: 响应于接收到来自用户的交互语音,对所述交互语音进行语义分析,得到用户意图信息;In response to receiving an interactive voice from a user, performing semantic analysis on the interactive voice to obtain user intention information; 基于所述用户意图信息,确定应用场景信息,其中,所述应用场景信息表征用于执行操作的终端应用程序的类别;Determining application scenario information based on the user intention information, wherein the application scenario information represents a category of a terminal application used to perform the operation; 响应于所述应用场景信息为第二预定场景信息,识别用于执行的操作的主题内容信息中的发音人的语音特征,其中,所述主题内容信息是基于所述用户意图信息确定的;In response to the application scenario information being the second predetermined scenario information, identifying a voice feature of a speaker in subject content information for performing the operation, wherein the subject content information is determined based on the user intention information; 基于所述发音人的语音特征,确定目标语音应答方式的应答语音特征;以及Determining the response voice features of the target voice response mode based on the voice features of the speaker; and 按照所述目标语音应答方式输出语音应答信息。The voice response information is output according to the target voice response mode. 2.根据权利要求1所述的方法,还包括,在所述按照所述目标语音应答方式输出语音应答信息之后:2. The method according to claim 1, further comprising, after outputting the voice response information in the target voice response mode: 基于所述用户意图信息,确定用于执行的操作的主题内容信息。Based on the user intention information, subject content information for the operation to be performed is determined. 3.根据权利要求2所述的方法,其中,所述目标语音应答方式包括应答风格;3. The method according to claim 2, wherein the target voice response mode includes a response style; 所述方法还包括:The method further comprises: 响应于所述应用场景信息为第一预定场景信息,识别所述主题内容信息中的情感信息;以及In response to the application scenario information being the first predetermined scenario information, identifying emotional information in the subject content information; and 基于所述情感信息,确定所述目标语音应答方式的应答风格。Based on the emotional information, a response style of the target voice response mode is determined. 4.根据权利要求1至3任一项所述的方法,其中,所述按照所述目标语音应答方式输出语音应答信息包括:4. The method according to any one of claims 1 to 3, wherein outputting the voice response information according to the target voice response mode comprises: 响应于从多个模板语音应答方式中搜索到所述目标语音应答方式,按照所述目标语音应答方式输出语音应答信息;以及In response to searching for the target voice response method from a plurality of template voice response methods, outputting voice response information according to the target voice response method; and 响应于从所述多个模板语音应答方式中没有搜索到所述目标语音应答方式,按照关注度推荐规则从所述多个模板语音应答方式中确定一个模板语音应答方式作为所述目标语音应答方式。In response to not finding the target voice response method from the multiple template voice response methods, a template voice response method is determined from the multiple template voice response methods as the target voice response method according to an attention recommendation rule. 5.一种语音交互装置,包括:5. A voice interaction device, comprising: 意图确定模块,用于响应于接收到来自用户的交互语音,对所述交互语音进行语义分析,得到用户意图信息;an intention determination module, configured to, in response to receiving an interactive voice from a user, perform semantic analysis on the interactive voice to obtain user intention information; 应答方式确定模块,用于基于所述用户意图信息,确定目标语音应答方式;以及an answering mode determining module, configured to determine a target voice answering mode based on the user intention information; and 应答输出模块,用于按照所述目标语音应答方式输出语音应答信息;A response output module, configured to output voice response information according to the target voice response mode; 其中,所述应答方式确定模块包括:Wherein, the response mode determination module includes: 应用场景确定子模块,用于基于所述用户意图信息,确定应用场景信息,其中,所述应用场景信息表征用于执行操作的终端应用程序的类别;以及an application scenario determination submodule, configured to determine application scenario information based on the user intention information, wherein the application scenario information represents a category of a terminal application used to perform an operation; and 应答方式确定子模块,用于基于所述应用场景信息,确定目标语音应答方式;An answer mode determination submodule, configured to determine a target voice answer mode based on the application scenario information; 其中,所述目标语音应答方式包括应答语音特征;Wherein, the target voice response mode includes response voice features; 所述应答方式确定子模块包括:The response mode determination submodule includes: 第二识别单元,用于响应于所述应用场景信息为第二预定场景信息,识别用于执行的操作的主题内容信息中的发音人的语音特征,其中,所述主题内容信息是基于所述用户意图信息确定的;以及a second recognition unit configured to recognize, in response to the application scenario information being second predetermined scenario information, a voice feature of a speaker in subject content information for executing the operation, wherein the subject content information is determined based on the user intention information; and 语音特征确定单元,用于基于所述发音人的语音特征,确定所述目标语音应答方式的应答语音特征。The voice feature determination unit is used to determine the response voice feature of the target voice response mode based on the voice feature of the speaker. 6.根据权利要求5所述的装置,还包括,在所述按照所述目标语音应答方式输出语音应答信息之后:6. The apparatus according to claim 5, further comprising, after outputting the voice response information in the target voice response mode: 操作确定模块,用于基于所述用户意图信息,确定用于执行的操作的主题内容信息。The operation determination module is used to determine subject content information of the operation to be performed based on the user intention information. 7.根据权利要求6所述的装置,其中,所述目标语音应答方式包括应答风格;7. The apparatus according to claim 6, wherein the target voice response mode comprises a response style; 所述应答方式确定子模块包括:The response mode determination submodule includes: 第一识别单元,用于响应于所述应用场景信息为第一预定场景信息,识别所述主题内容信息中的情感信息;以及a first identification unit, configured to identify emotional information in the subject content information in response to the application scenario information being first predetermined scenario information; and 应答风格确定单元,用于基于所述情感信息,确定所述目标语音应答方式的应答风格。The response style determination unit is used to determine the response style of the target voice response method based on the emotional information. 8.根据权利要求5至7任一项所述的装置,其中,所述应答输出模块包括:8. The device according to any one of claims 5 to 7, wherein the response output module comprises: 第一语音应答子模块,用于响应于从多个模板语音应答方式中搜索到所述目标语音应答方式,按照所述目标语音应答方式输出语音应答信息;以及a first voice response submodule, configured to output voice response information according to the target voice response mode in response to searching for the target voice response mode from a plurality of template voice response modes; and 第二语音应答子模块,用于响应于从所述多个模板语音应答方式中没有搜索到所述目标语音应答方式,按照关注度推荐规则从所述多个模板语音应答方式中确定一个模板语音应答方式作为所述目标语音应答方式。The second voice response submodule is used to determine a template voice response method from the multiple template voice response methods as the target voice response method according to the attention recommendation rule in response to the target voice response method not being searched from the multiple template voice response methods. 9.一种电子设备,包括:9. An electronic device comprising: 至少一个处理器;以及at least one processor; and 与所述至少一个处理器通信连接的存储器;其中,a memory communicatively connected to the at least one processor; wherein, 所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1至4中任一项所述的方法。The memory stores instructions that can be executed by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform the method according to any one of claims 1 to 4. 10.一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行根据权利要求1至4中任一项所述的方法。10. A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method according to any one of claims 1 to 4. 11.一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现根据权利要求1至4中任一项所述的方法。11. A computer program product comprising a computer program, which, when executed by a processor, implements the method according to any one of claims 1 to 4.
CN202111297097.3A 2021-11-03 2021-11-03 Voice interaction method, voice interaction device, electronic equipment and storage medium Active CN114093356B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111297097.3A CN114093356B (en) 2021-11-03 2021-11-03 Voice interaction method, voice interaction device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111297097.3A CN114093356B (en) 2021-11-03 2021-11-03 Voice interaction method, voice interaction device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN114093356A CN114093356A (en) 2022-02-25
CN114093356B true CN114093356B (en) 2025-08-01

Family

ID=80298861

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111297097.3A Active CN114093356B (en) 2021-11-03 2021-11-03 Voice interaction method, voice interaction device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114093356B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115331669A (en) * 2022-08-01 2022-11-11 中国工商银行股份有限公司 Information generation method, apparatus, device, medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108597509A (en) * 2018-03-30 2018-09-28 百度在线网络技术(北京)有限公司 Intelligent sound interacts implementation method, device, computer equipment and storage medium
CN108962217A (en) * 2018-07-28 2018-12-07 华为技术有限公司 Phoneme synthesizing method and relevant device
CN109346076A (en) * 2018-10-25 2019-02-15 三星电子(中国)研发中心 Voice interaction, voice processing method, device and system

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9786299B2 (en) * 2014-12-04 2017-10-10 Microsoft Technology Licensing, Llc Emotion type classification for interactive dialog system
KR101747896B1 (en) * 2015-12-30 2017-06-16 주식회사 미니스쿨 Device and method for bidirectional preschool education service
CN109273001B (en) * 2018-10-25 2021-06-18 珠海格力电器股份有限公司 Voice broadcasting method and device, computing device and storage medium
CN112309379B (en) * 2019-07-26 2024-05-31 北京地平线机器人技术研发有限公司 Method, device, medium and electronic equipment for realizing voice interaction
WO2021081744A1 (en) * 2019-10-29 2021-05-06 深圳市欢太科技有限公司 Voice information processing method, apparatus, and device, and storage medium
CN112309390B (en) * 2020-03-05 2024-08-06 北京字节跳动网络技术有限公司 Information interaction method and device
CN111415642A (en) * 2020-03-31 2020-07-14 广东美的制冷设备有限公司 Voice broadcast method and device of electric equipment, air conditioner and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108597509A (en) * 2018-03-30 2018-09-28 百度在线网络技术(北京)有限公司 Intelligent sound interacts implementation method, device, computer equipment and storage medium
CN108962217A (en) * 2018-07-28 2018-12-07 华为技术有限公司 Phoneme synthesizing method and relevant device
CN109346076A (en) * 2018-10-25 2019-02-15 三星电子(中国)研发中心 Voice interaction, voice processing method, device and system

Also Published As

Publication number Publication date
CN114093356A (en) 2022-02-25

Similar Documents

Publication Publication Date Title
US11823659B2 (en) Speech recognition through disambiguation feedback
US9865264B2 (en) Selective speech recognition for chat and digital personal assistant systems
EP3485489B1 (en) Contextual hotwords
US10360265B1 (en) Using a voice communications device to answer unstructured questions
CN110473546B (en) Method and device for recommending media files
CN112270925B (en) Platform for creating customizable dialog system engines
US10043520B2 (en) Multilevel speech recognition for candidate application group using first and second speech commands
US20190027147A1 (en) Automatic integration of image capture and recognition in a voice-based query to understand intent
US11829433B2 (en) Contextual deep bookmarking
US20180052824A1 (en) Task identification and completion based on natural language query
CN110263142A (en) Method and apparatus for output information
CN108701454A (en) Parameter Collection and Automatic Dialogue Generation in Dialogue Systems
CN110070859B (en) Voice recognition method and device
CN107707745A (en) Method and apparatus for extracting information
CN110956955A (en) A method and device for voice interaction
US11705113B2 (en) Priority and context-based routing of speech processing
CN111324626B (en) Search method, device, computer equipment and storage medium based on speech recognition
CN113674742A (en) Man-machine interaction method, device, equipment and storage medium
US20190066669A1 (en) Graphical data selection and presentation of digital content
CN112767916A (en) Voice interaction method, device, equipment, medium and product of intelligent voice equipment
CN113486170B (en) Natural language processing method, device, equipment and medium based on man-machine interaction
CN114093356B (en) Voice interaction method, voice interaction device, electronic equipment and storage medium
CN104424955B (en) Generate figured method and apparatus, audio search method and the equipment of audio
CN109712606A (en) A kind of information acquisition method, device, equipment and storage medium
WO2022271555A1 (en) Early invocation for contextual data processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant