CN114093356B

CN114093356B - Voice interaction method, voice interaction device, electronic equipment and storage medium

Info

Publication number: CN114093356B
Application number: CN202111297097.3A
Authority: CN
Inventors: 周毅
Original assignee: Apollo Intelligent Connectivity Beijing Technology Co Ltd
Current assignee: Apollo Intelligent Connectivity Beijing Technology Co Ltd
Priority date: 2021-11-03
Filing date: 2021-11-03
Publication date: 2025-08-01
Anticipated expiration: 2041-11-03
Also published as: CN114093356A

Abstract

This disclosure provides a voice interaction method, apparatus, electronic device, and storage medium, relating to the fields of computer technology, particularly autonomous driving, smart cockpits, and connected vehicles. A specific implementation scheme comprises: in response to receiving interactive voice from a user, performing semantic analysis on the interactive voice to obtain user intent information; determining a target voice response method based on the user intent information; and outputting voice response information according to the target voice response method.

Description

Voice interaction method, voice interaction device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technology, the field of internet technology, and in particular, to the field of intelligent voice technology, and specifically, to a voice interaction method, a voice interaction device, an electronic apparatus, a storage medium, and a program product.

Background

Along with the development of technology, the voice interaction technology is widely applied to various intelligent voice devices such as intelligent robots, intelligent sound boxes, intelligent vehicle-mounted devices, intelligent electric appliances and the like. The intelligent voice device may perform corresponding operations, such as answering questions in the user's interactive voice, starting or stopping the device, etc., based on the interactive voice uttered by the user.

Disclosure of Invention

The present disclosure provides a method for voice interaction, an apparatus for voice interaction, an electronic device, a storage medium, and a program product.

According to one aspect of the disclosure, a voice interaction method is provided, which comprises the steps of responding to the received interaction voice from a user, carrying out semantic analysis on the interaction voice to obtain user intention information, determining a target voice response mode based on the user intention information, and outputting voice response information according to the target voice response mode.

According to another aspect of the disclosure, a voice interaction device is provided, which comprises an intention determining module, a response mode determining module and a response output module, wherein the intention determining module is used for responding to the received interaction voice from a user and carrying out semantic analysis on the interaction voice to obtain user intention information, the response mode determining module is used for determining a target voice response mode based on the user intention information, and the response output module is used for outputting voice response information according to the target voice response mode.

According to another aspect of the present disclosure, there is provided an electronic device comprising at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method as described above.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 schematically illustrates an exemplary system architecture to which voice interaction methods and apparatus may be applied, according to embodiments of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a voice interaction method according to an embodiment of the disclosure;

FIG. 3 schematically illustrates an application scenario diagram of a voice interaction method according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates an application scenario diagram of a user in voice interaction according to an embodiment of the present disclosure;

FIG. 5 schematically illustrates a block diagram of a voice interaction device according to an embodiment of the disclosure;

FIG. 6 illustrates a schematic block diagram of an example electronic device that may be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The present disclosure provides a voice interaction method, a voice interaction apparatus, an electronic device, a storage medium, and a program product.

According to the embodiment of the disclosure, the voice interaction method comprises the steps of responding to the received interaction voice from a user, conducting semantic analysis on the interaction voice to obtain user intention information, determining a target voice response mode based on the user intention information, and outputting voice response information according to the target voice response mode.

According to the embodiment of the disclosure, through carrying out semantic analysis on interactive voice of a user, determining user intention information, determining a target voice response mode based on the user intention information, the target voice response mode can be determined on the basis of fully considering the user intention information, and outputting the voice response information according to the target voice response mode, the personalized requirements of the user can be met according to the user intention, and the interestingness and the use experience of the user are improved.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

Fig. 1 schematically illustrates an exemplary system architecture to which voice interaction methods and apparatus may be applied, according to embodiments of the present disclosure.

It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios. For example, in another embodiment, an exemplary system architecture to which the voice interaction method and apparatus may be applied may include a terminal device, but the terminal device may implement the voice interaction method and apparatus provided by the embodiments of the present disclosure without interacting with a server.

As shown in fig. 1, a system architecture 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired and/or wireless communication links, and the like.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications may be installed on the terminal devices 101, 102, 103, such as a knowledge reading class application, a web browser application, a search class application, an instant messaging tool, a mailbox client and/or social platform software, etc. (as examples only).

The terminal devices 101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for content browsed by the user using the terminal devices 101, 102, 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that, the voice interaction method provided by the embodiments of the present disclosure may be generally performed by the terminal device 101, 102, or 103. Accordingly, the voice interaction apparatus provided by the embodiments of the present disclosure may also be provided in the terminal device 101, 102, or 103.

Or the voice interaction method provided by the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the voice interaction apparatus provided by the embodiments of the present disclosure may be generally disposed in the server 105. The voice interaction method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Accordingly, the voice interaction apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster different from the server 105 and capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.

For example, when the user sends out interactive voice, the terminal device 101, 102, 103 may acquire the interactive voice from the user, then send the acquired interactive voice from the user to the server 105, and the server 105 performs semantic analysis on the interactive voice to obtain user intention information, determines a target voice response mode based on the user intention information, and outputs voice response information according to the target voice response mode. Or the server cluster capable of communicating with the terminal equipment 101, 102, 103 and/or the server 105 performs semantic analysis on the interactive voice, and finally outputs voice response information according to the target voice response mode.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Fig. 2 schematically illustrates a flow chart of a voice interaction method according to an embodiment of the disclosure.

As shown in FIG. 2, the method includes operations S210-S230.

In response to receiving the interactive voice from the user, semantic analysis is performed on the interactive voice to obtain user intention information in operation S210.

In operation S220, a target voice response mode is determined based on the user intention information.

In operation S230, voice response information is output in accordance with the target voice response mode.

According to embodiments of the present disclosure, the user intent information may include information reflecting the needs of the user, such as that the interactive voice from the user is "please play singer Axx's song. By performing semantic analysis on the interactive voice, it may be determined that the user intent information includes information reflecting the need of the user to enjoy the song of singer Axx. However, the user intention information is not limited thereto, and may be information reflecting other needs of the user, such as information required to acquire weather information, information required to listen to a voiced novel, and the like.

It should be noted that, the semantic analysis may be performed on the interactive voice through a neural network module in the related art, and the neural network module may be constructed based on a long-short-term memory network, for example, or may also be a model constructed based on a statistical algorithm. The specific technical means for performing semantic analysis on the interactive voice in the embodiment of the disclosure is not limited.

According to embodiments of the present disclosure, a voice response manner may refer to a voice response style, a speaker of a voice response, and the like. For example, the voice response style may include styles in terms of intonation, speed, volume, etc., and the speaker of the voice response may refer to the tone characteristics of the speaker, etc.

According to embodiments of the present disclosure, the target voice response mode may refer to a voice response mode that matches user intention information. For a plurality of different user intention information, a target voice response mode matched with each of the plurality of different user intention information can be matched.

According to embodiments of the present disclosure, the voice response information may include voice information that responds to user intention information. For example, where the user intent information is to reflect the user's desire to enjoy songs by singer Axx, the voice response information may be "good," which is to play songs "ABC" by singer Axx for you.

According to the embodiment of the disclosure, the target voice response mode can be determined by determining the obtained user intention information, and the target voice response mode matched with the user intention information is adjusted according to the difference of the user intention information. Therefore, the voice response information is output according to the target voice response mode, so that the voice response information has different voice information characteristics, such as different voice response styles, different emotion colors and the like. Thereby meeting the personalized requirements of users and improving the interestingness and the use experience of the users.

The data evaluation method according to the embodiments of the present disclosure is further described below with reference to fig. 3 to 4.

According to an embodiment of the present disclosure, operation S220, determining a target voice response mode based on user intention information may include:

and determining a target voice response mode based on the application scene information.

According to embodiments of the present disclosure, application scenario information may refer to category information of a terminal application, or a scenario in which an operation is performed using the terminal application, such as a map class application to play navigation information, a song play class application to play music, a voiced novel play class application to play a novel, or a query class application to query weather information. Based on different user intention information, different application scene information can be determined, and the application scene information can correspond to the user intention information, so that the user requirement is met. For example, in the case where the user demand reflected by the user intention information is to enjoy songs of singers Axx, the application scenario information may be to play songs using a song play class application.

According to the embodiment of the disclosure, when the application scene information is the navigation information played by using the map application program, the response style of which the target voice response mode is urgent can be determined, for example, by accelerating the voice speed to represent the urgency of the response style, the voice response information is output according to the urgent response style, and the attention of the user to the voice response information can be improved so as to play a role in warning the user.

According to the embodiment of the disclosure, the corresponding target voice response mode can be determined based on different application scene information, so that the applicability of the target voice response mode to different application scene information is improved, and the personalized requirements of a user are met.

According to an embodiment of the present disclosure, the voice interaction method may further include, after operation S230, that is, after outputting the voice response information in the target voice response manner:

based on the user intention information, subject content information for the performed operation is determined.

According to an embodiment of the present disclosure, the subject content information for performing the operation may include played voice information, for example, song content information of song ABC of the played singer Axx, but not limited thereto, and may include navigation content information, weather forecast information, vocal content information, vocal novels content information, or the like.

According to the embodiment of the disclosure, different user intention information can reflect information of different user demands, corresponding theme content information for executing operations is determined based on the user intention information, personalized demands of users can be met, and user experience is improved.

Fig. 3 schematically illustrates an application scenario diagram of a voice interaction method according to an embodiment of the present disclosure.

As shown in fig. 3, the interactive voice 310 from the user may be "navigate AA mall to BB cinema. In response to receiving the interactive voice 310 from the user, semantic analysis of the interactive voice 310 may determine that the user intent information includes information reflecting the user's need to obtain navigation information. According to the user intention information, the application scene information can be determined to be the playing navigation information, and then the target voice response mode can be determined based on the application scene information. For example, accent information of a local dialect of a navigational region may be included. The voice response information 320 is output according to the target voice response mode of the local dialect of the navigation area, "good," and navigation for you is started. Then, the subject content information 330 for the performed operation is determined as "please go straight ahead, and.

According to embodiments of the present disclosure, it may be determined that application scenario information is to play navigation information using a map-like application by determining user intent information, e.g., reflecting a need for a user to obtain navigation information. According to the application scene information, determining that the target voice response mode is a local dialect voice response mode adopting a navigation area, and outputting voice response information according to the target voice response mode. Therefore, the method and the device can meet the requirement of the user for acquiring the navigation information, output the voice response information according to the target voice response mode of the local dialect of the navigation area, further meet the personalized requirement of the user and promote the interestingness and the use experience. According to an embodiment of the present disclosure, the target voice response means includes a response style.

Determining the target voice response mode based on the application scene information may include identifying emotion information in the subject content information in response to the application scene information being first predetermined scene information, and determining a response style of the target voice response mode based on the emotion information.

According to the embodiment of the disclosure, whether the application scene information is matched with the first preset scene information can be judged, and the target voice response mode is determined according to the rule matched with the first preset scene information under the condition that the application scene information is determined to be the first preset scene information.

For example, the first predetermined scene information may include playing music with a song play class application, playing novels with a voiced novice play class application, and so forth. The subject content information may include songs, vocals, etc. corresponding to the first predetermined scene information. In the case where the application scene information is the first predetermined scene information, emotion information in the subject content information may be identified, for example, emotion information in a song to be played or a novel subject content information may be cheerful, sad, or the like.

According to embodiments of the present disclosure, the response style may include a response style that characterizes speech emotion, for example, may be a response style that characterizes cheering or anxiety. Based on the emotion information in the topic content information, the response style of the target voice response mode is determined, and the response style can be unified with the emotion information in the topic content information, so that the voice response information is consistent with the emotion of the topic content information of the follow-up execution operation, emotion colors can be rendered in advance, and the intelligence and the interestingness of voice interaction are improved.

It should be noted that, emotion information in the subject content information may be identified by analyzing semantic information of the subject content information. For example, semantic information of the subject content information is analyzed through a neural network model, emotion information in the subject content information is identified, but the method is not limited to the method, and emotion information in the subject content information can be identified through analyzing frequency and amplitude of sound in the subject content information. The embodiment of the present disclosure does not limit specific technical means for identifying emotion information in subject matter information.

According to an embodiment of the present disclosure, the target voice response means includes a response voice feature.

Based on the application scene information, determining the target voice response mode comprises the following steps:

Identifying the voice characteristics of the speaker in the subject content information in response to the application scene information being second predetermined scene information; and determining response voice characteristics of the target voice response mode based on the voice characteristics of the speaker.

According to an embodiment of the present disclosure, the second predetermined scene information may include playing music with a song play class application, playing novels with a voiced novice play class application, and so forth. The subject content information may include song or vocal voice information corresponding to the second predetermined scene information. The phonetic features of the speaker may include the tone color of the speaker.

For example, when the second predetermined scene information is a speech of a speech actor Byy using a speech novel play class application, the subject content information may be a speech work "EDF" of the speech actor Byy, and the speaker of the speech work "EDF" may be Byy. By recognizing the voice characteristics of the speaker Byy in the phasor work "EDF", the voice characteristics of the speaker Byy can be determined as the response voice characteristics of the target voice response mode.

It should be noted that, the voice features of the speaker in the subject content information may be identified by analyzing the voiceprint features of the voice information in the subject content information.

According to an embodiment of the present disclosure, determining the response voice feature of the target voice response mode based on the voice feature of the speaker may be to use the voice feature of the speaker as the response voice feature. The voice response information is output according to the voice characteristics of the speaker in the subject content information, so that the voice response information and the subject content information have the same voice characteristics, and the intelligence of voice interaction is improved.

According to an exemplary embodiment of the present disclosure, the target voice response mode includes a response voice feature and a response style, and emotion information in the subject content information may be identified in the case where the application scene information is both first predetermined scene information and second predetermined scene information. And determining the response style of the target voice response mode based on the emotion information. And determining response voice characteristics of the target voice response mode based on the voice characteristics of the speaker. For example, the target voice response mode may include a happy response style, and a response voice feature of a female, so that voice response information is output according to the response voice feature and the response style, and voice response information may be output according to the target voice response mode of "happy girl".

It should be understood that the first predetermined scene information and the second predetermined scene information may be the same or different, and those skilled in the art may set the first predetermined scene information and the second predetermined scene information according to actual needs.

Fig. 4 schematically illustrates an application scenario diagram of a user in voice interaction according to an embodiment of the present disclosure.

As shown in fig. 4, the interactive voice 411 uttered by the user 410 may be "please play song ABC of singer Axx". The voice interaction device may be disposed in the vehicle 420, for example, the voice interaction device is an in-vehicle terminal device. It should be appreciated that the user 410 may be located within the vehicle 420, or the user 410 may also be located outside of the vehicle 420, so long as the interactive voice 411 uttered by the user 410 is receivable by the voice interactive device.

The voice interaction device of vehicle 420 may perform semantic analysis on the interaction voice 411 and may determine that the user intent information includes information reflecting the need for the user to enjoy songs by singer Axx. Based on the user intent information, it may be determined that the first predetermined application scenario may be to play music using a song play class application, and based on the user intent information, it may be determined that the subject content information is song ABC 422.

Identifying the emotion information in the song ABC 422 as a worry injury, determining that the response style can be the worry injury based on the emotion information in the song ABC 422, determining that the target voice response mode comprises the response style of the worry injury according to the response style, and further outputting voice response information 421 with the response style of the worry injury, wherein the voice response information is good, and playing Axx of the song ABC for you. "

Further, under the condition that the second predetermined application scenario is the same as the first predetermined application scenario, the voice characteristics of the speaker in the voice information 422 of the song ABC, that is, the tone of the singer Axx of the song ABC, can be identified, based on the voice characteristics of the speaker, the response voice characteristics can be determined to be the tone of the singer Axx, according to the response voice characteristics, the target voice response mode can be determined to include the tone of the singer Axx, further according to the response style of the anxiety, and according to the tone of the singer Axx, the voice response information 421: "good, song ABC of Axx is played for you" is output. "after outputting the voice response information 421, the song" ABC "422 can be played.

By unifying the response style with the emotion information of the song ABC and outputting the voice response information according to the tone of the speaker of the song ABC, the user can be helped to enter the psychological state of enjoying the song ABC in advance before playing the voice information of the song ABC, so that the personalized requirements of the user are further met, and the use experience of the user is improved.

According to an embodiment of the present disclosure, outputting voice response information in a target voice response manner includes:

and determining one template voice response mode from the plurality of template voice response modes as a target voice response mode according to the attention recommendation rule in response to no target voice response mode from the plurality of template voice response modes.

According to embodiments of the present disclosure, the attention may include popularity or heat, and the attention may be generated based on a user browsing operation, a retrieving operation, or a praying operation. The attention degree recommendation rule may be to determine one template voice response mode from N template voice response modes with highest attention degrees as the template voice response mode. For example, in the case where the target voice response mode includes the tone of the singer Axx, the target voice response mode including the tone of the singer Axx is not searched among the plurality of template voice response modes, the template voice response mode with the highest attention degree may be determined as the target voice response mode from among the plurality of template voice response modes in the cloud server, and the voice response information may be output according to the target voice response mode. But is not limited thereto. And a template voice response mode can be determined from a plurality of template voice response modes according to the interest degree of the user and used as a target voice response mode.

By utilizing the voice interaction method provided by the embodiment of the disclosure, various determination channels of the target voice response mode are provided, the diversity of the target voice response mode is enriched, the application range of the voice interaction method is expanded, and the use experience of a user is further improved.

Fig. 5 schematically illustrates a block diagram of a voice interaction apparatus according to an embodiment of the disclosure.

As shown in fig. 5, the voice interaction apparatus 500 may include an intention determination module 510, a response mode determination module 520, and a response output module 530.

The intention determining module 510 is configured to perform semantic analysis on the interactive voice to obtain user intention information in response to receiving the interactive voice from the user.

The response mode determining module 520 is configured to determine a target voice response mode based on the user intention information.

The response output module 530 is configured to output voice response information according to the target voice response mode.

According to the embodiment of the disclosure, the response determining module comprises an application scene determining sub-module and a response mode determining sub-module.

And the application scene determining sub-module is used for determining application scene information based on the user intention information.

And the response mode determining submodule is used for determining a target voice response mode based on the application scene information.

According to an embodiment of the present disclosure, the voice interaction device further includes, after outputting the voice response information in the target voice response manner:

And an operation determining module for determining subject content information of an operation to be performed based on the user intention information.

According to an embodiment of the present disclosure, the target voice response means includes a response style.

The answer mode determination submodule comprises a first identification unit and an answer style determination unit.

The first identifying unit is used for identifying emotion information in the theme content information in response to the application scene information being first preset scene information.

And the response style determining unit is used for determining the response style of the target voice response mode based on the emotion information.

According to an embodiment of the present disclosure, the target voice response means includes a response voice feature;

the response mode determining submodule comprises a second identification unit and a voice characteristic determining unit.

And a second recognition unit for recognizing the voice characteristics of the speaker in the subject content information in response to the application scene information being second predetermined scene information.

And the voice characteristic determining unit is used for determining response voice characteristics of the target voice response mode based on the voice characteristics of the speaker.

According to the embodiment of the disclosure, the response output module comprises a first voice response sub-module and a second voice response sub-module.

The first voice response sub-module is used for responding to the target voice response mode searched from the plurality of template voice response modes and outputting voice response information according to the target voice response mode.

And the second voice response sub-module is used for determining one template voice response mode from the plurality of template voice response modes as a target voice response mode according to the attention recommendation rule in response to the fact that the target voice response mode is not searched from the plurality of template voice response modes.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

According to an embodiment of the present disclosure, an electronic device includes at least one processor and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to an embodiment of the present disclosure, a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method as described above.

According to an embodiment of the present disclosure, a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.

FIG. 6 illustrates a schematic block diagram of an example electronic device that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 may also be stored. The computing unit 601, ROM 602, and RAM 603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Various components in the device 600 are connected to the I/O interface 605, including an input unit 606, e.g., keyboard, mouse, etc., an output unit 607, e.g., various types of displays, speakers, etc., a storage unit 608, e.g., magnetic disk, optical disk, etc., and a communication unit 609, e.g., network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the various methods and processes described above, such as voice interaction methods. For example, in some embodiments, the voice interaction method may be implemented as a computer software program tangibly embodied on a machine-readable medium, e.g., storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into the RAM 603 and executed by the computing unit 601, one or more steps of the voice interaction method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the voice interaction method in any other suitable way (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be a special or general purpose programmable processor, operable to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user, for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback), and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a Local Area Network (LAN), a Wide Area Network (WAN), and the Internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A voice interaction method, comprising:

In response to receiving an interactive voice from a user, performing semantic analysis on the interactive voice to obtain user intention information;

Determining application scenario information based on the user intention information, wherein the application scenario information represents a category of a terminal application used to perform the operation;

In response to the application scenario information being the second predetermined scenario information, identifying a voice feature of a speaker in subject content information for performing the operation, wherein the subject content information is determined based on the user intention information;

Determining the response voice features of the target voice response mode based on the voice features of the speaker; and

The voice response information is output according to the target voice response mode.

2. The method according to claim 1, further comprising, after outputting the voice response information in the target voice response mode:

Based on the user intention information, subject content information for the operation to be performed is determined.

3. The method according to claim 2, wherein the target voice response mode includes a response style;

The method further comprises:

In response to the application scenario information being the first predetermined scenario information, identifying emotional information in the subject content information; and

Based on the emotional information, a response style of the target voice response mode is determined.

4. The method according to any one of claims 1 to 3, wherein outputting the voice response information according to the target voice response mode comprises:

In response to searching for the target voice response method from a plurality of template voice response methods, outputting voice response information according to the target voice response method; and

In response to not finding the target voice response method from the multiple template voice response methods, a template voice response method is determined from the multiple template voice response methods as the target voice response method according to an attention recommendation rule.

5. A voice interaction device, comprising:

an intention determination module, configured to, in response to receiving an interactive voice from a user, perform semantic analysis on the interactive voice to obtain user intention information;

an answering mode determining module, configured to determine a target voice answering mode based on the user intention information; and

A response output module, configured to output voice response information according to the target voice response mode;

Wherein, the response mode determination module includes:

an application scenario determination submodule, configured to determine application scenario information based on the user intention information, wherein the application scenario information represents a category of a terminal application used to perform an operation; and

An answer mode determination submodule, configured to determine a target voice answer mode based on the application scenario information;

Wherein, the target voice response mode includes response voice features;

The response mode determination submodule includes:

a second recognition unit configured to recognize, in response to the application scenario information being second predetermined scenario information, a voice feature of a speaker in subject content information for executing the operation, wherein the subject content information is determined based on the user intention information; and

The voice feature determination unit is used to determine the response voice feature of the target voice response mode based on the voice feature of the speaker.

6. The apparatus according to claim 5, further comprising, after outputting the voice response information in the target voice response mode:

The operation determination module is used to determine subject content information of the operation to be performed based on the user intention information.

7. The apparatus according to claim 6, wherein the target voice response mode comprises a response style;

The response mode determination submodule includes:

a first identification unit, configured to identify emotional information in the subject content information in response to the application scenario information being first predetermined scenario information; and

The response style determination unit is used to determine the response style of the target voice response method based on the emotional information.

8. The device according to any one of claims 5 to 7, wherein the response output module comprises:

a first voice response submodule, configured to output voice response information according to the target voice response mode in response to searching for the target voice response mode from a plurality of template voice response modes; and

The second voice response submodule is used to determine a template voice response method from the multiple template voice response methods as the target voice response method according to the attention recommendation rule in response to the target voice response method not being searched from the multiple template voice response methods.

9. An electronic device comprising:

at least one processor; and

a memory communicatively connected to the at least one processor; wherein,

The memory stores instructions that can be executed by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform the method according to any one of claims 1 to 4.

10. A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method according to any one of claims 1 to 4.

11. A computer program product comprising a computer program, which, when executed by a processor, implements the method according to any one of claims 1 to 4.