Disclosure of Invention
The application provides the digital person image driving method, the device, the equipment, the storage medium and the product, enriches the image change content of the digital person, ensures that the image of the digital person and the broadcasting content of the digital person are higher in matching performance, and improves the user experience.
In order to achieve the above purpose, the application adopts the following technical scheme:
In a first aspect, a digital person figure driving method is provided for use with a digital person figure driving engine, the method comprising obtaining dialog text, the dialog text being text that a digital person responds to based on user input. And carrying out semantic analysis on the dialogue text to obtain image characteristic data matched with semantic content of the dialogue text, wherein the image characteristic data is used for indicating image change characteristics of the digital person. And controlling the digital person to make a figure animation matched with the semantic content of the dialogue text based on the figure characteristic data.
According to the scheme provided by the application, the content expressed by the digital person is matched with the image change item of the digital person through semantic analysis of the text which is responded, so that the image change content of the digital person is enriched, a plurality of fixed actions are not used any more, the personification degree of the digital person is improved, and the user experience is improved.
In one possible implementation, the avatar characteristic data includes one or more of mouth characteristic data, limb motion characteristic data or facial expression characteristic data.
In yet another possible implementation, the digital persona drive engine includes a text-to-speech engine. The method comprises the step of carrying out semantic analysis on the dialogue text to obtain image characteristic data matched with semantic content of the dialogue text, and can be concretely realized by carrying out voice conversion on the dialogue text through the text-to-language engine to obtain voice information conforming to the digital person role setting. And decomposing the phonemes of the words in the dialogue text through the text-to-language engine to obtain phonemes corresponding to the words, and recording the starting time and the ending time of the phonemes corresponding to the words in the voice information, wherein the phonemes refer to pronunciation units corresponding to the words. And according to the starting time and the ending time of the phonemes corresponding to the characters in the voice information, time alignment is carried out on the voice information and the phonemes corresponding to the characters to obtain the mouth characteristic data.
In another possible implementation manner, the controlling the digital person to make the avatar animation matched with the semantic content of the dialogue text based on the avatar feature data may be specifically implemented by controlling the digital person to send out the language information based on the mouth feature data and controlling the mouth of the digital person to make the mouth-shaped action corresponding to the phonemes corresponding to the respective characters.
In another possible implementation manner, the digital persona driving engine comprises a behavior reasoning engine. The semantic analysis is carried out on the dialogue text to obtain the image characteristic data matched with the semantic content of the dialogue text, and the method can be concretely realized in that the behavior reasoning engine is used for decomposing the characters in the dialogue text to obtain the text fields. And selecting from a behavior tree according to the text field to obtain the limb action characteristic data matched with the semantic content of the dialogue text, wherein the behavior tree comprises different action information corresponding to different text fields.
In still another possible implementation manner, the controlling the digital person to make the avatar animation matched with the semantic content of the dialogue text based on the avatar characteristic data may be specifically implemented by controlling the limb of the digital person to make the limb action corresponding to the word segment based on the limb action characteristic data.
A further possible implementation mode of the digital human image driving engine comprises an emotion analysis engine, wherein the emotion analysis engine performs semantic analysis on the dialogue text to obtain image feature data matched with semantic content of the dialogue text, and the digital human image driving engine can be specifically realized in such a way that the emotion text in the dialogue text is screened out through the emotion analysis engine, the emotion text is used for expressing emotion, and the facial expression feature data is obtained according to the type and the emotion degree of the emotion text.
In still another possible implementation manner, the facial expression feature data is obtained according to the type and the emotion degree of the emotion text, and the facial expression corresponding to the emotion text is determined according to the type of the emotion text. And obtaining the expression amplitude of the facial expression according to the emotion degree of the emotion text. And obtaining the facial expression characteristic data based on the facial expression and the expression amplitude.
In still another possible implementation manner, the controlling the digital person to make the avatar animation matched with the semantic content of the dialogue text based on the avatar characteristic data may be specifically implemented by controlling the face of the digital person to make the facial expression corresponding to the emotion text based on the facial expression characteristic data.
In another possible implementation manner, the digital human image driving engine is deployed at a vehicle end and/or a cloud end.
In a further possible implementation manner, the digital human figure driving method further comprises the step of carrying out data synchronization on at least two feature data according to a time stamp when at least two of the mouth feature data, the limb motion feature data or the facial expression feature data are included in the image feature data. And controlling the digital person to make a visual animation matched with the semantic content of the dialogue text according to the at least two kinds of characteristic data after data synchronization.
In a second aspect, a digital human figure drive apparatus is provided for use with a digital human figure drive engine. The device comprises an acquisition module, an analysis module and a driving module. Wherein:
and the acquisition module is used for acquiring dialogue texts, wherein the dialogue texts refer to texts which are responded by a finger person based on user input.
The analysis module is used for carrying out semantic analysis on the dialogue text to obtain image characteristic data matched with semantic content of the dialogue text, and the image characteristic data is used for indicating image change characteristics of the digital person.
And the driving module is used for controlling the digital person to make the image animation matched with the semantic content of the dialogue text based on the image characteristic data.
In one possible implementation, the avatar characteristic data includes one or more of mouth characteristic data, limb motion characteristic data or facial expression characteristic data.
In yet another possible implementation, the digital persona drive engine includes a text-to-speech engine. The parsing module is further configured to perform voice conversion on the dialog text through the text-to-speech engine, so as to obtain voice information that accords with the digital character setting. And decomposing the phonemes of the words in the dialogue text through the text-to-language engine to obtain phonemes corresponding to the words, and recording the starting time and the ending time of the phonemes corresponding to the words in the voice information, wherein the phonemes refer to pronunciation units corresponding to the words. And according to the starting time and the ending time of the phonemes corresponding to the characters in the voice information, time alignment is carried out on the voice information and the phonemes corresponding to the characters to obtain the mouth characteristic data.
In another possible implementation manner, the driving module is further configured to control the digital person to send out the language information based on the mouth feature data, and control the mouth of the digital person to make a mouth shape action corresponding to phonemes corresponding to the respective characters.
In another possible implementation manner, the digital persona driving engine comprises a behavior reasoning engine. The parsing module is further configured to disassemble the text in the dialog text through the behavior reasoning engine to obtain a text field. And selecting from a behavior tree according to the text field to obtain the limb action characteristic data matched with the semantic content of the dialogue text, wherein the behavior tree comprises different action information corresponding to different text fields.
In still another possible implementation manner, the driving module is further configured to control the limb of the digital person to make a limb motion corresponding to the text segment based on the limb motion feature data.
The digital human image driving engine comprises an emotion analysis engine, the analysis module is further used for screening out emotion texts in the dialogue texts through the emotion analysis engine, the emotion texts are texts used for expressing emotion, and the facial expression feature data are obtained according to the types and the emotion degrees of the emotion texts.
In still another possible implementation manner, the parsing module is further configured to determine, according to the type of the emotion text, a facial expression corresponding to the emotion text. And obtaining the expression amplitude of the facial expression according to the emotion degree of the emotion text. And obtaining the facial expression characteristic data based on the facial expression and the expression amplitude.
In still another possible implementation manner, the driving module is further configured to control the face of the digital person to make a facial expression corresponding to the emotion text based on the facial expression feature data.
In another possible implementation manner, the digital human image driving engine is deployed at a vehicle end and/or a cloud end.
In still another possible implementation manner, the parsing module is further configured to perform data synchronization on at least two feature data according to a timestamp when at least two of the mouth feature data, the limb motion feature data, or the facial expression feature data are included in the image feature data. And controlling the digital person to make a visual animation matched with the semantic content of the dialogue text according to the at least two kinds of characteristic data after data synchronization.
The digital human figure driving device provided in the second aspect is configured to perform the digital human figure driving method provided in the first aspect or any one of the possible implementation manners of the first aspect, and the technical effect corresponding to any one of the implementation manners of the second aspect may refer to the technical effect corresponding to any one of the implementation manners of the first aspect, which is not described herein again.
In a third aspect, there is provided a computer device comprising a processor and a memory in which at least one computer program is stored, the at least one computer program being loaded and executed by the processor to implement a digital human figure driving method as described above in the first aspect or any one of the possible implementations of the first aspect.
In a fourth aspect, there is provided a computer readable storage medium having stored therein at least one computer program, the at least one computer program being loaded and executed by a processor to implement the digital human figure driving method as described above in the first aspect or any implementation of the first aspect.
In a fifth aspect, a computer program product is provided, the computer program product comprising a computer program or instructions which, when executed by a processor, implement a digital human image driving method as described in the first aspect or any implementation manner of the first aspect.
In a sixth aspect, an embodiment of the present application provides a chip system, including at least one processor and at least one interface circuit, where the at least one interface circuit is configured to perform a transceiving function, and send an instruction to the at least one processor, and when the at least one processor executes the instruction, the at least one processor executes to implement a digital portrait driving method according to the first aspect or any implementation manner of the first aspect.
In a seventh aspect, an embodiment of the present application provides a vehicle, the vehicle including a display screen in which a digital person is displayed, the vehicle driving the digital person based on the digital person image driving method according to the first aspect or any one of the possible implementation manners of the first aspect.
The foregoing third aspect to seventh aspect of the present invention are solutions for implementing the foregoing first aspect or the method provided by the first aspect, and specific implementation details thereof are not described in detail. Technical effects corresponding to any implementation manner of the third aspect to the seventh aspect may be referred to the technical effects corresponding to the first aspect or any implementation manner of the first aspect, and are not described herein again.
It should be noted that, in various possible implementations of any of the above aspects, on the premise that the schemes are not contradictory, the combination can be carried out.
Detailed Description
In the embodiments of the present application, in order to facilitate the clear description of the technical solutions of the embodiments of the present application, the words "first", "second", etc. are used to distinguish the same item or similar items having substantially the same function and effect. It will be appreciated by those of skill in the art that the words "first," "second," and the like do not limit the amount and order of execution, and that the words "first," "second," and the like do not necessarily differ. The technical features described in the first and second descriptions are not sequential or in order of magnitude.
In embodiments of the application, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g." in an embodiment should not be taken as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion that may be readily understood.
In the embodiment of the present application, at least one may also be described as one or more, and a plurality may be two, three, four or more, and the present application is not limited thereto.
In addition, the network architecture and the scenario described in the embodiments of the present application are for more clearly describing the technical solution provided in the embodiments of the present application, and do not constitute a limitation on the technical solution provided in the embodiments of the present application, and those skilled in the art can know that, with the evolution of the network architecture and the appearance of a new service scenario, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems.
It should be noted that, the information (including but not limited to device information, object personal information, etc.), data (including but not limited to data for analysis, stored data, displayed data, etc.) and signals related to the present application are all authorized by the object or sufficiently authorized by each party, and the collection, use and processing of the related data are required to comply with the related laws and regulations and standards of the related countries and regions. For example, the dialog text referred to in the present application is obtained with sufficient authorization.
With the rapid development of new powerful intelligent automobiles, vehicle-mounted voice assistants have gradually evolved from original virtual figures to digital figures of human figures. In the process of interaction with the digital person, the digital person can correspondingly act according to the voice instruction. When a user gives out an explicit voice instruction, the digital person can not only execute the voice instruction, but also perform related actions while executing the voice instruction.
A schematic diagram of triggering a digital person to perform an action in a conventional method is shown in fig. 1. The process of triggering the digital person to execute the action comprises the steps of user voice input, local signal processing, local or cloud voice processing, dialogue management, digital person driving processing, returning dialogue text, and converting the local text into voice, wherein the voice is output to the user.
Specifically, the user inputs voice through the vehicle-mounted voice assistant, and triggers the vehicle-mounted voice assistant to start working. After receiving the voice input, the vehicle-mounted voice assistant firstly carries out signal processing on the voice input locally (vehicle machine end), wherein the signal processing comprises preprocessing operations such as noise reduction, framing and the like. The voice input after the signal processing can be processed locally or processed in cloud.
The cloud speech processing or local speech processing includes speech recognition (automatic speech recognition, ASR) and semantic understanding (naturallanguage understanding, NLU).
Under the condition of locally processing voice, the vehicle side processing engine performs ASR and NLU on voice input, and responds to the voice input after understanding the intention and the semantics of the voice input to obtain dialogue text. According to the responding dialogue text, the car machine side processing engine converts the dialogue text into voice from Text To Speech (TTS) to play. On the other hand, the vehicle-mounted processing engine calls a local large model, and the image action of the digital person is found in a preset action library, so that the image action is executed while the digital person plays the voice.
For example, the user speaks into the car microphone "small A, I want to hear the song of Star C. The local ASR module receives the voice signal of the user and converts the voice signal into text, i want to listen to the song of the star C. The "local NLU module analyzes this text, understands that the user's intent is" play music ", and designates the artist" star C ". It extracts key information, intent = play music, artist = star C. The vehicle side processing engine generates a proper dialogue text based on the NLU understanding result. For example, "good, a song of Star C is being played for you. After the local TTS module receives the dialogue text, and plays the song of the star C for you, the local TTS module synthesizes the dialogue text into voice and plays the voice through the vehicle-mounted sound equipment. Meanwhile, the vehicle-side processing engine processes the image action of the digital human small A. By simple semantic analysis of the dialog text (e.g., recognition of words such as "good", "play", etc.), a "action library" is then called, which is stored in advance in the vehicle side. In the action library, based on the word "good" indicating agreement or confirmation, and the action "play", it may find several preset actions, such as: and a small A slightly points the head, the corners of the mouth are raised, and the hands make a gesture of starting. Action b small A slightly leaning forward, eye concentration, and hand-to-hand ratio drawing a select/confirm action in the air. At this time, the user hears that small A says "good with synthesized sound, playing a song of Star C for you. At the same time, the user sees the digital person small a pico nod on the screen, smiles, and both hands make a gesture to start playing.
Under the condition that the cloud side processing engine processes voice input, the cloud side processing engine processes the voice input, namely ASR and NLU are conducted, after the intention and the semantics of the voice input are understood, the voice input is responded, dialogue text is obtained, meanwhile, driving processing is conducted on the image of the digital person through calling the large model, namely, the image of the digital person is driven to act, such as mouth shape driving and limb driving. The cloud side processing engine converts the dialogue text into voice for playing through the TTS according to the responded dialogue text. On the other hand, the cloud side processing engine processes the image actions of the digital person found in the preset action library and drives the mouth shape and limbs of the digital person, so that the image actions are executed while the digital person plays the voice.
For example, the user speaks into the car microphone "small A, helping me find the nearest gas station". The voice signal 'small A' helps me find the nearest gas station 'and uploads the nearest gas station' to the cloud side processing engine through network connection (such as 4G/5G/Wi-Fi) of the vehicle side. After receiving the voice signal, the ASR module in the cloud side processing engine converts the voice signal into a text, namely 'help me find the nearest gas station'. The NLU module in the "cloud side processing engine analyzes this text, understanding that the user's intent is" find place ", specifically" gas station ", and with the constraint of" nearest ". The cloud side processing engine generates a proper dialogue text based on the NLU understanding result. For example, "good, is looking for nearby gas stations for you, and is immediately good. The cloud side processing engine sends the dialogue text to the TTS module at the vehicle terminal, and the TTS module synthesizes the dialogue text into voice after the TTS module receives the dialogue text and searches nearby gas stations for you and immediately gets good, and plays the voice through the vehicle-mounted sound equipment. Meanwhile, the cloud side processing engine processes the image action of the digital person small A. The cloud side processing engine calls the large model to analyze the dialogue text, generates an accurate mouth-shaped transformation instruction, and ensures perfect synchronization of mouth animation of the digital person small A and voice synthesized by TTS. And, the large model further analyzes the semantics and emotion of the text. It recognizes that "good" indicates consent. Based on these analyses, the large model looks up or generates appropriate actions in real time from a more massive and finer action library in the cloud. It is possible to select a nodding action that indicates "confirm/agree". The cloud side processing engine transmits the dialogue text and the generated animation data (the mouth shape driving instruction and the limb driving instruction) to the vehicle side. And the audio player at the vehicle terminal receives and plays the voice converted from the dialogue text. The vehicle machine end renders the digital person and executes the mouth shape driving instruction and the limb driving instruction, so that the mouth shape and the limb synchronously act according to the cloud side processing engine. At this time, the user hears that the small A is "good" in the synthesized voice, and is looking for nearby gas stations for you, just now. At the same time, the user sees the digital person small A smiling nodding head on the screen, and the shape of the mouth changes accurately with speaking content (mouth shape synchronization).
However, in the above-mentioned technology, the response actions of the digital person are all fixed, that is, the actions corresponding to the trigger instruction are executed according to the trigger instruction, so that the image of the digital person is single and programmed, and the image of the digital person and the broadcasting content of the digital person are poor in matching.
Based on the above, the application provides a digital person image driving method, which obtains dialogue text that a digital person responds to based on user input. And carrying out semantic analysis on the dialogue text to obtain image feature data matched with the semantic content of the dialogue text. Based on the character feature data, the digital person is controlled to make character animation matched with the semantic content of the dialogue text. According to the method, the text which is responded is subjected to semantic analysis, so that the content expressed by the digital person is matched with the image change item of the digital person, the image change content of the digital person is enriched, a plurality of fixed actions are not used any more, the personification degree of the digital person is improved, and the user experience is improved.
The following describes the embodiments of the present application with reference to the drawings.
The scheme provided by the application can be applied to the computing system shown in fig. 2. As shown in FIG. 2, the computing system includes a digital persona drive engine. The digital portrait drive engine may be operated in different locations, such as the vehicle end and the cloud end.
The digital portrait driving engine runs locally in the vehicle at the vehicle-mounted end, and can still run even in a network-free environment, so that the digital portrait driving engine is suitable for the situations of basic vehicle control and poor network conditions. The digital portrait driving engine runs in the remote server in the cloud, when the digital portrait driving engine runs in the cloud, the vehicle and the cloud are connected by a network, and the digital portrait driving engine in the cloud can run only in the condition of the network.
Alternatively, the vehicle may be a new energy vehicle, an oil vehicle, a motorcycle, a tricycle, an autopilot logistics vehicle, an electric truck, etc., but is not limited thereto, and the embodiment of the present application is not particularly limited thereto.
Wherein an onboard voice assistant is supported in the vehicle, i.e. the vehicle is capable of supporting voice interactions. The vehicle also includes a display screen in which a digital person is displayed, the digital person being interactable with the user.
Optionally, the cloud may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services, a cloud database, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content distribution network (content delivery network, CDN), and a cloud server of basic cloud computing services such as big data.
Fig. 3 is a schematic diagram of a digital portrait driving method according to an embodiment of the present application, where the method is applied to a digital portrait driving engine. The digital portrait drive engine may run on the vehicle side in the computing system illustrated in fig. 2, and may also run on the cloud side in the computing system illustrated in fig. 2.
As shown in fig. 3, the digital human figure driving method may include:
Step 302, the digital portrait driver engine obtains dialogue text.
Where dialog text refers to text that a person responds to based on user input, also referred to as "response text".
Illustratively, the user interacts with an on-board voice assistant on the vehicle, and the dialog text is text that is in reply to the user input. For example, the user speaks "small A, I want to hear the song of Star C" with respect to the vehicle-mounted voice assistant. The vehicle-mounted voice assistant replies that "good, playing a song of Star C for you. "wherein," good, song playing for you, star C "is dialog text.
Alternatively, the dialog text may be a reply to the user's voice input or a reply to the user's instruction.
The digital person image driving engine is a software system running in a vehicle end, a cloud end or other computing environments, and generates actions, expressions and language feedback of a digital person by analyzing instructions or dialogue texts of the user, so that natural interaction between the digital person and the user is realized.
The digital person is a built virtual character. They not only possess the appearance and morphology of an anthropomorphic or real human being, but also interact with the human being through speech, expression and limb movements. Alternatively, the digital person may be a 3D human figure, or a 2D human figure.
In addition, the pattern of the digital person is not limited to only a person, that is, the digital person may be an animated dog, an animated cat, a plant, or the like, but is not limited thereto, and the embodiment of the application is not particularly limited thereto.
And 304, carrying out semantic analysis on the dialogue text by the digital portrait drive engine to obtain the image characteristic data matched with the semantic content of the dialogue text.
Wherein the character feature data is used to indicate character change features of the digital person.
Illustratively, the digital persona driver engine obtains semantic content of the dialog text by semantically parsing the dialog text, such as whether the dialog text is a happy, surprised, queried, or complaint. And generates a group of image characteristic data according to the image characteristic data, thereby realizing the driving of the image of the digital person.
Optionally, the image characteristic data includes one or more of mouth characteristic data, limb motion characteristic data or facial expression characteristic data, but is not limited thereto, and the embodiment of the present application is not particularly limited thereto.
Wherein the mouth characteristic data is data for indicating a mouth morphological feature of the digital person. That is, the mouth shape of the digital person can be determined from the mouth feature data.
Optionally, the mouth shape includes mouth shape, mouth opening and closing degree, mouth angle state, etc., but is not limited thereto, and the embodiment of the present application is not limited thereto.
The mouth shape is the core part in the mouth shape, and is mainly used for matching voices. Different phones require different mouth shapes. For example, when emitting/a/(o) sound, the mouth is usually open and circular or oval. During the sound/i/(clothes) the corners of the mouth will stretch to both sides and the lips will be slightly thinner. When the voice/u/(black) is generated, the mouth is tightly closed and protrudes forwards to form a round shape. When the lips are made/s/(silk) or/f/(Buddha) voice, the lips may come close to or separate from each other, forming a gap. Closed mouth (for transitions between vowels or silent portions). The digital portrait driver engine selects or generates a corresponding mouth shape according to the phonemes to be emitted.
The opening and closing degree of the mouth refers to the opening size of the mouth.
Corner of mouth conditions include ascending, usually representing a smiling, happy, delightful and other positive emotions. Sagging, a negative emotion that may be expressed as sadness, dissatisfaction, fatigue, etc. The maintenance level is neutral or represents a serious, neutral statement. Asymmetry, may represent confusion, jeers, or simply natural distortions when speaking.
The limb-motion-feature data is data for indicating limb-motion features of the digital person. That is, the limb movement of the digital person can be determined from the limb movement feature data.
Optionally, the limb includes a digital person's hand, foot, upper body, lower body, and the like. The limb morphology refers to a specific posture or motion state exhibited by a limb of a human. It comprises the following main aspects:
(1) Joint angle/rotation.
Joint angle/rotation is the most central data of limb movements. Each movable joint (e.g., shoulder, elbow, wrist, hip, knee, ankle, etc.) has one or more angular values defining its flexion, extension, rotation, etc. For example, the degree of bending of the arm, elbow joint (bending vs straightens), the lifting angle of the shoulder joint and the back and forth swing angle. Leg, knee joint bending degree (bending vs straightening), hip joint lifting, lowering, front-back swing angle. Trunk, bending and twisting angle of spine (body forward tilting, backward tilting, left and right twisting). Finger and wrist, degree of bending of finger (fist vs open), rotation angle of wrist. Body posture/balance refers to the posture of the whole body relative to the ground, including the position of the center of gravity, the angle of inclination of the body, etc. For example, standing straight or slightly leaning forward.
(2) Gesture/hand motion.
Gestures/hand actions include pointing, waving hands, making fists, making gestures (e.g., to indicate "OK", "stop", etc.), taking or manipulating objects, and so forth.
The facial expression feature data is data for indicating facial expression features of a digital person. That is, by the mouth feature data, the facial expression of the digital person can be determined.
Alternatively, facial expressions generally include, but are not limited to, open heart (manifestation: raised corners of the mouth, squinted eyes), injury (raised corners of the mouth, possibly accompanied by tears), anger (manifestation: raised corners of the mouth, eyes are raised, lips are closed or tucked), fear (raised corners of the eye, eyes are opened, lips are slightly open), surprise (raised corners of the eye, eyes are opened, lips are slightly open), aversion (raised upper lips, raised lips, micro-wrinkled), contempt (manifestation: raised corners of the mouth, slightly lowered) and the like.
For example, dialogue text is "about how is i just sprinkling water on the keyboard. The digital portrait drive engine analyzes the dialogue text, and the digital portrait drive engine understands that the user is possibly surprised, has a little confusion and seeks help, so that the obtained portrait feature data matched with the semantic content of the dialogue text are facial expression feature data such as raised eyebrows (representing surprised), possible slight sagging of mouth corners (representing worry or annoyance) and large eyes. And generating corresponding mouth shape changes (such as mouth shapes for emitting voices of o, water and the like) according to the dialogue text. Limb motion feature data, body slightly forward, indicates concern, one hand may be lifted, as if wiping was simulated, and the head may be slightly askew to one side, indicating thinking or confusion.
Step 306, the digital person image driving engine controls the digital person to make image animation matched with the semantic content of the dialogue text based on the image characteristic data.
The image animation is an image change animation after the digital person image is adjusted by the digital person image driving engine according to the image characteristic data.
Illustratively, after the digital person figure driving engine obtains the figure feature data, the digital person figure driving engine drives the figure of the digital person according to the figure feature data and renders the figure of the digital person so that the figure of the digital person changes according to the figure feature data.
For example, the obtained image characteristic data are facial expression characteristic data, namely, raised eyebrows (representing surprise), possible slight sagging of mouth corners (representing worry or trouble), and large opening of eyes. And generating corresponding mouth shape changes (such as mouth shapes for emitting voices of o, water and the like) according to the dialogue text. Limb motion feature data, body slightly forward, indicates concern, one hand may be lifted, as if wiping was simulated, and the head may be slightly askew to one side, indicating thinking or confusion. The digital human figure driving engine controls the digital human to make figure animation according to the figure characteristic data, and after rendering, the figure animation seen by the user is that the eyebrows of the digital human are raised and the eyes are opened greatly. The mouth makes corresponding opening, closing and shape changes along with what is what is done by the pronunciation of the sentence "when the user just sprays water on the keyboard. The body is slightly tilted forward, one hand is lifted, and the head is slightly tilted to one side.
In summary, according to the scheme provided by the application, through semantic analysis of the text which is responded, the content expressed by the digital person is matched with the image change item of the digital person, so that the image change content of the digital person is enriched, a plurality of fixed actions are not used any more, the personification degree of the digital person is improved, and the user experience is improved.
Further, as shown in the flowchart of the digital human figure driving method of fig. 4, since the figure feature data includes mouth feature data, limb motion feature data, or facial expression feature data, the step 304 may be implemented as steps 3041, 3042, and 3043 for different figure feature data. Step 306 may be implemented as step 3061, step 3062, and step 3063.
In the case where the character feature data is mouth feature data, steps 304 and 306 are implemented as steps 3041 and 3061.
In the case where the avatar characteristic data is limb movement characteristic data, steps 304 and 306 are implemented as steps 3042 and 3062.
In the case where the avatar characteristic data is facial expression characteristic data, steps 304 and 306 are implemented as steps 3043 and 3063.
It should be noted that, the steps 3041, 3042, and 3043 may be combined, and the steps 3061, 3062, and 3063 may also be combined, that is, different image feature data sets may be embodied on one digital person. That is, in the case where at least two of mouth feature data, limb motion feature data, or facial expression feature data are included in the avatar feature data, the at least two feature data are data-synchronized according to the time stamp. And controlling the digital person to make the image animation matched with the semantic content of the dialogue text according to at least two kinds of characteristic data after the data synchronization.
In one possible implementation, where the avatar characteristic data is mouth characteristic data, steps 304 and 306 are implemented as steps 3041 and 3061.
And 3041, carrying out semantic analysis on the dialogue text to obtain mouth feature data matched with semantic content of the dialogue text.
The dialog text is illustratively speech-converted by a text-to-speech engine in the digital persona drive engine to obtain speech information conforming to the digital persona settings. And decomposing the phonemes of the words in the dialogue text through a text-to-language engine to obtain phonemes corresponding to the words, and recording the starting time and the ending time of the phonemes corresponding to the words in the voice information. And according to the starting time and the ending time of the phonemes corresponding to each word in the voice information, time alignment is carried out on the voice information and the phonemes corresponding to each word, so as to obtain the mouth characteristic data.
The text-to-speech engine is a TTS engine. The text-to-speech engine is used to turn text into sound.
The voice information conforming to the digital person role setting refers to the voice information conforming to the digital person role setting obtained by adjusting the tone, speed, pitch, mood, etc. of sound according to the digital person role setting. For example, if small a is set to a young active boy, the TTS engine will generate a voice message from the dialog text that sounds young, active, and may be at a slightly faster pace. If small A is set as a mature and steady girl, the TTS engine generates voice information with slower, lower, softer sound according to the dialogue text.
The phonemes are pronunciation units corresponding to the characters;
Specifically, while generating speech information, the TTS engine will decompose the dialog text into the most basic pronunciation units, phonemes. (e.g., "a", "o", "e", "b", "p", etc. in chinese). Then, it is precisely recorded from which point in time to which point in time each phoneme starts in the generated voice information. This is as if each "syllable fragment" of speech was time-stamped. The TTS engine obtains mouth feature data based on the speech information and the phonemes.
For example, the dialog text is "hello, today's weather is good |", the TTS engine breaks the dialog text into the most basic pronunciation units, phonemes are ni, hao, jin, tian, tian, qi, zhen, hao, and the start and stop time of each phoneme in the voice file is recorded accurately. For example, "ni" starts at 0.1 seconds and ends at 0.3 seconds. "hao" starts from 0.3 seconds and ends at 0.6 seconds. "jin" starts from 0.6 seconds, ends 0.9 seconds, and so on.
Step 3061, based on the mouth characteristic data, controlling the digital person to send out language information, and controlling the mouth of the digital person to make mouth shape actions corresponding to phonemes corresponding to the characters.
Illustratively, based on the mouth feature data, the digital person is controlled to send language information, and the mouth of the digital person is controlled to make mouth-shape actions corresponding to phonemes corresponding to respective characters.
Specifically, after obtaining the speech information and phonemes in the mouth feature data and aligning the phonemes with the speech information, the TTS engine may use the information to control the mouth motion of the digital person. For example, when the TTS engine broadcasts an "a" sound, the digital person's mouth will take the shape of the "a" sound (e.g., open the large mouth), and when the TTS engine broadcasts an "n" sound, the digital person's mouth will change into the shape of the "n" sound (e.g., the mouth angle is raised and the tongue tip is against the upper gums). This process is performed in real time, ensuring that the digital person's lip movements and the sounds it emits are perfectly matched.
For example, the dialogue text is "hello, today's weather is good |", the TTS engine controls the mouth of the digital person to make mouth-shaped actions corresponding to phonemes corresponding to each word according to the phonemes, for example, when the voice is played for 0.1 seconds, the TTS engine knows the first phoneme "n" which should now send "ni" according to the alignment result, and then drives the mouth of the small A to take the shape of the starting "n" sound (the mouth corner may be slightly inwardly retracted, and the tongue tip is close to the upper gum). When playing for 0.3 seconds, switch to the mouth with "i" tone (flat mouth, open mouth). When playing for 0.3 seconds, "hao" starts and the mouth switches to the shape of "h" (slightly open lips, breath out). The shape of the hair "a" is switched immediately (mouth is large). Then switch to the shape of the "o" (round mouth). This process continues until the phones "o" in "hao" are spoken, and the mouth shape of the small a will change smoothly in real time according to the pronunciation requirements of each phone.
In another possible implementation, where the avatar characteristic data is mouth characteristic data, steps 304 and 306 are implemented as steps 3042 and 3062.
And 3042, carrying out semantic analysis on the dialogue text to obtain limb action characteristic data matched with semantic content of the dialogue text.
Illustratively, the text in the dialog text is disassembled by a behavioral reasoning engine in the digital portrait driver engine to obtain text fields. And selecting from the action tree according to the text field to obtain limb action characteristic data matched with the semantic content of the dialogue text.
The behavior tree comprises different action information corresponding to different text fields. The behavioral reasoning engine is used for understanding and presuming semantic content in the dialogue text, namely, reasoning what is being said by words in the dialogue text, and which emotion, intention and specific content are contained.
For example, the dialogue text is that the weather today is good, the people go to the picnic bar of the park | the behavior inference engine disassembles the weather today is good, the people go to the picnic bar of the park | and the obtained text segments are weather really good (active emotion) and picnic in the park (activity proposal). For the text field "weather is good", it is known that the text field is a positive emotion, so that some expressions can be searched from the behavior tree, namely, a gesture of appreciation or approval, such as gently nodding a head, or a upward pointing action by hand, as if pointing to a clear sky. For the text field "park picnic", it is known to be a specific example, so that some representations can be looked up from the behavior tree, more specific, more fanciful gestures, such as the appearance of both hands making "bars" or the fingers pointing far away, representing "going to the spot". And summarizing the action data obtained from the action tree to obtain limb action characteristic data.
Step 3062, based on the limb movement characteristic data, controlling the limb of the digital person to make a limb movement corresponding to the text segment.
For example, after analyzing the text of the dialog, the behavior inference engine may plan a series of suitable "gesture actions" and "leg actions" according to the analysis result. These actions need to be associated with the word segment just analyzed, such as the word segment expressing happiness, confusion, emphasizing a certain point, or describing a certain action, etc. The planned action will instruct the digital person to perform the corresponding limb action.
For example, the dialogue text is that 'today weather is good, people go to park picnic bars |', the behavior reasoning engine controls limbs of the digital person to make corresponding limb actions according to limb action feature data, the picture seen by the user is that the digital person lightly points at a head and points fingers to a clear sky, and then hands make things to be 'touted'.
In yet another possible implementation, where the avatar characteristic data is mouth characteristic data, steps 304 and 306 are implemented as steps 3043 and 3063.
And 3043, carrying out semantic analysis on the dialogue text to obtain facial expression characteristic data matched with semantic content of the dialogue text.
The emotional text in the dialog text is illustratively screened out by an emotion analysis engine. And obtaining facial expression characteristic data according to the type and the emotion degree of the emotion text.
Wherein the emotion text is a text for expressing emotion. The emotion degree is used to represent the expression degree of emotion.
Optionally, the facial expression characteristic data includes facial expressions and expression magnitudes of the facial expressions. And the emotion analysis engine determines facial expressions corresponding to the emotion texts according to the types of the emotion texts. And obtaining the expression amplitude of the facial expression according to the emotion degree of the emotion text. Facial expression characteristic data are obtained based on facial expressions and expression amplitudes.
For example, the dialogue text is "the new restaurant we eat today is too excellent for | food is delicious, the service is also good, i go beyond happy |", and the emotion analysis engine can judge that emotion is very happy by screening out emotion texts in the dialogue text, i. The emotion degree is high. Thus, the emotion analysis engine obtains facial expressions including 'large smile, eyes possibly squinting (like laughing eyes), eyebrows slightly lifting', and expression amplitude including 'mouth corner height Gao Yangqi', according to the type and emotion degree of emotion text.
Step 3063, based on the facial expression characteristic data, controlling the face of the digital person to make facial expressions corresponding to the emotion text.
For example, after analyzing the emotion text in the dialogue text, the emotion analysis engine plans out a proper facial expression according to the analysis result. These facial expressions need to be associated with the emotion text just analyzed.
For example, the dialogue text is that 'the newly opened restaurant we eat today is too excellent for | food is delicious, the service is also good, i go beyond happiness |', the emotion analysis engine can control limbs of a digital person to make corresponding expressions according to facial expressions and expression amplitudes, the picture seen by a user is that the corners of the digital person are high-raised, large smiles are presented, eyes can squint (like smiles), and eyebrows are raised slightly.
The above embodiments describe a digital character driving method, which will be described below with specific examples.
Illustratively, the user makes a voice input through the vehicle-mounted voice assistant, triggering the vehicle-mounted voice assistant to begin working. After receiving the voice input, the vehicle-mounted voice assistant firstly carries out signal processing on the voice input locally (vehicle machine end), wherein the signal processing comprises preprocessing operations such as noise reduction, framing and the like. The voice input after the signal processing can be processed locally or processed in cloud.
Under the condition that the digital human figure driving engine is deployed at the vehicle side, as shown in a schematic diagram of digital human figure driving at the vehicle side in fig. 5, the digital human figure driving engine in the vehicle side performs local voice processing (ASR and NLU) on the voice input after signal processing, and after understanding the intention and the semantics of the voice input, responds to the voice input to obtain a dialogue text. The digital portrait drive engine in the vehicle terminal carries out semantic analysis on the dialogue text to obtain the portrait characteristic data matched with the semantic content of the dialogue text. And the animation engine generates an animation corresponding to the digital person according to the image characteristic data, renders the animation through the rendering engine to obtain a video file rendered by the digital person, and presents the visual animation of the video file to a user.
In the case where the digital portrait driving engine is deployed at the cloud, as shown in the schematic diagram of performing digital portrait driving at the cloud shown in fig. 6, the digital portrait driving engine in the cloud performs cloud voice processing on the voice input after signal processing (i.e., performs voice recognition at the cloud), and after understanding the intention and the semantics of the voice input, responds to the voice input to obtain a dialog text. The digital person image driving engine in the cloud performs semantic understanding on the dialogue text, and determines the scene of the dialogue text, for example, determines that the user is in boring scene with the vehicle-mounted voice assistant. The digital person image driving engine in the cloud performs voice conversion on the dialogue text through the text-to-language engine to obtain voice information conforming to the setting of the digital person roles. Meanwhile, the text in the dialogue text is subjected to phoneme decomposition through the text-to-speech engine, phonemes corresponding to the respective text are obtained, and the starting time and the ending time of the phonemes corresponding to the respective text in the voice information are recorded. And according to the starting time and the ending time of the phonemes corresponding to each word in the voice information, time alignment is carried out on the voice information and the phonemes corresponding to each word, so as to obtain the mouth characteristic data. In addition, a digital person image driving engine in the cloud end disassembles the characters in the dialogue text through a behavior reasoning engine to obtain text fields. And selecting from the action tree according to the text field to obtain limb action characteristic data matched with the semantic content of the dialogue text. And the digital portrait driving engine in the cloud end screens out emotion texts in the dialogue texts through the emotion analysis engine. And obtaining facial expression characteristic data according to the type and the emotion degree of the emotion text. The digital portrait drive engine in the cloud saves the mouth feature data, limb action feature data and facial expression feature data into a material library. And then, generating an animation corresponding to the digital person by the animation engine according to the mouth feature data, the limb action feature data and the facial expression feature data in the material library, rendering by the rendering engine to obtain a video file rendered by the digital person, and displaying the visual animation of the video file to a user.
For example, the user is chatted through the car microphone and the little A, and the user is 'today' weather is good, and is doing so. Small A 'today's weather is good, we go to park walk bar-. The dialogue text is 'today' weather is good, we go to park walk bar-.
The digital human image driving engine calculates and says that ' today's weather is good, we go to park walk bar | ' all phoneme sequences and corresponding accurate mouth-shaped changing time points needed, and obtains mouth characteristic data.
The content is analyzed by the behavior reasoning engine to be about 'good weather' and 'walk to park', possible gestures are planned, such as pointing out of a window (if the interface allows) or a gesture representing 'departure' or 'enjoyment' is made (such as opening two hands and simulating hugging sunlight), and limb action characteristic data are obtained.
The emotion analysis engine analyzes the words filled with positive and pleasant emotion (really good and walking bar to park), plans out happy expressions (smile and bent eyes) and possibly accompanying easy and pleasant body gestures (slightly forward and vivid body) and obtains facial expression characteristic data.
The animation engine starts to drive the digital person after receiving one or more of the mouth characteristic data, limb motion characteristic data or facial expression characteristic data, for example, the mouth needs to accurately send each sound according to the mouth characteristic data to make a corresponding mouth shape. When speaking "true," the expression needs to be switched to a happy smile, and the eyes are also lightened. When we say "we go to park walk bar", we can coordinate with an easy gesture, such as a hand lifting, a finger bending slightly and then stretching outwards, simulating the action of inviting or going to a distance. At the same time, the body posture is kept relaxed and pleasant.
The final comprehensive expression of the digital human small A is that the mouth naturally opens and closes along with the voice content, so that the weather is really good today, the people go to park walk bars | from speaking, the people slowly float on the face to form a happy smile, and eyes also appear bright. When speaking the latter half, small a may cooperate with an easy, inviting gesture. Throughout the process, the posture of small a is relaxed, positive and the eye is also pleasurable.
The foregoing has mainly described the solutions provided by the present application. Correspondingly, the application also provides a digital human figure driving device which is used for realizing the method embodiment.
In some embodiments, the digital avatar driving device includes corresponding hardware structures and/or software modules for performing the functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The embodiment of the application can divide the functional modules of the digital human figure driving device according to the embodiment of the method, for example, each functional module can be divided corresponding to each function, and two or more functions can be integrated in one processing module. The integrated modules may be implemented in hardware or in software functional modules. It should be noted that, in the embodiment of the present application, the division of the modules is schematic, which is merely a logic function division, and other division manners may be implemented in actual implementation.
In some embodiments, the present application provides a digital avatar driving apparatus, where the digital avatar driving apparatus is used to implement the functions of the digital avatar driving apparatus in the above-described digital avatar driving method embodiment. The digital human figure driving apparatus shown in fig. 7 is schematically constructed. The digital portrait driving device may include an acquisition module 701, an analysis module 702, and a driving module 703.
The obtaining module 701 is configured to perform the operations of step 302 in the methods illustrated in fig. 3 and 4. The parsing module 702 is configured to perform the operations of steps 304, 3041, 3042, 3043 in the methods illustrated in fig. 3 and 4. The driving module 703 is used to perform the operations of steps 306, 3061, 3062, 3063 of the methods illustrated in fig. 3, 4.
As shown in fig. 8, a computer device provided by an embodiment of the present application may include a processor 801, a bus 802, a communication interface 803, and a memory 804. The processor 801, the memory 804 and the communication interface 803 communicate via a bus 802. It should be understood that the present application is not limited to the number of processors, memories in a computer device.
Bus 802 may be a PCI bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, or a UB bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one line is shown in fig. 8, but not only one bus or one type of bus. Bus 802 may include a path to transfer information between various components of the computer device (e.g., memory 804, processor 801, communication interface 803).
The processor 801 may include any one or more of a CPU, a graphics processor (graphics processing unit, GPU), a Microprocessor (MP), or a digital signal processor (digital signalprocessor, DSP).
The memory 804 may include volatile memory (RAM), such as random access memory (random access memory). The processor 801 may also include non-volatile memory (non-volatile memory), such as read-only memory (ROM), flash memory, mechanical hard disk (HARD DISK DRIVE, HDD) or solid state disk (SSD STATE DRIVE).
The communication interface 803 enables communication between the computer device and other devices or communication networks using a transceiver module such as, but not limited to, a network interface card, transceiver, etc.
The memory 804 stores executable program codes, which the processor 801 executes to implement the functions of the digital avatar driving device or the CPU core in the foregoing method embodiment, respectively. That is, the memory 804 has stored thereon a program for executing the above-described digital portrait driving method.
In yet another aspect, a computer readable storage medium having at least one computer program stored therein is provided, the at least one computer program being loaded and executed by a processor to implement a digital human image driving method as provided by the above method embodiments.
In a further aspect, there is provided a computer program product comprising a computer program or instructions which, when executed by a processor, implement a digital human image driving method as described in the above aspects.
In yet another aspect, a system on a chip is provided, including at least one processor and at least one interface circuit, the at least one interface circuit configured to perform a transceiving function and to send instructions to the at least one processor, the at least one processor executing instructions when executed by the at least one processor to implement a digital portrait driving method as described in the above aspect.
The method steps in this embodiment may be implemented by hardware, or may be implemented by executing software instructions by a processor. The software instructions may be comprised of corresponding software modules that may be stored in random access memory (random access memory, RAM), flash memory, read-only memory (ROM), programmable ROM (PROM), erasable programmable ROM (erasable PROM, EPROM), electrically Erasable Programmable ROM (EEPROM), registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. In addition, the ASIC may reside in a computer device. The processor and the storage medium may reside as discrete components in a computer device.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer programs or instructions. When the computer program or instructions are loaded and executed on a computer, the processes or functions described in the embodiments of the present application are performed in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, a network device, a user device, or other programmable apparatus. The computer program or instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer program or instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired or wireless means. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that integrates one or more available media. The usable medium may be a magnetic medium such as a floppy disk, a hard disk, a magnetic tape, an optical medium such as a digital video disc (digital video disc, DVD), or a semiconductor medium such as a solid state disk (solid STATE DRIVE, SSD). While the application has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the application. Therefore, the protection scope of the application is subject to the protection scope of the claims.