[go: up one dir, main page]

CN120707709A - Digital human image driving method, device, equipment, storage medium and product - Google Patents

Digital human image driving method, device, equipment, storage medium and product

Info

Publication number
CN120707709A
CN120707709A CN202510863538.3A CN202510863538A CN120707709A CN 120707709 A CN120707709 A CN 120707709A CN 202510863538 A CN202510863538 A CN 202510863538A CN 120707709 A CN120707709 A CN 120707709A
Authority
CN
China
Prior art keywords
text
feature data
digital human
image
digital
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202510863538.3A
Other languages
Chinese (zh)
Inventor
李春燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
FAW Toyota Motor Co Ltd
Original Assignee
Tianjin FAW Toyota Motor Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin FAW Toyota Motor Co Ltd filed Critical Tianjin FAW Toyota Motor Co Ltd
Priority to CN202510863538.3A priority Critical patent/CN120707709A/en
Publication of CN120707709A publication Critical patent/CN120707709A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/2053D [Three Dimensional] animation driven by audio data
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Child & Adolescent Psychology (AREA)
  • Psychiatry (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Hospice & Palliative Care (AREA)
  • Processing Or Creating Images (AREA)

Abstract

公开了数字人形象驱动方法、装置、设备、存储介质及产品,涉及计算机技术领域。该方法包括:获取对话文本,对话文本是指数字人基于用户输入进行响应的文本。对对话文本进行语义解析,得到与对话文本的语义内容相匹配的形象特征数据,形象特征数据用于指示所述数字人的形象变化特征。基于形象特征数据,控制数字人做出与所述对话文本的语义内容相匹配的形象动画。通过对进行响应的文本进行语义解析,使得数字人表达出的内容和数字人的形象变化项匹配,丰富了数字人的形象变化内容,不再是使用固定的几个动作,提升了数字人的拟人化程度,提升了用户体验。

Disclosed are a method, apparatus, device, storage medium, and product for driving a digital human image, relating to the field of computer technology. The method comprises: obtaining a dialogue text, which refers to the text to which a digital human responds based on user input; performing semantic parsing on the dialogue text to obtain image feature data that matches the semantic content of the dialogue text; the image feature data is used to indicate the image change characteristics of the digital human; and, based on the image feature data, controlling the digital human to produce an image animation that matches the semantic content of the dialogue text. By performing semantic parsing on the responsive text, the content expressed by the digital human matches the digital human's image change items, enriching the digital human's image change content and eliminating the need for a fixed number of actions. This enhances the digital human's anthropomorphism and improves the user experience.

Description

Digital portrait driving method, device, apparatus, storage medium and product
Technical Field
The embodiment of the application relates to the technical field of computers, in particular to a digital human figure driving method, a device, equipment, a storage medium and a product.
Background
With the rapid development of new powerful intelligent automobiles, vehicle-mounted voice assistants have gradually evolved from original virtual figures to digital figures of human figures. In the process of interaction with the digital person, the digital person can correspondingly perform several kinds of fixed action changes according to the voice command. When the user sends out an explicit voice command, the digital person recognizes the scene type to which the voice command belongs and responds and performs related actions according to the scene type. For example, instruction "small A, turn on air conditioner. The "car side judges that this is an" instruction "scene, and the reaction of small A is that the air conditioner is turned on and possibly accompanied by a simple action such as nodding or smiling. For another example, instruction "Small A, play music. The car machine side judges that the scene is an instruction scene, and the reaction of the small A is that music starts to play, and the small A possibly shows good or plays for you along with voice.
In the related art, an action library is preset at a vehicle machine end, and actions and corresponding trigger instructions are prestored in the action library. The reply response of the digital person mainly depends on a preset and fixed action library to cope with different situations. For example, the user speaks "play music". The vehicle terminal judges that the scene is an entertainment scene, and can trigger the action of smiling or nodding following the rhythm.
However, in the above related art, the response actions of the digital person are all fixed, that is, the actions corresponding to the scenes are executed according to the scenes to which the trigger instruction belongs, so that the image of the digital person is single and programmed, and the image of the digital person and the broadcasting content of the digital person are poor in matching.
Disclosure of Invention
The application provides the digital person image driving method, the device, the equipment, the storage medium and the product, enriches the image change content of the digital person, ensures that the image of the digital person and the broadcasting content of the digital person are higher in matching performance, and improves the user experience.
In order to achieve the above purpose, the application adopts the following technical scheme:
In a first aspect, a digital person figure driving method is provided for use with a digital person figure driving engine, the method comprising obtaining dialog text, the dialog text being text that a digital person responds to based on user input. And carrying out semantic analysis on the dialogue text to obtain image characteristic data matched with semantic content of the dialogue text, wherein the image characteristic data is used for indicating image change characteristics of the digital person. And controlling the digital person to make a figure animation matched with the semantic content of the dialogue text based on the figure characteristic data.
According to the scheme provided by the application, the content expressed by the digital person is matched with the image change item of the digital person through semantic analysis of the text which is responded, so that the image change content of the digital person is enriched, a plurality of fixed actions are not used any more, the personification degree of the digital person is improved, and the user experience is improved.
In one possible implementation, the avatar characteristic data includes one or more of mouth characteristic data, limb motion characteristic data or facial expression characteristic data.
In yet another possible implementation, the digital persona drive engine includes a text-to-speech engine. The method comprises the step of carrying out semantic analysis on the dialogue text to obtain image characteristic data matched with semantic content of the dialogue text, and can be concretely realized by carrying out voice conversion on the dialogue text through the text-to-language engine to obtain voice information conforming to the digital person role setting. And decomposing the phonemes of the words in the dialogue text through the text-to-language engine to obtain phonemes corresponding to the words, and recording the starting time and the ending time of the phonemes corresponding to the words in the voice information, wherein the phonemes refer to pronunciation units corresponding to the words. And according to the starting time and the ending time of the phonemes corresponding to the characters in the voice information, time alignment is carried out on the voice information and the phonemes corresponding to the characters to obtain the mouth characteristic data.
In another possible implementation manner, the controlling the digital person to make the avatar animation matched with the semantic content of the dialogue text based on the avatar feature data may be specifically implemented by controlling the digital person to send out the language information based on the mouth feature data and controlling the mouth of the digital person to make the mouth-shaped action corresponding to the phonemes corresponding to the respective characters.
In another possible implementation manner, the digital persona driving engine comprises a behavior reasoning engine. The semantic analysis is carried out on the dialogue text to obtain the image characteristic data matched with the semantic content of the dialogue text, and the method can be concretely realized in that the behavior reasoning engine is used for decomposing the characters in the dialogue text to obtain the text fields. And selecting from a behavior tree according to the text field to obtain the limb action characteristic data matched with the semantic content of the dialogue text, wherein the behavior tree comprises different action information corresponding to different text fields.
In still another possible implementation manner, the controlling the digital person to make the avatar animation matched with the semantic content of the dialogue text based on the avatar characteristic data may be specifically implemented by controlling the limb of the digital person to make the limb action corresponding to the word segment based on the limb action characteristic data.
A further possible implementation mode of the digital human image driving engine comprises an emotion analysis engine, wherein the emotion analysis engine performs semantic analysis on the dialogue text to obtain image feature data matched with semantic content of the dialogue text, and the digital human image driving engine can be specifically realized in such a way that the emotion text in the dialogue text is screened out through the emotion analysis engine, the emotion text is used for expressing emotion, and the facial expression feature data is obtained according to the type and the emotion degree of the emotion text.
In still another possible implementation manner, the facial expression feature data is obtained according to the type and the emotion degree of the emotion text, and the facial expression corresponding to the emotion text is determined according to the type of the emotion text. And obtaining the expression amplitude of the facial expression according to the emotion degree of the emotion text. And obtaining the facial expression characteristic data based on the facial expression and the expression amplitude.
In still another possible implementation manner, the controlling the digital person to make the avatar animation matched with the semantic content of the dialogue text based on the avatar characteristic data may be specifically implemented by controlling the face of the digital person to make the facial expression corresponding to the emotion text based on the facial expression characteristic data.
In another possible implementation manner, the digital human image driving engine is deployed at a vehicle end and/or a cloud end.
In a further possible implementation manner, the digital human figure driving method further comprises the step of carrying out data synchronization on at least two feature data according to a time stamp when at least two of the mouth feature data, the limb motion feature data or the facial expression feature data are included in the image feature data. And controlling the digital person to make a visual animation matched with the semantic content of the dialogue text according to the at least two kinds of characteristic data after data synchronization.
In a second aspect, a digital human figure drive apparatus is provided for use with a digital human figure drive engine. The device comprises an acquisition module, an analysis module and a driving module. Wherein:
and the acquisition module is used for acquiring dialogue texts, wherein the dialogue texts refer to texts which are responded by a finger person based on user input.
The analysis module is used for carrying out semantic analysis on the dialogue text to obtain image characteristic data matched with semantic content of the dialogue text, and the image characteristic data is used for indicating image change characteristics of the digital person.
And the driving module is used for controlling the digital person to make the image animation matched with the semantic content of the dialogue text based on the image characteristic data.
In one possible implementation, the avatar characteristic data includes one or more of mouth characteristic data, limb motion characteristic data or facial expression characteristic data.
In yet another possible implementation, the digital persona drive engine includes a text-to-speech engine. The parsing module is further configured to perform voice conversion on the dialog text through the text-to-speech engine, so as to obtain voice information that accords with the digital character setting. And decomposing the phonemes of the words in the dialogue text through the text-to-language engine to obtain phonemes corresponding to the words, and recording the starting time and the ending time of the phonemes corresponding to the words in the voice information, wherein the phonemes refer to pronunciation units corresponding to the words. And according to the starting time and the ending time of the phonemes corresponding to the characters in the voice information, time alignment is carried out on the voice information and the phonemes corresponding to the characters to obtain the mouth characteristic data.
In another possible implementation manner, the driving module is further configured to control the digital person to send out the language information based on the mouth feature data, and control the mouth of the digital person to make a mouth shape action corresponding to phonemes corresponding to the respective characters.
In another possible implementation manner, the digital persona driving engine comprises a behavior reasoning engine. The parsing module is further configured to disassemble the text in the dialog text through the behavior reasoning engine to obtain a text field. And selecting from a behavior tree according to the text field to obtain the limb action characteristic data matched with the semantic content of the dialogue text, wherein the behavior tree comprises different action information corresponding to different text fields.
In still another possible implementation manner, the driving module is further configured to control the limb of the digital person to make a limb motion corresponding to the text segment based on the limb motion feature data.
The digital human image driving engine comprises an emotion analysis engine, the analysis module is further used for screening out emotion texts in the dialogue texts through the emotion analysis engine, the emotion texts are texts used for expressing emotion, and the facial expression feature data are obtained according to the types and the emotion degrees of the emotion texts.
In still another possible implementation manner, the parsing module is further configured to determine, according to the type of the emotion text, a facial expression corresponding to the emotion text. And obtaining the expression amplitude of the facial expression according to the emotion degree of the emotion text. And obtaining the facial expression characteristic data based on the facial expression and the expression amplitude.
In still another possible implementation manner, the driving module is further configured to control the face of the digital person to make a facial expression corresponding to the emotion text based on the facial expression feature data.
In another possible implementation manner, the digital human image driving engine is deployed at a vehicle end and/or a cloud end.
In still another possible implementation manner, the parsing module is further configured to perform data synchronization on at least two feature data according to a timestamp when at least two of the mouth feature data, the limb motion feature data, or the facial expression feature data are included in the image feature data. And controlling the digital person to make a visual animation matched with the semantic content of the dialogue text according to the at least two kinds of characteristic data after data synchronization.
The digital human figure driving device provided in the second aspect is configured to perform the digital human figure driving method provided in the first aspect or any one of the possible implementation manners of the first aspect, and the technical effect corresponding to any one of the implementation manners of the second aspect may refer to the technical effect corresponding to any one of the implementation manners of the first aspect, which is not described herein again.
In a third aspect, there is provided a computer device comprising a processor and a memory in which at least one computer program is stored, the at least one computer program being loaded and executed by the processor to implement a digital human figure driving method as described above in the first aspect or any one of the possible implementations of the first aspect.
In a fourth aspect, there is provided a computer readable storage medium having stored therein at least one computer program, the at least one computer program being loaded and executed by a processor to implement the digital human figure driving method as described above in the first aspect or any implementation of the first aspect.
In a fifth aspect, a computer program product is provided, the computer program product comprising a computer program or instructions which, when executed by a processor, implement a digital human image driving method as described in the first aspect or any implementation manner of the first aspect.
In a sixth aspect, an embodiment of the present application provides a chip system, including at least one processor and at least one interface circuit, where the at least one interface circuit is configured to perform a transceiving function, and send an instruction to the at least one processor, and when the at least one processor executes the instruction, the at least one processor executes to implement a digital portrait driving method according to the first aspect or any implementation manner of the first aspect.
In a seventh aspect, an embodiment of the present application provides a vehicle, the vehicle including a display screen in which a digital person is displayed, the vehicle driving the digital person based on the digital person image driving method according to the first aspect or any one of the possible implementation manners of the first aspect.
The foregoing third aspect to seventh aspect of the present invention are solutions for implementing the foregoing first aspect or the method provided by the first aspect, and specific implementation details thereof are not described in detail. Technical effects corresponding to any implementation manner of the third aspect to the seventh aspect may be referred to the technical effects corresponding to the first aspect or any implementation manner of the first aspect, and are not described herein again.
It should be noted that, in various possible implementations of any of the above aspects, on the premise that the schemes are not contradictory, the combination can be carried out.
Drawings
FIG. 1 is a schematic diagram of a trigger digital person to perform an action provided by one exemplary embodiment;
FIG. 2 is a schematic diagram of the architecture of a computing system provided by an exemplary embodiment of the present application;
FIG. 3 is a flow chart of a method of digital portrait driving provided by an exemplary embodiment of the present application;
FIG. 4 is a flow chart of a method of digital portrait driving provided by an exemplary embodiment of the present application;
FIG. 5 is a schematic diagram of a digital portrait drive at the vehicle side provided by an exemplary embodiment of the present application;
FIG. 6 is a schematic diagram of digital portrait driving at the cloud end according to an exemplary embodiment of the present application;
fig. 7 is a schematic structural view of a digital human figure driving apparatus according to an exemplary embodiment of the present application;
Fig. 8 is a schematic structural view of a computer device according to an exemplary embodiment of the present application.
Detailed Description
In the embodiments of the present application, in order to facilitate the clear description of the technical solutions of the embodiments of the present application, the words "first", "second", etc. are used to distinguish the same item or similar items having substantially the same function and effect. It will be appreciated by those of skill in the art that the words "first," "second," and the like do not limit the amount and order of execution, and that the words "first," "second," and the like do not necessarily differ. The technical features described in the first and second descriptions are not sequential or in order of magnitude.
In embodiments of the application, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g." in an embodiment should not be taken as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion that may be readily understood.
In the embodiment of the present application, at least one may also be described as one or more, and a plurality may be two, three, four or more, and the present application is not limited thereto.
In addition, the network architecture and the scenario described in the embodiments of the present application are for more clearly describing the technical solution provided in the embodiments of the present application, and do not constitute a limitation on the technical solution provided in the embodiments of the present application, and those skilled in the art can know that, with the evolution of the network architecture and the appearance of a new service scenario, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems.
It should be noted that, the information (including but not limited to device information, object personal information, etc.), data (including but not limited to data for analysis, stored data, displayed data, etc.) and signals related to the present application are all authorized by the object or sufficiently authorized by each party, and the collection, use and processing of the related data are required to comply with the related laws and regulations and standards of the related countries and regions. For example, the dialog text referred to in the present application is obtained with sufficient authorization.
With the rapid development of new powerful intelligent automobiles, vehicle-mounted voice assistants have gradually evolved from original virtual figures to digital figures of human figures. In the process of interaction with the digital person, the digital person can correspondingly act according to the voice instruction. When a user gives out an explicit voice instruction, the digital person can not only execute the voice instruction, but also perform related actions while executing the voice instruction.
A schematic diagram of triggering a digital person to perform an action in a conventional method is shown in fig. 1. The process of triggering the digital person to execute the action comprises the steps of user voice input, local signal processing, local or cloud voice processing, dialogue management, digital person driving processing, returning dialogue text, and converting the local text into voice, wherein the voice is output to the user.
Specifically, the user inputs voice through the vehicle-mounted voice assistant, and triggers the vehicle-mounted voice assistant to start working. After receiving the voice input, the vehicle-mounted voice assistant firstly carries out signal processing on the voice input locally (vehicle machine end), wherein the signal processing comprises preprocessing operations such as noise reduction, framing and the like. The voice input after the signal processing can be processed locally or processed in cloud.
The cloud speech processing or local speech processing includes speech recognition (automatic speech recognition, ASR) and semantic understanding (naturallanguage understanding, NLU).
Under the condition of locally processing voice, the vehicle side processing engine performs ASR and NLU on voice input, and responds to the voice input after understanding the intention and the semantics of the voice input to obtain dialogue text. According to the responding dialogue text, the car machine side processing engine converts the dialogue text into voice from Text To Speech (TTS) to play. On the other hand, the vehicle-mounted processing engine calls a local large model, and the image action of the digital person is found in a preset action library, so that the image action is executed while the digital person plays the voice.
For example, the user speaks into the car microphone "small A, I want to hear the song of Star C. The local ASR module receives the voice signal of the user and converts the voice signal into text, i want to listen to the song of the star C. The "local NLU module analyzes this text, understands that the user's intent is" play music ", and designates the artist" star C ". It extracts key information, intent = play music, artist = star C. The vehicle side processing engine generates a proper dialogue text based on the NLU understanding result. For example, "good, a song of Star C is being played for you. After the local TTS module receives the dialogue text, and plays the song of the star C for you, the local TTS module synthesizes the dialogue text into voice and plays the voice through the vehicle-mounted sound equipment. Meanwhile, the vehicle-side processing engine processes the image action of the digital human small A. By simple semantic analysis of the dialog text (e.g., recognition of words such as "good", "play", etc.), a "action library" is then called, which is stored in advance in the vehicle side. In the action library, based on the word "good" indicating agreement or confirmation, and the action "play", it may find several preset actions, such as: and a small A slightly points the head, the corners of the mouth are raised, and the hands make a gesture of starting. Action b small A slightly leaning forward, eye concentration, and hand-to-hand ratio drawing a select/confirm action in the air. At this time, the user hears that small A says "good with synthesized sound, playing a song of Star C for you. At the same time, the user sees the digital person small a pico nod on the screen, smiles, and both hands make a gesture to start playing.
Under the condition that the cloud side processing engine processes voice input, the cloud side processing engine processes the voice input, namely ASR and NLU are conducted, after the intention and the semantics of the voice input are understood, the voice input is responded, dialogue text is obtained, meanwhile, driving processing is conducted on the image of the digital person through calling the large model, namely, the image of the digital person is driven to act, such as mouth shape driving and limb driving. The cloud side processing engine converts the dialogue text into voice for playing through the TTS according to the responded dialogue text. On the other hand, the cloud side processing engine processes the image actions of the digital person found in the preset action library and drives the mouth shape and limbs of the digital person, so that the image actions are executed while the digital person plays the voice.
For example, the user speaks into the car microphone "small A, helping me find the nearest gas station". The voice signal 'small A' helps me find the nearest gas station 'and uploads the nearest gas station' to the cloud side processing engine through network connection (such as 4G/5G/Wi-Fi) of the vehicle side. After receiving the voice signal, the ASR module in the cloud side processing engine converts the voice signal into a text, namely 'help me find the nearest gas station'. The NLU module in the "cloud side processing engine analyzes this text, understanding that the user's intent is" find place ", specifically" gas station ", and with the constraint of" nearest ". The cloud side processing engine generates a proper dialogue text based on the NLU understanding result. For example, "good, is looking for nearby gas stations for you, and is immediately good. The cloud side processing engine sends the dialogue text to the TTS module at the vehicle terminal, and the TTS module synthesizes the dialogue text into voice after the TTS module receives the dialogue text and searches nearby gas stations for you and immediately gets good, and plays the voice through the vehicle-mounted sound equipment. Meanwhile, the cloud side processing engine processes the image action of the digital person small A. The cloud side processing engine calls the large model to analyze the dialogue text, generates an accurate mouth-shaped transformation instruction, and ensures perfect synchronization of mouth animation of the digital person small A and voice synthesized by TTS. And, the large model further analyzes the semantics and emotion of the text. It recognizes that "good" indicates consent. Based on these analyses, the large model looks up or generates appropriate actions in real time from a more massive and finer action library in the cloud. It is possible to select a nodding action that indicates "confirm/agree". The cloud side processing engine transmits the dialogue text and the generated animation data (the mouth shape driving instruction and the limb driving instruction) to the vehicle side. And the audio player at the vehicle terminal receives and plays the voice converted from the dialogue text. The vehicle machine end renders the digital person and executes the mouth shape driving instruction and the limb driving instruction, so that the mouth shape and the limb synchronously act according to the cloud side processing engine. At this time, the user hears that the small A is "good" in the synthesized voice, and is looking for nearby gas stations for you, just now. At the same time, the user sees the digital person small A smiling nodding head on the screen, and the shape of the mouth changes accurately with speaking content (mouth shape synchronization).
However, in the above-mentioned technology, the response actions of the digital person are all fixed, that is, the actions corresponding to the trigger instruction are executed according to the trigger instruction, so that the image of the digital person is single and programmed, and the image of the digital person and the broadcasting content of the digital person are poor in matching.
Based on the above, the application provides a digital person image driving method, which obtains dialogue text that a digital person responds to based on user input. And carrying out semantic analysis on the dialogue text to obtain image feature data matched with the semantic content of the dialogue text. Based on the character feature data, the digital person is controlled to make character animation matched with the semantic content of the dialogue text. According to the method, the text which is responded is subjected to semantic analysis, so that the content expressed by the digital person is matched with the image change item of the digital person, the image change content of the digital person is enriched, a plurality of fixed actions are not used any more, the personification degree of the digital person is improved, and the user experience is improved.
The following describes the embodiments of the present application with reference to the drawings.
The scheme provided by the application can be applied to the computing system shown in fig. 2. As shown in FIG. 2, the computing system includes a digital persona drive engine. The digital portrait drive engine may be operated in different locations, such as the vehicle end and the cloud end.
The digital portrait driving engine runs locally in the vehicle at the vehicle-mounted end, and can still run even in a network-free environment, so that the digital portrait driving engine is suitable for the situations of basic vehicle control and poor network conditions. The digital portrait driving engine runs in the remote server in the cloud, when the digital portrait driving engine runs in the cloud, the vehicle and the cloud are connected by a network, and the digital portrait driving engine in the cloud can run only in the condition of the network.
Alternatively, the vehicle may be a new energy vehicle, an oil vehicle, a motorcycle, a tricycle, an autopilot logistics vehicle, an electric truck, etc., but is not limited thereto, and the embodiment of the present application is not particularly limited thereto.
Wherein an onboard voice assistant is supported in the vehicle, i.e. the vehicle is capable of supporting voice interactions. The vehicle also includes a display screen in which a digital person is displayed, the digital person being interactable with the user.
Optionally, the cloud may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services, a cloud database, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content distribution network (content delivery network, CDN), and a cloud server of basic cloud computing services such as big data.
Fig. 3 is a schematic diagram of a digital portrait driving method according to an embodiment of the present application, where the method is applied to a digital portrait driving engine. The digital portrait drive engine may run on the vehicle side in the computing system illustrated in fig. 2, and may also run on the cloud side in the computing system illustrated in fig. 2.
As shown in fig. 3, the digital human figure driving method may include:
Step 302, the digital portrait driver engine obtains dialogue text.
Where dialog text refers to text that a person responds to based on user input, also referred to as "response text".
Illustratively, the user interacts with an on-board voice assistant on the vehicle, and the dialog text is text that is in reply to the user input. For example, the user speaks "small A, I want to hear the song of Star C" with respect to the vehicle-mounted voice assistant. The vehicle-mounted voice assistant replies that "good, playing a song of Star C for you. "wherein," good, song playing for you, star C "is dialog text.
Alternatively, the dialog text may be a reply to the user's voice input or a reply to the user's instruction.
The digital person image driving engine is a software system running in a vehicle end, a cloud end or other computing environments, and generates actions, expressions and language feedback of a digital person by analyzing instructions or dialogue texts of the user, so that natural interaction between the digital person and the user is realized.
The digital person is a built virtual character. They not only possess the appearance and morphology of an anthropomorphic or real human being, but also interact with the human being through speech, expression and limb movements. Alternatively, the digital person may be a 3D human figure, or a 2D human figure.
In addition, the pattern of the digital person is not limited to only a person, that is, the digital person may be an animated dog, an animated cat, a plant, or the like, but is not limited thereto, and the embodiment of the application is not particularly limited thereto.
And 304, carrying out semantic analysis on the dialogue text by the digital portrait drive engine to obtain the image characteristic data matched with the semantic content of the dialogue text.
Wherein the character feature data is used to indicate character change features of the digital person.
Illustratively, the digital persona driver engine obtains semantic content of the dialog text by semantically parsing the dialog text, such as whether the dialog text is a happy, surprised, queried, or complaint. And generates a group of image characteristic data according to the image characteristic data, thereby realizing the driving of the image of the digital person.
Optionally, the image characteristic data includes one or more of mouth characteristic data, limb motion characteristic data or facial expression characteristic data, but is not limited thereto, and the embodiment of the present application is not particularly limited thereto.
Wherein the mouth characteristic data is data for indicating a mouth morphological feature of the digital person. That is, the mouth shape of the digital person can be determined from the mouth feature data.
Optionally, the mouth shape includes mouth shape, mouth opening and closing degree, mouth angle state, etc., but is not limited thereto, and the embodiment of the present application is not limited thereto.
The mouth shape is the core part in the mouth shape, and is mainly used for matching voices. Different phones require different mouth shapes. For example, when emitting/a/(o) sound, the mouth is usually open and circular or oval. During the sound/i/(clothes) the corners of the mouth will stretch to both sides and the lips will be slightly thinner. When the voice/u/(black) is generated, the mouth is tightly closed and protrudes forwards to form a round shape. When the lips are made/s/(silk) or/f/(Buddha) voice, the lips may come close to or separate from each other, forming a gap. Closed mouth (for transitions between vowels or silent portions). The digital portrait driver engine selects or generates a corresponding mouth shape according to the phonemes to be emitted.
The opening and closing degree of the mouth refers to the opening size of the mouth.
Corner of mouth conditions include ascending, usually representing a smiling, happy, delightful and other positive emotions. Sagging, a negative emotion that may be expressed as sadness, dissatisfaction, fatigue, etc. The maintenance level is neutral or represents a serious, neutral statement. Asymmetry, may represent confusion, jeers, or simply natural distortions when speaking.
The limb-motion-feature data is data for indicating limb-motion features of the digital person. That is, the limb movement of the digital person can be determined from the limb movement feature data.
Optionally, the limb includes a digital person's hand, foot, upper body, lower body, and the like. The limb morphology refers to a specific posture or motion state exhibited by a limb of a human. It comprises the following main aspects:
(1) Joint angle/rotation.
Joint angle/rotation is the most central data of limb movements. Each movable joint (e.g., shoulder, elbow, wrist, hip, knee, ankle, etc.) has one or more angular values defining its flexion, extension, rotation, etc. For example, the degree of bending of the arm, elbow joint (bending vs straightens), the lifting angle of the shoulder joint and the back and forth swing angle. Leg, knee joint bending degree (bending vs straightening), hip joint lifting, lowering, front-back swing angle. Trunk, bending and twisting angle of spine (body forward tilting, backward tilting, left and right twisting). Finger and wrist, degree of bending of finger (fist vs open), rotation angle of wrist. Body posture/balance refers to the posture of the whole body relative to the ground, including the position of the center of gravity, the angle of inclination of the body, etc. For example, standing straight or slightly leaning forward.
(2) Gesture/hand motion.
Gestures/hand actions include pointing, waving hands, making fists, making gestures (e.g., to indicate "OK", "stop", etc.), taking or manipulating objects, and so forth.
The facial expression feature data is data for indicating facial expression features of a digital person. That is, by the mouth feature data, the facial expression of the digital person can be determined.
Alternatively, facial expressions generally include, but are not limited to, open heart (manifestation: raised corners of the mouth, squinted eyes), injury (raised corners of the mouth, possibly accompanied by tears), anger (manifestation: raised corners of the mouth, eyes are raised, lips are closed or tucked), fear (raised corners of the eye, eyes are opened, lips are slightly open), surprise (raised corners of the eye, eyes are opened, lips are slightly open), aversion (raised upper lips, raised lips, micro-wrinkled), contempt (manifestation: raised corners of the mouth, slightly lowered) and the like.
For example, dialogue text is "about how is i just sprinkling water on the keyboard. The digital portrait drive engine analyzes the dialogue text, and the digital portrait drive engine understands that the user is possibly surprised, has a little confusion and seeks help, so that the obtained portrait feature data matched with the semantic content of the dialogue text are facial expression feature data such as raised eyebrows (representing surprised), possible slight sagging of mouth corners (representing worry or annoyance) and large eyes. And generating corresponding mouth shape changes (such as mouth shapes for emitting voices of o, water and the like) according to the dialogue text. Limb motion feature data, body slightly forward, indicates concern, one hand may be lifted, as if wiping was simulated, and the head may be slightly askew to one side, indicating thinking or confusion.
Step 306, the digital person image driving engine controls the digital person to make image animation matched with the semantic content of the dialogue text based on the image characteristic data.
The image animation is an image change animation after the digital person image is adjusted by the digital person image driving engine according to the image characteristic data.
Illustratively, after the digital person figure driving engine obtains the figure feature data, the digital person figure driving engine drives the figure of the digital person according to the figure feature data and renders the figure of the digital person so that the figure of the digital person changes according to the figure feature data.
For example, the obtained image characteristic data are facial expression characteristic data, namely, raised eyebrows (representing surprise), possible slight sagging of mouth corners (representing worry or trouble), and large opening of eyes. And generating corresponding mouth shape changes (such as mouth shapes for emitting voices of o, water and the like) according to the dialogue text. Limb motion feature data, body slightly forward, indicates concern, one hand may be lifted, as if wiping was simulated, and the head may be slightly askew to one side, indicating thinking or confusion. The digital human figure driving engine controls the digital human to make figure animation according to the figure characteristic data, and after rendering, the figure animation seen by the user is that the eyebrows of the digital human are raised and the eyes are opened greatly. The mouth makes corresponding opening, closing and shape changes along with what is what is done by the pronunciation of the sentence "when the user just sprays water on the keyboard. The body is slightly tilted forward, one hand is lifted, and the head is slightly tilted to one side.
In summary, according to the scheme provided by the application, through semantic analysis of the text which is responded, the content expressed by the digital person is matched with the image change item of the digital person, so that the image change content of the digital person is enriched, a plurality of fixed actions are not used any more, the personification degree of the digital person is improved, and the user experience is improved.
Further, as shown in the flowchart of the digital human figure driving method of fig. 4, since the figure feature data includes mouth feature data, limb motion feature data, or facial expression feature data, the step 304 may be implemented as steps 3041, 3042, and 3043 for different figure feature data. Step 306 may be implemented as step 3061, step 3062, and step 3063.
In the case where the character feature data is mouth feature data, steps 304 and 306 are implemented as steps 3041 and 3061.
In the case where the avatar characteristic data is limb movement characteristic data, steps 304 and 306 are implemented as steps 3042 and 3062.
In the case where the avatar characteristic data is facial expression characteristic data, steps 304 and 306 are implemented as steps 3043 and 3063.
It should be noted that, the steps 3041, 3042, and 3043 may be combined, and the steps 3061, 3062, and 3063 may also be combined, that is, different image feature data sets may be embodied on one digital person. That is, in the case where at least two of mouth feature data, limb motion feature data, or facial expression feature data are included in the avatar feature data, the at least two feature data are data-synchronized according to the time stamp. And controlling the digital person to make the image animation matched with the semantic content of the dialogue text according to at least two kinds of characteristic data after the data synchronization.
In one possible implementation, where the avatar characteristic data is mouth characteristic data, steps 304 and 306 are implemented as steps 3041 and 3061.
And 3041, carrying out semantic analysis on the dialogue text to obtain mouth feature data matched with semantic content of the dialogue text.
The dialog text is illustratively speech-converted by a text-to-speech engine in the digital persona drive engine to obtain speech information conforming to the digital persona settings. And decomposing the phonemes of the words in the dialogue text through a text-to-language engine to obtain phonemes corresponding to the words, and recording the starting time and the ending time of the phonemes corresponding to the words in the voice information. And according to the starting time and the ending time of the phonemes corresponding to each word in the voice information, time alignment is carried out on the voice information and the phonemes corresponding to each word, so as to obtain the mouth characteristic data.
The text-to-speech engine is a TTS engine. The text-to-speech engine is used to turn text into sound.
The voice information conforming to the digital person role setting refers to the voice information conforming to the digital person role setting obtained by adjusting the tone, speed, pitch, mood, etc. of sound according to the digital person role setting. For example, if small a is set to a young active boy, the TTS engine will generate a voice message from the dialog text that sounds young, active, and may be at a slightly faster pace. If small A is set as a mature and steady girl, the TTS engine generates voice information with slower, lower, softer sound according to the dialogue text.
The phonemes are pronunciation units corresponding to the characters;
Specifically, while generating speech information, the TTS engine will decompose the dialog text into the most basic pronunciation units, phonemes. (e.g., "a", "o", "e", "b", "p", etc. in chinese). Then, it is precisely recorded from which point in time to which point in time each phoneme starts in the generated voice information. This is as if each "syllable fragment" of speech was time-stamped. The TTS engine obtains mouth feature data based on the speech information and the phonemes.
For example, the dialog text is "hello, today's weather is good |", the TTS engine breaks the dialog text into the most basic pronunciation units, phonemes are ni, hao, jin, tian, tian, qi, zhen, hao, and the start and stop time of each phoneme in the voice file is recorded accurately. For example, "ni" starts at 0.1 seconds and ends at 0.3 seconds. "hao" starts from 0.3 seconds and ends at 0.6 seconds. "jin" starts from 0.6 seconds, ends 0.9 seconds, and so on.
Step 3061, based on the mouth characteristic data, controlling the digital person to send out language information, and controlling the mouth of the digital person to make mouth shape actions corresponding to phonemes corresponding to the characters.
Illustratively, based on the mouth feature data, the digital person is controlled to send language information, and the mouth of the digital person is controlled to make mouth-shape actions corresponding to phonemes corresponding to respective characters.
Specifically, after obtaining the speech information and phonemes in the mouth feature data and aligning the phonemes with the speech information, the TTS engine may use the information to control the mouth motion of the digital person. For example, when the TTS engine broadcasts an "a" sound, the digital person's mouth will take the shape of the "a" sound (e.g., open the large mouth), and when the TTS engine broadcasts an "n" sound, the digital person's mouth will change into the shape of the "n" sound (e.g., the mouth angle is raised and the tongue tip is against the upper gums). This process is performed in real time, ensuring that the digital person's lip movements and the sounds it emits are perfectly matched.
For example, the dialogue text is "hello, today's weather is good |", the TTS engine controls the mouth of the digital person to make mouth-shaped actions corresponding to phonemes corresponding to each word according to the phonemes, for example, when the voice is played for 0.1 seconds, the TTS engine knows the first phoneme "n" which should now send "ni" according to the alignment result, and then drives the mouth of the small A to take the shape of the starting "n" sound (the mouth corner may be slightly inwardly retracted, and the tongue tip is close to the upper gum). When playing for 0.3 seconds, switch to the mouth with "i" tone (flat mouth, open mouth). When playing for 0.3 seconds, "hao" starts and the mouth switches to the shape of "h" (slightly open lips, breath out). The shape of the hair "a" is switched immediately (mouth is large). Then switch to the shape of the "o" (round mouth). This process continues until the phones "o" in "hao" are spoken, and the mouth shape of the small a will change smoothly in real time according to the pronunciation requirements of each phone.
In another possible implementation, where the avatar characteristic data is mouth characteristic data, steps 304 and 306 are implemented as steps 3042 and 3062.
And 3042, carrying out semantic analysis on the dialogue text to obtain limb action characteristic data matched with semantic content of the dialogue text.
Illustratively, the text in the dialog text is disassembled by a behavioral reasoning engine in the digital portrait driver engine to obtain text fields. And selecting from the action tree according to the text field to obtain limb action characteristic data matched with the semantic content of the dialogue text.
The behavior tree comprises different action information corresponding to different text fields. The behavioral reasoning engine is used for understanding and presuming semantic content in the dialogue text, namely, reasoning what is being said by words in the dialogue text, and which emotion, intention and specific content are contained.
For example, the dialogue text is that the weather today is good, the people go to the picnic bar of the park | the behavior inference engine disassembles the weather today is good, the people go to the picnic bar of the park | and the obtained text segments are weather really good (active emotion) and picnic in the park (activity proposal). For the text field "weather is good", it is known that the text field is a positive emotion, so that some expressions can be searched from the behavior tree, namely, a gesture of appreciation or approval, such as gently nodding a head, or a upward pointing action by hand, as if pointing to a clear sky. For the text field "park picnic", it is known to be a specific example, so that some representations can be looked up from the behavior tree, more specific, more fanciful gestures, such as the appearance of both hands making "bars" or the fingers pointing far away, representing "going to the spot". And summarizing the action data obtained from the action tree to obtain limb action characteristic data.
Step 3062, based on the limb movement characteristic data, controlling the limb of the digital person to make a limb movement corresponding to the text segment.
For example, after analyzing the text of the dialog, the behavior inference engine may plan a series of suitable "gesture actions" and "leg actions" according to the analysis result. These actions need to be associated with the word segment just analyzed, such as the word segment expressing happiness, confusion, emphasizing a certain point, or describing a certain action, etc. The planned action will instruct the digital person to perform the corresponding limb action.
For example, the dialogue text is that 'today weather is good, people go to park picnic bars |', the behavior reasoning engine controls limbs of the digital person to make corresponding limb actions according to limb action feature data, the picture seen by the user is that the digital person lightly points at a head and points fingers to a clear sky, and then hands make things to be 'touted'.
In yet another possible implementation, where the avatar characteristic data is mouth characteristic data, steps 304 and 306 are implemented as steps 3043 and 3063.
And 3043, carrying out semantic analysis on the dialogue text to obtain facial expression characteristic data matched with semantic content of the dialogue text.
The emotional text in the dialog text is illustratively screened out by an emotion analysis engine. And obtaining facial expression characteristic data according to the type and the emotion degree of the emotion text.
Wherein the emotion text is a text for expressing emotion. The emotion degree is used to represent the expression degree of emotion.
Optionally, the facial expression characteristic data includes facial expressions and expression magnitudes of the facial expressions. And the emotion analysis engine determines facial expressions corresponding to the emotion texts according to the types of the emotion texts. And obtaining the expression amplitude of the facial expression according to the emotion degree of the emotion text. Facial expression characteristic data are obtained based on facial expressions and expression amplitudes.
For example, the dialogue text is "the new restaurant we eat today is too excellent for | food is delicious, the service is also good, i go beyond happy |", and the emotion analysis engine can judge that emotion is very happy by screening out emotion texts in the dialogue text, i. The emotion degree is high. Thus, the emotion analysis engine obtains facial expressions including 'large smile, eyes possibly squinting (like laughing eyes), eyebrows slightly lifting', and expression amplitude including 'mouth corner height Gao Yangqi', according to the type and emotion degree of emotion text.
Step 3063, based on the facial expression characteristic data, controlling the face of the digital person to make facial expressions corresponding to the emotion text.
For example, after analyzing the emotion text in the dialogue text, the emotion analysis engine plans out a proper facial expression according to the analysis result. These facial expressions need to be associated with the emotion text just analyzed.
For example, the dialogue text is that 'the newly opened restaurant we eat today is too excellent for | food is delicious, the service is also good, i go beyond happiness |', the emotion analysis engine can control limbs of a digital person to make corresponding expressions according to facial expressions and expression amplitudes, the picture seen by a user is that the corners of the digital person are high-raised, large smiles are presented, eyes can squint (like smiles), and eyebrows are raised slightly.
The above embodiments describe a digital character driving method, which will be described below with specific examples.
Illustratively, the user makes a voice input through the vehicle-mounted voice assistant, triggering the vehicle-mounted voice assistant to begin working. After receiving the voice input, the vehicle-mounted voice assistant firstly carries out signal processing on the voice input locally (vehicle machine end), wherein the signal processing comprises preprocessing operations such as noise reduction, framing and the like. The voice input after the signal processing can be processed locally or processed in cloud.
Under the condition that the digital human figure driving engine is deployed at the vehicle side, as shown in a schematic diagram of digital human figure driving at the vehicle side in fig. 5, the digital human figure driving engine in the vehicle side performs local voice processing (ASR and NLU) on the voice input after signal processing, and after understanding the intention and the semantics of the voice input, responds to the voice input to obtain a dialogue text. The digital portrait drive engine in the vehicle terminal carries out semantic analysis on the dialogue text to obtain the portrait characteristic data matched with the semantic content of the dialogue text. And the animation engine generates an animation corresponding to the digital person according to the image characteristic data, renders the animation through the rendering engine to obtain a video file rendered by the digital person, and presents the visual animation of the video file to a user.
In the case where the digital portrait driving engine is deployed at the cloud, as shown in the schematic diagram of performing digital portrait driving at the cloud shown in fig. 6, the digital portrait driving engine in the cloud performs cloud voice processing on the voice input after signal processing (i.e., performs voice recognition at the cloud), and after understanding the intention and the semantics of the voice input, responds to the voice input to obtain a dialog text. The digital person image driving engine in the cloud performs semantic understanding on the dialogue text, and determines the scene of the dialogue text, for example, determines that the user is in boring scene with the vehicle-mounted voice assistant. The digital person image driving engine in the cloud performs voice conversion on the dialogue text through the text-to-language engine to obtain voice information conforming to the setting of the digital person roles. Meanwhile, the text in the dialogue text is subjected to phoneme decomposition through the text-to-speech engine, phonemes corresponding to the respective text are obtained, and the starting time and the ending time of the phonemes corresponding to the respective text in the voice information are recorded. And according to the starting time and the ending time of the phonemes corresponding to each word in the voice information, time alignment is carried out on the voice information and the phonemes corresponding to each word, so as to obtain the mouth characteristic data. In addition, a digital person image driving engine in the cloud end disassembles the characters in the dialogue text through a behavior reasoning engine to obtain text fields. And selecting from the action tree according to the text field to obtain limb action characteristic data matched with the semantic content of the dialogue text. And the digital portrait driving engine in the cloud end screens out emotion texts in the dialogue texts through the emotion analysis engine. And obtaining facial expression characteristic data according to the type and the emotion degree of the emotion text. The digital portrait drive engine in the cloud saves the mouth feature data, limb action feature data and facial expression feature data into a material library. And then, generating an animation corresponding to the digital person by the animation engine according to the mouth feature data, the limb action feature data and the facial expression feature data in the material library, rendering by the rendering engine to obtain a video file rendered by the digital person, and displaying the visual animation of the video file to a user.
For example, the user is chatted through the car microphone and the little A, and the user is 'today' weather is good, and is doing so. Small A 'today's weather is good, we go to park walk bar-. The dialogue text is 'today' weather is good, we go to park walk bar-.
The digital human image driving engine calculates and says that ' today's weather is good, we go to park walk bar | ' all phoneme sequences and corresponding accurate mouth-shaped changing time points needed, and obtains mouth characteristic data.
The content is analyzed by the behavior reasoning engine to be about 'good weather' and 'walk to park', possible gestures are planned, such as pointing out of a window (if the interface allows) or a gesture representing 'departure' or 'enjoyment' is made (such as opening two hands and simulating hugging sunlight), and limb action characteristic data are obtained.
The emotion analysis engine analyzes the words filled with positive and pleasant emotion (really good and walking bar to park), plans out happy expressions (smile and bent eyes) and possibly accompanying easy and pleasant body gestures (slightly forward and vivid body) and obtains facial expression characteristic data.
The animation engine starts to drive the digital person after receiving one or more of the mouth characteristic data, limb motion characteristic data or facial expression characteristic data, for example, the mouth needs to accurately send each sound according to the mouth characteristic data to make a corresponding mouth shape. When speaking "true," the expression needs to be switched to a happy smile, and the eyes are also lightened. When we say "we go to park walk bar", we can coordinate with an easy gesture, such as a hand lifting, a finger bending slightly and then stretching outwards, simulating the action of inviting or going to a distance. At the same time, the body posture is kept relaxed and pleasant.
The final comprehensive expression of the digital human small A is that the mouth naturally opens and closes along with the voice content, so that the weather is really good today, the people go to park walk bars | from speaking, the people slowly float on the face to form a happy smile, and eyes also appear bright. When speaking the latter half, small a may cooperate with an easy, inviting gesture. Throughout the process, the posture of small a is relaxed, positive and the eye is also pleasurable.
The foregoing has mainly described the solutions provided by the present application. Correspondingly, the application also provides a digital human figure driving device which is used for realizing the method embodiment.
In some embodiments, the digital avatar driving device includes corresponding hardware structures and/or software modules for performing the functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The embodiment of the application can divide the functional modules of the digital human figure driving device according to the embodiment of the method, for example, each functional module can be divided corresponding to each function, and two or more functions can be integrated in one processing module. The integrated modules may be implemented in hardware or in software functional modules. It should be noted that, in the embodiment of the present application, the division of the modules is schematic, which is merely a logic function division, and other division manners may be implemented in actual implementation.
In some embodiments, the present application provides a digital avatar driving apparatus, where the digital avatar driving apparatus is used to implement the functions of the digital avatar driving apparatus in the above-described digital avatar driving method embodiment. The digital human figure driving apparatus shown in fig. 7 is schematically constructed. The digital portrait driving device may include an acquisition module 701, an analysis module 702, and a driving module 703.
The obtaining module 701 is configured to perform the operations of step 302 in the methods illustrated in fig. 3 and 4. The parsing module 702 is configured to perform the operations of steps 304, 3041, 3042, 3043 in the methods illustrated in fig. 3 and 4. The driving module 703 is used to perform the operations of steps 306, 3061, 3062, 3063 of the methods illustrated in fig. 3, 4.
As shown in fig. 8, a computer device provided by an embodiment of the present application may include a processor 801, a bus 802, a communication interface 803, and a memory 804. The processor 801, the memory 804 and the communication interface 803 communicate via a bus 802. It should be understood that the present application is not limited to the number of processors, memories in a computer device.
Bus 802 may be a PCI bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, or a UB bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one line is shown in fig. 8, but not only one bus or one type of bus. Bus 802 may include a path to transfer information between various components of the computer device (e.g., memory 804, processor 801, communication interface 803).
The processor 801 may include any one or more of a CPU, a graphics processor (graphics processing unit, GPU), a Microprocessor (MP), or a digital signal processor (digital signalprocessor, DSP).
The memory 804 may include volatile memory (RAM), such as random access memory (random access memory). The processor 801 may also include non-volatile memory (non-volatile memory), such as read-only memory (ROM), flash memory, mechanical hard disk (HARD DISK DRIVE, HDD) or solid state disk (SSD STATE DRIVE).
The communication interface 803 enables communication between the computer device and other devices or communication networks using a transceiver module such as, but not limited to, a network interface card, transceiver, etc.
The memory 804 stores executable program codes, which the processor 801 executes to implement the functions of the digital avatar driving device or the CPU core in the foregoing method embodiment, respectively. That is, the memory 804 has stored thereon a program for executing the above-described digital portrait driving method.
In yet another aspect, a computer readable storage medium having at least one computer program stored therein is provided, the at least one computer program being loaded and executed by a processor to implement a digital human image driving method as provided by the above method embodiments.
In a further aspect, there is provided a computer program product comprising a computer program or instructions which, when executed by a processor, implement a digital human image driving method as described in the above aspects.
In yet another aspect, a system on a chip is provided, including at least one processor and at least one interface circuit, the at least one interface circuit configured to perform a transceiving function and to send instructions to the at least one processor, the at least one processor executing instructions when executed by the at least one processor to implement a digital portrait driving method as described in the above aspect.
The method steps in this embodiment may be implemented by hardware, or may be implemented by executing software instructions by a processor. The software instructions may be comprised of corresponding software modules that may be stored in random access memory (random access memory, RAM), flash memory, read-only memory (ROM), programmable ROM (PROM), erasable programmable ROM (erasable PROM, EPROM), electrically Erasable Programmable ROM (EEPROM), registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. In addition, the ASIC may reside in a computer device. The processor and the storage medium may reside as discrete components in a computer device.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer programs or instructions. When the computer program or instructions are loaded and executed on a computer, the processes or functions described in the embodiments of the present application are performed in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, a network device, a user device, or other programmable apparatus. The computer program or instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer program or instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired or wireless means. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that integrates one or more available media. The usable medium may be a magnetic medium such as a floppy disk, a hard disk, a magnetic tape, an optical medium such as a digital video disc (digital video disc, DVD), or a semiconductor medium such as a solid state disk (solid STATE DRIVE, SSD). While the application has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims (16)

1.一种数字人形象驱动方法,其特征在于,所述方法应用于数字人形象驱动引擎;所述方法包括:1. A digital human image driving method, characterized in that the method is applied to a digital human image driving engine; the method comprises: 获取对话文本,所述对话文本是指数字人基于用户输入进行响应的文本;Acquiring a dialogue text, wherein the dialogue text refers to the text in which the digital human responds based on user input; 对所述对话文本进行语义解析,得到与所述对话文本的语义内容相匹配的形象特征数据,所述形象特征数据用于指示所述数字人的形象变化特征;Performing semantic analysis on the dialogue text to obtain image feature data that matches the semantic content of the dialogue text, wherein the image feature data is used to indicate image change characteristics of the digital human; 基于所述形象特征数据,控制所述数字人做出与所述对话文本的语义内容相匹配的形象动画。Based on the image feature data, the digital human is controlled to produce an image animation that matches the semantic content of the dialogue text. 2.根据权利要求1所述的方法,其特征在于,所述形象特征数据包括:嘴部特征数据、肢体动作特征数据或脸部表情特征数据中的一种或多种。2. The method according to claim 1 is characterized in that the image feature data includes one or more of mouth feature data, body movement feature data or facial expression feature data. 3.根据权利要求2所述的方法,其特征在于,所述数字人形象驱动引擎中包括:文本转语言引擎;3. The method according to claim 2, wherein the digital human image driving engine comprises: a text-to-speech engine; 所述对所述对话文本进行语义解析,得到与所述对话文本的语义内容相匹配的形象特征数据,包括:The semantic analysis of the dialogue text to obtain image feature data that matches the semantic content of the dialogue text includes: 通过所述文本转语言引擎对所述对话文本进行语音转化,得到符合所述数字人角色设定的语音信息;Converting the dialogue text into speech using the text-to-speech engine to obtain speech information that matches the character setting of the digital human; 通过所述文本转语言引擎对所述对话文本中的文字进行音素拆解,得到各个文字对应的音素,并记录所述各个文字对应的音素在所述语音信息中的起始时间和终止时间,所述音素是指文字对应的发音单位;Decomposing the characters in the conversation text into phonemes by the text-to-speech engine to obtain phonemes corresponding to each character, and recording the start time and end time of the phonemes corresponding to each character in the voice information, wherein the phonemes refer to the pronunciation units corresponding to the characters; 根据所述各个文字对应的音素在所述语音信息中的起始时间和终止时间,将所述语音信息和所述各个文字对应的音素进行时间对齐,得到所述嘴部特征数据。According to the start time and end time of the phonemes corresponding to the respective characters in the voice information, the voice information and the phonemes corresponding to the respective characters are time-aligned to obtain the mouth feature data. 4.根据权利要求3所述的方法,其特征在于,所述基于所述形象特征数据,控制所述数字人做出与所述对话文本的语义内容相匹配的形象动画,包括:4. The method according to claim 3, wherein the step of controlling the digital human to produce an image animation that matches the semantic content of the dialogue text based on the image feature data comprises: 基于所述嘴部特征数据,控制所述数字人发出所述语言信息,并控制所述数字人的嘴部做出与所述各个文字对应的音素相对应的口型动作。Based on the mouth feature data, the digital human is controlled to emit the language information, and the mouth of the digital human is controlled to make lip movements corresponding to the phonemes corresponding to the various characters. 5.根据权利要求2所述的方法,其特征在于,所述数字人形象驱动引擎中包括:行为推理引擎;5. The method according to claim 2, wherein the digital human image driving engine comprises: a behavior inference engine; 所述对所述对话文本进行语义解析,得到与所述对话文本的语义内容相匹配的形象特征数据,包括:The semantic analysis of the dialogue text to obtain image feature data that matches the semantic content of the dialogue text includes: 通过所述行为推理引擎对所述对话文本中的文字进行拆解,得到文字段;Decomposing the text in the conversation text by the behavior inference engine to obtain text segments; 根据所述文字段,在行为树中进行选择,得到与所述对话文本的语义内容相匹配的所述肢体动作特征数据,所述行为树包括不同文字段对应的不同动作信息。According to the text segment, selection is made in a behavior tree to obtain the body movement feature data that matches the semantic content of the dialogue text, wherein the behavior tree includes different action information corresponding to different text segments. 6.根据权利要求5所述的方法,其特征在于,所述基于所述形象特征数据,控制所述数字人做出与所述对话文本的语义内容相匹配的形象动画,包括:6. The method according to claim 5, wherein the step of controlling the digital human to produce an image animation that matches the semantic content of the dialogue text based on the image feature data comprises: 基于所述肢体动作特征数据,控制所述数字人的肢体做出与所述文字段对应的肢体动作。Based on the limb movement feature data, the limbs of the digital human are controlled to perform limb movements corresponding to the text segment. 7.根据权利要求2所述的方法,其特征在于,所述数字人形象驱动引擎中包括:情绪分析引擎;7. The method according to claim 2, wherein the digital human image driving engine comprises: an emotion analysis engine; 所述对所述对话文本进行语义解析,得到与所述对话文本的语义内容相匹配的形象特征数据,包括:The semantic analysis of the dialogue text to obtain image feature data that matches the semantic content of the dialogue text includes: 通过所述情绪分析引擎筛选出所述对话文本中的情绪文本,所述情绪文本是用于表达情绪的文本;Filtering out emotional text from the conversation text by the emotion analysis engine, wherein the emotional text is text used to express emotions; 根据所述情绪文本的类型和情绪程度,得到所述脸部表情特征数据。The facial expression feature data is obtained according to the type and emotion level of the emotional text. 8.根据权利要求7所述的方法,其特征在于,所述根据所述情绪文本的类型和情绪程度,得到所述脸部表情特征数据,包括:8. The method according to claim 7, wherein obtaining the facial expression feature data according to the type and emotional level of the emotional text comprises: 根据所述情绪文本的类型,确定所述情绪文本对应的脸部表情;Determining, according to the type of the emotional text, a facial expression corresponding to the emotional text; 根据所述情绪文本的情绪程度,得到所述脸部表情的表现幅度;Obtaining the magnitude of the facial expression according to the emotional level of the emotional text; 基于所述脸部表情和所述表现幅度,得到所述脸部表情特征数据。The facial expression feature data is obtained based on the facial expression and the expression amplitude. 9.根据权利要求8所述的方法,其特征在于,所述基于所述形象特征数据,控制所述数字人做出与所述对话文本的语义内容相匹配的形象动画,包括:9. The method according to claim 8, wherein the step of controlling the digital human to produce an image animation that matches the semantic content of the dialogue text based on the image feature data comprises: 基于所述脸部表情特征数据,控制所述数字人的脸部做出与所述情绪文本对应的脸部表情。Based on the facial expression feature data, the face of the digital human is controlled to make a facial expression corresponding to the emotional text. 10.根据权利要求1至9任一项所述的方法,其特征在于,所述数字人形象驱动引擎部署于车机端和/或云端。10. The method according to any one of claims 1 to 9, characterized in that the digital human image driving engine is deployed on the vehicle terminal and/or the cloud. 11.根据权利要求2至9任一项所述的方法,其特征在于,所述方法还包括:11. The method according to any one of claims 2 to 9, further comprising: 在所述形象特征数据中包括所述嘴部特征数据、所述肢体动作特征数据或所述脸部表情特征数据中的至少两种的情况下,根据时间戳,将至少两种特征数据进行数据同步;When the image feature data includes at least two of the mouth feature data, the body movement feature data, or the facial expression feature data, synchronizing the at least two feature data according to a timestamp; 根据数据同步后的所述至少两种特征数据,控制所述数字人做出与所述对话文本的语义内容相匹配的形象动画。According to the at least two kinds of feature data after data synchronization, the digital human is controlled to produce an image animation that matches the semantic content of the dialogue text. 12.一种数字人形象驱动装置,其特征在于,所述装置包括:12. A digital human image driving device, characterized in that the device comprises: 获取模块,用于获取对话文本,所述对话文本是指数字人基于用户输入进行响应的文本;An acquisition module, configured to acquire a dialogue text, wherein the dialogue text refers to the text in which the digital human responds based on the user input; 解析模块,用于对所述对话文本进行语义解析,得到与所述对话文本的语义内容相匹配的形象特征数据,所述形象特征数据用于指示所述数字人的形象变化特征;An analysis module, configured to perform semantic analysis on the dialogue text to obtain image feature data that matches the semantic content of the dialogue text, wherein the image feature data is used to indicate image change characteristics of the digital human; 驱动模块,用于基于所述形象特征数据,控制所述数字人做出与所述对话文本的语义内容相匹配的形象动画。The driving module is used to control the digital human to produce an image animation that matches the semantic content of the dialogue text based on the image feature data. 13.一种车辆,其特征在于,所述车辆包括显示屏,所述显示屏中显示有数字人,所述车辆基于权利要求1至11中任一项所述的数字人形象驱动方法对所述数字人进行驱动。13. A vehicle, characterized in that the vehicle comprises a display screen on which a digital human is displayed, and the vehicle drives the digital human based on the digital human image driving method according to any one of claims 1 to 11. 14.一种计算机设备,其特征在于,所述计算机设备包括:处理器和存储器,所述存储器中存储有至少一条计算机程序,至少一条所述计算机程序由所述处理器加载并执行以实现如权利要求1至11中任一项所述的数字人形象驱动方法。14. A computer device, characterized in that the computer device comprises: a processor and a memory, wherein the memory stores at least one computer program, and the at least one computer program is loaded and executed by the processor to implement the digital human image driving method according to any one of claims 1 to 11. 15.一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有至少一条计算机程序,至少一条计算机程序由处理器加载并执行以实现如权利要求1至11中任一项所述的数字人形象驱动方法。15. A computer-readable storage medium, characterized in that at least one computer program is stored in the computer-readable storage medium, and the at least one computer program is loaded and executed by a processor to implement the digital human image driving method according to any one of claims 1 to 11. 16.一种计算机程序产品,其特征在于,所述计算机程序产品包括计算机程序或指令,当所述计算机程序或指令被处理器执行时,实现如权利要求1至11中任一项所述的数字人形象驱动方法。16. A computer program product, characterized in that the computer program product comprises a computer program or instructions, and when the computer program or instructions are executed by a processor, the digital human image driving method according to any one of claims 1 to 11 is implemented.
CN202510863538.3A 2025-06-25 2025-06-25 Digital human image driving method, device, equipment, storage medium and product Pending CN120707709A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510863538.3A CN120707709A (en) 2025-06-25 2025-06-25 Digital human image driving method, device, equipment, storage medium and product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510863538.3A CN120707709A (en) 2025-06-25 2025-06-25 Digital human image driving method, device, equipment, storage medium and product

Publications (1)

Publication Number Publication Date
CN120707709A true CN120707709A (en) 2025-09-26

Family

ID=97121150

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510863538.3A Pending CN120707709A (en) 2025-06-25 2025-06-25 Digital human image driving method, device, equipment, storage medium and product

Country Status (1)

Country Link
CN (1) CN120707709A (en)

Similar Documents

Publication Publication Date Title
CN110688911B (en) Video processing method, device, system, terminal equipment and storage medium
CN114357135B (en) Interaction method, interaction device, electronic equipment and storage medium
CN106653052B (en) Virtual human face animation generation method and device
EP1269465B1 (en) Character animation
KR102116309B1 (en) Synchronization animation output system of virtual characters and text
JP2022518721A (en) Real-time generation of utterance animation
CN108492817A (en) A kind of song data processing method and performance interactive system based on virtual idol
Albrecht et al. Automatic generation of non-verbal facial expressions from speech
KR101089184B1 (en) Character utterance and emotion expression system and method
CN116597858A (en) Speech lip-shape matching method, device, storage medium and electronic equipment
CN114173188B (en) Video generation method, electronic device, storage medium and digital person server
CN108052250A (en) Virtual idol deductive data processing method and system based on multi-modal interaction
CN117523088A (en) Personalized three-dimensional digital human holographic interaction forming system and method
WO2022242706A1 (en) Multimodal based reactive response generation
US20040054519A1 (en) Language processing apparatus
CN117174067A (en) Speech processing method, device, electronic equipment and computer-readable medium
CN119441436A (en) A method and system for training intelligent digital human based on multimodal interaction
CN118807208A (en) An intelligent NPC system that interacts with players
US20250384537A1 (en) System and method for an audio-visual avatar evaluation
US20250265756A1 (en) Method for generating audio-based animation with controllable emotion values and electronic device for performing the same.
CN120707709A (en) Digital human image driving method, device, equipment, storage medium and product
CN115730048A (en) A session processing method, device, electronic equipment and readable storage medium
CN112907706A (en) Multi-mode-based sound-driven animation video generation method, device and system
CN119763546A (en) Speech synthesis method, system, electronic device and storage medium
CN119131205A (en) Somatosensory dance game method for generating virtual characters based on user timbre

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination