CN119399371A

CN119399371A - Personalized audio-visual fusion intelligent interaction system based on multimodal data and its use method

Info

Publication number: CN119399371A
Application number: CN202411491891.5A
Authority: CN
Inventors: 薛佳晖; 请求不公布姓名
Original assignee: Gansu North Guorun Technology Co ltd
Current assignee: Gansu North Guorun Technology Co ltd
Priority date: 2024-10-24
Filing date: 2024-10-24
Publication date: 2025-02-07

Abstract

The present application discloses a personalized audio-visual fusion intelligent interaction system based on multimodal data and its use method, the system includes a multimodal data receiving module, a facial feature extraction module, a three-dimensional model generation module, an audio feature extraction module, a virtual sound generation module, a virtual image integration module and an interaction module. Users upload pictures, videos and audios, and the system extracts facial and audio features, generates virtual images and virtual sounds, and performs adjustments and matching. The interaction module displays the virtual image, the user inputs the conversation content, the system generates replies through algorithms, and performs video and audio interactions through the virtual image. It can realize the generation and interaction of virtual images and sounds based on multimodal data, thereby providing a highly personalized and intelligent user experience. Not only can it generate highly realistic virtual images and sounds, but it can also realize natural and smooth user interaction through the intelligent interaction module, greatly improving the quality and satisfaction of the user experience.

Description

Personalized audio-visual fusion intelligent interaction system based on multi-mode data and application method thereof

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a personalized audio-visual fusion intelligent interaction system based on multi-mode data and a use method thereof.

Background

In the current digital age, avatar generation and interaction technology has become a bridge connecting reality and the digital world, providing an unprecedented immersive experience for users. However, conventional avatar generation and interaction systems rely mostly on predefined templates and static rules, which greatly limit the degree of individualization and level of intellectualization of the system. When users use the systems, the users often find that the virtual images lack enough personality characteristics, the interaction process also appears mechanical and monotonous, and the requirements of the users on highly real, natural and personality interaction experience are difficult to meet.

With the rapid development of the fields of deep learning, computer vision, voice recognition, natural language processing and the like, the precision and efficiency of image and audio processing technology are remarkably improved, and a strong technical support is provided for innovation of virtual image generation and interaction technology. Particularly, the application of the deep learning algorithm enables the system to capture facial features, sound characteristics and even emotion changes of the user more accurately, and establishes a solid foundation for generating highly personalized virtual images. Therefore, the novel system based on the deep learning technology and the multi-mode fusion interaction is developed, not only has important technical value, but also brings brand new digital experience to users, and promotes the rapid development of the fields of virtual reality, artificial intelligence and the like.

Disclosure of Invention

The embodiment of the application solves the problems that the prior art lacks enough personality characteristics and is difficult to meet the requirements of users on highly real, natural and personality-rich interaction experience by providing the personalized audio-visual fusion intelligent interaction system based on the multi-modal data and the use method thereof.

In order to achieve the above object, the technical solution of the embodiment of the present invention is:

The embodiment of the invention provides a personalized visual fusion intelligent interaction system based on multi-modal data, which comprises a multi-modal data receiving module, a facial feature extracting module, a three-dimensional model generating module and an audio feature extracting module, wherein the multi-modal data receiving module is used for receiving multi-modal data uploaded by a first target user, the multi-modal data at least comprises pictures, videos and audio files containing portrait information and audio information of the first target user, the facial feature extracting module is used for carrying out facial recognition and analysis on the first target user according to the multi-modal data to obtain facial feature data of the first target user, the facial feature data at least comprises facial contours and facial feature shapes of the first target user, the three-dimensional model generating module is used for generating a corresponding three-dimensional facial model according to the extracted facial feature data of the first target user and carrying out detail optimization and rendering to generate an virtual image of the first target user, the audio feature extracting module is used for extracting voice features of the first target user, the voice features at least comprise tones, tone colors and voice speeds of the first target user, and the virtual voice generating module is used for generating virtual voice which is matched with original voice of the first target user based on the voice features of the first target user. The interactive module is used for displaying the virtual image through an interactive interface, receiving dialogue content input by a second target user through the interactive interface, generating reply content through a preset intelligent algorithm according to the dialogue content, and carrying out vivid and natural interactive feedback on the reply content through the virtual image in a video and audio integration mode through the virtual image integration module.

In some possible implementations, the multi-mode data receiving module further has data preprocessing functions, including but not limited to data compression, denoising, and format conversion, so as to improve data processing efficiency and accuracy.

In some possible implementations, the system further includes a personalization customization module for adjusting appearance characteristics of the avatar and/or pitch, timbre, speed of the avatar to achieve a more personalized avatar generation and interaction experience according to personal preferences of the first target user, wherein the appearance characteristics include, but are not limited to, skin color, hairstyle, clothing, and accessories.

In some possible implementations, the system further includes an emotion recognition and response module configured to adjust pitch, timbre, and pace parameters of the virtual voice according to emotion changes in the first target user's voice using emotion recognition techniques to enrich the emotion expressive power of the virtual voice.

In a second aspect, the embodiment of the invention provides a method for using a personalized audio-visual fusion intelligent interaction system based on multi-mode data, which is applied to the personalized audio-visual fusion intelligent interaction system based on multi-mode data in the first aspect, and comprises the steps that a first target user uploads a picture, a video and an audio file containing portrait information and audio information of the first target user through an interaction interface; extracting facial features of a first target user in the uploaded image and/or video, generating a three-dimensional facial model according to the extracted facial features and optimizing the three-dimensional facial model to obtain an virtual image of the first target user, extracting voice features of the uploaded audio and/or video to obtain voice features of the first target user, generating virtual sound of the first target user based on the voice features, integrating and adapting the virtual model and the virtual sound, displaying the virtual image, receiving input of a second target user, generating reply content through a preset intelligent algorithm, and carrying out video and audio interactive feedback on the reply content through the virtual image.

In some possible implementations, after the multi-modal data uploaded by the first target user is obtained, the method further includes preprocessing the multi-modal data, including but not limited to data compression, denoising and format conversion, so as to improve data processing efficiency and accuracy.

In some possible implementations, after obtaining the avatar of the first target user, the method further includes adjusting appearance characteristics of the avatar and/or pitch, timbre, speed of the virtual sound according to personal preferences of the first target user to achieve a more personalized avatar generation and interaction experience, wherein the appearance characteristics include, but are not limited to, skin tone, hairstyle, clothing, and accessories.

In some possible implementations, after generating the virtual sound of the first target user based on the speech characteristics, the method further includes adjusting pitch, timbre, and pace parameters of the virtual sound to enrich the virtual sound with emotion expressive power according to emotion changes in the first target user's speech using emotion recognition techniques.

One or more technical solutions provided in the embodiments of the present invention at least have the following technical effects or advantages:

According to the embodiment of the invention, the generation and interaction of the virtual image and the sound based on the multi-mode data can be realized, so that a highly personalized and intelligent user experience is provided. The system not only can generate high-fidelity virtual images and sound, but also can realize natural and smooth user interaction through the intelligent interaction module, thereby greatly improving the quality and satisfaction of user experience.

Drawings

In order to more clearly illustrate the embodiments of the present invention, the drawings that are required to be used in the embodiments of the present invention will be briefly described below, and it will be apparent that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort to those of ordinary skill in the art.

FIG. 1 is a schematic diagram of a personalized audio-visual fusion intelligent interaction system based on multi-modal data according to the embodiment of the present invention;

fig. 2 is a flowchart of an embodiment of a method for using a personalized audio-visual fusion intelligent interaction system based on multi-modal data in an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. It will be apparent that the described embodiments are some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In the description of the present embodiment, the terms "comprise, include, have", etc. are open-ended terms, and are generally and preferably understood to include, but not limited to, the term "at least one" is generally and preferably understood to mean one or more, where "a plurality" means two or more, the term "at least one" or the like means any combination of these terms, including any combination of single term(s) or plural term(s), e.g., "at least one term(s) of a, B, or c", or "at least one term(s) of a, B, and c", each may mean a, B, c, a-B (i.e., a and B), a-c, B-c, or a-B-c, where a, B, c may be single or plural, respectively, and the symbol "a/B" is used to describe a selected relationship of associated objects, generally indicating a relationship of "or".

In the following description of the present embodiment, the terminology used in the embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood by those skilled in the art that, in the following description of the present embodiment, the sequence number does not mean that the execution sequence is sequential, and some or all of the steps may be executed in parallel or sequentially, and the execution sequence of each process should be determined by its functions and inherent logic, and should not constitute any limitation on the implementation process of the embodiment of the present application.

It will be appreciated by those skilled in the art that the numerical ranges in the embodiments of the present application are to be understood as also specifically disclosing each intermediate value between the upper and lower limits of the range. Every smaller range between any stated value or stated range, and any other stated value or intermediate value within the range, is also encompassed by the application. The upper and lower limits of these smaller ranges may independently be included or excluded in the range.

Unless otherwise defined, technical/scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. Although only preferred methods and materials are described herein, any methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present application. All documents referred to in this specification are incorporated by reference herein to disclose and describe the methods and/or materials in connection with which the documents are to be. In case of conflict with any incorporated document, the present specification will control.

In order to illustrate the technical scheme of the invention, the following description is made by specific examples.

With the rapid development of the fields of deep learning, computer vision, voice recognition, natural language processing and the like, the precision and efficiency of image and audio processing technology are remarkably improved, and a strong technical support is provided for innovation of virtual image generation and interaction technology. Particularly, the application of the deep learning algorithm enables the system to capture facial features, sound characteristics and even emotion changes of the user more accurately, and establishes a solid foundation for generating highly personalized virtual images.

In this context, it is particularly urgent to construct a new avatar generation and interaction system capable of combining image processing and audio processing. The system aims to break the limitation of the traditional system, and achieves intelligent generation and highly personalized customization of the virtual image through the energization of a deep learning technology. Meanwhile, through application of the multi-mode fusion interaction technology, the system can provide more natural, smooth and emotional interaction experience, and the pursuit of users for higher interactivity and individuation requirements is met.

Therefore, the novel system based on the deep learning technology and the multi-mode fusion interaction is developed, not only has important technical value, but also brings brand new digital experience to users, and promotes the rapid development of the fields of virtual reality, artificial intelligence and the like.

Based on the above, the embodiment of the application solves the problem that the prior art lacks enough personality characteristics and is difficult to meet the requirements of users on highly real, natural and personality-rich interaction experience by providing the personalized audio-visual fusion intelligent interaction system based on the multi-modal data and the use method thereof.

Fig. 1 is a schematic structural diagram of a personalized audio-visual fusion intelligent interaction system based on multi-modal data according to the embodiment of the present invention, and referring to fig. 1, the personalized audio-visual fusion intelligent interaction system 10 based on multi-modal data may include:

The multimode data receiving module 11 is configured to receive multimode data uploaded by the first target user;

The multi-mode data at least comprises pictures, videos and audio files containing the portrait information and the audio information of the first target user;

in some embodiments, the picture file may contain portrait information of the first target user, typically for facial recognition, expression analysis, or generation of personalized avatars. The supported file formats are wide, such as JPEG, PNG and the like, so as to meet uploading habit and equipment compatibility requirements of different users. The video file can dynamically display the media of user behavior, emotion and environment information. Reception of mainstream video formats such as MP4 can be supported, and these video data will be used for processing such as extraction of image information in successive frames, motion analysis, and voice synchronization. The audio file may contain key information such as voice content, intonation, speech speed, etc. of the user, which is critical for applications such as voice recognition, emotion analysis, and personalized voice synthesis. The supported audio formats may include WAV, MP3, etc., ensuring that the system is able to capture and parse high quality audio data.

In some embodiments, to enhance the user experience, the multimodal data receiving module 11 may provide an intuitively friendly user interface module that enables the user to easily upload the desired multimodal data. The user interface design may follow principles that are straightforward and at least may have the following features:

the multi-format support is that the user interface can automatically identify and accept uploading of various file formats, and the barriers encountered by users due to format inconsistencies are reduced.

The convenient uploading mode is that a user can upload files by dragging the files to a designated area or select the files to upload by clicking a button, and the uploading operation convenience can be improved in both modes.

And in the process of uploading, the system can provide progress feedback in real time and give clear prompt after the uploading is finished, so that a user can know the processing state of the data.

The security guarantee is that the security of data transmission is paid attention to when the user interface is designed, and the user data can be protected by adopting an encryption transmission technology, so that the data is prevented from being leaked or tampered in the transmission process.

In some embodiments, the multimodal data receiving module 11 will be responsible for receiving and storing the multimodal data in preparation for subsequent further processing. These data need to be pre-processed (e.g., data compression, denoising, format conversion, etc.) to be effectively utilized by other functional modules. Thus, the design of the multi-modal data reception module 11 may also allow for efficient storage and fast retrieval mechanisms of data to enable fast access to the required data resources in the process flow.

The facial feature extraction module 12 is used for carrying out facial recognition and analysis on the first target user according to the multi-mode data to obtain facial feature data of the first target user, wherein the facial feature data at least comprises facial contours and five-sense organ shapes of the first target user;

After the multimodal data uploaded by the first target user is obtained, the facial feature extraction module 12 may pre-process the multimodal data. This step includes, but is not limited to, image enhancement (e.g., brightness adjustment, contrast optimization), noise filtering (removal of image noise using gaussian filtering, median filtering, etc.), and image size normalization to ensure consistency and high quality of the input data. The face region is then quickly located in the preprocessed picture or video frame using advanced face detection algorithms, such as Haar feature classifiers, deep learning based detection networks, etc. The method can be used as a basis for subsequent feature extraction, and can greatly reduce interference of non-facial areas. Once the facial area is successfully detected, the system will further identify and mark key feature points of the face, including facial contours, facial features such as corners of the eyes, corners of the mouth, tips of the nose, and the like.

Feature extraction is then performed by Convolutional Neural Network (CNN). And carrying out depth feature extraction on the marked facial area by adopting a convolutional neural network model. CNNs automatically learn and abstract high-level feature representations of faces, such as textures, shapes, contours, etc., through multi-layer convolution, pooling, activation, etc. These features contain not only macroscopic information of the facial contours, but also subtle differences in the shape of the five sense organs. And integrating the facial features extracted by the CNN to form a complete first target user facial feature data set. The data set includes, but is not limited to, an accurate description of the facial contours, specific parameters of the facial shapes, and possibly other advanced features extracted, such as skin texture, expressive features, etc.

The three-dimensional model generating module 13 is configured to generate a corresponding three-dimensional face model according to the extracted facial feature data of the first target user, and perform detail optimization and rendering to generate an avatar of the first target user;

The three-dimensional model generation module 13 is a key element for constructing a personalized avatar, and creates a highly realistic and detailed three-dimensional face model based on detailed facial feature data of the first target user acquired from the facial feature extraction module.

Specifically, after the facial features of the first target user are acquired, the three-dimensional model generating module 13 first analyzes the received data, so as to ensure the accuracy of each feature point and the relative position relationship in the space, and provide accurate guidance for the subsequent three-dimensional modeling. And then, three-dimensional space mapping and geometric structure creation can be carried out, and two-dimensional facial feature points are mapped into the three-dimensional space by utilizing an algorithm to form the basic outline of the face and the preliminary position of the five sense organs. This step is the basis for constructing a three-dimensional model, determining the accuracy and authenticity of the model. Based on the results of the spatial mapping, a three-dimensional geometry of the face is created by principles of computational geometry and computer graphics. This includes defining the curved surface, edges, and connection relationships between the parts of the face, forming a preliminary three-dimensional model frame.

A mesh generation algorithm (e.g., delaunay triangulation, marching Cubes, etc.) may then be employed to generate a dense mesh on the three-dimensional geometry. These meshes are the basic units that make up a three-dimensional model, whose density and quality directly affect the detailed behavior of the model. The generated grids are subjected to refinement treatment, and the grids are smoother and more natural through the technologies of smoothing treatment, edge enhancement and the like, and meanwhile important facial feature details are reserved. Texture mapping technology is used to apply texture information such as skin, hair and the like to the three-dimensional model. This includes selecting an appropriate texture image, adjusting the mapping of the texture (e.g., UV mapping), and performing texture fusion and transition processing to ensure perfect fit of the texture to the model surface. Further detail optimization of the model may then be performed, including adjusting the proportions and shape of the five sense organs, enhancing the expressive features of the face, and adding personalized details such as moles, wrinkles, etc. These optimization measures aim at improving the fidelity and individuality of the model.

Finally, advanced rendering techniques (e.g., ray tracing, global illumination, etc.) may be employed to render the three-dimensional model. By simulating illumination and shadow effects in the real world, the model presents a more realistic and vivid visual effect. And outputting the rendered three-dimensional model in a proper format for the subsequent application module. The output formats may include, but are not limited to, 3D model files (e.g., OBJ, FBX, etc.), scene files in a real-time rendering engine, compression formats for network transmission, etc.

In the embodiment of the present invention, the three-dimensional model generating module 13 can successfully generate a three-dimensional model of an avatar highly realistic and rich in details with the first target user through a series of complex and fine operations such as accurate analysis of facial feature data, three-dimensional space mapping, geometric structure creation, grid generation and refinement, texture mapping and detail optimization, rendering and output, and the like.

An audio feature extraction module 14, configured to extract, from the multimodal data, a speech feature of the first target user, where the speech feature includes at least a tone, a timbre, and a speech rate;

the audio feature extraction module 14 may obtain audio contained within the audio files and video files uploaded by the user, which may contain the user's natural speech for subsequent feature extraction. To ensure the quality of the extracted speech features, the system may preprocess the received audio signal. This includes, but is not limited to, noise reduction processing to reduce interference of background noise with feature extraction, and normalization processing to adjust the amplitude of the audio signal to a uniform range for subsequent processing.

The pre-processed audio signal is divided into smaller frames, for example, in tens of milliseconds, before the speech feature extraction, and because many characteristics of speech (such as pitch, timbre) are stable in a short time, frame-by-frame analysis can be performed. Prior to formally using the deep learning algorithm, some conventional signal processing methods may be applied to initially screen out information that may be helpful for subsequent feature extraction, such as energy envelope, zero-crossing rate, etc. Then, in view of the advantages of long-term memory networks (LSTM) in processing sequence data, LSTM algorithms may be selected to extract key features in audio, by which long-term dependencies in the audio signal can be captured. In LSTM networks, a sequence of audio frames is taken as input, and the network extracts key audio features such as pitch (tone), timbre (timbre feature), and intensity (volume change) by learning the time dependence between audio frames. These features not only reflect the acoustic properties of the user's speech, but also contain the dynamically changing information of the speech. In the feature extraction process, some optimization strategies, such as feature selection, can be applied to select features most useful for virtual sound generation, feature dimension reduction, feature quantity reduction to improve processing efficiency, and the like, so as to further improve quality and effectiveness of the features. The extracted audio features are collated and output as basic data for the subsequent virtual sound generation module. These features, which include the uniqueness of the user's speech, are key to generating personalized virtual sounds.

The virtual sound generating module 15 is configured to generate a virtual sound matching the original sound of the first target user based on the voice feature of the first target user.

The virtual sound generation module 15 first receives audio feature data from the audio feature extraction module 14, including but not limited to key sound attributes such as pitch, timbre, speed of speech, etc. And analyzing the received characteristic data, and understanding the sound characteristics represented by each type of characteristic and the roles of the sound characteristics in the process of generating the virtual sound.

Based on the results of the feature analysis, the speech synthesis technique that best suits the current audio feature may be selected. Modern speech synthesis techniques are diverse and include, but are not limited to, hidden Markov Model (HMM) based synthesis, stitching synthesis, neural network synthesis (e.g., waveNet, tacotron, etc.). To ensure that the generated virtual sound matches the user's original sound, a speech synthesis model may be selected or trained that has or is capable of learning the characteristics of the user's specific sound.

And driving the selected voice synthesis model to generate virtual voice by using the analyzed audio characteristics as input. In the generation process, the parameters of the synthetic model can be finely adjusted according to the needs, so that the generated virtual sound is ensured to be more consistent with the original sound of the user in detail. In addition to accuracy, the fluency and naturalness of the generated virtual sound may also be of concern. The synthetic trace is reduced by optimizing a model algorithm or adopting a post-processing technology, so that the virtual sound sounds more natural.

The generated virtual sound can be checked by an automatic quality evaluation system to evaluate whether the tone, pitch, speed, etc. of the virtual sound are consistent with the extracted audio characteristics, and the overall naturalness and fluency of the virtual sound. Or in some cases, the module may also invite the human listener to evaluate the generated virtual sound to provide more subjective feedback. According to the results of automatic evaluation and manual feedback, the internal algorithm and model parameters can be continuously optimized, so that the quality and effect of virtual sound generation are improved. After quality assessment, the desired virtual sound is ready for output for subsequent use.

An avatar integration module 16 for integrating and matching the avatar of the first target user with the virtual sound so as to synchronize the mouth shape, expression and audio of the avatar in the video interaction;

The avatar integration module 16 is responsible for perfectly integrating the avatar of the first target user with the virtual sound in constructing the personalized audio-visual integration intelligent interaction experience, and ensures that the mouth shape, expression and audio output of the avatar are highly synchronized in the video interaction process. The module may first load the avatar of the first target user that has been created or specified, including visual elements of the face appearance, hairstyle, etc. of the avatar. At the same time, the module receives virtual sound data from the virtual sound generation module 15, which data contains the user's speech content and its corresponding audio features.

To ensure synchronization of the avatar with the audio, the avatar integration module 16 needs to record a time stamp of the audio data so that each frame of the audio and the avatar's actions can be accurately corresponding in the subsequent process. Based on the results of the audio feature analysis, voice content in the audio is mapped to the mouth shape action of the avatar using an audio-driven facial animation technique. This includes adjusting the shape, degree of opening and closing, etc. of the lips of the avatar according to the syllables, vocabulary, etc. to achieve accurate synchronization of the mouth shapes. Besides the mouth shape synchronization, the expression of the virtual image can be adjusted according to the emotion colors (such as happiness, sadness, surprise and the like) of the audio, so that the emotion expressed by the audio content can be more attached. All facial animation adjustments are made in real time, ensuring that the mouth shape, expression and audio content of the avatar remain synchronously updated throughout the video interaction process.

Based on this, in some embodiments, the system may further include an emotion recognition and response module configured to adjust pitch, timbre, and speed parameters of the virtual sound according to emotion changes in the first target user's voice using emotion recognition techniques, so that the virtual sound is emotional-expressive.

In some embodiments, to enhance the realism of the interaction, the avatar integration module 16 may continuously optimize the smoothness and naturalness of the avatar's movements, reducing the mechanical or unnatural appearance. By collecting user feedback, the module can continuously adjust the optimization algorithm so as to adapt to the preferences and requirements of different users and promote the overall user experience.

After all the steps are completed, the module outputs the integrated virtual image and the virtual sound to the video interaction platform together for the user to interact in real time.

In some embodiments, to expand the scope of application, the avatar integration module 16 may also ensure that the generated video interactive content is able to be played smoothly on a variety of platforms and devices, including PCs, mobile devices, VR/AR devices, and the like.

In some embodiments, to meet the needs of different users, the system may further include a personalized customization module.

And the personalized customization module is used for adjusting the appearance characteristics of the virtual image and/or the tone, tone color and speech speed of the virtual sound according to the personal preference of the first target user so as to realize more personalized virtual image generation and interaction experience, wherein the appearance characteristics comprise but are not limited to skin color, hairstyle, clothing and accessories.

For example, the user may select or fine tune the skin tone of the avatar based on his or her skin tone or preference. The system provides a variety of skin tone options, supporting continuous changes from light to dark, possibly even including special effects such as sunburn, healthy wheat, etc. The hairstyle is one of important ways to display individuality, and the individualization customizing module allows a user to select the hairstyle of the cardiology instrument from a preset hairstyle library, including various styles of short hair, long hair, curly hair, straight hair, and the like. In addition, the user can adjust the shade and glossiness of the hair color, and even add hair ornaments such as hair clips, hair bands and the like. Or the user can adjust the tone of the virtual sound according to the preference or demand of the user. Either a sunken magnetic voice or a crisp and pleasant female voice, or even a special sound effect, can be achieved by adjusting the tone. The timbre affects the texture and expressive power of the sound, and the personalized customization module allows the user to select or fine tune the timbre of the virtual sound, such as warm, cool, sweet, mature, etc., to accommodate different communication scenarios and emotional expressions. The speed of speech also affects the comfort and efficiency of communication. The user can adjust the speech speed of the virtual voice according to own speech speed habit or the demand of the communication object.

The interaction module 17 is configured to display the avatar through the interaction interface, receive the dialogue content input by the second target user through the interaction interface, generate the reply content according to the dialogue content through a preset intelligent algorithm, and perform vivid and natural interaction feedback on the reply content through the avatar integration module in a form of integrating video and audio through the avatar.

In the embodiment of the invention, the interaction module 17 is a core component for constructing intelligent and interactive experience, integrates a plurality of key functions such as user interface display, natural language processing, intelligent reply generation, video and audio integrated feedback and the like, and provides a vivid and natural dialogue communication platform between a user and an avatar.

The interaction module 17 can first intuitively present the generated avatar of the first target user through a User Interface (UI), ensuring that the user can intuitively see and interact with the avatar. On the interface, the avatar may be presented in the form of a high fidelity image or 3D model, including its appearance, expression, etc., details, giving the user a similar communication experience to a real character.

The second target user may enter the dialog content in a number of ways, including:

text input, namely, a second target user inputs dialogue content which the target user wants to express through a text box or a voice input function on an interface.

The second target user can also directly perform voice awakening to input dialogue content by voice input, and the interaction module 17 recognizes the voice input of the second target user through the built-in voice recognition module and converts the voice input into corresponding text content.

Upon receiving the dialog content entered by the second target user, the interaction module 17 may immediately launch a Natural Language Processing (NLP) engine for analysis. NLP technology can understand meaning, context and emotion input by users, and lays a foundation for generating reasonable replies later. Based on the results of the NLP analysis, the system can understand the semantic information input by the user in depth, including the factors of theme, intention, emotion and the like. And generating corresponding reply content according to semantic information input by a user by using a preset intelligent algorithm (such as a machine learning model, a dialogue management system and the like). The reply contents not only accord with language specifications, but also can accurately respond to the demands and emotion of the user.

The generated text reply may be passed to the virtual sound generation module 15 for conversion to a realistic audio signal by speech synthesis techniques. At the same time, the reply content is also passed to the avatar integration module 16 to drive the facial animation and limb language of the avatar. The avatar integration module 16 adjusts facial expressions and mouth shape actions of the avatars according to the reply contents, ensuring that they are synchronized with the audio signal. Thus, when the virtual image speaks, the mouth shape, the expression and the voice content of the virtual image can be perfectly matched, and a natural and smooth dialogue effect is presented. And finally, displaying the integrated video and audio signals to a user through a user interface. The user can see the virtual image to respond to the input of the user vividly through the interactive interface and hear the voice reply matched with the virtual image, so that the immersive dialogue experience is obtained.

In some embodiments, the entire dialog process for interaction with the avatar may be performed in real-time and continuously, with feedback from the avatar immediately upon user input, and may continue multiple dialogs. Such instantaneity can enhance the fluency and interactivity of the dialog.

In some embodiments, the interaction module 17 may also have dialog management capabilities that track the history of the dialog, understand the context information, and may take these factors into account when generating the reply, enabling the dialog to be more coherent and meaningful.

In some embodiments, the second target user may be the same user as the first target user.

In the embodiment of the invention, the system not only provides personalized experience of interaction with the virtual image, but also further enhances self expression and immersive feeling of the user. When the second target user is the same person as the first target user, the user can perform self-thinking and introgression through a dialogue with the avatar. The avatar is a customizable dialog object with personalized features that can guide the user to go deep into discussing his ideas, emotions and behaviors. This self-dialogue approach helps the user to better understand themselves, discover potential problems or advantages, and thereby make more informed decisions or adjust behavioral strategies.

In this mode, the user can treat the avatar as his own learning partner or teacher. By customizing certain appearance and audio characteristics for the avatar, the user can create an ideal learning environment for himself. For example, the user may set an avatar as an expert avatar of a certain field, and acquire expertise or skill guidance by talking therewith. In addition, the system can dynamically adjust the reply content and teaching mode of the virtual image according to the learning progress and feedback of the user so as to realize more personalized learning experience. For users who like creation or expression, a dialogue with a custom avatar can be a way to inspire inspiration. The user can explore new ideas, storylines, or artistic forms through interactions with the avatar. Meanwhile, the user can continuously adjust and perfect the image design of the virtual image by utilizing the personalized customization function of the system so as to better match the creation style and theme of the user. The creative interaction process is not only beneficial to improving the creation capability of the user, but also brings great pleasure and achievement to the user.

The personalized audio-visual fusion intelligent interaction system based on the multi-mode data can realize the generation and interaction of the virtual image and the sound based on the multi-mode data, thereby providing highly personalized and intelligent user experience. The system not only can generate high-fidelity virtual images and sound, but also can realize natural and smooth user interaction through the intelligent interaction module, thereby greatly improving the quality and satisfaction of user experience.

Based on the same inventive concept, the embodiment of the application also provides a using method of the personalized audio-visual fusion intelligent interaction system based on the multi-mode data. Fig. 2 is a flow chart of an embodiment of a method for using a personalized audio visual fusion intelligent interaction system based on multi-modal data in an embodiment of the present application, and referring to fig. 2, the method may include:

S201, a first target user uploads multi-mode data through an interactive interface, wherein the multi-mode data at least comprises pictures, videos and audio files containing portrait information and audio information of the first target user;

s202, extracting facial features of the face of a first target user in the uploaded image and/or video;

s203, generating a three-dimensional face model according to the extracted facial features and optimizing the three-dimensional face model to obtain an avatar of the first target user;

s204, extracting voice characteristics of the uploaded audio and/or video to obtain the voice characteristics of the first target user;

S205, generating virtual sound of the first target user based on the voice characteristics;

S206, integrating and adapting the virtual model and the virtual sound;

S207, displaying the avatar, receiving a second target user input, generating reply content through a preset intelligent algorithm, and carrying out video and audio interaction feedback on the reply content through the avatar.

In this specification, each embodiment is described in a progressive manner, and the same or similar parts of each embodiment are referred to each other, and each embodiment is mainly described as a difference from other embodiments.

The foregoing embodiments are only for illustrating the technical solution of the present application, but not for limiting the same, and although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that modifications may be made to the technical solution described in the foregoing embodiments, or equivalents may be substituted for some or all of the technical features thereof, without departing from the spirit of the corresponding technical solution from the scope of the technical solution of the present application.

Claims

1. A personalized audio-visual fusion intelligent interaction system based on multi-modal data is characterized by comprising:

the system comprises a multi-mode data receiving module, a first target user processing module and a second target user processing module, wherein the multi-mode data receiving module is used for receiving multi-mode data uploaded by the first target user, and the multi-mode data at least comprises pictures, videos and audio files containing portrait information and audio information of the first target user;

The facial feature extraction module is used for carrying out facial recognition and analysis on the first target user according to the multi-mode data to obtain facial feature data of the first target user, wherein the facial feature data at least comprises facial contours and five-sense organ shapes of the first target user;

the three-dimensional model generation module is used for generating a corresponding three-dimensional face model according to the extracted facial feature data of the first target user, performing detail optimization and rendering, and generating an avatar of the first target user;

the voice feature extraction module is used for extracting voice features of the first target user through the multi-modal data, wherein the voice features at least comprise tone, tone color and speech speed;

the virtual sound generation module is used for generating virtual sound matched with the original sound of the first target user based on the voice characteristics of the first target user;

The virtual image integration module is used for integrating and matching the virtual image of the first target user with the virtual sound so as to synchronize the mouth shape, expression and audio of the virtual image in video interaction;

And the interaction module is used for displaying the virtual image through an interaction interface, receiving dialogue content input by a second target user through the interaction interface, generating reply content through a preset intelligent algorithm according to the dialogue content, and carrying out vivid and natural interaction feedback on the reply content through the virtual image integration module in a video and audio integration mode.

2. The intelligent interactive system according to claim 1, wherein the multi-modal data receiving module further comprises data preprocessing functions including, but not limited to, data compression, denoising, format conversion, to improve data processing efficiency and accuracy.

3. The intelligent interactive system according to claim 2, wherein the system further comprises:

4. A smart interactive system according to claim 3, wherein the system further comprises:

and the emotion recognition and response module is used for adjusting the pitch, tone color and speech speed parameters of the virtual sound according to the emotion change in the first target user voice by using an emotion recognition technology so that the virtual sound has emotion expressive force.

5. A method for using a personalized audio-visual fusion intelligent interaction system based on multi-modal data, which is applied to the personalized audio-visual fusion intelligent interaction system based on multi-modal data as set forth in any one of claims 1 to 4, and is characterized by comprising the following steps:

Uploading multi-mode data by a first target user through an interactive interface, wherein the multi-mode data at least comprises pictures, videos and audio files containing portrait information and audio information of the first target user;

Extracting facial features of the face of the first target user in the uploaded image and/or video;

generating a three-dimensional face model according to the extracted facial features and optimizing the three-dimensional face model to obtain an virtual image of the first target user;

Extracting voice characteristics of the uploaded audio and/or video to obtain the voice characteristics of the first target user;

Generating a virtual sound of the first target user based on the speech features;

integrating and adapting the virtual model and the virtual sound;

And displaying the virtual image, receiving a second target user input, generating reply content through a preset intelligent algorithm, and carrying out video and audio interaction feedback on the reply content through the virtual image.