[go: up one dir, main page]

CN118230756B - Method and system for driving digital face by voice 3D based on depth regression network - Google Patents

Method and system for driving digital face by voice 3D based on depth regression network

Info

Publication number
CN118230756B
CN118230756B CN202410329686.2A CN202410329686A CN118230756B CN 118230756 B CN118230756 B CN 118230756B CN 202410329686 A CN202410329686 A CN 202410329686A CN 118230756 B CN118230756 B CN 118230756B
Authority
CN
China
Prior art keywords
voice
face
mode
processing
mel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410329686.2A
Other languages
Chinese (zh)
Other versions
CN118230756A (en
Inventor
王文涛
孙见青
梁家恩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisound Intelligent Technology Co Ltd
Original Assignee
Unisound Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisound Intelligent Technology Co Ltd filed Critical Unisound Intelligent Technology Co Ltd
Priority to CN202410329686.2A priority Critical patent/CN118230756B/en
Publication of CN118230756A publication Critical patent/CN118230756A/en
Application granted granted Critical
Publication of CN118230756B publication Critical patent/CN118230756B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2218/00Aspects of pattern recognition specially adapted for signal processing
    • G06F2218/02Preprocessing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2218/00Aspects of pattern recognition specially adapted for signal processing
    • G06F2218/08Feature extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • G10L2021/105Synthesis of the lips movements from speech, e.g. for talking heads

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Quality & Reliability (AREA)
  • Molecular Biology (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Algebra (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Processing Or Creating Images (AREA)

Abstract

A method and a system for driving a digital human face by voice 3D based on a depth regression network call a voice characteristic processing module, a pre-emphasis, framing, windowing, fourier transformation, a Mel frequency filter, logarithmic operation and discrete cosine transformation mode are adopted to process voice signals to obtain Mel cepstrum characteristics of voice, an ASR model extraction mode is adopted to process the Mel cepstrum characteristics of voice to remove identity information in the Mel cepstrum characteristics of voice, a 3D human face blendshape parameter conversion module is called, a network based on a transformer structure is adopted to extract voice characteristic parameters, the voice characteristic parameters are converted into 3D human face blendshape parameters, a human face expression control module is called, the voice characteristic parameters and emotion characteristics predicted by the corresponding voice model are combined, human face expression information is controlled, and corresponding 3D human face motion information is output. The invention solves the problems of difficult realization, single mouth shape, poor robustness and stiff facial expression.

Description

Method and system for driving digital face by voice 3D based on depth regression network
Technical Field
The invention relates to the technical field of voice-driven digital faces, in particular to a method and a system for driving a digital face by voice 3D based on a deep regression network.
Background
The method for driving 3D digital persons based on voice is to pre-process voice, extract phoneme information corresponding to audio, and then control the speaking of the 3D character model according to the correspondence between the phonemes and the mouth shape.
Currently, this approach can only be used for a single speaker, and requires data to be collected again after the speaker is replaced. The phoneme information and the mouth shape information are complex correspondingly, and smooth realization among different mouth shapes is difficult. Meanwhile, the expression of the speaker is stiff, the phonemes only correspond to the mouth shape, and the expression is absent.
Therefore, the realization is complex, the mouth shape is single, the robustness is poor, and the facial expression stiffness and the like become the problems to be solved urgently.
Disclosure of Invention
The invention provides a method and a system for driving a digital human face by voice 3D based on a depth regression network, which are used for solving the problem of difficulty in realization by directly predicting a motion coefficient of the 3D human face through voice by adopting the depth regression network, supporting voice input of any person by adopting a statistical mode and solving the problem of algorithm robustness, and controlling the expression of the human face through voice so as to capture the fine expression of the human face and solve the problem of stiff expression of the human face.
In order to achieve the above purpose, the invention provides the following technical scheme that the method for driving the digital face by voice 3D based on the deep regression network comprises the following steps:
The method comprises the steps of calling a voice characteristic processing module, processing a voice signal in a pre-emphasis mode to recover high frequency of audio in the voice signal, performing frame processing on the voice signal in a weighted window mode to smooth frames before and after the voice signal in a sliding window mode, processing the windowed voice signal in a Fourier transform mode to obtain a frequency domain signal of the voice signal, performing filtering processing on the voice signal by a Mel frequency filter to remove high frequency information in the voice signal, performing logarithmic transformation processing on the voice signal with the high frequency information removed to obtain a Mel spectrum cepstrum of the voice signal, performing discrete cosine transformation processing on the voice signal with Mel characteristics to obtain Mel cepstrum characteristics, processing the Mel cepstrum characteristics of the voice in an ASR model extraction mode to remove identity information in the Mel cepstrum characteristics of the voice;
Invoking a 3D face blendshape parameter conversion module, extracting voice characteristic parameters by adopting a network based on a transducer structure, and converting the voice characteristic parameters into 3D face blendshape parameters;
the method comprises the steps of selecting a facial expression control module, predicting emotion characteristics of input voice by using a voice model, automatically marking voice data, extracting emotion characteristics in the voice by using the voice model, combining voice characteristic parameters and the emotion characteristics predicted by the voice model, controlling facial expression information and outputting corresponding 3D facial motion information.
As a preferred scheme of the method for driving the digital face by the voice 3D in the depth regression network, the expression formula of the Mel frequency filter is as follows:
Where f is the input frequency and m is the Mel frequency.
As a preferred scheme of the method for driving the digital face by the 3D voice in the deep regression network, the method adopts an ASR model extraction mode to process the mel cepstrum features of the voice, extracts semantic features and removes tone color and tone information.
As a preferred scheme of the method for driving the digital face by the voice 3D in the deep regression network, after the voice characteristic parameters are converted into the 3D face blendshape parameters, the shake of the key points of the face in the regression process is processed by adopting a wingloss mode, and an expression formula of a wingloss mode is as follows:
Where w is the space that limits the nonlinear portion to [ -w, w ], e is the curvature of the nonlinear region, x is the argument, and C is a constant that characterizes the connection of the linear portion and the nonlinear portion.
The method is used for outputting corresponding 3D face motion information, keeping the face smooth transition in a short-time frame in a face motion information smoothing mode after the corresponding 3D face motion information is output, and carrying out data acquisition on a single person in a single-person feature fine adjustment mode to carry out fine adjustment on a face expression model.
The invention also provides a processing system of the voice 3D driving digital face based on the depth regression network, which comprises:
The voice characteristic processing module is used for processing the voice signal in a pre-emphasis mode to recover the high frequency of the audio in the voice signal, processing the voice signal in a frame mode in a weighted window mode to smooth the front and rear frames of the voice signal in a sliding window mode, processing the windowed voice signal in a Fourier transform mode to obtain the frequency domain signal of the voice signal, filtering the voice signal in a Mel frequency filter to remove the high frequency information in the voice signal, performing logarithmic transformation on the voice signal with the high frequency information removed to obtain the Mel spectrum cepstrum of the voice signal, performing discrete cosine transformation on the voice signal with Mel characteristics to obtain the Mel cepstrum characteristics, processing the Mel cepstrum characteristics of the voice in an ASR model extraction mode to remove the identity information in the Mel cepstrum characteristics of the voice;
The 3D face blendshape parameter conversion module is used for extracting voice characteristic parameters by adopting a network based on a transducer structure and converting the voice characteristic parameters into 3D face blendshape parameters;
The facial expression control module is used for predicting emotion characteristics of input voice by adopting a voice model, automatically marking voice data, extracting emotion characteristics in the voice by adopting the voice model, combining voice characteristic parameters and the emotion characteristics predicted by the voice model, controlling facial expression information and outputting corresponding 3D facial motion information.
As a preferred scheme of the processing system of the voice 3D driving digital face based on the deep regression network, the voice characteristic processing module comprises:
the mel frequency filter expression formula is:
Where f is the input frequency and m is the Mel frequency.
As a preferred scheme of the processing system of the voice 3D driving digital face based on the deep regression network, the voice characteristic processing module comprises:
And processing the mel cepstrum features of the voice by adopting an ASR model extraction mode, extracting semantic features, and removing tone and pitch information.
As a preferred scheme of the processing system of the voice 3D driving digital face based on the depth regression network, the 3D face blendshape parameter conversion module comprises:
After the voice characteristic parameters are converted into the 3D face blendshape parameters, the shake of the face key points in the regression process is processed in wingloss mode, and the expression formula of wingloss mode is as follows:
Where w is the space that limits the nonlinear portion to [ -w, w ], e is the curvature of the nonlinear region, x is the argument, and C is a constant that characterizes the connection of the linear portion and the nonlinear portion.
As a preferred scheme of the processing system of the voice 3D driving digital face based on the depth regression network, the facial expression control module comprises:
after corresponding 3D face motion information is output, the face is kept smooth and excessive in a short-time frame in a face motion information smoothing mode, and data acquisition is carried out on a single person in a single-person characteristic fine adjustment mode to carry out fine adjustment on a face expression model.
The invention has the advantages of invoking a voice characteristic processing module, processing voice signals in a pre-emphasis mode, recovering audio high frequency in the voice signals, processing the voice signals in a frame mode in a weighted window mode, smoothing the front and rear frames of the voice signals in a sliding window mode, processing the windowed voice signals in a Fourier transform mode to obtain frequency domain signals of the voice signals, filtering the voice signals by a Mel frequency filter to remove high frequency information in the voice signals, carrying out logarithmic transformation processing on the voice signals with the high frequency information removed to obtain Mel frequency cepstrum of the voice signals, carrying out discrete cosine transformation processing on the voice signals with Mel frequency to obtain Mel frequency cepstrum characteristics, processing the Mel frequency cepstrum characteristics of the voice in an ASR model extraction mode, removing identity information in the Mel frequency cepstrum characteristics, invoking a 3D human blendshape parameter conversion module, extracting voice characteristic parameters into 3D human face blendshape parameters by a network based on a transducer structure, invoking a human face expression control module, carrying out automatic prediction input characteristic, carrying out discrete cosine transformation processing on the voice signals with the voice frequency information removed to obtain emotion characteristics, carrying out corresponding voice characteristic prediction and carrying out facial expression feature control on the voice model and the emotion characteristic. The invention adopts a deep regression network to directly predict the 3D facial motion coefficient through voice, controls the facial expression through voice, can capture the fine facial expression, and solves the problems of difficult realization, single mouth shape, poor robustness and stiff facial expression.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It will be apparent to those skilled in the art from this disclosure that the drawings described below are merely exemplary and that other embodiments may be derived from the drawings provided without undue effort.
The structures, proportions, sizes, etc. shown in the present specification are shown only for the purposes of illustration and description, and are not intended to limit the scope of the invention, which is defined by the claims, so that any structural modifications, changes in proportions, or adjustments of sizes, which do not affect the efficacy or the achievement of the present invention, should fall within the scope of the invention.
Fig. 1 is a schematic flow chart of a method for driving a 3D digital face based on a depth regression network according to embodiment 1 of the present invention;
fig. 2 is a schematic diagram of a processing system architecture of a depth regression network-based video driven 3D digital face according to embodiment 2 of the present invention.
Detailed Description
Other advantages and advantages of the present invention will become apparent to those skilled in the art from the following detailed description, which, by way of illustration, is to be read in connection with certain specific embodiments, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
Referring to fig. 1, embodiment 1 of the present invention provides a method for driving a 3D digital face based on a depth regression network, comprising the steps of:
S1, invoking a voice characteristic processing module, processing a voice signal in a pre-emphasis mode to recover the high frequency of the audio in the voice signal, processing the voice signal in a weighted window mode to smooth the front and rear frames of the voice signal in a sliding window mode, processing the windowed voice signal in a Fourier transform mode to obtain a frequency domain signal of the voice signal, filtering the voice signal by a Mel frequency filter to remove high frequency information in the voice signal, performing logarithmic transformation on the voice signal with the high frequency information removed to obtain a Mel spectrum cepstrum of the voice signal, performing discrete cosine transformation on the voice signal with Mel characteristics to obtain Mel cepstrum characteristics, processing the Mel cepstrum characteristics of the voice in an ASR model extraction mode to remove identity information in the Mel cepstrum characteristics of the voice;
Specifically, the collected voice signal is subjected to preprocessing operation to eliminate the influence of lips, and then subjected to pre-emphasis processing to solve the problem that the high-frequency component of voice is weakened in the process of passing through vocal cords and lips, and the high-frequency part of the audio is recovered. After processing, the speech signal is relatively stable for a short period of time, and the speech signal is subjected to frame processing by using a weighting window. Meanwhile, the sliding window mode is adopted to ensure the smoothness of frames before and after the language information. The high frequency and the low frequency of the voice signal can reflect the original characteristics of the voice signal, and the structural characteristics of the frequency represent the original information of the voice signal. In order to obtain the spectral characteristics of the speech signal, fourier transform (FFT) is performed on the windowed speech signal to obtain a frequency domain signal of the speech information. The Mel frequency filter is used for filtering the voice signal, and as the input frequency increases, the output frequency approaches steadily, which is in line with the auditory effect of human ears. And carrying out logarithmic transformation on the voice signal to obtain a mel spectrum cepstrum of the voice signal. Discrete Cosine Transform (DCT) is performed on the Mel-characteristic speech signal to obtain Mel cepstrum characteristic (MFCC). Mel-cepstrum features characterize structural features of a speech signal, thereby more accurately expressing the original features of the speech signal.
S2, calling a 3D face blendshape parameter conversion module, extracting voice characteristic parameters by adopting a network based on a transformer structure, and converting the voice characteristic parameters into 3D face blendshape parameters;
wherein tansformer is a network structure commonly used in deep learning, which has good effect in most applications at present, and can make the network have stronger robustness and possibly extract more effective characteristics. The voice features are processed through the deep network and converted into high-dimensional data features, and then the high-dimensional data features correspond to the facial blendshape features one by one in a regression mode.
S3, invoking a facial expression control module, predicting emotion characteristics of input voice by adopting a voice model, automatically marking voice data, extracting emotion characteristics in the voice by adopting the voice model, combining voice characteristic parameters and the emotion characteristics predicted by the voice model, controlling facial expression information, and outputting corresponding 3D facial motion information.
Combining the voice characteristic parameters and the emotion characteristics predicted by the corresponding voice model means that two kinds of information are spliced into one dimension, and the formulated expression is shown in a formula (1):
wherein R full is complete information after splicing, R mouth is mouth shape information, R head is head information, and t is a certain time; Is splicing and fusion.
In this embodiment, the mel frequency filter expression formula is shown in formula (2):
Where f is the input frequency and m is the Mel frequency.
In this embodiment, the mel cepstrum features of the speech are processed by using an ASR model extraction method, semantic features are extracted, and timbre and pitch information are removed.
In this embodiment, after the voice feature parameters are converted into the 3D face blendshape parameters, the shake of the face key points in the regression process is processed by adopting wingloss mode, and the expression formula of wingloss mode is as follows:
Where w is the space that limits the nonlinear portion to [ -w, w ], e is the curvature of the nonlinear region, x is the argument, and C is a constant that characterizes the connection of the linear portion and the nonlinear portion.
Specifically, in Wingloss, e is a very small number, so a change in e results in a steady network training. In the initial training process, the linear part can ensure the training of the network. In the nonlinear part, the loss is limited in a nonlinear range through w, meanwhile, the loss can be amplified, the voice information is deeply mined, and the performance of the network is improved.
In the embodiment, after corresponding 3D face motion information is output, the face is kept smooth and excessive in a short-time frame in a face motion information smoothing mode, and data acquisition is carried out on a single person in a single-person characteristic fine adjustment mode to carry out fine adjustment on a face expression model.
Specifically, the information smoothing method is to average the facial features corresponding to blendshape in a sliding frame manner within a few frames. Sliding window is a technique that produces a smoothed output by applying a window of fixed size over a series of consecutive data points and averaging or other processing within the window. In this case, for the face feature and the corresponding BlendShapes parameters at each time point, a sliding window, for example, a window with size N, may be used to average the most recent N frame data, so as to obtain a smoothed result.
In summary, the invention calls a voice feature processing module to process voice signals in a pre-emphasis mode to recover high frequency of audio in the voice signals, processes the voice signals in a weighted window mode to smooth front and rear frames of the voice signals in a sliding window mode, processes the windowed voice signals in a Fourier transform mode to obtain frequency domain signals of the voice signals, processes the voice signals in a Mel frequency filter to filter the voice signals to remove high frequency information in the voice signals, processes the voice signals with the high frequency information removed to obtain Mel frequency cepstrum of the voice signals, processes the voice signals with Mel frequency cepstrum features to obtain Mel frequency cepstrum features, processes the Mel frequency cepstrum features of the voice signals in an ASR model extraction mode to remove identity information in the Mel frequency cepstrum features, calls a 3D human blendshape parameter conversion module to extract voice feature parameters based on a network of a transducer structure to convert the voice feature parameters into 3D human face blendshape parameters, calls a human face expression control module to predict input voice features, processes the voice signals in an automatic feature prediction mode to obtain the voice features, processes the voice features with the emotion model extraction mode to obtain emotion feature information, and combines the emotion feature information of the voice features and the emotion feature information. The invention adopts a deep regression network to directly predict the 3D facial motion coefficient through voice, controls the facial expression through voice, can capture the fine facial expression, and solves the problems of difficult realization, single mouth shape, poor robustness and stiff facial expression.
It should be noted that the method of the embodiments of the present disclosure may be performed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene, and is completed by mutually matching a plurality of devices. In the case of such a distributed scenario, one of the devices may perform only one or more steps of the methods of embodiments of the present disclosure, the devices interacting with each other to accomplish the methods.
It should be noted that the foregoing describes some embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
Example 2
Referring to fig. 2, embodiment 2 of the present invention further provides a processing system for voice 3D driven digital face based on deep regression network, including:
The voice characteristic processing module 1 is used for processing a voice signal in a pre-emphasis mode to recover the high frequency of the audio in the voice signal, processing the voice signal in a frame mode in a weighted window mode to smooth the front and rear frames of the voice signal in a sliding window mode, processing the windowed voice signal in a Fourier transform mode to obtain a frequency domain signal of the voice signal, filtering the voice signal in a Mel frequency filter to remove high frequency information in the voice signal, performing logarithmic transformation on the voice signal with the high frequency information removed to obtain a Mel spectrum cepstrum of the voice signal, performing discrete cosine transformation on the voice signal with Mel characteristics to obtain Mel cepstrum characteristics, processing the Mel cepstrum characteristics of the voice in an ASR model extraction mode to remove identity information in the Mel cepstrum characteristics of the voice;
The 3D face blendshape parameter conversion module 2 is configured to extract a voice feature parameter by using a network based on a transform structure, and convert the voice feature parameter into a 3D face blendshape parameter;
the facial expression control module 3 is used for predicting emotion characteristics of input voice by adopting a voice model, automatically marking voice data, extracting emotion characteristics in the voice by adopting the voice model, combining voice characteristic parameters and the emotion characteristics predicted by the voice model, controlling facial expression information and outputting corresponding 3D facial motion information.
In this embodiment, in the speech feature processing module 1:
the mel frequency filter expression formula is:
Where f is the input frequency and m is the Mel frequency.
In this embodiment, in the speech feature processing module 1:
And processing the mel cepstrum features of the voice by adopting an ASR model extraction mode, extracting semantic features, and removing tone and pitch information.
In this embodiment, in the 3D face blendshape parameter conversion module 2:
After the voice characteristic parameters are converted into the 3D face blendshape parameters, the shake of the face key points in the regression process is processed in wingloss mode, and the expression formula of wingloss mode is as follows:
Where w is the space that limits the nonlinear portion to [ -w, w ], e is the curvature of the nonlinear region, x is the argument, and C is a constant that characterizes the connection of the linear portion and the nonlinear portion.
In this embodiment, the facial expression control module 3:
after corresponding 3D face motion information is output, the face is kept smooth and excessive in a short-time frame in a face motion information smoothing mode, and data acquisition is carried out on a single person in a single-person characteristic fine adjustment mode to carry out fine adjustment on a face expression model.
It should be noted that, because the content of information interaction and execution process between the modules of the above system is based on the same concept as the method embodiment in the embodiment 1 of the present application, the technical effects brought by the content are the same as the method embodiment of the present application, and the specific content can be referred to the description in the foregoing illustrated method embodiment of the present application, which is not repeated herein.
Example 3
Embodiment 3 of the present invention provides a non-transitory computer readable storage medium having stored therein program code for a depth regression network based voice 3D driven digital face method, the program code comprising instructions for performing the depth regression network based voice 3D driven digital face method of embodiment 1 or any possible implementation thereof.
Computer readable storage media can be any available media that can be accessed by a computer or data storage devices, such as servers, data centers, etc., that contain an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk (Solid STATE DISK, SSD)), etc.
Example 4
The embodiment 4 of the invention provides electronic equipment, which comprises a memory and a processor;
the processor and the memory complete communication with each other through a bus, the memory stores program instructions executable by the processor, and the processor calls the program instructions to execute the voice 3D driving digital face method based on the depth regression network of the embodiment 1 or any possible implementation manner thereof.
The processor may be implemented by hardware or software, and when implemented by hardware, the processor may be a logic circuit, an integrated circuit, or the like, and when implemented by software, the processor may be a general-purpose processor, and by reading software codes stored in a memory, which may be integrated in the processor, may be located outside the processor, and exist independently.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable system. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.).
It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing system, they may be centralized in a single computing system, or distributed across a network of computing systems, and they may alternatively be implemented in program code that is executable by the computing system, such that they are stored in a memory system and, in some cases, executed in a different order than that shown or described, or they may be implemented as individual integrated circuit modules, or as individual integrated circuit modules. Thus, the present invention is not limited to any specific combination of hardware and software.
While the invention has been described in detail in the foregoing general description and specific examples, it will be apparent to those skilled in the art that modifications and improvements can be made thereto. Accordingly, such modifications or improvements may be made without departing from the spirit of the invention and are intended to be within the scope of the invention as claimed.

Claims (10)

1. The method for driving the digital face by the voice 3D based on the depth regression network is characterized by comprising the following steps:
The method comprises the steps of calling a voice characteristic processing module, processing a voice signal in a pre-emphasis mode to recover high frequency of audio in the voice signal, performing frame processing on the voice signal in a weighted window mode to smooth frames before and after the voice signal in a sliding window mode, processing the windowed voice signal in a Fourier transform mode to obtain a frequency domain signal of the voice signal, performing filtering processing on the voice signal by a Mel frequency filter to remove high frequency information in the voice signal, performing logarithmic transformation processing on the voice signal with the high frequency information removed to obtain a Mel spectrum cepstrum of the voice signal, performing discrete cosine transformation processing on the voice signal with Mel characteristics to obtain Mel cepstrum characteristics, processing the Mel cepstrum characteristics of the voice in an ASR model extraction mode to remove identity information in the Mel cepstrum characteristics of the voice;
Invoking a 3D face blendshape parameter conversion module, extracting voice characteristic parameters by adopting a network based on a transducer structure, and converting the voice characteristic parameters into 3D face blendshape parameters;
the method comprises the steps of selecting a facial expression control module, predicting emotion characteristics of input voice by using a voice model, automatically marking voice data, extracting emotion characteristics in the voice by using the voice model, combining voice characteristic parameters and the emotion characteristics predicted by the voice model, controlling facial expression information and outputting corresponding 3D facial motion information.
2. The method for driving a digital face in 3D by voice based on deep regression network according to claim 1, wherein the mel frequency filter expression formula is:
Where f is the input frequency and m is the Mel frequency.
3. The method for driving digital human face by 3D voice based on depth regression network according to claim 1, wherein the model extraction mode of ASR is adopted to process the mel cepstrum feature of voice, extract semantic features and remove tone and pitch information.
4. The method for driving a digital face by using 3D voice based on deep regression network according to claim 1, wherein after the voice characteristic parameters are converted into the parameters of the 3D face blendshape, the shake of the key points of the face in the regression process is processed by using wingloss, and the expression formula of wingloss is:
Where w is the space that limits the nonlinear portion to [ -w, w ], e is the curvature of the nonlinear region, x is the argument, and C is a constant that characterizes the connection of the linear portion and the nonlinear portion.
5. The method for driving the digital face by the voice 3D based on the depth regression network, which is disclosed by claim 1, comprises the steps of outputting corresponding 3D face motion information, keeping the face smooth and excessive in a short-time frame in a face motion information smoothing mode, and carrying out data acquisition on a single person in a single-person feature fine adjustment mode to carry out fine adjustment on a face expression model.
6. The processing system of the voice 3D driving digital face based on the deep regression network is characterized by comprising the following components:
The voice characteristic processing module is used for processing the voice signal in a pre-emphasis mode to recover the high frequency of the audio in the voice signal, processing the voice signal in a frame mode in a weighted window mode to smooth the front and rear frames of the voice signal in a sliding window mode, processing the windowed voice signal in a Fourier transform mode to obtain the frequency domain signal of the voice signal, filtering the voice signal in a Mel frequency filter to remove the high frequency information in the voice signal, performing logarithmic transformation on the voice signal with the high frequency information removed to obtain the Mel spectrum cepstrum of the voice signal, performing discrete cosine transformation on the voice signal with Mel characteristics to obtain the Mel cepstrum characteristics, processing the Mel cepstrum characteristics of the voice in an ASR model extraction mode to remove the identity information in the Mel cepstrum characteristics of the voice;
The 3D face blendshape parameter conversion module is used for extracting voice characteristic parameters by adopting a network based on a transducer structure and converting the voice characteristic parameters into 3D face blendshape parameters;
The facial expression control module is used for predicting emotion characteristics of input voice by adopting a voice model, automatically marking voice data, extracting emotion characteristics in the voice by adopting the voice model, combining voice characteristic parameters and the emotion characteristics predicted by the voice model, controlling facial expression information and outputting corresponding 3D facial motion information.
7. The processing system of a deep regression network based voice 3D driven digital human face of claim 6 wherein the voice feature processing module:
the mel frequency filter expression formula is:
Where f is the input frequency and m is the Mel frequency.
8. The processing system of a deep regression network based voice 3D driven digital human face of claim 6 wherein the voice feature processing module:
And processing the mel cepstrum features of the voice by adopting an ASR model extraction mode, extracting semantic features, and removing tone and pitch information.
9. The system for processing a 3D-driven digital face based on a deep regression network according to claim 6, wherein the 3D face blendshape parameter transformation module comprises:
After the voice characteristic parameters are converted into the 3D face blendshape parameters, the shake of the face key points in the regression process is processed in wingloss mode, and the expression formula of wingloss mode is as follows:
Where w is the space that limits the nonlinear portion to [ -w, w ], e is the curvature of the nonlinear region, x is the argument, and C is a constant that characterizes the connection of the linear portion and the nonlinear portion.
10. The system for processing a voice 3D driven digital face based on deep regression network of claim 6 wherein the facial expression control module:
after corresponding 3D face motion information is output, the face is kept smooth and excessive in a short-time frame in a face motion information smoothing mode, and data acquisition is carried out on a single person in a single-person characteristic fine adjustment mode to carry out fine adjustment on a face expression model.
CN202410329686.2A 2024-03-21 2024-03-21 Method and system for driving digital face by voice 3D based on depth regression network Active CN118230756B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410329686.2A CN118230756B (en) 2024-03-21 2024-03-21 Method and system for driving digital face by voice 3D based on depth regression network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410329686.2A CN118230756B (en) 2024-03-21 2024-03-21 Method and system for driving digital face by voice 3D based on depth regression network

Publications (2)

Publication Number Publication Date
CN118230756A CN118230756A (en) 2024-06-21
CN118230756B true CN118230756B (en) 2025-11-21

Family

ID=91511444

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410329686.2A Active CN118230756B (en) 2024-03-21 2024-03-21 Method and system for driving digital face by voice 3D based on depth regression network

Country Status (1)

Country Link
CN (1) CN118230756B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116934926A (en) * 2023-09-15 2023-10-24 杭州优航信息技术有限公司 Recognition method and system based on multi-mode data fusion
CN117711042A (en) * 2023-11-22 2024-03-15 中移(苏州)软件技术有限公司 Method and device for generating broadcast video of digital person based on driving text

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10013787B2 (en) * 2011-12-12 2018-07-03 Faceshift Ag Method for facial animation
CN118891616A (en) * 2022-06-22 2024-11-01 海信视像科技股份有限公司 A virtual digital human driving method, device, equipment and medium
CN115937369A (en) * 2022-11-21 2023-04-07 之江实验室 Method, system, electronic device and storage medium for generating expression animation
CN116863038A (en) * 2023-07-07 2023-10-10 东博未来人工智能研究院(厦门)有限公司 Method for generating digital human voice and facial animation by text
CN117524244B (en) * 2024-01-08 2024-04-12 广州趣丸网络科技有限公司 3D digital human voice driving method, device, storage medium and related equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116934926A (en) * 2023-09-15 2023-10-24 杭州优航信息技术有限公司 Recognition method and system based on multi-mode data fusion
CN117711042A (en) * 2023-11-22 2024-03-15 中移(苏州)软件技术有限公司 Method and device for generating broadcast video of digital person based on driving text

Also Published As

Publication number Publication date
CN118230756A (en) 2024-06-21

Similar Documents

Publication Publication Date Title
EP3933829B1 (en) Speech processing method and apparatus, electronic device, and computer-readable storage medium
CN112562691B (en) Voiceprint recognition method, voiceprint recognition device, computer equipment and storage medium
TW201935464A (en) Method and device for voiceprint recognition based on memorability bottleneck features
WO2024055752A9 (en) Speech synthesis model training method, speech synthesis method, and related apparatuses
WO2019097276A1 (en) Speech model personalization via ambient context harvesting
CN111508498A (en) Conversational speech recognition method, system, electronic device and storage medium
CN113571047B (en) A method, device and equipment for processing audio data
CN112562648A (en) Adaptive speech recognition method, apparatus, device and medium based on meta learning
CN108461081B (en) Voice control method, device, equipment and storage medium
CN111508519A (en) Method and device for enhancing voice of audio signal
CN108108357A (en) Accent conversion method and device, electronic equipment
CN115497451B (en) Voice processing method, device, electronic device and storage medium
CN117059068A (en) Speech processing method, device, storage medium and computer equipment
WO2019232826A1 (en) I-vector extraction method, speaker recognition method and apparatus, device, and medium
CN111462736B (en) Speech-based image generation method, device and electronic device
CN113516988B (en) Audio processing method and device, intelligent equipment and storage medium
CN118230756B (en) Method and system for driving digital face by voice 3D based on depth regression network
EP4475121A1 (en) Interactive speech signal processing method, related device and system
CN113823271B (en) Training method and device for voice classification model, computer equipment and storage medium
CN110197657B (en) A dynamic sound feature extraction method based on cosine similarity
CN117133303B (en) Voice noise reduction method, electronic equipment and medium
US20250078851A1 (en) System and Method for Disentangling Audio Signal Information
CN120766654B (en) Real-time interactive voice cloning methods, devices, equipment, and media
CN119580701B (en) A speech synthesis method, apparatus, computer device, and storage medium
CN112201229B (en) A speech processing method, device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant