CN111354079B

CN111354079B - Three-dimensional face reconstruction network training and virtual face image generation method and device

Info

Publication number: CN111354079B
Application number: CN202010165219.2A
Authority: CN
Inventors: 孙爽; 李琛; 陈杨; 戴宇荣
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-03-11
Filing date: 2020-03-11
Publication date: 2023-05-02
Anticipated expiration: 2040-03-11
Also published as: CN111354079A

Abstract

The application relates to a three-dimensional face reconstruction network training and virtual face image generation method and device. The three-dimensional face reconstruction network training method comprises the following steps: acquiring a two-dimensional face sample picture; extracting three-dimensional face sample parameters from the two-dimensional face sample pictures through a three-dimensional face reconstruction network to be trained; inputting three-dimensional face sample parameters into a renderer simulation network for mapping processing to obtain a virtual face image; respectively inputting the virtual face image and the two-dimensional face sample picture into a face consistency network to perform feature extraction, and calculating to obtain a feature loss value of the virtual face image relative to the two-dimensional face sample picture; and adjusting parameters of the three-dimensional face reconstruction network based on the characteristic loss value, and continuing training until the training is finished when the training finishing condition is met. By adopting the method, end-to-end unsupervised training can be realized without labeling.

Description

Three-dimensional face reconstruction network training and virtual face image generation method and device

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a method and apparatus for three-dimensional face reconstruction network training and virtual face image generation.

Background

With the development of artificial intelligence technology, image processing technology is increasingly applied to the application scene of an avatar, such as games, social contact and the like. Today, users are requesting to generate a two-dimensional virtual face image of a target person from two-dimensional (2 d) face pictures of the target person or persons actually photographed. In order to improve the fidelity of the virtual face image, in the actual rendering process of the virtual face image, besides the face picture of the target person, three-dimensional (3 d) face parameters such as face shape, five sense organs and the like of the target person are often used.

The traditional mode mainly utilizes a two-dimensional face picture marked with three-dimensional face parameters to train a three-dimensional face reconstruction network, and then renders and generates a virtual face image based on the three-dimensional face parameters output by the three-dimensional face reconstruction network. However, this approach requires a large number of labels, which reduces the training efficiency of the network, and thus affects the virtual face image generation efficiency.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a three-dimensional face reconstruction network training and virtual face image generation method, apparatus, computer device, and storage medium that can improve the network training efficiency and virtual face image generation effect.

A three-dimensional face reconstruction network training method, the method comprising:

acquiring a two-dimensional face sample picture;

extracting three-dimensional face sample parameters from the two-dimensional face sample pictures through a three-dimensional face reconstruction network to be trained;

inputting three-dimensional face sample parameters into a renderer simulation network for mapping processing to obtain a virtual face image;

respectively inputting the virtual face image and the two-dimensional face sample picture into a face consistency network to perform feature extraction, and calculating to obtain a feature loss value of the virtual face image relative to the two-dimensional face sample picture;

and adjusting parameters of the three-dimensional face reconstruction network based on the characteristic loss value, and continuing training until the training is finished when the training finishing condition is met.

A three-dimensional face reconstruction network training apparatus, the apparatus comprising:

the sample acquisition module is used for acquiring a two-dimensional face sample picture;

the feature extraction module is used for extracting three-dimensional face sample parameters from the two-dimensional face sample pictures through a three-dimensional face reconstruction network to be trained;

the loss measurement module is used for inputting three-dimensional face sample parameters into the renderer simulation network to perform mapping processing to obtain a virtual face image; respectively inputting the virtual face image and the two-dimensional face sample picture into a face consistency network to perform feature extraction, and calculating to obtain a feature loss value of the virtual face image relative to the two-dimensional face sample picture;

And the network training module is used for adjusting parameters of the three-dimensional face reconstruction network based on the characteristic loss value and continuing training until the training ending condition is met.

A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of:

acquiring a two-dimensional face sample picture;

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

Acquiring a two-dimensional face sample picture;

According to the three-dimensional face reconstruction network training method, the device, the computer equipment and the storage medium, the renderer simulation network is trained in advance, and when the three-dimensional face parameters of a target object extracted from two-dimensional face sample pictures by the three-dimensional face reconstruction network to be trained are input into the renderer simulation network, the three-dimensional face parameters can be mapped into virtual face images quickly and accurately; the virtual face image and the two-dimensional face sample picture are respectively input into a face consistency network, and parameters in the three-dimensional face reconstruction network can be gradually adjusted according to the characteristic loss value of the obtained virtual face image relative to the two-dimensional face sample picture. In the parameter adjustment process, the virtual face image obtained by the renderer simulated network mapping guided by the three-dimensional face parameter outputted by the three-dimensional face reconstruction network can be more and more close to the two-dimensional face sample picture. When three-dimensional face parameters are extracted based on the three-dimensional face reconstruction network, constraint and guidance from the face consistency network are obtained, training of the three-dimensional face reconstruction network can be completed without labeling two-dimensional face sample pictures, end-to-end unsupervised training is achieved, and the two-dimensional face pictures can be quickly and real-timely converted into virtual face images by directly using the three-dimensional face reconstruction network with training and a pre-trained renderer simulation network.

A virtual face image generation method, the method comprising:

acquiring a two-dimensional face target picture;

extracting three-dimensional face target parameters of a target object in the two-dimensional face target picture based on a three-dimensional face reconstruction network;

rendering the three-dimensional face target parameters based on a renderer simulation network to obtain a virtual face image of the target object;

the three-dimensional face reconstruction network is obtained by performing unsupervised training by means of the pre-trained renderer simulation network and a face consistency network; the face consistency network is used for calculating a characteristic loss value of output data of the renderer simulation network relative to input data of the three-dimensional face reconstruction network; the characteristic loss value is used for guiding the adjustment of parameters of the three-dimensional face reconstruction network.

A virtual face image generation apparatus, the apparatus comprising:

the two-dimensional image acquisition module is used for acquiring a two-dimensional face target image;

the three-dimensional parameter extraction module is used for extracting three-dimensional face target parameters of a target object in the two-dimensional face target picture based on a three-dimensional face reconstruction network;

the virtual image rendering module is used for rendering the three-dimensional face target parameters based on a renderer simulation network to obtain a virtual face image of the target object; the three-dimensional face reconstruction network is obtained by performing unsupervised training by means of the pre-trained renderer simulation network and a face consistency network; the face consistency network is used for calculating a characteristic loss value of output data of the renderer simulation network relative to input data of the three-dimensional face reconstruction network; the characteristic loss value is used for guiding the adjustment of parameters of the three-dimensional face reconstruction network.

acquiring a two-dimensional face target picture;

According to the virtual face image generation method, device, computer equipment and storage medium, the network parameters are directly adjusted in the network training stage, after the fixed network parameters are trained, three-dimensional face parameter extraction is directly carried out on the basis of the trained three-dimensional face reconstruction network, iteration is not needed, three-dimensional face parameter extraction efficiency is greatly improved, and therefore the method and device can be suitable for scenes with high real-time requirements. In addition, the three-dimensional face reconstruction network training stage is obtained based on a large number of two-dimensional face pictures, three-dimensional face standard parameters and virtual face sample objects, so that the rule that the two-dimensional face pictures are mapped into virtual face images in statistics is learned, the global optimal solution can be found, the generalization capability is strong, and therefore different two-dimensional face pictures of the same target object under different visual angles and illumination conditions can be ensured to be converted into stable virtual face images.

Drawings

FIG. 1 is an application environment diagram of a three-dimensional face reconstruction network training and virtual face image generation method in one embodiment;

FIG. 2 is a flow chart of a three-dimensional face reconstruction network training method in one embodiment;

FIG. 3 is a schematic diagram of a two-dimensional face picture and a corresponding virtual face image in one embodiment;

FIG. 4 is a schematic diagram of a three-dimensional face reconstruction network training method according to an embodiment;

FIG. 5 is a schematic diagram of a three-dimensional face reconstruction network training method according to another embodiment;

FIG. 6 is a flowchart of a three-dimensional face reconstruction network training method in an embodiment;

FIG. 7 is a flowchart of a virtual face image generation method according to an embodiment;

FIG. 8 is a schematic diagram of converting a two-dimensional face target picture into a virtual face image using a three-dimensional face reconstruction network and a cascaded renderer simulation network in one embodiment;

FIG. 9 is a schematic diagram of converting two-dimensional face sample pictures of multiple perspectives of the same target object taken under multiple illumination environments into a virtual face object in one embodiment;

FIG. 10 is a flowchart of a method for generating a virtual face image according to an embodiment;

FIG. 11 is a schematic structural diagram of a three-dimensional face reconstruction network training device in one embodiment;

FIG. 12 is a schematic diagram of a three-dimensional face reconstruction network training device according to another embodiment;

fig. 13 is a schematic structural diagram of a virtual face image generating apparatus according to an embodiment;

fig. 14 is a schematic structural diagram of a three-virtual face image generating apparatus according to another embodiment;

FIG. 15 is an internal block diagram of a computer device implemented as a server in one embodiment;

fig. 16 is an internal structural diagram of a computer device when implemented as a terminal in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

The three-dimensional face reconstruction network training and virtual face image generation method provided by the application can be applied to an application environment shown in fig. 1. The terminal 110 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms, and the like. The terminal 110 and the server 120 may be directly or indirectly connected through wired or wireless communication, which is not limited herein. The terminal 110 and the server 120 can be separately used for executing the three-dimensional face reconstruction network training and the virtual face image generation method provided in the embodiments of the present application. The terminal 110 and the server 120 may also be cooperatively used to perform the three-dimensional face reconstruction network training and the virtual face image generation method provided in the embodiments of the present application.

The method provided by the embodiments of the present application designs Computer Vision technology (CV) for artificial intelligence. The computer vision is a science for researching how to make a machine "see", and more specifically, a camera and a computer are used to replace human eyes to identify, track and measure targets, and the like, and further, graphic processing is performed, so that the computer is processed into images which are more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

It should be noted that, in the embodiments of the present application, three-dimensional models of faces are referred to, such as a standard three-dimensional face skeleton model in a 3DMM library (3D Morphable Model, three-dimensional deformation model). The three-dimensional face bone model is a bone-driven three-dimensional face model. Skeleton points are bound at a plurality of positions of a face skeleton in the three-dimensional face skeleton model, and the morphological change of the face can be regulated by controlling all skeleton points in the three-dimensional face skeleton model. The three-dimensional human face skeleton model is used for binding the human face skeleton on the facial form and the positions of the eyes, eyebrows, nose, mouth and other five sense organs.

The present application also relates to various neural network models, and in order to distinguish from a three-dimensional stereoscopic model of an object, all the neural network models referred to in the present application will hereinafter be simply referred to as networks, such as a three-dimensional face reconstruction network (Reconstruction Subnetwork), a renderer simulation network (Render Simulator Subnetwork), and a face-consistency network. The face consistency network may specifically include a plurality of sub-networks such as a face five sense organs segmentation sub-network (Face Parsing Subnetwork), a face recognition sub-network (Face Recognition Subnetwork), and a face joint point detection sub-network (Face Landmark Detection Subnetwork). Each network may be a machine learning network, respectively. The machine learning network is a calculation model with certain capability after sample learning, and specifically can be a neural network, such as a CNN (Convolutional Neural Networks, convolutional neural network), an RNN (Recurrent Neural Networks, cyclic neural network), and the like. Of course, other types of networks may be employed by the machine learning network.

The renderer simulation network and the face consistency network are respectively pretrained, so that the application can train the three-dimensional face reconstruction network on the basis of the pretrained renderer simulation network and the face consistency network, and the converged three-dimensional face reconstruction network can be obtained with fewer iterative training times. The pre-training process of the renderer simulation network and the face-consistency network and the training process of the three-dimensional face reconstruction network by combining the renderer simulation network and the face-consistency network can be referred to in detail in the following embodiments.

In one embodiment, as shown in fig. 2, a three-dimensional face reconstruction network training method is provided, and the method is applied to the computer device in fig. 1 for illustration, and the computer device may be specifically the terminal 110 or the server 120 in the above figure. Referring to fig. 2, the three-dimensional face reconstruction network training method specifically includes the following steps:

and 202, acquiring a two-dimensional face sample picture.

The two-dimensional face picture refers to a two-dimensional picture containing a face, and specifically may be a picture actually shot by a computer device, a picture drawn by the computer device based on a tool such as drawing software, a picture shot by other image acquisition devices or uploaded to the computer device after being drawn by other terminal devices, and the like.

Referring to fig. 3, fig. 3 illustrates a schematic diagram of a two-dimensional face picture and a virtual face avatar in one embodiment. As shown in fig. 3, the face in the two-dimensional face picture 302 includes the entire face region of the target object. For convenience of description, hereinafter, a two-dimensional face picture used in the training stage of the three-dimensional face reconstruction network is referred to as a two-dimensional face sample picture, and a two-dimensional face picture used in the application stage of the three-dimensional face reconstruction network is referred to as a two-dimensional face target picture.

Specifically, when the three-dimensional face reconstruction network is required to be trained, the computer equipment acquires two-dimensional face sample pictures, and the two-dimensional face pictures are preprocessed according to the network training requirement. The preprocessing operation may specifically include uniformly converting the two-dimensional face sample picture size into a preset target size, or converting a color two-dimensional face picture into a gray scale image, and the like.

In one embodiment, after the computer device obtains a picture taken or drawn, it may detect whether the picture contains a human face. If the face exists, the computer equipment can further check the definition, the angle and the like of the face contained in the picture, and when the check is passed, the picture is determined to be a two-dimensional face sample picture.

In one embodiment, the two-dimensional face picture may be a partial picture taken from a portrait picture including a head region and a body region of a person. The head region is the region above the neck of the person, and the body region is the region outside the head region in the portrait picture. The computer device obtains a portrait picture including a head region and a body region. The computer equipment identifies the portrait picture, identifies the head area and the body area in the portrait picture, divides the picture of the head area from the portrait picture, and takes the extracted head area picture as a two-dimensional face picture.

In one embodiment, when the head region picture divided from the portrait picture contains less face regions, the computer device may further perform face recognition on the head region picture to cut out part of the picture content except the face regions, and retain more face regions.

And 204, extracting three-dimensional face sample parameters from the two-dimensional face sample pictures through a three-dimensional face reconstruction network to be trained.

The three-dimensional face reconstruction network is a neural network for extracting three-dimensional face parameters of a target object in the two-dimensional face picture. Neural networks such as CNN (Convolutional Neural Network ), DNN (Deep Neural Network, deep neural network), RNN (Recurrent Neural Network ), and the like. The three-dimensional face reconstruction network may also be a combination of multiple neural networks. The three-dimensional face reconstruction network to be trained comprises network parameters. The network parameters are used as initial parameters for training the three-dimensional face reconstruction network in the embodiment to participate in training.

The convolutional neural network includes a convolutional Layer (Convolutional Layer) and a Pooling Layer (Pooling Layer). Convolutional neural networks are of various types, such as VGG (Visual Geometry Group visual set group) networks, google net (google network) or res net (residual convolutional network), etc. The deep neural network comprises an input layer, an implicit layer and an output layer, wherein the layers are in full-connection relation. A recurrent neural network is a neural network modeling sequence data, i.e. a sequence's current output is also related to the previous output. The specific expression is that the network will memorize the previous information and apply it to the calculation of the current output, i.e. the nodes between the hidden layers are no longer connectionless but connected, and the input of the hidden layer includes not only the output of the input layer but also the output of the hidden layer at the previous moment. Recurrent neural networks such as LSTM (Long Short-Term Memory Neural Network, long Short term memory neural network).

In one particular embodiment, the three-dimensional face reconstruction network may be an acceptance-ResNet-v 2 (a convolutional neural network) composed of a plurality of residual convolutional networks (ResNet).

Specifically, the computer device extracts three-dimensional face sample parameters of multiple dimensions from the preprocessed two-dimensional face sample pictures based on an initial three-dimensional face reconstruction network to be trained. The three-dimensional face parameters are geometric structure parameters in the three-dimensional face model, and comprise facial form parameters such as cheekbone positions, mandible positions and the like, and five sense organs parameters such as eyes, eyebrows, nose, mouth and the like. The eye parameters can specifically include an eye tail position, an eye head position and the like; the eyebrow parameters can specifically include an eyebrow rotation angle, an eyebrow thickness and the like; the nose parameters can specifically comprise nose height, nose wing size and the like; the mouth parameters may include in particular mouth angle, lips etc. For convenience of description, the three-dimensional face parameters extracted from the two-dimensional face sample pictures in the three-dimensional face reconstruction network training stage are hereinafter referred to as three-dimensional face sample parameters, and the three-dimensional face parameters extracted from the two-dimensional face target pictures in the three-dimensional face reconstruction network application stage are hereinafter referred to as three-dimensional face target parameters.

And 206, inputting the three-dimensional face sample parameters into a renderer simulation network for mapping processing to obtain the virtual face image.

The pre-trained renderer simulation network is a neural network for mapping three-dimensional face parameters into two-dimensional virtual face images. The renderer simulation network may be a network obtained by learning a mapping relationship between each set of three-dimensional face sample parameters and the virtual face image after acquiring a plurality of sets of three-dimensional face sample parameters and the corresponding virtual face image. The 3DMM library is provided with various forms of face models. The three-dimensional face sample parameters for the pre-training renderer simulation network may be model parameters of a three-dimensional face bone model in a 3DMM library.

The virtual face image is also a two-dimensional picture. The virtual face shape is a face shape of a target object displayed in a display style different from the above two-dimensional face picture, such as a cartoon character, an ancient dress character, and the like. As shown in fig. 3, the angles of the faces in the virtual face image 304 may be fixed, for example, the angles of the faces in the output virtual face image may be positive faces regardless of the angles of the faces in the input two-dimensional face image. The illumination intensity of the face in the virtual face image 304 may also be fixed.

Specifically, the computer device inputs three-dimensional face parameters output by the three-dimensional face reconstruction network into a pre-trained renderer simulation network. In one particular embodiment, the renderer simulation network may be a convolutional neural network composed of a plurality of deconvolution layers Deconvi. Wherein Deconvi is an ith deconvolution layer, i is more than or equal to 1 and less than or equal to n. As shown in table 1 below, taking as an example a convolutional neural network consisting of n=8, i.e. the renderer simulation network is composed of 8 deconvolution layers, the size, step size and number of convolution kernels employed by each deconvolution layer Deconvi may be different.

Wherein K is the convolution kernel size Kernel, S is the convolution kernel step size Stride, and C is the convolution kernel number Channel. Each deconvolution layer Deconvi is serially concatenated with the output of the previous sequential deconvolution layer Deconvi as the input to the next sequential deconvolution layer deconvolution +1. The size and step size of the convolution kernel employed by each deconvolution layer Deconvi may be the same, e.g., k=3, s=2. The dimension of the matrix characteristics of the deconvolution layer Deconvi output is S ⁱ *S ⁱ * C. For example, the three-dimensional face parameters 1×1×n output by the three-dimensional face reconstruction network are input into the deconvolution layer Deconv1, and when the number of convolution kernels of the deconvolution layer Deconv1 is 128, the deconvolution layer Deconv1 outputs to obtain the matrix characteristics of 2×2×128. N is the dimension of the three-dimensional face parameter used, for example, the cheekbone position may be one dimension, and the nose height may be another dimension. The number of convolution kernels c=3 of deconvolution layer deconvolution 8 corresponds to three channels of RGB (Red Green Blue, red Green Blue color mode), so that deconvolution layer deconvolution 8 of the last layer is a colorful virtual face image.

And step 208, respectively inputting the virtual face image and the two-dimensional face sample picture into a face consistency network for feature extraction, and calculating to obtain a feature loss value of the virtual face image relative to the two-dimensional face sample picture.

The pre-trained face consistency network is a neural network for evaluating the loss of a face area in a virtual face image (hereinafter referred to as a rendering graph) output by a renderer simulation network relative to a face area in a two-dimensional face sample picture (hereinafter referred to as a real graph) input to a three-dimensional face reconstruction network. The specific evaluation dimension may be facial form consistency, facial consistency, or key point consistency, etc. The key points are key pixel points of a face area in a pre-designated picture, such as pixel points of the edge position of the face area, pixel points of the edge position of the five sense organs and the like.

The face consistency network may be a preset image feature extraction algorithm or a pre-trained feature extraction machine learning network, etc. The feature extraction machine learning network can have the face feature extraction capability through sample learning. The machine learning model may employ a neural network model, a dual path network model (DPN, dualPathNetwork), a support vector machine, or a logistic regression model, among others.

In one embodiment, the feature extraction machine learning network may be a general purpose machine learning network with feature extraction capabilities that has been trained. The general machine learning network is not effective when used for extraction of a specific scene, and thus further training and optimization of the general machine learning network by a sample specific to the specific scene is required. In this embodiment, the computer device may obtain a network structure and network parameters of the machine learning network according to the general purpose, and import the network parameters into the feature extraction machine learning network structure to obtain the feature extraction machine learning network with the network parameters. The network parameters carried by the feature extraction machine learning network are used as initial parameters of the training feature extraction machine learning network in the embodiment to participate in training.

In one embodiment, the feature extraction machine learning network may be a complex network formed of multiple layers of interconnections. The Feature extraction machine learning network may include multiple Feature extraction layers, each Feature extraction layer has a corresponding network parameter, the network parameter of each layer may be multiple, and the network parameter in each Feature extraction layer performs linear or nonlinear change on an input face image, so as to obtain a Feature Map (Feature Map) as an operation result. Each feature extraction layer receives the operation result of the previous layer, outputs the operation result of the present layer to the next layer through self operation until the last feature extraction layer completes linear or nonlinear change operation, and obtains the face feature aiming at the current input image according to the result output by the last feature extraction layer. The network parameters are parameters in a network structure, and can reflect the corresponding relation between the output and the input of each layer of the network.

Specifically, the computer equipment inputs the virtual face image output by the renderer simulation network into a pre-trained face consistency network to obtain a first face feature of the virtual face image; and inputting the two-dimensional face sample picture input into the three-dimensional face reconstruction network into a pre-trained face consistency network to obtain a second face characteristic of the two-dimensional face picture. Wherein the face features are data for reflecting the facial features of the target object. The face features may reflect one or more of the feature information of the sex of the target object, the contour of the face, the hairstyle, glasses, nose, mouth, and distance between the facial organs. In one embodiment, the facial features may include facial texture features. The facial texture features may reflect pixel depths of facial organs, including the nose, ears, eyebrows, cheeks, or lips, etc. The facial texture features may include a color value distribution and a luminance value distribution of facial image pixels. The face features may be presented in the form of vectors or matrices.

Further, the computer equipment compares the difference between the first face feature and the second face feature to obtain a feature loss value of the virtual face image relative to the two-dimensional face picture. The feature loss value is a numerical value which is output by the face consistency network and used for representing the difference degree between the rendering graph and a target object in the real graph after the real graph and the rendering graph are input to the face consistency network in the training process. The characteristic loss value can be calculated by a distance measure or a similarity measure, such as Euclidean distance, manhattan distance, chebyshev distance, cosine similarity, and the like.

And step 210, adjusting parameters of the three-dimensional face reconstruction network based on the characteristic loss value, and continuing training until the training is finished when the training finishing condition is met.

The training ending condition is a condition for ending the network training. The training ending condition may be that the preset iteration number is reached, or that the three-dimensional face reconstruction network after the parameters are adjusted can enable the face consistency index of the target object in the rendering graph and the real graph to reach the preset index.

Specifically, the computer device may compare differences between the rendered map and the real map through the face-consistency network, thereby adjusting parameters of the three-dimensional face reconstruction network in a direction to reduce the differences. If the training end condition is not satisfied after the parameters are adjusted, the process returns to step S202 to continue training until the training end condition is satisfied.

Referring to fig. 4, fig. 4 illustrates a schematic diagram of three-dimensional face reconstruction network training in one embodiment. As shown in fig. 4, the three-dimensional face reconstruction network is used as a front network, the renderer simulation network is used as a rear network, and the three-dimensional face reconstruction network and the rear network are cascaded to form a joint network capable of generating a rendering graph based on the real graph. The pre-trained face consistency network has determined network parameters, the face consistency network evaluates the difference between the rendering graph and the target object in the real graph from at least one dimension, and timely and reversely transmits the characteristic loss value obtained by evaluation to the joint network, and the parameters of the joint network are adjusted according to the characteristic loss value. In the joint network, the renderer simulation network can also be pre-trained, namely, the network parameters of the renderer simulation network can be fixed, and the network parameters of the three-dimensional face reconstruction network are only required to be adjusted according to the characteristic loss value, so that the converged three-dimensional face reconstruction network can be obtained with fewer iterative training times.

In one embodiment, adjusting parameters of the three-dimensional face reconstruction network may be locally adjusting network parameters. In particular, part of network parameters can be kept unchanged, and the other part of network parameters can be adjusted.

In one embodiment, the face consistency network used to measure differences in rendered graphs from real graphs may also be a cost function. A cross entropy or a mean square error function may be chosen as the cost function. The computer device may end training when the value of the cost function is less than a preset value, resulting in a three-dimensional face reconstruction network that may be used to extract three-dimensional face parameters.

In the three-dimensional face reconstruction network training method, the renderer simulation network is trained in advance, and when three-dimensional face parameters of a target object extracted from two-dimensional face sample pictures by the three-dimensional face reconstruction network to be trained are input into the renderer simulation network, the three-dimensional face parameters can be mapped into virtual face images quickly and accurately; the virtual face image and the two-dimensional face sample picture are respectively input into a face consistency network, and parameters in the three-dimensional face reconstruction network can be gradually adjusted according to the characteristic loss value of the obtained virtual face image relative to the two-dimensional face sample picture. In the parameter adjustment process, the virtual face image obtained by the renderer simulated network mapping guided by the three-dimensional face parameter outputted by the three-dimensional face reconstruction network can be more and more close to the two-dimensional face sample picture. When three-dimensional face parameters are extracted based on the three-dimensional face reconstruction network, constraint and guidance from the face consistency network are obtained, training of the three-dimensional face reconstruction network can be completed without labeling two-dimensional face sample pictures, end-to-end unsupervised training is achieved, and the two-dimensional face pictures can be quickly and real-timely converted into virtual face images by directly using the three-dimensional face reconstruction network with training and a pre-trained renderer simulation network.

In one embodiment, the step of the renderer simulating network pre-training comprises: determining an application scene and a target style of a three-dimensional face reconstruction network to be trained; acquiring three-dimensional face standard parameters and virtual face sample images of a target style suitable for an application scene; and training the renderer simulation network to be trained by taking the three-dimensional face standard parameters as input and the virtual face sample image as output to obtain a pre-trained renderer simulation network for rendering the virtual face image of the target style in the application scene.

The application scene refers to a scene that needs to convert a two-dimensional face picture into a Virtual face image by means of the joint network provided by the application, such as a game scene, a payment scene, an AR (Augmented Reality) scene, a VR (Virtual Reality) scene, a beauty scene, or the like. For example, in a beauty scene, a user may convert a two-dimensional face picture of himself or herself or another person into a virtual face image of a specified style based on a beauty software running on a terminal.

Each application scene can support virtual face images converted into multiple target styles. The target style refers to a representative face of the virtual face image presented as a whole, and specifically may include cartoon style, ancient style, european style, fun style, cool and gorgeous style, cool and cool style, and the like. The same target style supported by different application scenarios may be different. For example, the cartoon style in a game scene may be different from the cartoon style in a beauty scene.

The embodiment provides different joint networks aiming at the requirements of generating virtual face images of different target styles in different application scenes. In other words, when the application scene or the target style is replaced, only the combined network is required to be retrained, and part of parameters in the combined network are adjusted, so that a new combined network suitable for the replaced application scene or the target style can be obtained. In other words, the joint network applicable to different application scenes or target styles can be trained based on different three-dimensional face skeleton models.

Specifically, when the pre-training renderer simulates a network, a three-dimensional face skeleton model with various forms and virtual face sample images corresponding to each three-dimensional face skeleton model can be obtained. The virtual face sample image is a virtual face image that trains the renderer simulation network as a sample.

A set of three-dimensional face parameters can be extracted based on each obtained three-dimensional face skeleton model. As described above, the three-dimensional face parameters output by the three-dimensional face reconstruction network in the three-dimensional face reconstruction network training stage are referred to as three-dimensional face sample parameters. The three-dimensional face reconstruction network training stage is that the renderer simulation network is pre-trained, and the three-dimensional face parameters input into the renderer simulation network are three-dimensional face sample parameters. For convenience of description, the three-dimensional face parameters directly input to the renderer simulation network in the renderer simulation network training phase are referred to as three-dimensional face standard parameters in the following.

Each three-dimensional face skeleton model has a corresponding application scene and target style. After determining an application scene and a target style to be used by the joint network to be trained, the computer equipment acquires a three-dimensional face skeleton model of the application scene and the target style, and extracts corresponding three-dimensional face standard parameters. The computer equipment instructs a renderer simulation network to be trained to learn the mapping relation between the three-dimensional face standard parameters and the virtual face sample objects, namely, the three-dimensional face standard parameters are used as input, the virtual face sample objects are used as output to train the renderer simulation network to be trained, and when the training stopping condition is met, the pre-trained renderer simulation network with the capability of converting the three-dimensional face parameters into virtual face images is obtained.

After the pre-trained renderer simulation network is obtained, training is carried out on the three-dimensional face reconstruction network according to the mode. When the application scene or the target style needs to be replaced, the three-dimensional face reconstruction network to be trained is retrained only based on the three-dimensional face skeleton model of the replaced application scene and the target style and the corresponding virtual face image, and the three-dimensional face reconstruction network to be trained is retrained based on the retrained renderer simulation network according to the mode.

In this embodiment, only the three-dimensional face skeleton model in the 3DMM library and the virtual image corresponding to each three-dimensional face model are needed to train to obtain the renderer simulation network applicable to the specific style. When changing application scenes or target styles, only partial parameters of the renderer simulation network and the three-dimensional human body reconstruction network need to be retrained and adjusted.

In one embodiment, the face consistency network comprises a face five sense organ segmentation network; respectively inputting the virtual face image and the two-dimensional face sample picture into a face consistency network for feature extraction, and calculating the feature loss value of the virtual face image relative to the two-dimensional face sample picture comprises the following steps: determining probability values of each picture area belonging to each preset segmentation result in the virtual face image through a pre-trained facial feature segmentation network, and obtaining a first probability map corresponding to each segmentation result; determining probability values of each picture area belonging to each preset segmentation result in the two-dimensional face sample picture through a pre-trained face five-sense organ segmentation network, and obtaining a second probability map corresponding to each segmentation result; respectively calculating the similar distance between the first probability map and the second probability map corresponding to each segmentation result; and determining the characteristic loss value of the virtual face image relative to the two-dimensional face graph according to the similar distances corresponding to all the segmentation results.

When the combined network applicable to different application scenes and target styles is trained, the adopted face consistency network can be the same. In other words, when application scene or target style changes occur, only the renderer simulation network needs to be retrained and network parameters are fixed, and then the three-dimensional face reconstruction network can be retrained based on the retrained renderer simulation network and a common pre-trained face consistency network. The face consistency network comprises a face five sense organ segmentation network. The facial feature segmentation network is a neural network for facial feature recognition of a face in a two-dimensional face picture. In a particular embodiment, the facial feature segmentation network may be a DFANET network (Deep Feature Aggregation for Real-Time Semantic Segmentation, image semantic segmentation network).

A picture region refers to a continuous pixel or pixels in a real or rendered image. The segmentation result is the result of respectively identifying the background and the face characteristic parts in the real image and the rendering image. The segmentation results preset in the present embodiment include six types of face, eyes, eyebrows, nose, mouth, and background. And finally, the facial five-sense organ segmentation network outputs a probability map corresponding to each segmentation result. The probability map corresponding to each segmentation result records the probability value of each picture region belonging to the segmentation result in the real map or the rendering map, for example, the probability map corresponding to the background records the probability value of each picture region as the background. The probability value recorded by the probability map can be two values of 0 and 1, or can be any value between 0 and 1.

For convenience of description, a probability map obtained by dividing a rendering map by a facial feature division network is referred to as a first probability map, and a probability map obtained by dividing a real map by a facial feature division network is referred to as a second probability map.

Specifically, the computer equipment inputs the rendering graph into a facial feature segmentation network to obtain first probability graphs corresponding to six segmentation results corresponding to the rendering graph. The computer equipment inputs the real image into a facial feature segmentation network to obtain a second probability image corresponding to the six segmentation results corresponding to the real image. In fact, the probability map can be understood as a probability matrix. The computer equipment calculates the similar distance between the first probability map and the second probability map, and the similar distance or a transformation value of the similar distance is used as a characteristic loss value of the rendering map on the five sense organs relative to the real map. As above, the calculation of the similarity distance may be calculated by any one of a distance measure or a similarity measure. For example, when the feature loss value of the rendering map relative to the real map on the five sense organs is calculated in the euclidean distance, the calculation formula may be:

wherein, the subscript img represents a real image, reminder represents a rendering image, and burst represents a probability image corresponding to the facial feature segmentation result _img Representing a second probability map, probability _reminder Representing a first probability map. The similarity distance (packing) between the probability maps corresponding to the six segmentation results _img -parsing _reminder ) ² Averaging to obtain the characteristic loss value L of the rendering graph relative to the real graph on the five sense organs _parsing 。

In this embodiment, the loss of the rendering graph relative to the real graph is measured from the five sense organs, and the parameters of the three-dimensional face reconstruction network are adjusted according to the loss, so that the three-dimensional face parameters output by the three-dimensional face reconstruction network guide the renderer to simulate the virtual face image obtained by network mapping to be more and more close to the two-dimensional face sample graph, and the model training effect is improved.

In one embodiment, the facial feature segmentation network includes a plurality of convolution layers connected in series; the calculating of the similarity distance between the first probability map and the second probability map corresponding to each segmentation result comprises the following steps: acquiring first intermediate features of the virtual face image output by a convolution layer in a target sequence, and respectively fusing the first intermediate features with first probability maps corresponding to each segmentation result to obtain first fusion features corresponding to each segmentation result; acquiring second intermediate features of two-dimensional face sample pictures output by a convolution layer in a target sequence, and respectively fusing the second intermediate features with second probability maps corresponding to each segmentation result to obtain second fusion features corresponding to each segmentation result; and respectively calculating the similar distance between the first fusion feature and the second fusion feature corresponding to each segmentation result.

The face five-sense organ segmentation network comprises a plurality of serially cascaded convolution layers, each layer of convolution layer outputs different picture features, and the last sequential convolution layer outputs a probability map. The middle feature refers to the picture feature output by the convolution layer of the target sequence. The target sequence is a sequence except for the final sequence, and may specifically be a sequence near the middle, for example, the facial five-sense organ segmentation network includes 8 convolution layers, and the convolution layer of the target sequence may be a layer 4 convolution layer or the like. The probability map output by the final sequential convolution layer belongs to the high-level feature, and the middle feature output by the target sequential convolution layer belongs to the low-level feature, for example, only the edge area or the color distribution of the picture is identified.

For convenience of description, an intermediate feature obtained by dividing a rendering image by the facial feature division network is referred to as a first intermediate feature, and an intermediate feature obtained by dividing a real image by the facial feature division network is referred to as a second intermediate feature.

Specifically, the computer equipment inputs the rendering graph into a facial feature segmentation network, and obtains a first intermediate feature output by a target sequential convolution layer in the facial feature segmentation network. The computer equipment inputs the real image into a facial feature segmentation network, and obtains a second intermediate feature output by a target sequential convolution layer in the facial feature segmentation network. The computer equipment fuses the first intermediate feature with a first probability map corresponding to the corresponding segmentation result to obtain a fused feature corresponding to each segmentation result, and averages similar distances between the rendering map and the real map based on the fused features corresponding to the multiple segmentation results to obtain a feature loss value of the rendering map on the five sense organs relative to the real map. The corresponding calculation formula may be:

Wherein, the coat _img Representing a second intermediate feature, feat _reminder Representing a first intermediate feature _img *feat _img Representing the fusion result of the second probability map and the second intermediate feature _reminder *feat _reminder Representing the fusion result of the first intermediate feature and the first probability map. The similarity distance (packing) of the fusion results corresponding to the six segmentation results is respectively calculated _img *feat _img -parsing _reminder *feat _reminder ) ² Averaging to obtain the characteristic loss value L of the rendering graph relative to the real graph on the five sense organs _parsing 。

In one embodiment, the computer device may also separately calculate the difference between the rendering graph and the real graph on the intermediate feature, then fuse the difference with the difference on the probability graph, and use the fused result as the feature loss value of the rendering graph and the real graph on the five sense organs. The corresponding calculation formula may be:

in this embodiment, the combination of the low-level features and the high-level features can improve the stringency of the consistency test, thereby further improving the training effect of the three-dimensional face reconstruction network.

In one embodiment, the face consistency network comprises a face recognition network; respectively inputting the virtual face image and the two-dimensional face sample picture into a face consistency network for feature extraction, and calculating the feature loss value of the virtual face image relative to the two-dimensional face sample picture comprises the following steps: extracting first face features from the virtual face image through a face recognition network; extracting second face features from the two-dimensional face sample picture through a face recognition network; and determining a feature loss value of the virtual face image relative to the two-dimensional face sample picture according to the similar distance between the first face feature and the second face feature.

The face recognition network is a neural network for recognizing the face shape of the face in the two-dimensional face picture. The Face recognition network may be simply not limited to the Face recognition technology using iOS self, openCV Face recognition technology, UIsee, face++, sensor Face recognition technology, and the like. In a specific embodiment, the face recognition network may be an acceptance-Resnetv 1 network (a convolutional neural network). The face features extracted by the face recognition network are high-dimensional features, and long-phase features of the target object, such as face shape, whether the face is full or not and the like, are measured on the whole.

For convenience of description, the face features obtained by the face recognition network recognizing the rendering map are called as first face features, and the face features obtained by the face recognition network recognizing the real map are called as second face features.

Specifically, the computer equipment inputs the rendering graph into a pre-trained face recognition network to obtain first face features corresponding to the rendering graph. The computer equipment inputs the real image into a face recognition network to obtain a second face feature corresponding to the real image. The computer equipment calculates the similar distance between the first face feature and the second face feature, and the similar distance or a transformation value of the similar distance is used as a feature loss value of the rendering diagram relative to the real diagram on the face shape. The corresponding calculation formula may be:

L _recong ＝(embed _img -embed _reminder ) ²

Wherein, emmbed _img Representing a second face feature, an emped _reminder Representing a first face feature, L _recong Representing the feature loss value of the rendering diagram relative to the real diagram on the face.

In this embodiment, the loss of the rendering graph relative to the real graph is measured from the overall facial features such as the face shape, and the parameters of the three-dimensional face reconstruction network are adjusted according to the loss, so that the three-dimensional face parameters output by the three-dimensional face reconstruction network guide the renderer to simulate the network to map to obtain the virtual face image which can be more and more close to the two-dimensional face sample graph, and the model training effect is improved.

In one embodiment, the face consistency network comprises a face key point detection network; the calculating of the characteristic loss value of the virtual face image relative to the two-dimensional face sample picture through the pre-trained face consistency network comprises the following steps: extracting first key point features from the virtual face image through a face key point detection network; extracting second key point features from the two-dimensional face sample picture through a face key point detection network; and determining the feature loss value of the virtual face image relative to the two-dimensional face sample picture according to the similar distance between the first key point feature and the second key point feature.

The face key point detection network is a neural network for carrying out key point recognition on faces in two-dimensional face pictures. The key points are key pixel points of a face area in a pre-designated input picture, such as pixel points of the edge position of the face area, pixel points of the edge position of the five sense organs and the like. In a specific embodiment, the face keypoint detection network may be a serial concatenation of multiple convolution layers and a maximum pooling layer, such as a series of 6 3*3 convolution layers and a maximum pooling layer.

As shown in table 1 above, the multiple deconvolution layers in the renderer simulation network gradually convert the input 1×1×n three-dimensional face parameters into high-dimensional 256×256×3 virtual face images. Convolution is the inverse of deconvolution. The multiple convolution layers in the face key point detection network can extract key point features of 1 x m dimensions, such as 1 x 2000, from the high-dimensional real image and the rendering image respectively. The maximum pooling layer is used for reducing the key point characteristics of 1 x M dimensions and finally outputting the key point characteristics of 1 x O dimensions of the target dimensions. Wherein, O is the number of key points marked in the sample picture adopted during the network training of the face key point detection, such as 106 two-dimensional face key points.

For convenience of description, the key point feature obtained by the face key point detection network detecting the rendering graph is referred to as a first key point feature, and the key point feature obtained by the face recognition network detecting the real graph is referred to as a second key point feature.

Specifically, the computer equipment inputs the rendering graph into a pre-trained face key point detection network to obtain a first key point characteristic corresponding to the rendering graph. The computer equipment inputs the real image into a human face key point detection network to obtain a second key point characteristic corresponding to the real image. The computer equipment calculates the similar distance between the first key point feature and the second key point feature, and the similar distance or a transformation value of the similar distance is used as a feature loss value of the rendering diagram relative to the real diagram on the key points. The corresponding calculation formula may be:

L _kp ＝(kp _img -kp _reminder ) ²

Wherein kp is _img Representing a second key point feature, kp _reminder Representing a first key point feature, L _kp Representing the feature loss value of the rendering graph relative to the real graph at the key point.

In this embodiment, the loss of the rendering graph relative to the real graph is measured from the key points, and parameters of the three-dimensional face reconstruction network are adjusted according to the loss, so that the three-dimensional face parameters output by the three-dimensional face reconstruction network guide the renderer to simulate the virtual face image obtained by network mapping to be more and more close to the two-dimensional face sample graph, and the model training effect is improved.

In one embodiment, the face consistency network comprises a face five sense organ segmentation network, a face recognition network and a face key point detection network; respectively inputting the virtual face image and the two-dimensional face sample picture into a face consistency network for feature extraction, and calculating the feature loss value of the virtual face image relative to the two-dimensional face sample picture comprises the following steps: calculating a facial feature loss value of the virtual face image relative to the two-dimensional face sample picture through a face facial segmentation network; calculating face feature loss values of the virtual face image relative to the two-dimensional face sample picture through a face recognition network; calculating a key point characteristic loss value of the virtual face image relative to the two-dimensional face sample picture through a face key point detection network; and determining the comprehensive characteristic loss value of the virtual face image relative to the two-dimensional face sample picture according to the five-sense organ characteristic loss value, the face characteristic loss value and the key point characteristic loss value.

In this embodiment, the face consistency network includes a face five-sense organ segmentation network, a face recognition network and a face key point detection network. In other embodiments, the face consistency network may also be a face five sense organ segmentation network, a combination of any other network of the face recognition network and the face key point detection network. The face consistency network may also include a network for measuring the consistency of the face of the target object in the rendered image and the real image from other dimensions, which is not limited.

As described above, the facial feature segmentation network is a neural network for performing facial feature recognition on a face in a two-dimensional facial image, and the feature loss value L of the rendered image relative to the real image on the facial feature can be obtained by calculating the similarity distance between the rendered image output by the facial feature segmentation network and the fusion result of the probability image and the intermediate features of the real image _parsing (five sense organs characteristic loss value for short). The face recognition network is a neural network for face recognition of faces in two-dimensional face pictures, and the feature loss value L of the rendering graph relative to the real graph on the whole long phase can be obtained by calculating the similar distance between the rendering graph output by the face recognition network and the face features of the real graph _recog (face feature loss value for short). The human face key point detection network is a neural network for carrying out key point recognition on human faces in two-dimensional human face pictures, and the feature loss value L of the rendering graph relative to the real graph on the key point distribution can be obtained by calculating the similar distance between the rendering graph output by the human face key point detection network and the key point features of the real graph _kp (abbreviated as key point feature loss value).

Specifically, the computer device calculates and obtains the five-element feature loss value L of the rendering graph relative to the real graph based on the face five-element segmentation network, the face recognition network and the face key point detection network respectively _parsing Face feature loss value L _recog Key point feature loss value L _kp . Loss value L of characteristic of five sense organs by computer equipment _parsing Face feature loss value L _recog Key point feature loss value L _kp And fusing, wherein the fusion result is used as a comprehensive characteristic loss value of the rendering graph relative to the real graph. Will L _parsing 、L _recog L and L _kp The fusion algorithm can specifically adopt a mode of summation, averaging and the like. The corresponding calculation formula may be:

L＝L _kp +L _recog +L _parsing

wherein L is the integrated characteristic loss value.

It is easy to understand that L may be also calculated based on an algorithm of bayesian decision theory, an algorithm based on sparse representation theory, or an algorithm based on deep learning theory _parsing 、L _recog L and L _kp And (5) fusing to obtain the comprehensive characteristic loss value.

In this embodiment, the loss of the rendering graph relative to the real graph is measured from the five sense organs, the facial form and the multiple dimensions of the key points, and the parameters of the three-dimensional face reconstruction network are adjusted according to the loss, so that the three-dimensional face parameters output by the three-dimensional face reconstruction network guide the renderer to simulate the virtual face image obtained by network mapping to be more and more close to the two-dimensional face sample graph, and the model training effect is improved.

In one embodiment, determining the integrated feature loss value of the virtual face image relative to the two-dimensional face sample picture according to the facial feature loss value, the face feature loss value and the key point feature loss value comprises: carrying out standardized treatment on the feature loss value of the five sense organs, the feature loss value of the facial form and the feature loss value of the key points; and fusing the standardized facial feature loss value, the face feature loss value and the key point feature loss value to obtain the comprehensive feature loss value of the virtual face image relative to the two-dimensional face sample picture.

Specifically, the computer device will L _parsing 、L _recog L and L _kp The fusion algorithm can specifically adopt a weighted summation mode. The corresponding calculation formula may be:

L＝a*L _kp +b*L _recog +c*L _parsing

Wherein a is the weight corresponding to the feature loss value of the five sense organs, b is the weight corresponding to the feature loss value of the facial form, and c is the weight corresponding to the feature loss value of the key point.

The feature dimensions respectively output by the face five-sense organ segmentation network, the face recognition network and the face key point detection network may be different, so that the judgment standards based on the feature loss values calculated by different networks may be different. Different weights are added for different feature loss values, so that each network can play a role in balancing the face consistency evaluation, in other words, the feature loss values of the five sense organs, the feature loss values of the facial features and the feature loss values of the key points can be balanced and reflected in the final comprehensive feature loss value. Different weights are added for different characteristic loss values, and the method can be specifically realized as a process of normalizing the characteristic loss values output by each network.

In this embodiment, since the feature dimensions output by each sub-network are different, different weights are added for different feature loss values, and each feature loss value can be pulled to a judgment standard first, so that the difference of the rendering graph relative to the real graph is reflected by each feature loss value more accurately, and further the training effect of the three-dimensional face reconstruction network is improved.

In one embodiment, the three-dimensional face reconstruction network training method further includes: acquiring a two-dimensional face target picture; extracting three-dimensional face target parameters from a two-dimensional face target picture based on a three-dimensional face reconstruction network after training; and inputting the three-dimensional face target parameters into a renderer simulation network to obtain the virtual face image.

After the training of the three-dimensional face reconstruction network is completed, a joint network consisting of the three-dimensional face reconstruction network and a renderer simulation network can be applied to manufacture the virtual face image. Specifically, the computer equipment acquires a two-dimensional face target picture, inputs the two-dimensional face target picture into a three-dimensional face reconstruction network for finishing training, and inputs three-dimensional face target parameters output by the three-dimensional face reconstruction network into a cascaded renderer simulation network to obtain a virtual face image mapped by the three-dimensional face target parameters.

In one embodiment, extracting three-dimensional face target parameters from a two-dimensional face target picture based on the trained three-dimensional face reconstruction network comprises: determining an application scene and a target style configured for a two-dimensional face target picture; acquiring a three-dimensional face reconstruction network and a pre-trained renderer simulation network which are applicable to application scenes and target styles after training is finished; and extracting three-dimensional face target parameters from the two-dimensional face picture based on the obtained three-dimensional face reconstruction network.

As described above, in order to improve the fidelity of the virtual face image, the application scenes and the target styles may be distinguished for performing network training, that is, a set of corresponding joint networks may be trained for each target style in each application scene. The three-dimensional face reconstruction network provided by the application can be widely applied to various scenes needing three-dimensional face modeling. The joint network provided by the application can be widely applied to various scenes needing virtual face image generation.

According to the three-dimensional face reconstruction network training method, the end-to-end unsupervised training can be performed on the basis of no need of three-dimensional face parameter labeling, and three-dimensional face parameters of an input face and virtual face images of people can be generated stably and accurately in real time.

In a specific embodiment, as shown in fig. 6, the three-dimensional neural network model training method provided in the present application specifically includes the following steps:

s602, determining an application scene and a target style of a three-dimensional face reconstruction network to be trained.

S604, obtaining three-dimensional face standard parameters and virtual face sample images of a target style suitable for an application scene.

S606, training the renderer simulation network to be trained by taking the three-dimensional face standard parameters as input and the virtual face sample image as output to obtain a pre-trained renderer simulation network for rendering the virtual face image of the target style in the application scene.

And S608, acquiring a two-dimensional face sample picture.

S610, extracting three-dimensional face sample parameters from two-dimensional face sample pictures through a three-dimensional face reconstruction network to be trained.

And S612, inputting the three-dimensional face sample parameters into a renderer simulation network for mapping processing, and obtaining the virtual face image.

S614, determining probability values of each picture area belonging to each preset segmentation result in the virtual face image through a face five-sense organ segmentation network, and obtaining a first probability map corresponding to each segmentation result; the facial five-sense organ segmentation network comprises a plurality of convolution layers connected in series.

S616, determining probability values of each picture area belonging to each preset segmentation result in the two-dimensional face sample picture through the face five-sense organ segmentation network, and obtaining a second probability map corresponding to each segmentation result.

S618, obtaining first intermediate features of the virtual face image output by the convolution layer of the target sequence, and fusing the first intermediate features with first probability diagrams corresponding to each segmentation result respectively to obtain first fusion features corresponding to each segmentation result.

S620, obtaining second intermediate features of the two-dimensional face sample pictures output by the convolution layers in the target sequence, and fusing the second intermediate features with second probability diagrams corresponding to each segmentation result respectively to obtain second fusion features corresponding to each segmentation result.

S622, calculating the similar distance between the first fusion feature and the second fusion feature corresponding to each segmentation result.

S624, determining the five-sense organ characteristic loss value of the virtual face image relative to the two-dimensional face graph according to the similar distances corresponding to all the segmentation results.

S626, extracting first face features from the virtual face image through the face recognition network.

S628, extracting the second face features from the two-dimensional face sample picture through the face recognition network.

S630, determining face feature loss values of the virtual face image relative to the two-dimensional face sample picture according to the similar distances between the first face features and the second face features.

S632, extracting first key point characteristics from the virtual face image through the face key point detection network.

S634, extracting second key point features from the two-dimensional face sample picture through the face key point detection network.

And S636, determining the key point feature loss value of the virtual face image relative to the two-dimensional face sample picture according to the similar distance between the first key point feature and the second key point feature.

S638, determining the comprehensive characteristic loss value of the virtual face image relative to the two-dimensional face sample picture according to the five-sense organ characteristic loss value, the face characteristic loss value and the key point characteristic loss value.

And S640, adjusting parameters of the three-dimensional face reconstruction network based on the comprehensive characteristic loss value, and continuing training until the training is finished when the training finishing condition is met.

In one embodiment, as shown in fig. 7, a virtual face image generation method is provided, and the method is applied to the computer device in fig. 1 for illustration, and the computer device may be specifically the terminal 110 or the server 120 in the above figures. Referring to fig. 7, the virtual face image generation method specifically includes the steps of:

s702, acquiring a two-dimensional face target picture.

S704, extracting three-dimensional face target parameters of a target object in the two-dimensional face target picture based on the three-dimensional face reconstruction network.

And S706, rendering the three-dimensional face target parameters based on the renderer simulation network to obtain the virtual face image of the target object. The three-dimensional face reconstruction network is obtained by performing unsupervised training by means of a pre-trained renderer simulation network and a face consistency network; the face consistency network is used for calculating a characteristic loss value of output data of the renderer simulation network relative to input data of the three-dimensional face reconstruction network; the characteristic loss value is used for guiding the adjustment of parameters of the three-dimensional face reconstruction network.

Referring to fig. 8, fig. 8 shows a schematic diagram of converting a two-dimensional face target picture into a virtual face image based on a three-dimensional face reconstruction network training provided by the present application to obtain a three-dimensional face reconstruction network and a renderer simulation network cascaded therewith in one embodiment. As shown in fig. 8, the ability of the combined network to convert the two-dimensional face picture into the virtual face image is continuously constrained and strengthened based on the face consistency network in the training stage, so that the consistency of the dimensions of faces in facial forms, five sense organs, key points and the like in the rendering map 802 and the real map 804 can be ensured, and the rendering map has higher fidelity.

Referring to fig. 9, fig. 9 is a schematic diagram illustrating conversion of two-dimensional face sample pictures of multiple perspectives of the same target object photographed under multiple illumination environments into a virtual face object in one embodiment. As shown in fig. 7, angles of faces in different two-dimensional face target pictures 902 may be different, for example, a front face or a side face that makes a certain angle side turn towards a direction of a non-use viewing angle with respect to the front face, and so on. The viewing angle direction includes a left viewing angle, a right viewing angle, a upward viewing angle, a depression viewing angle, and the like. However, the side-turning angle of the side face should not be too large, and at least the facial features such as facial form, five sense organs and the like should be clearly distinguished. The illumination brightness of the different two-dimensional face target pictures 902 may also be different due to the shooting environment and the like.

The virtual face model is obtained by uniformly training the standard virtual face sample objects when the renderer is trained to simulate a network. The standard virtual face sample object is a virtual face sample object with fixed face side angle (such as a positive face) and fixed illumination brightness of a target object. Thus, in the training phase, the renderer simulates the ability of the network to learn to map three-dimensional face sample parameters to virtual face sample objects of fixed face angles and fixed illumination brightness. Thus, in the application stage, as shown in fig. 9, the two-dimensional face target picture according to the specified arbitrary face angle and illumination brightness can be mapped into the virtual face object 904 with the unified face angle and illumination brightness based on the joint network provided by the application. In other words, even two-dimensional face sample pictures of multiple views of the same target object shot in multiple illumination environments can be converted into consistent virtual face images based on the joint network.

In this embodiment, since the training is based on a large number of two-dimensional face images, three-dimensional face standard parameters and virtual face sample objects in the joint network training stage, the rule that the two-dimensional face images are mapped into virtual face images in statistics is learned, and the global optimal solution can be found, so that the generalization capability is strong, and therefore, different two-dimensional face images of the same target object under different viewing angles and illumination conditions can be ensured to be stably converted into virtual face images which are nearly the same.

In one embodiment, the virtual face image generating method further includes: determining an application scene and a target style configured for a two-dimensional face target picture; acquiring a three-dimensional face reconstruction network and a renderer simulation network which are applicable to application scenes and target styles; the three-dimensional face target parameters of the target object in the two-dimensional face target picture are extracted based on the three-dimensional face reconstruction network, and the three-dimensional face target parameters comprise: and extracting three-dimensional face target parameters of a target object in the two-dimensional face target picture based on the obtained three-dimensional face reconstruction network.

The method provided by the application can be suitable for personalized three-dimensional face models of different styles. For example, a renderer simulation network A1 is obtained based on training of a three-dimensional face skeleton model of a cartoon style, a three-dimensional face reconstruction network C1 suitable for the cartoon style can be obtained based on training of the renderer simulation network A1 and a face consistency network B, and a joint network c1+a1 formed by cascading the three-dimensional face reconstruction network C1 and the renderer simulation network A1 can be used for converting a two-dimensional face picture into a virtual face image of the cartoon style. The three-dimensional face skeleton model based on the ancient style is trained to obtain a renderer simulation network A2, the three-dimensional face reconstruction network C2 suitable for the cartoon style can be trained based on the renderer simulation network A2 and the face consistency network B, and a joint network C2+A2 formed by cascading the three-dimensional face reconstruction network C2 and the renderer simulation network A2 can be used for converting two-dimensional face pictures into virtual face images in the ancient style.

More importantly, as the three-dimensional face parameters extracted from the two-dimensional face picture by the three-dimensional face reconstruction network can reflect the characteristics of the target object, the renderer simulation network is trained and obtained based on the three-dimensional face skeleton model of a specific style, so that the personalized model characteristics of the three-dimensional face skeleton model are learned, and the virtual portrait image combining the characteristics of the person and the personalized model characteristics can be reconstructed by the method provided by the application.

The traditional mode has a scheme of extracting three-dimensional face parameters through continuous iteration optimization, specifically, a two-dimensional face picture and an adjustable initial three-dimensional face parameter X0 are input into a three-dimensional face reconstruction network, the three-dimensional face parameter X1 is obtained through output, the three-dimensional face parameter X1 and the two-dimensional face parameter are input into the three-dimensional face reconstruction network through first iteration, the three-dimensional face parameter X2 is obtained through output, and iteration is continuously circulated until the output three-dimensional face parameter Xn meets iteration stopping conditions. The continuous iterative optimization mode makes a single two-dimensional face picture usually require 50-100 iterations, takes nearly 1 minute, and is difficult to adapt to a scene with high real-time requirements. The method and the device can directly adjust the network parameters in the network training stage, and can directly extract the three-dimensional face parameters based on the trained three-dimensional face reconstruction network after the training of the fixed network parameters is completed, iteration is not needed, three-dimensional face parameter extraction efficiency is greatly improved, and the method and the device can be applied to scenes with high real-time requirements.

In a specific embodiment, as shown in fig. 10, the virtual face image generating method provided in the present application specifically includes the following steps:

s1002, acquiring a two-dimensional face target picture.

S1004, determining an application scene and a target style configured for the two-dimensional face target picture.

S1006, acquiring a three-dimensional face reconstruction network and a renderer simulation network which are applicable to application scenes and target styles.

S1008, extracting three-dimensional face target parameters of a target object in the two-dimensional face target picture based on the obtained three-dimensional face reconstruction network.

And S1010, rendering the three-dimensional face target parameters based on the acquired renderer simulation network to obtain the virtual face image of the target object.

The three-dimensional face reconstruction network is obtained by performing unsupervised training by means of a pre-trained renderer simulation network and a face consistency network; the face consistency network is used for calculating a characteristic loss value of output data of the renderer simulation network relative to input data of the three-dimensional face reconstruction network; the characteristic loss value is used for guiding the adjustment of parameters of the three-dimensional face reconstruction network.

According to the method, the end-to-end unsupervised training can be performed on the basis that three-dimensional face parameter labeling is not needed, and three-dimensional face parameters of an input face and virtual face images of people can be generated stably and accurately in real time. In addition, when changing application scene or target style, only need retraining and adjusting the partial parameter of the renderer simulation network and the three-dimensional human body reconstruction network, the development cost is low, and the expandability is strong. Therefore, the three-dimensional face reconstruction network training method provided by the application can be used as a technical basis for a great number of virtual image applications, game applications and VR/AR applications, and the performance and effect of the existing applications are improved.

It should be understood that, although the steps in the flowcharts of fig. 2, 6, 7, and 10 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps of fig. 2, 6, 7, and 10 may include steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor does the order in which the steps or stages are performed necessarily occur sequentially, but may be performed alternately or alternately with other steps or at least a portion of the steps or stages in other steps.

In one embodiment, as shown in fig. 11, a three-dimensional face reconstruction network training apparatus 1100 is provided, comprising: a sample acquisition module 1102, a feature extraction module 1104, a loss measurement module 1106, and a network training module 1108, wherein:

the sample obtaining module 1102 is configured to obtain a two-dimensional face sample picture.

The feature extraction module 1104 is configured to extract three-dimensional face sample parameters from two-dimensional face sample pictures through a three-dimensional face reconstruction network to be trained.

The loss measurement module 1106 is configured to input three-dimensional face sample parameters into a pre-trained renderer simulation network to obtain a virtual face image; and calculating the characteristic loss value of the virtual face image relative to the two-dimensional face sample picture through the pre-trained face consistency network.

The network training module 1108 is configured to adjust parameters of the three-dimensional face reconstruction network based on the feature loss value and continue training until the training end condition is satisfied.

In one embodiment, as shown in fig. 12, the three-dimensional face reconstruction network training apparatus 1100 further includes a renderer pre-training module 1110 for determining an application scene and a target style of the three-dimensional face reconstruction network to be trained; acquiring three-dimensional face standard parameters and virtual face sample images of a target style suitable for an application scene; and training the renderer simulation network to be trained by taking the three-dimensional face standard parameters as input and the virtual face sample image as output to obtain a pre-trained renderer simulation network for rendering the virtual face image of the target style in the application scene.

In one embodiment, the face consistency network comprises a face five sense organ segmentation network; the loss measurement module 1106 is further configured to determine, through a pre-trained facial feature segmentation network, a probability value of each picture region in the virtual facial image belonging to each preset segmentation result, so as to obtain a first probability map corresponding to each segmentation result; determining probability values of each picture area belonging to each preset segmentation result in the two-dimensional face sample picture through a pre-trained face five-sense organ segmentation network, and obtaining a second probability map corresponding to each segmentation result; respectively calculating the similar distance between the first probability map and the second probability map corresponding to each segmentation result; and determining the characteristic loss value of the virtual face image relative to the two-dimensional face graph according to the similar distances corresponding to all the segmentation results.

In one embodiment, the facial feature segmentation network includes a plurality of convolution layers connected in series; the loss measurement module 1106 is further configured to obtain a first intermediate feature of the virtual face image output by the convolution layer in the target sequence, and fuse the first intermediate feature with a first probability map corresponding to each segmentation result, so as to obtain a first fusion feature corresponding to each segmentation result; acquiring second intermediate features of two-dimensional face sample pictures output by a convolution layer in a target sequence, and respectively fusing the second intermediate features with second probability maps corresponding to each segmentation result to obtain second fusion features corresponding to each segmentation result; and respectively calculating the similar distance between the first fusion feature and the second fusion feature corresponding to each segmentation result.

In one embodiment, the face consistency network comprises a face recognition network; the loss measurement module 1106 is further configured to extract, through a face recognition network, a first face feature from the virtual face image; extracting second face features from the two-dimensional face sample picture through a face recognition network; and determining a feature loss value of the virtual face image relative to the two-dimensional face sample picture according to the similar distance between the first face feature and the second face feature.

In one embodiment, the face consistency network comprises a face key point detection network; the loss measurement module 1106 is further configured to extract a first key point feature from the virtual face image through the face key point detection network; extracting second key point features from the two-dimensional face sample picture through a face key point detection network; and determining the feature loss value of the virtual face image relative to the two-dimensional face sample picture according to the similar distance between the first key point feature and the second key point feature.

In one embodiment, the face consistency network comprises a face five sense organ segmentation network, a face recognition network and a face key point detection network; the loss measurement module 1106 is further configured to calculate, through a facial feature segmentation network, a facial feature loss value of the virtual facial image relative to the two-dimensional facial sample picture; calculating face feature loss values of the virtual face image relative to the two-dimensional face sample picture through a face recognition network; calculating a key point characteristic loss value of the virtual face image relative to the two-dimensional face sample picture through a face key point detection network; and determining the comprehensive characteristic loss value of the virtual face image relative to the two-dimensional face sample picture according to the five-sense organ characteristic loss value, the face characteristic loss value and the key point characteristic loss value.

In one embodiment, the loss measurement module 1106 is further configured to normalize the feature loss value of the five sense organs, the feature loss value of the face and the feature loss value of the key points according to the feature loss value of the five sense organs, the feature loss value of the face and the feature loss value of the key points; and fusing the standardized facial feature loss value, the face feature loss value and the key point feature loss value to obtain the comprehensive feature loss value of the virtual face image relative to the two-dimensional face sample picture.

In one embodiment, the three-dimensional face reconstruction network training apparatus 1100 further includes a virtual face image generation module 1112 configured to obtain a two-dimensional face target picture; extracting three-dimensional face target parameters from a two-dimensional face target picture based on a three-dimensional face reconstruction network after training; and inputting the three-dimensional face target parameters into a renderer simulation network to obtain the virtual face image.

In one embodiment, the virtual face image generation module 1112 is further configured to determine an application scene and a target style configured for the two-dimensional face target picture; acquiring a three-dimensional face reconstruction network and a pre-trained renderer simulation network which are applicable to application scenes and target styles after training is finished; and extracting three-dimensional face target parameters from the two-dimensional face picture based on the obtained three-dimensional face reconstruction network.

According to the three-dimensional face reconstruction network training device, the renderer simulation network is trained in advance, and when three-dimensional face parameters of a target object extracted from two-dimensional face sample pictures by the three-dimensional face reconstruction network to be trained are input into the renderer simulation network, the three-dimensional face parameters can be mapped into virtual face images quickly and accurately; the virtual face image and the two-dimensional face sample picture are respectively input into a face consistency network, and parameters in the three-dimensional face reconstruction network can be gradually adjusted according to the characteristic loss value of the obtained virtual face image relative to the two-dimensional face sample picture. In the parameter adjustment process, the virtual face image obtained by the renderer simulated network mapping guided by the three-dimensional face parameter outputted by the three-dimensional face reconstruction network can be more and more close to the two-dimensional face sample picture. When three-dimensional face parameters are extracted based on the three-dimensional face reconstruction network, constraint and guidance from the face consistency network are obtained, training of the three-dimensional face reconstruction network can be completed without labeling two-dimensional face sample pictures, end-to-end unsupervised training is achieved, and the two-dimensional face pictures can be quickly and real-timely converted into virtual face images by directly using the three-dimensional face reconstruction network with training and a pre-trained renderer simulation network.

For specific limitation of the three-dimensional face reconstruction network training device, reference may be made to the limitation of the three-dimensional face reconstruction network training method hereinabove, and no further description is given here. All or part of the modules in the three-dimensional face reconstruction network training device can be realized by software, hardware and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, as shown in fig. 13, there is provided a virtual face image generating apparatus 1300, including: a two-dimensional picture acquisition module 1302, a three-dimensional parameter extraction module 1304, and an avatar rendering module 1306, wherein:

the two-dimensional image obtaining module 1302 is configured to obtain a two-dimensional face target image.

The three-dimensional parameter extraction module 1304 is configured to extract three-dimensional face target parameters of a target object in the two-dimensional face target picture based on the three-dimensional face reconstruction network.

The virtual image rendering module 1306 is configured to perform rendering processing on the three-dimensional face target parameters based on the renderer simulation network, so as to obtain a virtual face image of the target object; the three-dimensional face reconstruction network is obtained by performing unsupervised training by means of a pre-trained renderer simulation network and a face consistency network; the face consistency network is used for calculating a characteristic loss value of output data of the renderer simulation network relative to input data of the three-dimensional face reconstruction network; the characteristic loss value is used for guiding the adjustment of parameters of the three-dimensional face reconstruction network.

In one embodiment, as shown in fig. 14, the virtual face image generating apparatus 1300 further includes a scene style determining module 1308 configured to determine an application scene and a target style configured for the two-dimensional face target picture; acquiring a three-dimensional face reconstruction network and a renderer simulation network which are applicable to application scenes and target styles; the three-dimensional parameter extraction module 1304 is further configured to extract three-dimensional face target parameters of a target object in the two-dimensional face target picture based on the obtained three-dimensional face reconstruction network; the avatar rendering module 1306 is further configured to perform rendering processing on the three-dimensional face target parameters based on the acquired renderer-simulated network, so as to obtain an avatar of the target object.

According to the virtual face generating device, the network parameters are directly adjusted in the network training stage, after the fixed network parameters are trained, three-dimensional face parameter extraction is directly carried out on the basis of the trained three-dimensional face reconstruction network, iteration is not needed, three-dimensional face parameter extraction efficiency is greatly improved, and the virtual face generating device can be suitable for scenes with high real-time requirements. In addition, the three-dimensional face reconstruction network training stage is obtained based on a large number of two-dimensional face pictures, three-dimensional face standard parameters and virtual face sample objects, so that the rule that the two-dimensional face pictures are mapped into virtual face images in statistics is learned, the global optimal solution can be found, the generalization capability is strong, and therefore different two-dimensional face pictures of the same target object under different visual angles and illumination conditions can be ensured to be converted into stable virtual face images.

The specific limitation of the virtual face generating apparatus may be referred to the limitation of the virtual face generating method hereinabove, and will not be described herein. The modules in the virtual face generating apparatus may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 15. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer equipment is used for storing data such as a three-dimensional face skeleton model, a two-dimensional face picture, a virtual face image and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program when executed by the processor realizes a three-dimensional face reconstruction network training method and a virtual face image generation method.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 16. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program when executed by the processor realizes a three-dimensional face reconstruction network training method and a virtual face image generation method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structures shown in fig. 15 and 16 are merely block diagrams of portions of structures related to the present application and do not constitute a limitation of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, storing a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples represent only a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A three-dimensional face reconstruction network training method, the method comprising:

acquiring a two-dimensional face sample picture;

inputting the three-dimensional face sample parameters into a renderer simulation network for mapping processing to obtain a virtual face image;

respectively inputting the virtual face image and the two-dimensional face sample picture into a face consistency network for segmentation to obtain a first intermediate feature and a first probability map corresponding to each segmentation result, and a second intermediate feature and a second probability map corresponding to each segmentation result; fusing the first intermediate features with the first probability graphs corresponding to each segmentation result respectively to obtain first fusion features corresponding to each segmentation result; fusing the second intermediate features with the second probability graphs corresponding to each segmentation result respectively to obtain second fusion features corresponding to each segmentation result; respectively calculating the similar distance between the first fusion feature and the second fusion feature corresponding to each segmentation result; determining a characteristic loss value of the virtual face image relative to the two-dimensional face graph according to the similar distances corresponding to all the segmentation results;

And adjusting parameters of the three-dimensional face reconstruction network based on the characteristic loss value, and continuing training until training is finished when the training finishing condition is met, so as to obtain the trained three-dimensional face reconstruction network.

2. The method of claim 1, wherein the step of the renderer simulating network pre-training comprises:

determining an application scene and a target style of the three-dimensional face reconstruction network to be trained;

acquiring three-dimensional face standard parameters and virtual face sample images of the target styles applicable to the application scene;

and training the renderer simulation network to be trained by taking the three-dimensional face standard parameters as input and the virtual face sample image as output to obtain a pre-trained renderer simulation network for rendering the virtual face image of the target style in the application scene.

3. The method of claim 1, wherein the face consistency network comprises a face five sense organ segmentation network; the facial five-sense organ segmentation network comprises a plurality of convolution layers connected in series; the virtual face image and the two-dimensional face sample picture are respectively input into a face consistency network for segmentation, a first probability map corresponding to a first intermediate feature and each segmentation result is obtained, and a second probability map corresponding to a second intermediate feature and each segmentation result comprises:

Determining probability values of each picture area belonging to each preset segmentation result in the virtual face image through a pre-trained facial feature segmentation network, and obtaining a first probability map corresponding to each segmentation result; acquiring a first intermediate feature of the virtual face image output by a convolution layer of a target sequence;

determining probability values of each picture area belonging to each preset segmentation result in the two-dimensional face sample picture through a pre-trained facial feature segmentation network, and obtaining a second probability map corresponding to each segmentation result; and acquiring a second intermediate characteristic of the two-dimensional face sample picture output by the convolution layer of the target sequence.

4. The method according to claim 1, wherein the method further comprises:

acquiring a portrait picture;

identifying the portrait picture to obtain a head area and a body area in the portrait picture;

and extracting the picture of the head area from the portrait picture to serve as a two-dimensional face sample picture.

5. The method of claim 1, wherein the face consistency network comprises a face recognition network; the method further comprises the steps of:

extracting first face features from the virtual face image through the face recognition network;

Extracting second face features from the two-dimensional face sample picture through the face recognition network;

and determining a feature loss value of the virtual face image relative to the two-dimensional face sample picture according to the similar distance between the first face feature and the second face feature.

6. The method of claim 1, wherein the face consistency network comprises a face keypoint detection network; the method further comprises the steps of:

extracting first key point features from the virtual face image through a face key point detection network;

extracting second key point features from the two-dimensional face sample picture through a face key point detection network;

and determining a feature loss value of the virtual face image relative to the two-dimensional face sample picture according to the similar distance between the first key point feature and the second key point feature.

7. The method of claim 1, wherein the face consistency network comprises a face five sense organ segmentation network, a face recognition network, and a face key point detection network; the method further comprises the steps of:

calculating a facial feature loss value of the virtual face image relative to the two-dimensional face sample picture through the face facial segmentation network;

Calculating face feature loss values of the virtual face images relative to the two-dimensional face sample pictures through the face recognition network;

calculating a key point characteristic loss value of the virtual face image relative to the two-dimensional face sample picture through the face key point detection network;

determining a comprehensive characteristic loss value of the virtual face image relative to the two-dimensional face sample picture according to the five-sense organ characteristic loss value, the face characteristic loss value and the key point characteristic loss value;

the step of adjusting parameters of the three-dimensional face reconstruction network based on the characteristic loss value and continuing training comprises the following steps:

and adjusting parameters of the three-dimensional face reconstruction network based on the comprehensive characteristic loss values and continuing training.

8. The method of claim 7, wherein determining the integrated feature loss value of the virtual face image relative to the two-dimensional face sample picture based on the facial feature loss value, the face feature loss value, and the key point feature loss value comprises:

carrying out standardized processing on the five sense organs characteristic loss value, the facial form characteristic loss value and the key point characteristic loss value;

And fusing the standardized facial feature loss value, the face feature loss value and the key point feature loss value to obtain the comprehensive feature loss value of the virtual face image relative to the two-dimensional face sample picture.

9. The method according to claim 1, wherein the method further comprises:

acquiring a two-dimensional face target picture;

extracting three-dimensional face target parameters from the two-dimensional face target picture based on the three-dimensional face reconstruction network after training;

and inputting the three-dimensional face target parameters into the renderer simulation network to obtain the virtual face image.

10. The method according to claim 9, wherein the training-end-based three-dimensional face reconstruction network extracting three-dimensional face target parameters from the two-dimensional face target picture comprises:

determining an application scene and a target style configured for the two-dimensional face target picture;

acquiring the three-dimensional face reconstruction network and the pre-trained renderer simulation network applicable to the application scene and the end of training of the target style;

and extracting three-dimensional face target parameters from the two-dimensional face picture based on the obtained three-dimensional face reconstruction network.

11. A virtual face image generation method, the method comprising:

acquiring a two-dimensional face target picture;

the three-dimensional face reconstruction network is obtained by performing unsupervised training by means of the pre-trained renderer simulation network and a face consistency network; the face consistency network is used for calculating a characteristic loss value of output data of the renderer simulation network relative to input data of the three-dimensional face reconstruction network; the characteristic loss value is used for guiding the adjustment of parameters of the three-dimensional face reconstruction network; the input data of the three-dimensional face reconstruction network is a two-dimensional face sample image; the output data of the renderer simulation network is a virtual face image, the virtual face image is obtained by extracting three-dimensional face sample parameters from the two-dimensional face sample pictures through a three-dimensional face reconstruction network to be trained, and inputting the three-dimensional face sample parameters into the renderer simulation network for mapping; the feature loss value is obtained by respectively inputting the virtual face image and the two-dimensional face sample picture into a face consistency network for segmentation, so as to obtain a first intermediate feature and a first probability map corresponding to each segmentation result, and a second intermediate feature and a second probability map corresponding to each segmentation result; fusing the first intermediate features with the first probability graphs corresponding to each segmentation result respectively to obtain first fusion features corresponding to each segmentation result; fusing the second intermediate features with the second probability graphs corresponding to each segmentation result respectively to obtain second fusion features corresponding to each segmentation result; respectively calculating the similar distance between the first fusion feature and the second fusion feature corresponding to each segmentation result; and determining according to the similar distances corresponding to all the segmentation results.

12. The method of claim 11, wherein the method further comprises:

acquiring a three-dimensional face reconstruction network and a renderer simulation network applicable to the application scene and the target style;

the three-dimensional face reconstruction network-based three-dimensional face target parameter extraction method comprises the following steps of: and extracting three-dimensional face target parameters of a target object in the two-dimensional face target picture based on the obtained three-dimensional face reconstruction network.

13. A three-dimensional face reconstruction network training device, the device comprising:

the loss measurement module is used for inputting three-dimensional face sample parameters into the renderer simulation network to perform mapping processing to obtain a virtual face image; respectively inputting the virtual face image and the two-dimensional face sample picture into a face consistency network for feature segmentation to obtain a first intermediate feature and a first probability map corresponding to each segmentation result, and a second intermediate feature and a second probability map corresponding to each segmentation result; fusing the first intermediate features with the first probability graphs corresponding to each segmentation result respectively to obtain first fusion features corresponding to each segmentation result; fusing the second intermediate features with the second probability graphs corresponding to each segmentation result respectively to obtain second fusion features corresponding to each segmentation result; respectively calculating the similar distance between the first fusion feature and the second fusion feature corresponding to each segmentation result; determining a characteristic loss value of the virtual face image relative to the two-dimensional face graph according to the similar distances corresponding to all the segmentation results;

14. The apparatus of claim 13, further comprising a renderer pre-training module for:

15. The apparatus of claim 13, wherein the face consistency network comprises a face five sense organ segmentation network; the facial five-sense organ segmentation network comprises a plurality of convolution layers connected in series; the loss measurement module is further configured to:

16. The apparatus of claim 13, wherein the sample acquisition module is further configured to:

acquiring a portrait picture;

17. The apparatus of claim 13, wherein the face consistency network comprises a face recognition network; the loss measurement module is further configured to:

18. The apparatus of claim 13, wherein the face consistency network comprises a face keypoint detection network; the loss measurement module is further configured to:

19. The apparatus of claim 13, wherein the face consistency network comprises a face five sense organ segmentation network, a face recognition network, and a face key point detection network; the loss measurement module is further configured to:

the network training module is further configured to:

20. The apparatus of claim 19, wherein the loss measurement module is further configured to:

21. The apparatus of claim 13, further comprising a virtual face image generation module configured to:

acquiring a two-dimensional face target picture;

22. The apparatus of claim 21, wherein the virtual face image generation module is further configured to:

23. A virtual face image generation apparatus, the apparatus comprising:

the virtual image rendering module is used for rendering the three-dimensional face target parameters based on a renderer simulation network to obtain a virtual face image of the target object; the three-dimensional face reconstruction network is obtained by performing unsupervised training by means of the pre-trained renderer simulation network and a face consistency network; the face consistency network is used for calculating a characteristic loss value of output data of the renderer simulation network relative to input data of the three-dimensional face reconstruction network; the characteristic loss value is used for guiding the adjustment of parameters of the three-dimensional face reconstruction network; the input data of the three-dimensional face reconstruction network is a two-dimensional face sample image; the output data of the renderer simulation network is a virtual face image, the virtual face image is obtained by extracting three-dimensional face sample parameters from the two-dimensional face sample pictures through a three-dimensional face reconstruction network to be trained, and inputting the three-dimensional face sample parameters into the renderer simulation network for mapping; the feature loss value is obtained by respectively inputting the virtual face image and the two-dimensional face sample picture into a face consistency network for segmentation, so as to obtain a first intermediate feature and a first probability map corresponding to each segmentation result, and a second intermediate feature and a second probability map corresponding to each segmentation result; fusing the first intermediate features with the first probability graphs corresponding to each segmentation result respectively to obtain first fusion features corresponding to each segmentation result; fusing the second intermediate features with the second probability graphs corresponding to each segmentation result respectively to obtain second fusion features corresponding to each segmentation result; respectively calculating the similar distance between the first fusion feature and the second fusion feature corresponding to each segmentation result; and determining according to the similar distances corresponding to all the segmentation results.

24. The apparatus of claim 23, further comprising a scene style determination module configured to:

the three-dimensional parameter extraction module is further used for: and extracting three-dimensional face target parameters of a target object in the two-dimensional face target picture based on the obtained three-dimensional face reconstruction network.

25. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 12.

26. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 12 when the computer program is executed.