Disclosure of Invention
The invention aims to provide a human face aging simulation method, which solves the technical problems that the prior human face aging method adopts a scheme of decoupling identity information and age information, and the human identity characteristics are weakened due to difficult decoupling, so that the performance of the human face aging method on keeping identity consistency is weaker, an aging result is finally converted from one person to another person, and the phenomenon is more obvious as the span between the age of an input human face and a target age bracket is larger.
The invention provides a face aging simulation method, which comprises the following steps:
S1, performing image preprocessing on an adopted face age data set by using a semantic segmentation model, removing a background, and reserving a simple face area to obtain a face image to be subjected to age simulation;
s2, designing a required original-target age difference code when the age is changed from the original age to the target age, and designing a target-original age difference code when the age is changed from the target age to the original age;
S3, constructing a generated countermeasure network, wherein the generated countermeasure network comprises a generator, a discriminator, an age difference encoder and a mapping network, and sending an original-target age difference code, a target-original age difference code from a target age group to an original age group, and a face image to be subjected to age simulation into the generator and the discriminator so as to construct age difference information;
S4, respectively calculating a loss function, an age loss function and an identity consistency loss function of the generator and the discriminator, and a loss function of the generated countermeasure network, wherein the loss function of the discriminator comprises the loss function of the generated countermeasure network, and the network parameters are updated in a counter-propagation mode so as to complete training of the generated countermeasure network;
S5, taking the face image to be aged as input of a generator to obtain the face aging image.
Optionally, in step S1, the specific steps include:
S11, collecting a face aging data set, wherein the face aging data set comprises face images of different age groups and age labels corresponding to the face images;
S12, a pre-trained semantic segmentation network is adopted, semantic information in an input image can be distinguished by the network, each picture in a data set is used as input of the network, and a semantic graph of each picture is output;
And S13, taking the corresponding semantic graph of each face picture as a reference, only reserving semantic parts related to faces, five sense organs, hairs and necks, and randomly rotating the face pictures subjected to the operation, thereby completing the construction of a data set.
Optionally, in step S11, the different age groups include 0-2 years old, 3-6 years old, 7-9 years old, 15-19 years old, 30-39 years old, and 50-69 years old.
Alternatively, in step S2, assuming that the age groups have n groups, the age difference encoding I has 50×2n bits in total, where each 50 bits is used to represent age difference information for converting from one age group to an adjacent age group;
Firstly, adding an age difference code I with a noise vector which is of the same length and obeys Gaussian distribution, secondly, taking the 50 Xn bit as a reference bit, constructing an age difference code from an age of an original age group to a target age group j, adding 1 to the 50 Xn to 50X (n+j) -1 bit, and keeping the rest bits unchanged;
the age difference code from the age of the target age group j to the original age group is constructed by adding 1 to the 50 Xn-1 to 50X (n-j) bits, and the rest bits are unchanged.
Optionally, in step S3, the specific steps include:
s31, constructing an encoder structure in a generator, wherein the encoder part firstly adopts a 7 multiplied by 7 convolution layer, the step length of the encoder part is 1, then a ReLU activation function and a pixel normalization layer are connected, then 3 multiplied by 3 convolution layers are connected, the step length of each convolution layer is 2, the ReLU activation function and the pixel normalization layer are connected to the back of each convolution layer, then 4 layers of residual blocks are connected, the step length of each residual block is 1, the ReLU activation function and the pixel normalization layer are connected to the back of the first three residual blocks, and the pixel normalization layer is not connected to the back of the last residual block, so that the construction of the encoder is completed;
S32, constructing a decoder structure in a generator, constructing a main body of the decoder by adopting an age difference injection module, wherein the age difference injection module is a residual structure formed by a convolution layer and a style convolution layer in StyleGAN, the decoder totally comprises 6 age difference injection modules and a convolution layer of a1 multiplied by 1 convolution kernel, an up-sampling layer is connected behind the 5 th and 6 th age difference injection modules, a feature map is restored to the size of an input image, the last layer is the convolution layer of the 1 multiplied by 1 convolution kernel, the channel number of the feature map is reduced to 3, and finally a Tanh activation function is connected, so that the construction of the decoder is completed;
S33, constructing an age difference encoder which is a convolutional neural network, wherein the first layer of the encoder is a convolutional layer formed by 7 multiplied by 7 convolutional kernels, the step length is 1, then the first layer is connected with a convolutional layer formed by 5 multiplied by 3 convolutional kernels, the step length is 2, finally the first layer is connected with a convolutional layer formed by 1 multiplied by 1 convolutional kernels, the first 5 layers of convolutional layers are connected with LReLU activation functions, and the second layer is connected with a global average pooling layer after the third layer of convolutional layers, and is responsible for reducing the dimension of the feature map to vectors;
s34, constructing a mapping network, wherein the mapping network consists of 8 layers of linear layers, wherein a ReLU activation function and a pixel normalization layer are connected to the back of the first 7 layers of linear layers, and only one pixel normalization layer is connected to the back of the last layer;
s35, constructing a decoder part, wherein the decoder part adopts the decoder structure proposed in StyleGAN;
S36, sending the original-target age difference code and the face image to be subjected to aging simulation into a generator to generate an aging face image of a target age group and a reconstructed face image of the original age group, sending the aging face image of the target age group and the target-original age difference code into the generator as input, generating an aging face image of the reconstructed image of the target age group and the original age group, sending the aging image and the real face into a discriminator, sending the real face of the target age group, the aging face image of the target age group and the face image to be subjected to aging simulation into the input of an age difference encoder, and constructing age difference information.
Compared with the prior art, the invention provides a human face aging simulation method, which comprises the steps of carrying out image preprocessing by using a semantic segmentation model through an adopted human face age data set to obtain a human face image to be subjected to aging simulation, designing an original-target age difference code required when the human face is aged from an original age bracket to a target age bracket, constructing a generated type countermeasure network to further construct age difference information, updating network parameters in a counter-propagation mode to further finish training of the generated type countermeasure network, and taking the human face image to be subjected to aging as input of a generator to obtain the human face aging image. The invention has the advantages that (1) the training time required for generating clear aging face effect is shorter, (2) the identity consistency performance of the generated aging face is better, and (3) the generated aging face is closer to the real face appearance of the input face in the target age range.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art. The technical means used in the examples are conventional means well known to those skilled in the art unless otherwise indicated.
It is noted that unless otherwise indicated, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this invention belongs. Relational terms such as "first" and "second", and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms "coupled," "connected," and the like are to be construed broadly and may be, for example, fixedly coupled, detachably coupled, or integrally formed, mechanically coupled, electrically coupled, or indirectly coupled via an intervening medium. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising" does not exclude the presence of additional identical elements in a process, method, article, or apparatus that comprises the element.
As shown in fig. 1, the present embodiment provides a face aging simulation method, which is performed according to the following steps:
S1, performing image preprocessing on an adopted face age data set by using a semantic segmentation model, removing a background, and reserving a simple face area to obtain a face image to be subjected to age simulation;
In step S1, the specific steps include:
S11, collecting a face aging data set, wherein the face aging data set comprises face images of different age groups and age labels corresponding to the face images;
In step S11, the different ages include 0-2 years old, 3-6 years old, 7-9 years old, 15-19 years old, 30-39 years old, and 50-69 years old.
S12, a pre-trained semantic segmentation network is adopted, the network can distinguish 19 semantic information such as background, face skin, five sense organs, clothes, neck and the like in an input image, each picture in a data set is used as input of the network, and a semantic graph of each picture is output;
And S13, taking the corresponding semantic graph of each face picture as a reference, only reserving semantic parts related to faces, five sense organs, hairs and necks, and randomly rotating the face pictures subjected to the operation, thereby completing the construction of a data set.
S2, designing a required original-target age difference code when the age is changed from the original age to the target age, and designing a target-original age difference code when the age is changed from the target age to the original age;
In step S2, assuming that there are n groups of age groups, the age difference encoding I has 50×2n bits in total, where each 50 bits is used to represent age difference information for converting from one group of age groups to an adjacent age group;
Firstly, adding an age difference code I with a noise vector which is of the same length and obeys Gaussian distribution, secondly, taking the 50 Xn bit as a reference bit, constructing an age difference code from an age of an original age group to a target age group j, adding 1 to the 50 Xn to 50X (n+j) -1 bit, and keeping the rest bits unchanged;
the age difference code from the age of the target age group j to the original age group is constructed by adding 1 to the 50 Xn-1 to 50X (n-j) bits, and the rest bits are unchanged.
S3, constructing a generating type countermeasure network, wherein the generating type countermeasure network comprises a generator, a discriminator, an age difference encoder and a mapping network, and sending an original-target age difference code, a target-original age difference code for aging from a target age group to an original age group, and a face image to be subjected to aging simulation into the generator and the discriminator, so as to construct age difference information, as shown in fig. 2 and 4;
in step S3, the specific steps include:
s31, constructing an encoder structure in a generator, wherein the encoder part firstly adopts a 7 multiplied by 7 convolution layer, the step length of the encoder part is 1, then a ReLU activation function and a pixel normalization layer are connected, then 3 multiplied by 3 convolution layers are connected, the step length of each convolution layer is 2, the ReLU activation function and the pixel normalization layer are connected to the back of each convolution layer, then 4 layers of residual blocks are connected, the step length of each residual block is 1, the ReLU activation function and the pixel normalization layer are connected to the back of the first three residual blocks, and the pixel normalization layer is not connected to the back of the last residual block, so that the construction of the encoder is completed;
S32, constructing a decoder structure in a generator, constructing a main body of the decoder by adopting an age difference injection module, wherein the age difference injection module is a residual structure formed by a convolution layer and a style convolution layer in StyleGAN, the decoder totally comprises 6 age difference injection modules and a convolution layer of a1 multiplied by 1 convolution kernel, an up-sampling layer is connected behind the 5 th and 6 th age difference injection modules, a feature map is restored to the size of an input image, the last layer is the convolution layer of the 1 multiplied by 1 convolution kernel, the channel number of the feature map is reduced to 3, and finally a Tanh activation function is connected, so that the construction of the decoder is completed;
S33, constructing an age difference encoder which is a convolutional neural network, wherein the first layer of the encoder is a convolutional layer formed by 7 multiplied by 7 convolutional kernels, the step length is 1, then the first layer is connected with a convolutional layer formed by 5 multiplied by 3 convolutional kernels, the step length is 2, finally the first layer is connected with a convolutional layer formed by 1 multiplied by 1 convolutional kernels, the first 5 layers of convolutional layers are connected with LReLU activation functions, the second layer is connected with a global average pooling layer, and the third layer is responsible for dimension reduction to vectors of the feature map, as shown in figure 3;
s34, constructing a mapping network, wherein the mapping network consists of 8 layers of linear layers, wherein a ReLU activation function and a pixel normalization layer are connected to the back of the first 7 layers of linear layers, and only one pixel normalization layer is connected to the back of the last layer;
s35, constructing a decoder part, wherein the decoder part adopts the decoder structure proposed in StyleGAN;
S36, sending the original-target age difference code and the face image to be subjected to aging simulation into a generator to generate an aging face image of a target age group and a reconstructed face image of the original age group, sending the aging face image of the target age group and the target-original age difference code into the generator as input, generating an aging face image of the reconstructed image of the target age group and the original age group, sending the aging image and the real face into a discriminator, sending the real face of the target age group, the aging face image of the target age group and the face image to be subjected to aging simulation into the input of an age difference encoder, and constructing age difference information.
S4, respectively calculating a loss function, an age loss function and an identity consistency loss function of the generator and the discriminator, and a loss function of the generated countermeasure network, wherein the loss function of the discriminator comprises the loss function of the generated countermeasure network, and the network parameters are updated in a counter-propagation mode so as to complete training of the generated countermeasure network;
S5, taking the face image to be aged as input of a generator to obtain a face aging image, as shown in fig. 5.
Illustratively, in this embodiment, the face age dataset is divided into 6 groups including 0-2 (group 1), 3-6 (group 2), 7-9 (group 3), 15-19 (group 4), 30-39 (group 5), and 50-69 (group 6), the face dataset is preprocessed, a semantic graph corresponding to each face picture in the dataset is obtained by using a pre-trained deeplabv3 +network, and only semantic parts related to faces, five sense organs, hairs and necks of each picture in the dataset are reserved based on the semantic graph, and then the dataset is randomly rotated to complete preprocessing operation of the dataset.
Age difference codes were constructed on a 6-group basis, the total length of the codes was 50× (6×2) =600, and each bit was set to 0. The age difference code is added to a 600 bit long vector with a mean of 0 and a variance of 0.04. With reference to the 300 th bit, 50 bits per forward direction represent a set of age difference information adjacent to the age group of the input face, and the target age group is older than the age group of the input picture. The 50 bits are each added with 1. If the aging process spans two age groups, it is necessary to start from the base position and add 1 to the forward 100 positions. 50 bits per negative direction represent a set of age difference information adjacent to the age of the input face and the target age is less than the age of the input picture.
The network is constructed as follows. Table 1 is a table of encoder network structures.
Table 1 encoder network architecture table
Table 2 is a decoder network architecture table.
Table 2 decoder network architecture table
Table 3 is the mapping network.
TABLE 3 mapping network structural Table
Table 4 is a decoder network architecture table.
Table 4 age difference encoder structure table
The following is part of the loss function during training. First, for constraint identity information, the network contains an input image x orig, an aged image y trans, a reconstructed image x rec, and a regenerated image x cyc of the original age group x with y trans as input. With x orig and x rec construction reconstruction losses, it is desirable that the face image obtained after reconstruction can be sufficiently close to the original face image. The cyclic consistency loss is constructed by adopting x orig and x cyc, and when the image after aging is expected to be aged again to be the face image of the initial age group, the generated result is close to the original face image. Through these two losses, the control generator as a whole strengthens identity information consistency when generating age images. The two loss functions are shown in the formula (1) and the formula (2):
Lrec=||xorig-xrec||1 (1)
Lcyc=||xorig-xcyc||1 (2)
Next, constraint on age information is performed. First, constraint is made with the value obtained by the age difference encoder. The age difference encoder is responsible for completing the difference learning work. The input is an image y trans after age change, an original face image y orig of the y age group and an original face image x orig of the x age group. The three face pictures are used for obtaining age information codes A (y trans)、A(yorig)、A(xorig) contained in the face pictures through an age difference encoder A. Wherein, A (y trans) is used as age information code of the face picture after aging, and the included characteristics are extracted from the non-real data. And a (y orig) and a (x orig) are extracted from the real data. When the age of the age group x is taken as an original age group and the age is changed to a y age group, taking A (x orig) as a reference, taking A (y trans) and A (y orig) as subtracted numbers to respectively subtract A (x orig) so as to obtain two codes dif y_trans-x_orig and dif y_orig-x_orig containing age difference information. both age codes are involved in the loss function calculation as values for control l age. Finally, the content of the age difference information can be obtained more reliably by the device age. The whole process can be represented by the formula (3):
The dif y_trans-x_orig and the dif y_orig-x_orig obtained after calculation by the age difference encoder are used as true values of age constraint. The dif y_orig-x_orig takes two pieces of real data as input and makes difference to obtain age information, and the dif y_orig-x_orig can provide real information for the age code I age, so that the hidden vector l age obtained through I age can obtain more real age characterization. Meanwhile, dif y_orig-x_orig is used as age difference information obtained by using one piece of real data and one piece of generated data, and does not completely contain real age representation, but contains age information of the obtained aged picture, and the age difference information can be fed back to the age code I age through constraint to improve the age authenticity of the aged picture. The two loss functions are shown in the formula (4) and the formula (5):
Ldif_t=||Iage-dify_orig-x_orig||1 (4)
Ldif_f=||Iage-dify_trans-x_orig||1 (5)
the final age difference loss function is the sum of the two loss functions as shown in equation (6):
Lage=Ldif_t+Ldif_f (6)
Meanwhile, the generated aged face image and the real face image are respectively used as input of a discriminator, and a vector is generated and used as a result of the discriminator predicting the age range to which the provided picture belongs. Also responsible for constraining the age information of the generated picture, as shown in equation (7):
Ladv(G,D)=Ex,a(logDa(x))+Ex,b(log(1-Db(ytrans))) (7)
Wherein a represents the age group of the real picture and b represents the age group of the generated picture. The output of the discriminator is a vector whose number of bits represents a specific age group category, and each bit represents the probability that the input picture is in that age group. But in doing the loss function it is only responsible for providing a prediction of the number of bits in the correct age range. For example, if the input real picture is an age group a, only the prediction bit of the age group a is provided when the loss is calculated, and the prediction bit is used as a constraint term of the loss function. When the input picture is generated into a picture, and the age bracket of the picture is b, only the prediction bit of the age bracket b is provided, and the prediction bit is calculated by taking the prediction bit as a constraint term of a loss function. The loss function of the final network as a whole is shown in formula (8):
L=min maxLadv(G,D)+λrecLrec(G)+λcycLcyc(G)+λageLage(G) (8)
Wherein, lambda rec represents the super-parameters of the reconstruction loss function, lambda cyc represents the super-parameters of the cyclic consistency loss function, lambda age represents the super-parameters of the age difference loss function, and the reconstruction loss, the cyclic consistency loss and the age difference loss are only participated in the constraint in the training of the generator. Therefore, the definition of the loss function of the face aging network based on the age difference is completed, and based on the loss, people can optimize the age information and the identity information of the face at the same time. The super parameter part, lambda rec、λcyc, is set to 10 and lambda age is set to 0.5. The optimizer employs Adam, whose momentum is set to 0.9 and the learning rate is set to 0.001.
Details of the training process are detailed below. In the whole, each epoch at least comprises two face pictures when forward propagation is carried out once, wherein one face picture is the face picture of the age group x and the other face picture is the face picture of the age group y, so that the mutual conversion between the two age groups can be carried out on the basis of a real sample at the same time in the forward propagation once. Thus, the gradient calculated by loss in back propagation simultaneously comprises the face aging process in two directions, so that the accuracy of bidirectional aging is improved. The overall training of the network is divided into a training process of the generator and a training process of the arbiter, and specific details are described below.
The training process of the generator is mainly participated by the generator and the discriminator at the same time. The generator part correspondingly generates a reconstructed picture and an aged picture by one input picture in one forward propagation process. It is emphasized that the two pictures in the present method are not generated separately during one forward propagation, but simultaneously due to the use of the age-difference injection structure. Because the feature map obtains a feature map into which the age difference information is injected through the age difference injection structure of each layer, and a feature map in which the age information injection part is skipped only through the convolution layer. Therefore, after the multi-layer age difference injection structure, the finally obtained output is the reconstructed face and the aged face.
After the reconstructed face and the aged face are obtained, continuous cyclic consistent training is carried out, namely the aged face is used as the input of a generator, and the aging operation is carried out again to obtain the face picture which is close enough to the face picture before aging. The generator used at this time is the same as the generator used for generating the reconstructed face and the aged face by taking the original face picture as input. I.e. the network has only one generator, which is responsible for the two-way simulation of aging and younger age. The main part of controlling the direction of aging is the age information provided. The generated aging face only participates in the calculation of the cyclic consistency loss function, and does not participate in the training process of the discriminator.
And a discriminator section for obtaining a vector predicted by the discriminator by taking the aged face as an input. The content of the vector contains age-group probabilities predicted by the arbiter for the input picture. And when the loss function is calculated, selecting a value in a bit represented by the expected age group of the picture for calculation, and discarding the rest. The above is the training process of the whole generator.
The training process of the arbiter requires the generator to provide the generated pictures. The generated picture does not store gradient information, is only responsible for training of the discriminator, and does not update the weight of the generator. And respectively taking the aged human face and the real human face as input to obtain two vectors predicted by the discriminator. Similarly, when the two vectors participate in the loss calculation, only the value in the bits represented by the expected age group of each picture is selected to participate in the loss function calculation, and the rest bits are discarded. To complete training of the discriminant.
After the network training is completed, the test face to be subjected to face aging can be input into the network to complete face aging simulation, as shown in fig. 5, the left image is an input image, and the right image is an output image. Whereas in the prior art, i.e. in fig. 6, left 1 is the input image and the rest is the output image.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.