[go: up one dir, main page]

WO2022173820A2 - System and method for class-identity-preserving data augmentation - Google Patents

System and method for class-identity-preserving data augmentation Download PDF

Info

Publication number
WO2022173820A2
WO2022173820A2 PCT/US2022/015806 US2022015806W WO2022173820A2 WO 2022173820 A2 WO2022173820 A2 WO 2022173820A2 US 2022015806 W US2022015806 W US 2022015806W WO 2022173820 A2 WO2022173820 A2 WO 2022173820A2
Authority
WO
WIPO (PCT)
Prior art keywords
image
network
input
real
class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2022/015806
Other languages
French (fr)
Other versions
WO2022173820A3 (en
Inventor
Marios Savvides
Yutong Zheng
Yu Kai Huang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Carnegie Mellon University
Original Assignee
Carnegie Mellon University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Carnegie Mellon University filed Critical Carnegie Mellon University
Priority to US18/259,477 priority Critical patent/US20240320964A1/en
Publication of WO2022173820A2 publication Critical patent/WO2022173820A2/en
Publication of WO2022173820A3 publication Critical patent/WO2022173820A3/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification

Definitions

  • Deep neural networks with large-scale training are an efficient solution for many different applications, such as image classification, object detection and image segmentation.
  • One crucial issue in the training of deep neural networks is the overfitting problem.
  • generalization of the training dataset must be considered because the training of parameters could easily become fitted to the limited training dataset.
  • Data augmentation is an efficient method to introduce variations in the training dataset during training, thereby increasing the size of the training dataset.
  • the size of a training dataset for a neural network can be increased by introducing copies of existing training data that have been slightly modified or by creating synthetic training data from existing training data that are then added to the training dataset.
  • the augmented dataset acts thereby as a regularizer and helps to reduce the overfitting problem.
  • Data augmentation may take many forms. For example, objects may be deformed, skewed, rotated or mirrored. In addition, semantic features such as pose, lighting, shape and texture may be modified by various means.
  • a system and method for data augmentation for general object recognition which preserves the class identity of the augmented data.
  • the system is conditioned on the identity of the person and changes are made to other facial semantics, such as pose, lighting, expression, makeup, etc. to achieve a better accuracy in model performance.
  • the method sheds light on related AI problems such as insufficiency of available training images and privacy concerns over the training data. This method enables the training for in-the- wild recognition systems with only limited available data to create large scale photorealistic synthetic datasets that can be used for training any neural network.
  • FIG. 1 is a block diagram illustrating a system to be trained.
  • FIG. 2 is a block diagram of the discriminator used in the training of the system of FIG. 1.
  • “real” facial images may be images that are acceptably photorealistic faces
  • “real” classes are classes of facial images that the generator has been trained to generate.
  • the facial image generator may have been trained to generate faces with or without glasses, with or without facial hair, having hair of different colors, etc.
  • the facial image generator may have been trained to generate images in classes representing semantic features such as pose, lighting, expression, makeup, etc.
  • the method may be used on an image generator for generating images depicting classes of objects other than facial images.
  • FIG. 1 is a block diagram of the system to be trained.
  • the system consist of a image recognition network 102 which is a classifier which may be implemented as, for example, a deep neural network, and an image generation network 104 which may be, for example, a generative adversarial network (GAN).
  • Image recognition network 102 given a real image 106 of an object (e.g., a face), predicts one or more classes 108 to which the object belongs.
  • the image generation network 104 given one or more real classes 110, generates an augmented image 112 having the features specified by the real class 120.
  • the real image 106 input to recognition network 102 and the real class 110 input to image generator 104 will be ground truth inputs wherein the real image 106 exhibits the real class 110.
  • FIG. 2 is a block diagram showing discriminator 202.
  • Discriminator 202 takes two inputs: (1) images generated by the image generation network 104; and (2) a classification of the image generated by the image generation network 104, as predicted by the recognition network 102.
  • Discriminator 202 is responsible for determining the authenticity of the generated images and predicted classes according to both their quality and identity preservation, and punishing the image recognition network 102 and/or the image generation network 104. That is, if the image input to discriminator 202 is not photorealistic (quality) or the predicted class input to discriminator 202 is not accurate (identity preservation), discriminator 202 will determine a “fake” outcome and will punish the image generation network 104 and/or the image recognition network 102 to make them more accurate. Over time, as image recognition network 102 and image generation network 106 become more and more accurate, the punishment becomes weaker.
  • discriminator 202 may return one of two results, either a “real” determination or a “fake” determination.
  • FIG. 2(a) is an example of discriminator 202 returning a “real” result based a determination that the generated image and predicted class input to discriminator 202 are both real.
  • FIG. 2(b) shows one case wherein discriminator 202 returns a fake result based on a determination that the generated image input to discriminator 202 is a real image, but the predicted class is fake (i.e., inaccurate).
  • FIG. 2(c) shows the case wherein discriminator 202 returns a fake result based on a low-quality (i.e., “fake”) image generated by image generation network 104 and a real class as generated by image recognition network 102.
  • the image generation network 104 and recognition network 102 will not be punished, while a “fake” output of discriminator 202 will result in image generation 104 and/or recognition network 102 being punished.
  • the punishment may be in the form of gradient to be backpropagated to the various layers of the respective networks.
  • image generation network 104 may take as additional input noise to be applied by the image generation network 104 to further boost the variability of the output of image generation network 104, even with a pre- specified real class 110. That is, class independent semantics may be explicitly introduced into the generation. In the special case of face recognition, while the networks have been conditioned on facial identities, other semantics such as pose, lighting, expression, makeup, etc. can still vary independently. As such, image generation network 104 is encouraged to render images with large variations on which the image recognition network 102 trains. This is the key goal of data augmentation.
  • image generation network 104 will be able to generate photorealistic images with a given identity, while the generated images are expected to exhibit large variations among multiple facial semantics such as pose, lighting, expression, etc.
  • the system 100 can also generate desired faces with desired facial semantics, for example, a 60 degree left yaw angle of a face with glasses and exhibiting a smiling expression.
  • the disclosed system 100 described herein can be implemented by a system further comprising a processor and memory, storing software that, when executed by the processor, implements the soft components comprising system 100.
  • many variations on implementations discussed herein which fall within the scope of the invention are possible.
  • the features of the various embodiments described herein were not mutually exclusive and can exist in various combinations and permutations, even if such combinations or permutations were not made express herein, without departing from the spirit and scope of the invention. Accordingly, the method and apparatus disclosed herein are not to be taken as limitations on the invention but as an illustration thereof. The scope of the invention is defined by the claims which follow.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Image Analysis (AREA)
  • Character Discrimination (AREA)

Abstract

Disclosed herein is a system and method for data augmentation for general object recognition which preserves the class identity of the augmented data. The system comprises an image recognition network an image generation network that take as input ground truth images and classes respectively and which generates a predicted class and an augmented image. A discriminator evaluates the predicted class and augmented image and provides feedback to the image recognition network and the image generation network.

Description

SYSTEM AND METHOD FOR CLASS-IDENTITY -PRESERVING DATA
AUGMENTATION
Related Applications
[0001] This application claims the benefit of U.S. Provisional Patent Application No. 63/149,388, filed February 15, 2021, the contents of which are incorporated herein in their entirety.
Background
[0002] Deep neural networks with large-scale training are an efficient solution for many different applications, such as image classification, object detection and image segmentation. One crucial issue in the training of deep neural networks is the overfitting problem. In a deep neural network with a large number of parameters, generalization of the training dataset must be considered because the training of parameters could easily become fitted to the limited training dataset.
[0003] Data augmentation is an efficient method to introduce variations in the training dataset during training, thereby increasing the size of the training dataset. Using data augmentation, the size of a training dataset for a neural network can be increased by introducing copies of existing training data that have been slightly modified or by creating synthetic training data from existing training data that are then added to the training dataset. The augmented dataset acts thereby as a regularizer and helps to reduce the overfitting problem.
[0004] Data augmentation may take many forms. For example, objects may be deformed, skewed, rotated or mirrored. In addition, semantic features such as pose, lighting, shape and texture may be modified by various means.
Summary
[0005] Disclosed herein is a system and method for data augmentation for general object recognition which preserves the class identity of the augmented data. In one instance, where facial images are the objects, the system is conditioned on the identity of the person and changes are made to other facial semantics, such as pose, lighting, expression, makeup, etc. to achieve a better accuracy in model performance. The method sheds light on related AI problems such as insufficiency of available training images and privacy concerns over the training data. This method enables the training for in-the- wild recognition systems with only limited available data to create large scale photorealistic synthetic datasets that can be used for training any neural network.
Brief Description of the Drawings
[0006] By way of example, a specific exemplary embodiment of the disclosed system and method will now be described, with reference to the accompanying drawings, in which:
[0007] FIG. 1 is a block diagram illustrating a system to be trained.
[0008] FIG. 2 is a block diagram of the discriminator used in the training of the system of FIG. 1.
Detailed Description
[0009] The system and method of the disclosed invention will be explained in the context of a facial image generator that is to be trained to generate “real” facial images in “real” classes. In this context, “real” facial images may be images that are acceptably photorealistic faces, while “real” classes are classes of facial images that the generator has been trained to generate. For example, the facial image generator may have been trained to generate faces with or without glasses, with or without facial hair, having hair of different colors, etc. In addition, the facial image generator may have been trained to generate images in classes representing semantic features such as pose, lighting, expression, makeup, etc. As would be realized by one of skill in the art, the method may be used on an image generator for generating images depicting classes of objects other than facial images.
[0010] FIG. 1 is a block diagram of the system to be trained. The system consist of a image recognition network 102 which is a classifier which may be implemented as, for example, a deep neural network, and an image generation network 104 which may be, for example, a generative adversarial network (GAN). Image recognition network 102, given a real image 106 of an object (e.g., a face), predicts one or more classes 108 to which the object belongs. The image generation network 104, given one or more real classes 110, generates an augmented image 112 having the features specified by the real class 120. Preferably, the real image 106 input to recognition network 102 and the real class 110 input to image generator 104 will be ground truth inputs wherein the real image 106 exhibits the real class 110.
[0011] FIG. 2 is a block diagram showing discriminator 202. Discriminator 202 takes two inputs: (1) images generated by the image generation network 104; and (2) a classification of the image generated by the image generation network 104, as predicted by the recognition network 102.
[0012] Discriminator 202 is responsible for determining the authenticity of the generated images and predicted classes according to both their quality and identity preservation, and punishing the image recognition network 102 and/or the image generation network 104. That is, if the image input to discriminator 202 is not photorealistic (quality) or the predicted class input to discriminator 202 is not accurate (identity preservation), discriminator 202 will determine a “fake” outcome and will punish the image generation network 104 and/or the image recognition network 102 to make them more accurate. Over time, as image recognition network 102 and image generation network 106 become more and more accurate, the punishment becomes weaker.
[0013] Given the two inputs, discriminator 202 may return one of two results, either a “real” determination or a “fake” determination. FIG. 2(a) is an example of discriminator 202 returning a “real” result based a determination that the generated image and predicted class input to discriminator 202 are both real. FIG. 2(b) shows one case wherein discriminator 202 returns a fake result based on a determination that the generated image input to discriminator 202 is a real image, but the predicted class is fake (i.e., inaccurate). Lastly, FIG. 2(c) shows the case wherein discriminator 202 returns a fake result based on a low-quality (i.e., “fake”) image generated by image generation network 104 and a real class as generated by image recognition network 102.
[0014] As previously stated, when a “real” result is returned, the image generation network 104 and recognition network 102 will not be punished, while a “fake” output of discriminator 202 will result in image generation 104 and/or recognition network 102 being punished. The punishment may be in the form of gradient to be backpropagated to the various layers of the respective networks.
[0015] In addition to generating images exhibiting the real class 110, image generation network 104 may take as additional input noise to be applied by the image generation network 104 to further boost the variability of the output of image generation network 104, even with a pre- specified real class 110. That is, class independent semantics may be explicitly introduced into the generation. In the special case of face recognition, while the networks have been conditioned on facial identities, other semantics such as pose, lighting, expression, makeup, etc. can still vary independently. As such, image generation network 104 is encouraged to render images with large variations on which the image recognition network 102 trains. This is the key goal of data augmentation.
[0016] Taking face recognition as an example, after training, image generation network 104 will be able to generate photorealistic images with a given identity, while the generated images are expected to exhibit large variations among multiple facial semantics such as pose, lighting, expression, etc. Using facial semantic disentanglement methods, the system 100 can also generate desired faces with desired facial semantics, for example, a 60 degree left yaw angle of a face with glasses and exhibiting a smiling expression.
[0017] As would be realized by one of skill in the art, the disclosed system 100 described herein can be implemented by a system further comprising a processor and memory, storing software that, when executed by the processor, implements the soft components comprising system 100. [0018] As would further be realized by one of skill in the art, many variations on implementations discussed herein which fall within the scope of the invention are possible. Moreover, it is to be understood that the features of the various embodiments described herein were not mutually exclusive and can exist in various combinations and permutations, even if such combinations or permutations were not made express herein, without departing from the spirit and scope of the invention. Accordingly, the method and apparatus disclosed herein are not to be taken as limitations on the invention but as an illustration thereof. The scope of the invention is defined by the claims which follow.

Claims

1. A system comprising: an image recognition network which takes a real image as input and predicts a class of the image; an image generation network which takes a real class as input and generates an image fitted to the input class; and a discriminator network which takes as input the generated image and the predicted class and returns a result indicating whether the predicted class is accurate, and the generated image is of an acceptable quality.
2. The system of claim 1 wherein the real image input to the image recognition network and the real class input to the image generation network are ground truth inputs, wherein the real image input to the image recognition network exhibits features of the real class input to the image generation network.
3. The system of claim 2 wherein the discriminator network punishes the image recognition network and/or the image generation network based on an output of the discriminator network.
4. The system of claim 3 wherein the discriminator network returns a real result if the generated image input is real and the predicted class input is real.
5. The system of claim 3 wherein the discriminator network returns a fake result if the generated input image is real and the predicted class input is fake.
6. The system of claim 3 wherein the discriminator network returns a fake result if the generated input image is fake and the predicted class input is real.
7. The system of claim 3 wherein the discriminator network punishes the recognition network if the discriminator network returns a fake result based on the predicted class input being fake.
8. The system of claim 3 wherein the discriminator network punishes the image generation network if the discriminator network returns a result based on the generated input image being fake.
9. The system of claim 3 wherein the discriminator network generates a gradient to be backpropagated to the image recognition network and/or the image generation network as the punishment.
10. The system of claim 1 wherein the image generation network takes as additional input random noise to introduce class independent semantic variations into the generated image.
11. The system of claim 1 wherein the image is generated by the image generation network are used to train the image recognition network.
12. The system of claim 1 wherein objects recognized by image recognition network in the real input image are facial images.
13. The system of claim 12 wherein the facial images are generated by image generation network and preserve a class identity of the face depicted in the facial image.
14. The system of claim 1 further comprising: a processor; memory, storing software that, when executed by the processor, implement the image recognition network, the image generation network, and the discriminator network.
PCT/US2022/015806 2021-02-15 2022-02-09 System and method for class-identity-preserving data augmentation Ceased WO2022173820A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/259,477 US20240320964A1 (en) 2021-02-15 2022-02-09 System and method for class-identity-preserving data augmentation

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202163149388P 2021-02-15 2021-02-15
US63/149,388 2021-02-15

Publications (2)

Publication Number Publication Date
WO2022173820A2 true WO2022173820A2 (en) 2022-08-18
WO2022173820A3 WO2022173820A3 (en) 2022-12-15

Family

ID=82837879

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2022/015806 Ceased WO2022173820A2 (en) 2021-02-15 2022-02-09 System and method for class-identity-preserving data augmentation

Country Status (2)

Country Link
US (1) US20240320964A1 (en)
WO (1) WO2022173820A2 (en)

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004005364A (en) * 2002-04-03 2004-01-08 Fuji Photo Film Co Ltd Similar image retrieval system
US7043474B2 (en) * 2002-04-15 2006-05-09 International Business Machines Corporation System and method for measuring image similarity based on semantic meaning
US7519200B2 (en) * 2005-05-09 2009-04-14 Like.Com System and method for enabling the use of captured images through recognition
US7587070B2 (en) * 2005-09-28 2009-09-08 Facedouble, Inc. Image classification and information retrieval over wireless digital networks and the internet
US20160321523A1 (en) * 2015-04-30 2016-11-03 The Regents Of The University Of California Using machine learning to filter monte carlo noise from images
US10762337B2 (en) * 2018-04-27 2020-09-01 Apple Inc. Face synthesis using generative adversarial networks

Also Published As

Publication number Publication date
WO2022173820A3 (en) 2022-12-15
US20240320964A1 (en) 2024-09-26

Similar Documents

Publication Publication Date Title
Boughrara et al. Facial expression recognition based on a mlp neural network using constructive training algorithm
Kae et al. Augmenting CRFs with Boltzmann machine shape priors for image labeling
Zafaruddin et al. Face recognition using eigenfaces
CN113822953B (en) Image generator processing method, image generation method and device
CN106599883A (en) Face recognition method capable of extracting multi-level image semantics based on CNN (convolutional neural network)
Pradhyumna A survey of modern deep learning based generative adversarial networks (gans)
Usmani et al. Efficient deepfake detection using shallow vision transformer
CN116403290A (en) A Liveness Detection Method Based on Self-Supervised Domain Clustering and Domain Generalization
Pawar et al. Advancements and Applications of Generative Adversarial Networks: A Comprehensive Review
Khalid et al. Deepfakes catcher: a novel fused truncated densenet model for deepfakes detection
US11308699B2 (en) Method and system for data generation
Narvaez et al. Painting authorship and forgery detection challenges with ai image generation algorithms: Rembrandt and 17th century dutch painters as a case study
US20240320964A1 (en) System and method for class-identity-preserving data augmentation
Joseph et al. Beyond frontal face recognition
CN114897670A (en) Stylized picture generation method, device, device and storage medium
Liu et al. Deep counterfactual representation learning for visual recognition against weather corruptions
Sinha et al. A Systematic Review On Generative Adversarial Networks (GANs) For Biometrics.
Poongodi et al. Secure translation from sketch to image using an unsupervised generative adversarial network with het for AI based images
Viswanathan et al. Text to image translation using generative adversarial networks
Tandon et al. Real-time face transition using deepfake technology (gan model)
Wang Enhanced forest microexpression recognition based on optical flow direction histogram and deep multiview network
Rombach et al. Invertible neural networks for understanding semantics of invariances of CNN representations
Sheth An intelligent approach to detect facial retouching using Fine Tuned VGG16
Roy Applying aging effect on facial image with multi-domain generative adversarial network
Kesarwani et al. Generative adversarial networks (GANs): Introduction and vista

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22753260

Country of ref document: EP

Kind code of ref document: A2

DPE1 Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101)
WWE Wipo information: entry into national phase

Ref document number: 18259477

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22753260

Country of ref document: EP

Kind code of ref document: A2