[go: up one dir, main page]

WO2016086330A1 - A method and a system for face recognition - Google Patents

A method and a system for face recognition Download PDF

Info

Publication number
WO2016086330A1
WO2016086330A1 PCT/CN2014/001091 CN2014001091W WO2016086330A1 WO 2016086330 A1 WO2016086330 A1 WO 2016086330A1 CN 2014001091 W CN2014001091 W CN 2014001091W WO 2016086330 A1 WO2016086330 A1 WO 2016086330A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature extraction
features
extraction module
images
convolutional layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2014/001091
Other languages
French (fr)
Inventor
Xiaoou Tang
Yi Sun
Xiaogang Wang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201480083717.5A priority Critical patent/CN107004115B/en
Priority to PCT/CN2014/001091 priority patent/WO2016086330A1/en
Publication of WO2016086330A1 publication Critical patent/WO2016086330A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/192Recognition using electronic means using simultaneous comparisons or correlations of the image signals with a plurality of references
    • G06V30/194References adjustable by an adaptive method, e.g. learning

Definitions

  • the present application relates to a method for face recognition and a system thereof.
  • the deep learning achieved great success on face recognition and significantly outperformed systems using low level features.
  • the second is supervising the deep neural networks with both the identification and verification tasks. The verification task minimizes the distance between features of the same identity, and decreases intra-personal variations.
  • joint identification-verification achieved the current state-of-the-art 99.15% face verification accuracy on the most extensively evaluated LFW face recognition dataset.
  • features in deep neural networks have sparsity on identities and attributes.
  • deep neural networks in this application are not taught to distinguish attributes during training, they have implicitly learned such high-level concepts.
  • Directly employing features learned by deep neural networks leads to much higher classification accuracy on identity-related attributes than widely used handcrafted features such as high-dimensional LBP (Local Binary Pattern) .
  • LBP Local Binary Pattern
  • this application shows that deep neural networks trained by natural web face images without artificial occlusion patterns added during training have implicitly encoded invariance to occlusions.
  • features learned by the deep neural networks are moderately sparse. For an input face image, around half of the features in the top hidden layer are activated. On the other hand, each feature is activated on roughly half of the face images.
  • Such sparsity distributions can maximize the discriminative power of deep neural networks as well as the distance between images.
  • Different identities have different subsets of features activated. Two images of the same identity have similar activation patterns.
  • the apparatus may comprise a feature extractor and a recognition unit.
  • the feature extractor is configured with a plurality of cascaded feature extraction modules, wherein each of the feature extraction modules comprises a convolutional layer for extracting local features from input face images or from features extracted in a previous feature extraction module of the modules; and a fully-connected layer connected to the convolutional layer in the same feature extraction module and extracting global features from the extracted local features.
  • the recognizer is configured to, in accordance with distances between the extracted global features, determine: if two face images of the input images are from a same identity, or if one of the input images, as a probe face image, is belonging to a same identity as one of gallery face images consisting of the input images.
  • the convolutional layer in the first feature extraction module of the cascaded feature extraction modules is connected to an input face image, and the convolutional layer in each of the following feature extraction modules is connected to the convolutional layer in the previous feature extraction module.
  • the fully-connected layer in each feature extraction module is connected to the convolutional layer in the same feature extraction module.
  • the apparatus may further comprise a trainer configured to update neuron weights on connections between each convolutional layer and the corresponding fully-connected layer in the same feature extraction module by back-propagating identification supervisory signals and verification supervisory signals through the cascaded feature extraction modules.
  • the process of the updating may comprise: inputting two face images to the neural network, respectively, to get feature representations of each of the two face images; calculating identification errors by classifying feature representations of each face image in each fully-connected layer of the neural network into one of a plurality of identities; calculating verification errors by verifying if feature representations of two face images, respectively, in each fully-connected layer are from the same identity, the identification and verification errors being treated as identification and verification supervisory signals, respectively; and back-propagating all identification and verification supervisory signals through the neural network simultaneously, so as to update the neuron weights on connections between each convolutional layer and the corresponding fully-connected layer in the same feature extraction module.
  • the present application discovers and proves three properties of features extracted in later feature extraction modules, i.e., sparsity, selectiveness, and robustness, all of which are critical for face recognition, wherein features are sparse in both the sense that features of each face image have approximately half zero values and half positive values, and each feature has approximately half of the time being zero and half of the time being positive over all face images; features are selective to both identities and identity-related attributes such as sex and race in the sense that there are features which take either positive (activated) or zero (inhibited) values for all face images of a given identity or containing a given identity-related attribute; features are robust to image corruptions such as occlusions, wherein feature values remain largely unchanged under moderate image corruptions.
  • Fig. 1 is a schematic diagram illustrating an apparatus for face recognition consistent with some disclosed embodiments.
  • Fig. 2 is a schematic diagram illustrating the sparsity, selectiveness, and robustness of features extracted in later feature extraction modules.
  • Fig. 3 is a schematic diagram illustrating structures of cascaded feature extraction modules in the feature extractor, as well as input face images and supervisory signals in the trainer.
  • Fig. 4 is schematic histograms illustrating the sparsity of activated features (neurons) on individual face images as well as the sparsity of individual features (neurons) activated on all face images.
  • Fig. 5 is schematic histograms illustrating the selective activation and inhibition of features on face images of particular identities.
  • Fig. 6 is schematic histograms illustrating the selective activation and inhibition of features on face images containing particular attributes.
  • Fig. 7 is a schematic diagram illustrating face images with random block occlusions, which are used to test the robustness of features extracted by the feature extractor against image corruptions.
  • Fig. 8 is a schematic diagram illustrating the mean feature activations over face images of individual identities under various degrees of random block occlusions.
  • Fig. 9 is a schematic flowchart illustrating the trainer as shown in Fig. 1 consistent with some disclosed embodiments.
  • Fig. 10 is a schematic flowchart illustrating the feature extractor as shown in Fig. 1 consistent with some disclosed embodiments.
  • Fig. 11 is a schematic flowchart illustrating the recognizer as shown in Fig. 1 consistent with some disclosed embodiments.
  • the present invention may be embodied as a system, method or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc. ) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit, ” “module” or “system. ” Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer-usable program code embodied in the medium.
  • Fig. 1 is a schematic diagram illustrating an exemplary apparatus 100 for face recognition consistent with some disclosed embodiments.
  • the apparatus 100 may comprise a feature extractor 10 and a recognizer 20.
  • the feature extractor 10 is configured to extract features from input face images.
  • the feature extractor 10 may comprise a neural network which may be constructed with a plurality of cascaded feature extraction modules, wherein each feature extraction module in the cascade comprises a convolutional layer and a fully-connected layer.
  • the cascaded feature extraction modules may be implemented by software, integrated circuits (ICs) or the combination thereof.
  • Fig. 3 illustrates a schematic diagram for structures of cascaded feature extraction modules in the feature extractor 10.
  • the convolutional layer in the first feature extraction module of the cascaded feature extraction modules is connected to an input face image
  • the convolutional layer in each of the following feature extraction modules is connected to the convolutional layer in the previous feature extraction module.
  • the fully-connected layer in each feature extraction module is connected to the convolutional layer in the same feature extraction module.
  • the apparatus 100 further comprises a trainer 30 configured to update neural weights on connections between the convolutional layer in the first feature extraction module and the input layer containing an input face image, connections between each convolutional layer in the second to the last feature extraction modules and the corresponding convolutional layer in the previous feature extraction module, and connections between each convolutional layer and the corresponding fully-connected layer in the same feature extraction module, by back-propagating identification supervisory signals and verification supervisory signals through the cascaded feature extraction modules, such that features extracted in last/highest one of the cascaded feature extraction modules are sparse, selective, and robust, which will be discussed later.
  • a trainer 30 configured to update neural weights on connections between the convolutional layer in the first feature extraction module and the input layer containing an input face image, connections between each convolutional layer in the second to the last feature extraction modules and the corresponding convolutional layer in the previous feature extraction module, and connections between each convolutional layer and the corresponding fully-connected layer in the same feature extraction module, by back-propagating identification supervisory signals and
  • the recognizer 20 may be implemented by software, integrated circuits (ICs) or the combination thereof, and is configured to calculate distances between features extracted from different face images to determine if two face images are from the same identity for face verification or determine if one of the input images, as a probe face image, is belonging to a same identity as one of gallery face images consisting of the input images for face identification.
  • ICs integrated circuits
  • the feature extractor 10 contains a plurality of cascaded feature extraction modules, and operates to extract features hierarchically from input face images.
  • the convolutional layer Conv-1 in the first feature extraction module of the feature extractor 10 is connected to an input face image as an input layer, while the convolutional layer Conv-n for n > 1 in each of the following feature extraction modules of the feature extractor 10 is connected to the convolutional layer Conv- (n-1) in the previous feature extraction module.
  • the fully-connected layer FC-n in each feature extraction module of the feature extractor 10 is connected to the convolutional layer Conv-n in the same feature extraction module.
  • Fig. 10 is a schematic flowchart illustrating the feature extraction process in the feature extractor 10.
  • the feature extractor 10 forward propagates an input face image through convolutional layers in all feature extraction modules of the feature extractor 10.
  • step 102 the feature extractor 10 forward propagates outputs of each of the convolutional layers to a corresponding fully-connected layer within the same feature extraction module.
  • step 103 it takes the outputs/representations from a last one of the fully-connected layers as features as discussed below.
  • Convolutional layers in the feature extractor 10 are configured to extract local facial features (i.e. features extracted from local regions of the input images or the input features) from input images (for the first convolutional layer) or the feature maps (which is output feature maps of the previous convolutional layer followed by max pooling as well known in the art) to form output feature maps of the current convolutional layer.
  • Each feature map is a certain kind of features organized in 2D.
  • the features in the same output feature map or in local regions of the same feature map are extracted from input feature maps with the same set of neural connection weights w between the input features maps and the output feature maps in the previous convolutional layers (followed by max pooling) and the current convolutional layers, respectively.
  • the convolution operation in each convolutional laver may be expressed as
  • x i and y j are the i-th input feature map and the j-th output feature map, respectively.
  • k ij is the convolution kernel between the i-th input feature map and the j-th output feature map. *denotes convolution.
  • b j is the bias of the j-th output feature map.
  • Each convolutional layer may be followed by max-pooling formulated as
  • each neuron in the i-th output feature map y i pools over an s ⁇ s non-overlapping local region in the i-th input feature map x i .
  • Each of the fully-connected layers in the feature extractor 10 is configured to extract global features (features extracted from the entire region of input feature maps) from the feature maps obtained from the convolutional layers on the same module. That is, the fully-connected layer FC-n extracts global features from the convolutional layer Conv-n.
  • the fully-connected layers also serve as interfaces for receiving supervisory signals during training and outputting features during feature extraction.
  • Fully-connected layers may be formulated as
  • x i represent the output of the i-th neuron in the previous convolutional layer (followed by max-pooling) .
  • y j represent the output of the j-th neuron in the current fully-connected layer.
  • w i,j is a weight on connections between the i-th neuron in the previous convolutional layer (followed by max-pooling) and the j-th neuron in the current fully-connected layer.
  • b j is a bias of the j-th neuron in the current fully-connected layer.
  • Max (0, x) is the ReLU non-linearity.
  • features extracted in the last/highest feature extraction modules of the feature extractor 10, e.g., those in FC-4 layer as shown in Fig. 3, are sparse, selective, and robust: features are sparse in both the sense that features of each face image have approximately half zero values and half positive values, and each feature has approximately half of the time being zero and half of the time being positive over all face images; features are selective to both identities and identity-related attributes such as sex and race in the sense that there are features which take either positive (activated) or zero (inhibited) values for all face images of a given identity or containing a given identity-related attribute; features are robust to image corruptions such as occlusions, wherein feature values remain largely unchanged under moderate image corruptions.
  • the sparse features can be converted to binary code by comparing to a threshold, wherein the binary code can be used for face recognition.
  • Fig. 2 illustrates the three properties, sparsity, selectiveness, and robustness, of features extracted in FC-4 layer.
  • Fig. 2 left shows features on three face images of Bush and one face image of Powell .
  • the second face image of Bush is partially occluded.
  • there are 512 features in FC-4 layer from which Fig. 2 subsamples 32 for illustration as an example.
  • Features are sparsely activated on each face image, in which there are approximately half of features being positive and half being zero.
  • Features of face images of the same identity have similar activation patterns while being different for different identities.
  • Features are robust in that when occlusions are presented, as shown on the second face of Bush, activation patterns of features keep largely unchanged.
  • a feature is generally activated on about half of face images. But it may constantly have activations (or no activation) for all images belonging to a particular identity of attribute. In this sense, features are sparse, and selective to identities and attributes.
  • Fig. 4 left shows the histogram of activated (positive) feature numbers on each of 46, 594 (for example) face images in a validating dataset
  • Fig. 4 fight shows the histogram of the number of images on which each feature are activated (positive) .
  • the evaluation is based features extracted by FC-4 layer.
  • the mean and standard deviation of the number of activated neurons on images is 292 ⁇ 34, while compared to all 46, 594 validating images, the mean and standard deviation of the number of images on which each feature are activated is 26, 565 ⁇ 5754, both of which are approximated centered at half of all features/images.
  • the activation patterns i.e., whether features are activated (with positive values) , are more important than precise activation values. Converting feature activations to binary code by thresholding only sacrifices less than 1%face verification accuracies. This shows that the state of excitation or inhibition of features already contains the majority of discriminative information. Binary code is economic for storage and fast for image search.
  • Fig. 5 and Fig. 6 show examples of activation histograms of features over given identities and attributes, respectively. Histograms over given identities exhibit strong selectiveness. Some features are constantly activated for a given identity, with histograms distributed in values greater than zero, as shown in the first two rows in Fig. 5, while some others are constantly inhibited, with histograms accumulated at zero or small values, as shown in the last two rows in Fig. 5. For attributes, each row of Fig. 7 shows histograms of a single feature over a few related attributes (those related to sex, race, and age) . The selected features are excitatory on each of attributes given in the left of each row. As shown in Fig.
  • features exhibit strong selectiveness to sex, race, and certain ages such as child and senior, in which features are strongly activated for a given attribute while inhibited for other attributes in the same category.
  • the selectiveness is weak, in which there are no features solely activated for each of these attributes. This is because ages do not exactly correspond to identities. For example, in face recognition, features have to be invariant to the same identity photographed at both young and middle aged.
  • Fig. 7 and Fig. 8 illustrate the robustness of features extracted in later feature extraction modules (FC-4 layer) against image corruptions. Face images are occluded by random blocks with various sizes from 10 ⁇ 10 to 70 ⁇ 70, as illustrated in Fig. 7.
  • Fig. 8 shows mean feature activations over images with random block occlusions, in which each column shows the mean activation over face images of a single identity given in the top of each column, with various degrees of occlusions given in the left of each row.
  • Feature values are mapped to a color map with warm colors indicating positive values and cool colors indicating zero or small values.
  • the order of features in figures in each column is sorted by the mean feature activation values on the original face images of each identity, respectively.
  • the activation patterns keep largely unchanged (with most activated features still being activated and most inhibited features still being inhibited) until a large degree of occlusions.
  • the Recognizer20 The Recognizer20
  • the recognizer 20 operates to calculate distances between global features for different face images extracted by the fully-connected layer of the feature extractor 10 to determine if two face images are from the same identity for face verification or determine if one of the input images, as a probe face image, is belonging to a same identity as one of gallery face images consisting of the input images for face identification.
  • Fig. 10 is a schematic flowchart illustrating the recognition process in the recognizer 20.
  • the recognizer 20 calculates distances between features (i.e. global features for different face images extracted by the fully-connected layer) extracted from different face images by the feature extractor 10.
  • the recognizer 20 determines if two face images are from the same identity for face verification, or, alternatively, in step 203, it determines one of the input images, as a probe face image, is belonging to a same identity as one of gallery face images consisting of the input images for face identification.
  • two face images are determined to belong to the same identity if their feature distance is smaller than a threshold, or the probe face image is determined to belong to the same identity as one of gallery face images if their feature distance is the smallest compared to feature distances of the probe face image to all the other gallery face images, wherein feature distances determined by the recognizer 20 could be Euclidean distances, Joint Bayesian distances, cosine distances, Hamming distances, or any other distances.
  • Joint Bayesian distances are used as feature distances.
  • Joint Bayesian has been a popular similarity metric of faces, which represents the extracted facial features x (after subtracting the mean) by the sum of two independent Gaussian variables
  • S S and S ⁇ can be learned from data with EM algorithm. In test, it calculates the likelihood ratio
  • the Trainer 30 The Trainer 30
  • the a trainer 30 is used to update the weights w on connections between neurons in convolutional and fully-connected layers in the feature extractor 10 by inputting initial weights on connections between neurons in convolutional and fully-connected layers in the feature extractor, a plurality of identification supervisory signals, and a plurality of verification supervisory signals, such that that features extracted in last one of the cascaded feature extraction modules in the extractor are sparse, selective, and robust..
  • the identification supervisory signals “Id” are generated in the trainer 30 by classifying all of the fully-connected layer representations/outputs (i.e., formula (4) ) of a single face image into one of N identities, wherein the classification errors are used as the identification supervisory signals.
  • the verification supervisory signals in the trainer 30 are generated by verifying the fully-connected layer representations of two compared face images, respectively, in each of the feature extraction modules, to determine if the two compared face images belong to the same identity, wherein the verification errors are used as the verification supervisory signals.
  • the feature extractor 10 Given a pair of training face images, the feature extractor 10 extracts two feature vectors f i and f j from the two face images respectively in each of the feature extraction modules.
  • the verification error is if f i and f j are features of face images of the same identity, or if f i and f j are features of face images of different identities, where
  • 2 is Euclidean distance of the two feature vectors, m is a positive constant value.
  • f i and f j are dissimilar for the same identity, or if f i and f j are similar for different identities.
  • Fig. 9 is a schematic flowchart illustrating the training process in the trainer 30.
  • the trainer 30 samples two face images and inputs them to the feature extractor 10, respectively, to get feature representations of each of the two face images in all fully-connected layers of the feature extractor 10.
  • the trainer 30 calculates identification errors by classifying feature representations of each face image in each fully-connected layer into one of a plurality of (N) identities.
  • the trainer 30 calculates verification errors by verifying if feature representations of two face images, respectively, in each fully-connected layer are from the same identity.
  • the identification and verification errors are used as identification and verification supervisory signals, respectively.
  • step 104 the trainer 30 back propagates all identification and verification supervisory signals through the feature extractor 10 simultaneously, so as to update weights on connections between neurons in the feature extractor 10.
  • step 105 the trainer 30 judges if training process has converged, and repeats steps 101-104 if a convergence point has not reached.

Landscapes

  • Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)
  • Collating Specific Patterns (AREA)

Abstract

Disclosed are an apparatus a method for face recognition. The apparatus may comprise: an extractor having a plurality of cascaded feature extraction modules, wherein each of the cascaded feature extraction modules comprises convolutional layer for extracting local features from input face images or from features extracted in a previous feature extraction module of the modules; and a fully -connected layer connected to the convolutional layer on a same feature extraction module and extracting global features from the extracted local features. The apparatus may further comprise a recognizer configured to, in accordance with distances between the extracted global features, determine: if two face images of the input images are from a same identity, or if one of the input images, as a probe face image, is belonging to a same identity as one of gallery face images consisting of the input images.

Description

A METHOD AND A SYSTEM FOR FACE RECOGNITION Technical Field
The present application relates to a method for face recognition and a system thereof.
Background
Only very recently, the deep learning achieved great success on face recognition and significantly outperformed systems using low level features. There are two notable breakthroughs. The first is large-scale face identification with deep neural networks. By classifying the face images into thousands or even millions of identities, the last hidden layer forms features highly discriminative to identities. The second is supervising the deep neural networks with both the identification and verification tasks. The verification task minimizes the distance between features of the same identity, and decreases intra-personal variations. By combining features learned from many face regions, joint identification-verification achieved the current state-of-the-art 99.15% face verification accuracy on the most extensively evaluated LFW face recognition dataset.
There have been works on first learning attribute classifiers and then using attribute predictions for face recognition. In addition, Sparse representation-based classification was extensively studied for face recognition with occlusions. The Robust Boltzmann Machine has been proposed to distinguish corrupted pixels and learn latent representations. These methods designed components explicitly handling occlusions.
Summary
There have been works on first learning attribute classifiers and then using attribute predictions for face recognition, while this application tries the inverse, by first predicting the identities, and then using the learned identity-related features to predict attributes. It is observed that the features in higher layers of the neural networks are highly selective to identities and identity-related attributes such as sex and race. When an identity (who can be outside the training data) or attribute is presented, a subset of features can be identified which are constantly excited and also another subset of features can be identified which are constantly inhibited. A feature  from any of these two subsets has strong indication on the existence/non-existence of this identity or attribute, and this application show that the single feature alone has high recognition accuracy for a particular identity or attribute. In other words, features in deep neural networks have sparsity on identities and attributes. Although deep neural networks in this application are not taught to distinguish attributes during training, they have implicitly learned such high-level concepts. Directly employing features learned by deep neural networks leads to much higher classification accuracy on identity-related attributes than widely used handcrafted features such as high-dimensional LBP (Local Binary Pattern) .
Contrary to the conventional sparse representation-based classification, this application shows that deep neural networks trained by natural web face images without artificial occlusion patterns added during training have implicitly encoded invariance to occlusions.
It is observed in this application that features learned by the deep neural networks are moderately sparse. For an input face image, around half of the features in the top hidden layer are activated. On the other hand, each feature is activated on roughly half of the face images. Such sparsity distributions can maximize the discriminative power of deep neural networks as well as the distance between images. Different identities have different subsets of features activated. Two images of the same identity have similar activation patterns. This motivates this application to binarize the real-valued features in the top hidden layer of deep neural networks and use the binary code for recognition. Its result is surprisingly good. Its verification accuracy on LFW only slightly drops by less than 1%. It has significant impact on large-scale face search since huge storage and computation time is saved. This also implies that binary activation patterns are more important than activation magnitudes in the deep neural networks.
In one aspect of the present application, disclosed is an apparatus for face recognition. The apparatus may comprise a feature extractor and a recognition unit. The feature extractor is configured with a plurality of cascaded feature extraction modules, wherein each of the feature extraction modules comprises a convolutional layer for extracting local features from input face images or from features extracted in a previous feature extraction module of the modules; and a fully-connected layer connected to the convolutional layer in the same feature extraction module and extracting global features from the extracted local features. The recognizer is  configured to, in accordance with distances between the extracted global features, determine: if two face images of the input images are from a same identity, or if one of the input images, as a probe face image, is belonging to a same identity as one of gallery face images consisting of the input images.
In one embodiment of the present application, the convolutional layer in the first feature extraction module of the cascaded feature extraction modules is connected to an input face image, and the convolutional layer in each of the following feature extraction modules is connected to the convolutional layer in the previous feature extraction module. The fully-connected layer in each feature extraction module is connected to the convolutional layer in the same feature extraction module.
The apparatus may further comprise a trainer configured to update neuron weights on connections between each convolutional layer and the corresponding fully-connected layer in the same feature extraction module by back-propagating identification supervisory signals and verification supervisory signals through the cascaded feature extraction modules.
The process of the updating may comprise: inputting two face images to the neural network, respectively, to get feature representations of each of the two face images; calculating identification errors by classifying feature representations of each face image in each fully-connected layer of the neural network into one of a plurality of identities; calculating verification errors by verifying if feature representations of two face images, respectively, in each fully-connected layer are from the same identity, the identification and verification errors being treated as identification and verification supervisory signals, respectively; and back-propagating all identification and verification supervisory signals through the neural network simultaneously, so as to update the neuron weights on connections between each convolutional layer and the corresponding fully-connected layer in the same feature extraction module.
The present application discovers and proves three properties of features extracted in later feature extraction modules, i.e., sparsity, selectiveness, and robustness, all of which are critical for face recognition, wherein features are sparse in both the sense that features of each face image have approximately half zero values and half positive values, and each feature has approximately half of the time being zero and half of the time being positive over all face images; features are selective to  both identities and identity-related attributes such as sex and race in the sense that there are features which take either positive (activated) or zero (inhibited) values for all face images of a given identity or containing a given identity-related attribute; features are robust to image corruptions such as occlusions, wherein feature values remain largely unchanged under moderate image corruptions.
Brief Description of the Drawing
Exemplary non-limiting embodiments of the present invention are described below with reference to the attached drawings. The drawings are illustrative and generally not to an exact scale. The same or similar elements on different figures are referenced with the same reference numbers.
Fig. 1 is a schematic diagram illustrating an apparatus for face recognition consistent with some disclosed embodiments.
Fig. 2 is a schematic diagram illustrating the sparsity, selectiveness, and robustness of features extracted in later feature extraction modules.
Fig. 3 is a schematic diagram illustrating structures of cascaded feature extraction modules in the feature extractor, as well as input face images and supervisory signals in the trainer.
Fig. 4 is schematic histograms illustrating the sparsity of activated features (neurons) on individual face images as well as the sparsity of individual features (neurons) activated on all face images.
Fig. 5 is schematic histograms illustrating the selective activation and inhibition of features on face images of particular identities.
Fig. 6 is schematic histograms illustrating the selective activation and inhibition of features on face images containing particular attributes.
Fig. 7 is a schematic diagram illustrating face images with random block occlusions, which are used to test the robustness of features extracted by the feature extractor against image corruptions.
Fig. 8 is a schematic diagram illustrating the mean feature activations over face images of individual identities under various degrees of random block occlusions.
Fig. 9 is a schematic flowchart illustrating the trainer as shown in Fig. 1 consistent with some disclosed embodiments.
Fig. 10 is a schematic flowchart illustrating the feature extractor as  shown in Fig. 1 consistent with some disclosed embodiments.
Fig. 11 is a schematic flowchart illustrating the recognizer as shown in Fig. 1 consistent with some disclosed embodiments.
Detailed Description
Reference will now be made in detail to some specific embodiments of the invention including the best modes contemplated by the inventors for carrying out the invention. Examples of these specific embodiments are illustrated in the accompanying drawings. While the invention is described in conjunction with these specific embodiments, it will be understood that it is not intended to limit the invention to the described embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be practiced without some or all of these specific details. In other instances, well-known process operations have not been described in detail in order not to unnecessarily obscure the present invention.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms "a" , "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising, " when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
As will be appreciated by one skilled in the art, the present invention may be embodied as a system, method or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc. ) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit, ” “module” or “system. ” Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer-usable program code embodied in the medium.
It is further understood that the use of relational terms such as first and second, and the like, if any, are used solely to distinguish one from another entity, item, or action without necessarily requiring or implying any actual such relationship or order between such entities, items or actions.
Much of the inventive functionality and many of the inventive principles when implemented, are best supported with or in software or integrated circuits (ICs) , such as a digital signal processor and software therefore or application specific ICs. It is expected that one of ordinary skill, notwithstanding possibly significant effort and many design choices motivated by, for example, available time, current technology, and economic considerations, when guided by the concepts and principles disclosed herein will be readily capable of generating such software instructions or ICs with minimal experimentation. Therefore, in the interest of brevity and minimization of any risk of obscuring the principles and concepts according to the present invention, further discussion of such software and ICs, if any, will be limited to the essentials with respect to the principles and concepts used by the preferred embodiments.
Fig. 1 is a schematic diagram illustrating an exemplary apparatus 100 for face recognition consistent with some disclosed embodiments. As shown, the apparatus 100 may comprise a feature extractor 10 and a recognizer 20. The feature extractor 10 is configured to extract features from input face images. In one embodiment of the present application, the feature extractor 10 may comprise a neural network which may be constructed with a plurality of cascaded feature extraction modules, wherein each feature extraction module in the cascade comprises a convolutional layer and a fully-connected layer. The cascaded feature extraction modules may be implemented by software, integrated circuits (ICs) or the combination thereof. Fig. 3 illustrates a schematic diagram for structures of cascaded feature extraction modules in the feature extractor 10. As shown, the convolutional layer in the first feature extraction module of the cascaded feature extraction modules is connected to an input face image, and the convolutional layer in each of the following feature extraction modules is connected to the convolutional layer in the previous feature extraction module. The fully-connected layer in each feature extraction module is connected to the convolutional layer in the same feature extraction module.
Referring to Fig. 1, to enable the neural network to work effectively, the apparatus 100 further comprises a trainer 30 configured to update neural weights on  connections between the convolutional layer in the first feature extraction module and the input layer containing an input face image, connections between each convolutional layer in the second to the last feature extraction modules and the corresponding convolutional layer in the previous feature extraction module, and connections between each convolutional layer and the corresponding fully-connected layer in the same feature extraction module, by back-propagating identification supervisory signals and verification supervisory signals through the cascaded feature extraction modules, such that features extracted in last/highest one of the cascaded feature extraction modules are sparse, selective, and robust, which will be discussed later.
The recognizer 20 may be implemented by software, integrated circuits (ICs) or the combination thereof, and is configured to calculate distances between features extracted from different face images to determine if two face images are from the same identity for face verification or determine if one of the input images, as a probe face image, is belonging to a same identity as one of gallery face images consisting of the input images for face identification.
The Feature Extractor 10
The feature extractor 10 contains a plurality of cascaded feature extraction modules, and operates to extract features hierarchically from input face images. Fig. 3 illustrates an example of structures of cascaded feature extraction modules in the feature extractor 10, which comprises, for example, four cascaded feature extraction modules, each of which contains a convolutional layer Conv-n and a fully-connected layer FC-n for n = 1, ..., 4. The convolutional layer Conv-1 in the first feature extraction module of the feature extractor 10 is connected to an input face image as an input layer, while the convolutional layer Conv-n for n > 1 in each of the following feature extraction modules of the feature extractor 10 is connected to the convolutional layer Conv- (n-1) in the previous feature extraction module. The fully-connected layer FC-n in each feature extraction module of the feature extractor 10 is connected to the convolutional layer Conv-n in the same feature extraction module.
Fig. 10 is a schematic flowchart illustrating the feature extraction process in the feature extractor 10. In step 101, the feature extractor 10 forward  propagates an input face image through convolutional layers in all feature extraction modules of the feature extractor 10. Then in step 102, the feature extractor 10 forward propagates outputs of each of the convolutional layers to a corresponding fully-connected layer within the same feature extraction module. Finally in step 103, it takes the outputs/representations from a last one of the fully-connected layers as features as discussed below.
Convolutional layers in the feature extractor 10 are configured to extract local facial features (i.e. features extracted from local regions of the input images or the input features) from input images (for the first convolutional layer) or the feature maps (which is output feature maps of the previous convolutional layer followed by max pooling as well known in the art) to form output feature maps of the current convolutional layer. Each feature map is a certain kind of features organized in 2D. The features in the same output feature map or in local regions of the same feature map are extracted from input feature maps with the same set of neural connection weights w between the input features maps and the output feature maps in the previous convolutional layers (followed by max pooling) and the current convolutional layers, respectively. The convolution operation in each convolutional laver may be expressed as
Figure PCTCN2014001091-appb-000001
where xi and yj are the i-th input feature map and the j-th output feature map, respectively. kij is the convolution kernel between the i-th input feature map and the j-th output feature map. *denotes convolution. bj is the bias of the j-th output feature map. Herein, ReLU nonlinearity y= max (0, x) is used for neurons. Weights in higher convolutional layers of the ConvNets are locally shared. r indicates a local region where weights are shared.
Each convolutional layer may be followed by max-pooling formulated as
Figure PCTCN2014001091-appb-000002
where each neuron in the i-th output feature map  yi  pools over an s ×s non-overlapping local region in the i-th input feature map xi.
Each of the fully-connected layers in the feature extractor 10 is configured to extract global features (features extracted from the entire region of input feature maps) from the feature maps obtained from the convolutional layers on the same module. That is, the fully-connected layer FC-n extracts global features from the convolutional layer Conv-n. The fully-connected layers also serve as interfaces for  receiving supervisory signals during training and outputting features during feature extraction. Fully-connected layers may be formulated as
Figure PCTCN2014001091-appb-000003
where xi represent the output of the i-th neuron in the previous convolutional layer (followed by max-pooling) . yj represent the output of the j-th neuron in the current fully-connected layer. wi,j is a weight on connections between the i-th neuron in the previous convolutional layer (followed by max-pooling) and the j-th neuron in the current fully-connected layer. bj is a bias of the j-th neuron in the current fully-connected layer. Max (0, x) is the ReLU non-linearity.
Features extracted in the last/highest feature extraction modules of the feature extractor 10, e.g., those in FC-4 layer as shown in Fig. 3, are sparse, selective, and robust: features are sparse in both the sense that features of each face image have approximately half zero values and half positive values, and each feature has approximately half of the time being zero and half of the time being positive over all face images; features are selective to both identities and identity-related attributes such as sex and race in the sense that there are features which take either positive (activated) or zero (inhibited) values for all face images of a given identity or containing a given identity-related attribute; features are robust to image corruptions such as occlusions, wherein feature values remain largely unchanged under moderate image corruptions. The sparse features can be converted to binary code by comparing to a threshold, wherein the binary code can be used for face recognition.
Fig. 2 illustrates the three properties, sparsity, selectiveness, and robustness, of features extracted in FC-4 layer. Fig. 2 left shows features on three face images of Bush and one face image of Powell. The second face image of Bush is partially occluded. In one embodiment of the present application, there are 512 features in FC-4 layer, from which Fig. 2 subsamples 32 for illustration as an example. Features are sparsely activated on each face image, in which there are approximately half of features being positive and half being zero. Features of face images of the same identity have similar activation patterns while being different for different identities. Features are robust in that when occlusions are presented, as shown on the second face of Bush, activation patterns of features keep largely unchanged. Fig. 2  right shows activation histograms of a few selected features over all face images (as background) , all images belonging to Bush, all images with attribute “male” , and all images with attribute “female” . A feature is generally activated on about half of face images. But it may constantly have activations (or no activation) for all images belonging to a particular identity of attribute. In this sense, features are sparse, and selective to identities and attributes.
The moderate sparsity on images makes faces of different identities maximally distinguishable, while the moderate sparsity on features makes them to have maximum discrimination abilities. Fig. 4 left shows the histogram of activated (positive) feature numbers on each of 46, 594 (for example) face images in a validating dataset, and Fig. 4 fight shows the histogram of the number of images on which each feature are activated (positive) . The evaluation is based features extracted by FC-4 layer. Compared to all 512 (for example) features in FC-4 layer in one embodiment of the present application, the mean and standard deviation of the number of activated neurons on images is 292±34, while compared to all 46, 594 validating images, the mean and standard deviation of the number of images on which each feature are activated is 26, 565±5754, both of which are approximated centered at half of all features/images.
The activation patterns, i.e., whether features are activated (with positive values) , are more important than precise activation values. Converting feature activations to binary code by thresholding only sacrifices less than 1%face verification accuracies. This shows that the state of excitation or inhibition of features already contains the majority of discriminative information. Binary code is economic for storage and fast for image search.
Fig. 5 and Fig. 6 show examples of activation histograms of features over given identities and attributes, respectively. Histograms over given identities exhibit strong selectiveness. Some features are constantly activated for a given identity, with histograms distributed in values greater than zero, as shown in the first two rows in Fig. 5, while some others are constantly inhibited, with histograms accumulated at zero or small values, as shown in the last two rows in Fig. 5. For attributes, each row of Fig. 7 shows histograms of a single feature over a few related attributes (those related to sex, race, and age) . The selected features are excitatory on each of attributes given in the left of each row. As shown in Fig. 6, features exhibit strong selectiveness to sex, race, and certain ages such as child and senior, in which  features are strongly activated for a given attribute while inhibited for other attributes in the same category. For some other attributes such as youth and middle aged, the selectiveness is weak, in which there are no features solely activated for each of these attributes. This is because ages do not exactly correspond to identities. For example, in face recognition, features have to be invariant to the same identity photographed at both young and middle aged.
Fig. 7 and Fig. 8 illustrate the robustness of features extracted in later feature extraction modules (FC-4 layer) against image corruptions. Face images are occluded by random blocks with various sizes from 10×10 to 70×70, as illustrated in Fig. 7. Fig. 8 shows mean feature activations over images with random block occlusions, in which each column shows the mean activation over face images of a single identity given in the top of each column, with various degrees of occlusions given in the left of each row. Feature values are mapped to a color map with warm colors indicating positive values and cool colors indicating zero or small values. The order of features in figures in each column is sorted by the mean feature activation values on the original face images of each identity, respectively. As can be seen in Fig. 8, the activation patterns keep largely unchanged (with most activated features still being activated and most inhibited features still being inhibited) until a large degree of occlusions.
The Recognizer20
The recognizer 20 operates to calculate distances between global features for different face images extracted by the fully-connected layer of the feature extractor 10 to determine if two face images are from the same identity for face verification or determine if one of the input images, as a probe face image, is belonging to a same identity as one of gallery face images consisting of the input images for face identification. Fig. 10 is a schematic flowchart illustrating the recognition process in the recognizer 20. In step 201, the recognizer 20 calculates distances between features (i.e. global features for different face images extracted by the fully-connected layer) extracted from different face images by the feature extractor 10. Then in step 202, the recognizer 20 determines if two face images are from the same identity for face verification, or, alternatively, in step 203, it determines one of the input images, as a probe face image, is belonging to a same identity as one of gallery face images consisting of the input images for face identification.
In the recognizer 20, two face images are determined to belong to the same identity if their feature distance is smaller than a threshold, or the probe face image is determined to belong to the same identity as one of gallery face images if their feature distance is the smallest compared to feature distances of the probe face image to all the other gallery face images, wherein feature distances determined by the recognizer 20 could be Euclidean distances, Joint Bayesian distances, cosine distances, Hamming distances, or any other distances.
In one embodiment of the present application, Joint Bayesian distances are used as feature distances. Joint Bayesian has been a popular similarity metric of faces, which represents the extracted facial features x (after subtracting the mean) by the sum of two independent Gaussian variables
x=μ+ò,   (5)
where μ~N (0, Sμ) represents the face identity and ò~N (0, Sò) represents the intra-personal variations. Joint Bayesian models the joint probability of two faces given the intra or extra-personal variation hypothesis, P (x1, x2|H1) and P(x1, x2|HE) . It is readily shown from Equation (5) that these two probabilities are also Gaussian with variations
Figure PCTCN2014001091-appb-000004
and
Figure PCTCN2014001091-appb-000005
respectively. SS and Sò can be learned from data with EM algorithm. In test, it calculates the likelihood ratio
Figure PCTCN2014001091-appb-000006
which has closed-form solutions and is efficient.
The Trainer 30
The a trainer 30 is used to update the weights w on connections between neurons in convolutional and fully-connected layers in the feature extractor 10 by inputting initial weights on connections between neurons in convolutional and fully-connected layers in the feature extractor, a plurality of identification supervisory signals, and a plurality of verification supervisory signals, such that that features extracted in last one of the cascaded feature extraction modules in the extractor are sparse, selective, and robust..
As shown in Fig. 3, the identification and verification supervisory signals in the trainer 30, denoted as “Id” and “Ve” respectively in, are simultaneously added to each of the fully-connected layers FC-n for n = 1, ..., 4 in each of the feature extraction modules in the feature extractor 10, and respectively back-propagated to the input face image, so as to update the weights on connections between neurons in all the cascaded feature extraction modules.
The identification supervisory signals “Id” are generated in the trainer 30 by classifying all of the fully-connected layer representations/outputs (i.e., formula (4) ) of a single face image into one of N identities, wherein the classification errors are used as the identification supervisory signals.
The verification supervisory signals in the trainer 30 are generated by verifying the fully-connected layer representations of two compared face images, respectively, in each of the feature extraction modules, to determine if the two compared face images belong to the same identity, wherein the verification errors are used as the verification supervisory signals. Given a pair of training face images, the feature extractor 10 extracts two feature vectors fi and fj from the two face images respectively in each of the feature extraction modules. The verification error is 
Figure PCTCN2014001091-appb-000007
if fi and fj are features of face images of the same identity, or 
Figure PCTCN2014001091-appb-000008
if fi and fj are features of face images of different identities, where ||fi-fj||2 is Euclidean distance of the two feature vectors, m is a positive constant value. There are errors if fi and fj are dissimilar for the same identity, or if fi and fj are similar for different identities.
Fig. 9 is a schematic flowchart illustrating the training process in the trainer 30. In step 101, the trainer 30 samples two face images and inputs them to the feature extractor 10, respectively, to get feature representations of each of the two face images in all fully-connected layers of the feature extractor 10. Then in step 102, the trainer 30 calculates identification errors by classifying feature representations of each face image in each fully-connected layer into one of a plurality of (N) identities. Simultaneously, in step 103, the trainer 30 calculates verification errors by verifying if feature representations of two face images, respectively, in each fully-connected layer are from the same identity. The identification and verification errors are used as identification and verification supervisory signals, respectively. In step 104, the trainer 30 back propagates all identification and verification supervisory signals through the feature extractor 10 simultaneously, so as to update weights on connections between neurons in the feature extractor 10. Identification and verification supervisory signals (or errors) simultaneously added to fully-connected layers FC-n for n = 1, 2, 3, 4 are back-propagated through the cascade of feature extraction modules until the input image. After back-propagation, errors got in each layer in the cascade of feature extraction modules are accumulated. Weights on connections between neurons in the feature extractor 10 are updated according to the magnitude of the errors. At last, in step 105, the trainer 30 judges if training process has converged, and repeats steps 101-104 if a convergence point has not reached.
The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Claims (21)

  1. An apparatus for face recognition, comprising:
    an extractor having a plurality of cascaded feature extraction modules, wherein each of the cascaded feature extraction modules comprises:
    a convolutional layer for extracting local features from input face images or from features extracted in a previous feature extraction module of the modules; and
    a fully-connected layer connected to the convolutional layer on a same feature extraction module and extracting global features from the extracted local features; and
    a recognizer configured to, in accordance with distances between the extracted global features, determine:
    if two face images of the input images are from a same identity, or 
    if one of the input images, as a probe face image, is belonging to a same identity as one of gallery face images consisting of the input images.
  2. An apparatus of claim 1, wherein the convolutional layer in the first feature extraction module of the cascaded feature extraction modules is configured to extract the local features from the input face images, and the convolutional layer in each of following feature extraction modules is connected to the convolutional layer in a previous feature extraction module of the modules.
  3. An apparatus of claim 2, wherein the fully-connected layer in each feature extraction module is connected to the convolutional layer in a same feature extraction module of the modules.
  4. An apparatus of claim 3, further comprising:
    a trainer configured to update neural weights on connections between the convolutional layer in a first feature extraction module and an input layer containing the input face images, connections between each convolutional layer in a second to a last feature extraction modules and a corresponding convolutional layer in the previous feature extraction module, and connections between each convolutional layer and a corresponding fully-connected layer in the same feature extraction module, by  back-propagating identification supervisory signals and verification supervisory signals through the cascaded feature extraction modules.
  5. An apparatus of claim 4, wherein the features extracted in the last feature extraction module for each face image are sparsely organized in 2D with approximately half zero values and half positive values, and each of the features has approximately half of the time being zero and half of the time being positive over all face images.
  6. An apparatus of claim 4, wherein features extracted in a last feature extraction module are selective to both identities and identity-related attributes such that there are features which take either positive (activated) or zero (inhibited) values for all face images of a given identity or containing a given identity-related attribute.
  7. An apparatus of claim 6, wherein the identity-related attributes comprise sex and/or race.
  8. An apparatus of claim 4, wherein features extracted in the last feature extraction module are robust to image corruptions, wherein values of the feature remain largely unchanged under moderate image corruptions.
  9. An apparatus of claim 1, wherein the recognizer determines that two faces belong to the same identity if the determined feature distance thereof is smaller than a threshold, or one of the input images, as the probe face image, is belonging to the same identity as one of gallery face images consisting of the input images if their feature distance is the smallest compared to feature distances of the probe face image to all the other gallery face images.
  10. An apparatus of claim 9, wherein the feature distances comprise one selected from a group consisting of Euclidean distances, Joint Bayesian distances, cosine distances, and Hamming distances.
  11. An apparatus of claim 4, wherein features outputted from each fully-connected layer for a single face image are classified to one of a plurality of  identities, wherein classification errors are treated as the identification supervisory signals.
  12. An apparatus of claim 4, wherein features outputted from each fully-connected layer for two compared face images, respectively, are verified to determine if the two compared face images belong to the same identity, wherein verification errors are treated as the verification supervisory signals.
  13. A method for face recognition, comprising:
    extracting, by a trained neural network, local features of two or more input images;
    extracting, by the trained neural network, global features from the extracted local features;
    determining distances between the extracted global features; and
    determining, in accordance with the determined distance, iftwo face images of the input images are from the same identity for face verification or ifone of the input images, as a probe face image, is belonging to a same identity as one of gallery face images consisting of the input images.
  14. A method of claim 13, wherein the neural network comprises a plurality of cascaded feature extraction modules, each of the feature extraction modules having a convolutional layer, and wherein the convolutional layer in the first feature extraction module of the cascaded feature extraction modules is connected to the input face images and the convolutional layer in each of the following feature extraction modules is connected to the convolutional layer in the previous feature extraction module.
  15. A method of claim 14, wherein each of the feature extraction modules further comprises a fully-connected layer, the fully-connected layer in each feature extraction module being connected to the convolutional layer in the same feature extraction module.
  16. A method of claim 15, further comprising:
    updating neural weights on connections between the convolutional layer in the  first feature extraction module and an input layer containing an input face image, connections between each convolutional layer in a second to a last feature extraction modules and the corresponding convolutional layer in the previous feature extraction module, and connections between each convolutional layer and a corresponding fully-connected layer in the same feature extraction module, by back-propagating identification supervisory signals and verification supervisory signals through the cascaded feature extraction modules.
  17. A method of claim 16, wherein the updating further comprises:
    inputting two face images to the neural network, respectively, to get feature representations of each of the two face images;
    calculating identification errors by classifying feature representations of each face image in each fully-connected layer of the neural network into one of a plurality of identities;
    calculating verification errors by verifying if feature representations of two face images, respectively, in each fully-connected layer are from the same identity, the identification and verification errors being treated as the identification and verification supervisory signals, respectively; and
    back-propagating the identification and verification supervisory signals through the neural network simultaneously, so as to update the neural weights on connections between the convolutional layer in the first feature extraction module and the input layer containing an input face image, connections between each convolutional layer in the second to the last feature extraction modules and the corresponding convolutional layer in the previous feature extraction module, and connections between each convolutional layer and the corresponding fully-connected layer in the same feature extraction module, .
  18. A method of claim 16, wherein the features extracted in the last feature extraction module for each face image are sparsely organized in 2D with approximately half zero values and half positive values, and each of the features has approximately half of the time being zero and half of the time being positive over all face images.
  19. A method of claim 16, wherein features extracted in the last feature  extraction module are selective to both identities and identity-related attributes such that there are features which take either positive (activated) or zero (inhibited) values for all face images of a given identity or containing a given identity-related attribute.
  20. A method of claim 19, wherein the identity-related attributes comprise sex and/or race.
  21. A method of claim 13, wherein the determining further comprises:
    determining that two faces belong to the same identity if the determined feature distance thereof is smaller than a threshold, or one of the input images, as the probe face image, is belonging to a same identity as one of gallery face images consisting of the input images if their feature distance is the smallest compared to feature distances of the probe face image to all the other gallery face images.
PCT/CN2014/001091 2014-12-03 2014-12-03 A method and a system for face recognition Ceased WO2016086330A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201480083717.5A CN107004115B (en) 2014-12-03 2014-12-03 Method and system for face recognition
PCT/CN2014/001091 WO2016086330A1 (en) 2014-12-03 2014-12-03 A method and a system for face recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2014/001091 WO2016086330A1 (en) 2014-12-03 2014-12-03 A method and a system for face recognition

Publications (1)

Publication Number Publication Date
WO2016086330A1 true WO2016086330A1 (en) 2016-06-09

Family

ID=56090783

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/001091 Ceased WO2016086330A1 (en) 2014-12-03 2014-12-03 A method and a system for face recognition

Country Status (2)

Country Link
CN (1) CN107004115B (en)
WO (1) WO2016086330A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019067223A1 (en) * 2017-09-29 2019-04-04 General Electric Company Automatic authentication for access control using facial recognition
CN111079549A (en) * 2019-11-22 2020-04-28 杭州电子科技大学 A method for comic face recognition using gated fusion discriminative features
US10922531B2 (en) 2018-04-09 2021-02-16 Pegatron Corporation Face recognition method
US11651229B2 (en) 2017-11-22 2023-05-16 Zhejiang Dahua Technology Co., Ltd. Methods and systems for face recognition

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108009481A (en) * 2017-11-22 2018-05-08 浙江大华技术股份有限公司 A kind of training method and device of CNN models, face identification method and device
CN108073917A (en) * 2018-01-24 2018-05-25 燕山大学 A kind of face identification method based on convolutional neural networks
CN110309692B (en) * 2018-03-27 2023-06-02 杭州海康威视数字技术股份有限公司 Face recognition method, device and system, model training method and device
CN111968264A (en) * 2020-10-21 2020-11-20 东华理工大学南昌校区 Sports event time registration device
CN114880636A (en) * 2022-05-31 2022-08-09 中国银行股份有限公司 Identity verification method and system based on face recognition
CN116311464B (en) * 2023-03-24 2023-12-12 北京的卢铭视科技有限公司 Model training method, face recognition method, electronic device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050123202A1 (en) * 2003-12-04 2005-06-09 Samsung Electronics Co., Ltd. Face recognition apparatus and method using PCA learning per subgroup
CN1700240A (en) * 2004-05-17 2005-11-23 香港中文大学 Face Recognition Method Based on Random Sampling
CN1866270A (en) * 2004-05-17 2006-11-22 香港中文大学 Video-Based Facial Recognition Methods
CN101763506A (en) * 2008-12-22 2010-06-30 Nec九州软件株式会社 Facial image tracking apparatus and method, computer readable recording medium
US20130322708A1 (en) * 2012-06-04 2013-12-05 Sony Mobile Communications Ab Security by z-face detection

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102629320B (en) * 2012-03-27 2014-08-27 中国科学院自动化研究所 Ordinal measurement statistical description face recognition method based on feature level

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050123202A1 (en) * 2003-12-04 2005-06-09 Samsung Electronics Co., Ltd. Face recognition apparatus and method using PCA learning per subgroup
CN1700240A (en) * 2004-05-17 2005-11-23 香港中文大学 Face Recognition Method Based on Random Sampling
CN1866270A (en) * 2004-05-17 2006-11-22 香港中文大学 Video-Based Facial Recognition Methods
CN101763506A (en) * 2008-12-22 2010-06-30 Nec九州软件株式会社 Facial image tracking apparatus and method, computer readable recording medium
US20130322708A1 (en) * 2012-06-04 2013-12-05 Sony Mobile Communications Ab Security by z-face detection

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019067223A1 (en) * 2017-09-29 2019-04-04 General Electric Company Automatic authentication for access control using facial recognition
CN111133433A (en) * 2017-09-29 2020-05-08 通用电气公司 Automatic authentication using facial recognition for access control
CN111133433B (en) * 2017-09-29 2023-09-05 通用电气公司 Automatic authentication using facial recognition for access control
US11651229B2 (en) 2017-11-22 2023-05-16 Zhejiang Dahua Technology Co., Ltd. Methods and systems for face recognition
US10922531B2 (en) 2018-04-09 2021-02-16 Pegatron Corporation Face recognition method
CN111079549A (en) * 2019-11-22 2020-04-28 杭州电子科技大学 A method for comic face recognition using gated fusion discriminative features
CN111079549B (en) * 2019-11-22 2023-09-22 杭州电子科技大学 A method for cartoon face recognition using gated fusion discriminant features

Also Published As

Publication number Publication date
CN107004115A (en) 2017-08-01
CN107004115B (en) 2019-02-15

Similar Documents

Publication Publication Date Title
WO2016086330A1 (en) A method and a system for face recognition
CN111814584B (en) Vehicle re-identification method in multi-view environment based on multi-center metric loss
CN106415594B (en) Method and system for face verification
US9811718B2 (en) Method and a system for face verification
Khan et al. Human Gait Recognition Using Deep Learning and Improved Ant Colony Optimization.
WO2016119076A1 (en) A method and a system for face recognition
CN111523621A (en) Image recognition method and device, computer equipment and storage medium
Zheng et al. Vlad encoded deep convolutional features for unconstrained face verification
Cai et al. HOG-assisted deep feature learning for pedestrian gender recognition
Xia et al. Loop closure detection for visual SLAM using PCANet features
CN108352072A (en) Object tracking methods, object tracking apparatus and program
US9842279B2 (en) Data processing method for learning discriminator, and data processing apparatus therefor
CN114358205B (en) Model training method, model training device, terminal device and storage medium
Giraddi et al. Flower classification using deep learning models
CN108985200A (en) A kind of In vivo detection algorithm of the non-formula based on terminal device
Wang et al. Hierarchical spatial sum–product networks for action recognition in still images
CN109033321B (en) Image and natural language feature extraction and keyword-based language indication image segmentation method
CN113762249A (en) Image attack detection and image attack detection model training method and device
Chan et al. A 3-D-point-cloud system for human-pose estimation
CN114093003A (en) A more fraud-distinguishing face detection method and its network model
CN108805280B (en) Image retrieval method and device
CN108496174B (en) Method and system for face recognition
Pramukha et al. End-to-end latent fingerprint enhancement using multi-scale Generative Adversarial Network
CN111985434B (en) Model enhanced face recognition method, device, device and storage medium
CN104200222B (en) Object identifying method in a kind of picture based on factor graph model

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14907199

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14907199

Country of ref document: EP

Kind code of ref document: A1