WO2016086330A1

WO2016086330A1 - A method and a system for face recognition

Info

Publication number: WO2016086330A1
Application number: PCT/CN2014/001091
Authority: WO
Inventors: Xiaoou Tang; Yi Sun; Xiaogang Wang
Original assignee: Individual
Current assignee: Individual
Priority date: 2014-12-03
Filing date: 2014-12-03
Publication date: 2016-06-09
Anticipated expiration: 2017-06-03
Also published as: CN107004115A; CN107004115B

Abstract

Disclosed are an apparatus a method for face recognition. The apparatus may comprise: an extractor having a plurality of cascaded feature extraction modules, wherein each of the cascaded feature extraction modules comprises convolutional layer for extracting local features from input face images or from features extracted in a previous feature extraction module of the modules; and a fully -connected layer connected to the convolutional layer on a same feature extraction module and extracting global features from the extracted local features. The apparatus may further comprise a recognizer configured to, in accordance with distances between the extracted global features, determine: if two face images of the input images are from a same identity, or if one of the input images, as a probe face image, is belonging to a same identity as one of gallery face images consisting of the input images.

Description

A METHOD AND A SYSTEM FOR FACE RECOGNITION

Technical Field

The present application relates to a method for face recognition and a system thereof.

Background

Only very recently， the deep learning achieved great success on face recognition and significantly outperformed systems using low level features. There are two notable breakthroughs. The first is large-scale face identification with deep neural networks. By classifying the face images into thousands or even millions of identities， the last hidden layer forms features highly discriminative to identities. The second is supervising the deep neural networks with both the identification and verification tasks. The verification task minimizes the distance between features of the same identity， and decreases intra-personal variations. By combining features learned from many face regions， joint identification-verification achieved the current state-of-the-art 99.15％ face verification accuracy on the most extensively evaluated LFW face recognition dataset.

There have been works on first learning attribute classifiers and then using attribute predictions for face recognition. In addition， Sparse representation-based classification was extensively studied for face recognition with occlusions. The Robust Boltzmann Machine has been proposed to distinguish corrupted pixels and learn latent representations. These methods designed components explicitly handling occlusions.

Summary

There have been works on first learning attribute classifiers and then using attribute predictions for face recognition， while this application tries the inverse， by first predicting the identities， and then using the learned identity-related features to predict attributes. It is observed that the features in higher layers of the neural networks are highly selective to identities and identity-related attributes such as sex and race. When an identity (who can be outside the training data) or attribute is presented， a subset of features can be identified which are constantly excited and also another subset of features can be identified which are constantly inhibited. A feature from any of these two subsets has strong indication on the existence/non-existence of this identity or attribute， and this application show that the single feature alone has high recognition accuracy for a particular identity or attribute. In other words， features in deep neural networks have sparsity on identities and attributes. Although deep neural networks in this application are not taught to distinguish attributes during training， they have implicitly learned such high-level concepts. Directly employing features learned by deep neural networks leads to much higher classification accuracy on identity-related attributes than widely used handcrafted features such as high-dimensional LBP (Local Binary Pattern) .

Contrary to the conventional sparse representation-based classification， this application shows that deep neural networks trained by natural web face images without artificial occlusion patterns added during training have implicitly encoded invariance to occlusions.

It is observed in this application that features learned by the deep neural networks are moderately sparse. For an input face image， around half of the features in the top hidden layer are activated. On the other hand， each feature is activated on roughly half of the face images. Such sparsity distributions can maximize the discriminative power of deep neural networks as well as the distance between images. Different identities have different subsets of features activated. Two images of the same identity have similar activation patterns. This motivates this application to binarize the real-valued features in the top hidden layer of deep neural networks and use the binary code for recognition. Its result is surprisingly good. Its verification accuracy on LFW only slightly drops by less than 1％. It has significant impact on large-scale face search since huge storage and computation time is saved. This also implies that binary activation patterns are more important than activation magnitudes in the deep neural networks.

In one aspect of the present application， disclosed is an apparatus for face recognition. The apparatus may comprise a feature extractor and a recognition unit. The feature extractor is configured with a plurality of cascaded feature extraction modules， wherein each of the feature extraction modules comprises a convolutional layer for extracting local features from input face images or from features extracted in a previous feature extraction module of the modules； and a fully-connected layer connected to the convolutional layer in the same feature extraction module and extracting global features from the extracted local features. The recognizer is configured to， in accordance with distances between the extracted global features， determine： if two face images of the input images are from a same identity， or if one of the input images， as a probe face image， is belonging to a same identity as one of gallery face images consisting of the input images.

In one embodiment of the present application， the convolutional layer in the first feature extraction module of the cascaded feature extraction modules is connected to an input face image， and the convolutional layer in each of the following feature extraction modules is connected to the convolutional layer in the previous feature extraction module. The fully-connected layer in each feature extraction module is connected to the convolutional layer in the same feature extraction module.

The apparatus may further comprise a trainer configured to update neuron weights on connections between each convolutional layer and the corresponding fully-connected layer in the same feature extraction module by back-propagating identification supervisory signals and verification supervisory signals through the cascaded feature extraction modules.

The process of the updating may comprise： inputting two face images to the neural network， respectively， to get feature representations of each of the two face images； calculating identification errors by classifying feature representations of each face image in each fully-connected layer of the neural network into one of a plurality of identities； calculating verification errors by verifying if feature representations of two face images， respectively， in each fully-connected layer are from the same identity， the identification and verification errors being treated as identification and verification supervisory signals， respectively； and back-propagating all identification and verification supervisory signals through the neural network simultaneously， so as to update the neuron weights on connections between each convolutional layer and the corresponding fully-connected layer in the same feature extraction module.

The present application discovers and proves three properties of features extracted in later feature extraction modules， i.e.， sparsity， selectiveness， and robustness， all of which are critical for face recognition， wherein features are sparse in both the sense that features of each face image have approximately half zero values and half positive values， and each feature has approximately half of the time being zero and half of the time being positive over all face images； features are selective to both identities and identity-related attributes such as sex and race in the sense that there are features which take either positive (activated) or zero (inhibited) values for all face images of a given identity or containing a given identity-related attribute； features are robust to image corruptions such as occlusions， wherein feature values remain largely unchanged under moderate image corruptions.

Brief Description of the Drawing

Exemplary non-limiting embodiments of the present invention are described below with reference to the attached drawings. The drawings are illustrative and generally not to an exact scale. The same or similar elements on different figures are referenced with the same reference numbers.

Fig. 1 is a schematic diagram illustrating an apparatus for face recognition consistent with some disclosed embodiments.

Fig. 2 is a schematic diagram illustrating the sparsity， selectiveness， and robustness of features extracted in later feature extraction modules.

Fig. 3 is a schematic diagram illustrating structures of cascaded feature extraction modules in the feature extractor， as well as input face images and supervisory signals in the trainer.

Fig. 4 is schematic histograms illustrating the sparsity of activated features (neurons) on individual face images as well as the sparsity of individual features (neurons) activated on all face images.

Fig. 5 is schematic histograms illustrating the selective activation and inhibition of features on face images of particular identities.

Fig. 6 is schematic histograms illustrating the selective activation and inhibition of features on face images containing particular attributes.

Fig. 7 is a schematic diagram illustrating face images with random block occlusions， which are used to test the robustness of features extracted by the feature extractor against image corruptions.

Fig. 8 is a schematic diagram illustrating the mean feature activations over face images of individual identities under various degrees of random block occlusions.

Fig. 9 is a schematic flowchart illustrating the trainer as shown in Fig. 1 consistent with some disclosed embodiments.

Fig. 10 is a schematic flowchart illustrating the feature extractor as shown in Fig. 1 consistent with some disclosed embodiments.

Fig. 11 is a schematic flowchart illustrating the recognizer as shown in Fig. 1 consistent with some disclosed embodiments.

Detailed Description

Reference will now be made in detail to some specific embodiments of the invention including the best modes contemplated by the inventors for carrying out the invention. Examples of these specific embodiments are illustrated in the accompanying drawings. While the invention is described in conjunction with these specific embodiments， it will be understood that it is not intended to limit the invention to the described embodiments. On the contrary， it is intended to cover alternatives， modifications， and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. In the following description， numerous specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be practiced without some or all of these specific details. In other instances， well-known process operations have not been described in detail in order not to unnecessarily obscure the present invention.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein， the singular forms "a" ， "an" and "the" are intended to include the plural forms as well， unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising， " when used in this specification， specify the presence of stated features， integers， steps， operations， elements， and/or components， but do not preclude the presence or addition of one or more other features， integers， steps， operations， elements， components， and/or groups thereof.

As will be appreciated by one skilled in the art， the present invention may be embodied as a system， method or computer program product. Accordingly， the present invention may take the form of an entirely hardware embodiment， an entirely software embodiment (including firmware， resident software， micro-code， etc. ) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit， ” “module” or “system. ” Furthermore， the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer-usable program code embodied in the medium.

It is further understood that the use of relational terms such as first and second， and the like， if any， are used solely to distinguish one from another entity， item， or action without necessarily requiring or implying any actual such relationship or order between such entities， items or actions.

Much of the inventive functionality and many of the inventive principles when implemented， are best supported with or in software or integrated circuits (ICs) ， such as a digital signal processor and software therefore or application specific ICs. It is expected that one of ordinary skill， notwithstanding possibly significant effort and many design choices motivated by， for example， available time， current technology， and economic considerations， when guided by the concepts and principles disclosed herein will be readily capable of generating such software instructions or ICs with minimal experimentation. Therefore， in the interest of brevity and minimization of any risk of obscuring the principles and concepts according to the present invention， further discussion of such software and ICs， if any， will be limited to the essentials with respect to the principles and concepts used by the preferred embodiments.

Fig. 1 is a schematic diagram illustrating an exemplary apparatus 100 for face recognition consistent with some disclosed embodiments. As shown， the apparatus 100 may comprise a feature extractor 10 and a recognizer 20. The feature extractor 10 is configured to extract features from input face images. In one embodiment of the present application， the feature extractor 10 may comprise a neural network which may be constructed with a plurality of cascaded feature extraction modules， wherein each feature extraction module in the cascade comprises a convolutional layer and a fully-connected layer. The cascaded feature extraction modules may be implemented by software， integrated circuits (ICs) or the combination thereof. Fig. 3 illustrates a schematic diagram for structures of cascaded feature extraction modules in the feature extractor 10. As shown， the convolutional layer in the first feature extraction module of the cascaded feature extraction modules is connected to an input face image， and the convolutional layer in each of the following feature extraction modules is connected to the convolutional layer in the previous feature extraction module. The fully-connected layer in each feature extraction module is connected to the convolutional layer in the same feature extraction module.

Referring to Fig. 1， to enable the neural network to work effectively， the apparatus 100 further comprises a trainer 30 configured to update neural weights on connections between the convolutional layer in the first feature extraction module and the input layer containing an input face image， connections between each convolutional layer in the second to the last feature extraction modules and the corresponding convolutional layer in the previous feature extraction module， and connections between each convolutional layer and the corresponding fully-connected layer in the same feature extraction module， by back-propagating identification supervisory signals and verification supervisory signals through the cascaded feature extraction modules， such that features extracted in last/highest one of the cascaded feature extraction modules are sparse， selective， and robust， which will be discussed later.

The recognizer 20 may be implemented by software， integrated circuits (ICs) or the combination thereof， and is configured to calculate distances between features extracted from different face images to determine if two face images are from the same identity for face verification or determine if one of the input images， as a probe face image， is belonging to a same identity as one of gallery face images consisting of the input images for face identification.

The Feature Extractor 10

The feature extractor 10 contains a plurality of cascaded feature extraction modules， and operates to extract features hierarchically from input face images. Fig. 3 illustrates an example of structures of cascaded feature extraction modules in the feature extractor 10， which comprises， for example， four cascaded feature extraction modules， each of which contains a convolutional layer Conv-n and a fully-connected layer FC-n for n ＝ 1， ...， 4. The convolutional layer Conv-1 in the first feature extraction module of the feature extractor 10 is connected to an input face image as an input layer， while the convolutional layer Conv-n for n ＞ 1 in each of the following feature extraction modules of the feature extractor 10 is connected to the convolutional layer Conv- (n-1) in the previous feature extraction module. The fully-connected layer FC-n in each feature extraction module of the feature extractor 10 is connected to the convolutional layer Conv-n in the same feature extraction module.

Fig. 10 is a schematic flowchart illustrating the feature extraction process in the feature extractor 10. In step 101， the feature extractor 10 forward propagates an input face image through convolutional layers in all feature extraction modules of the feature extractor 10. Then in step 102， the feature extractor 10 forward propagates outputs of each of the convolutional layers to a corresponding fully-connected layer within the same feature extraction module. Finally in step 103， it takes the outputs/representations from a last one of the fully-connected layers as features as discussed below.

Convolutional layers in the feature extractor 10 are configured to extract local facial features (i.e. features extracted from local regions of the input images or the input features) from input images (for the first convolutional layer) or the feature maps (which is output feature maps of the previous convolutional layer followed by max pooling as well known in the art) to form output feature maps of the current convolutional layer. Each feature map is a certain kind of features organized in 2D. The features in the same output feature map or in local regions of the same feature map are extracted from input feature maps with the same set of neural connection weights w between the input features maps and the output feature maps in the previous convolutional layers (followed by max pooling) and the current convolutional layers， respectively. The convolution operation in each convolutional laver may be expressed as

where xⁱ and y^j are the i-th input feature map and the j-th output feature map， respectively. k^ij is the convolution kernel between the i-th input feature map and the j-th output feature map. *denotes convolution. b^j is the bias of the j-th output feature map. Herein， ReLU nonlinearity y＝ max (0， x) is used for neurons. Weights in higher convolutional layers of the ConvNets are locally shared. r indicates a local region where weights are shared.

Each convolutional layer may be followed by max-pooling formulated as

where each neuron in the i-th output feature map yⁱ pools over an s ×s non-overlapping local region in the i-th input feature map xⁱ.

Each of the fully-connected layers in the feature extractor 10 is configured to extract global features (features extracted from the entire region of input feature maps) from the feature maps obtained from the convolutional layers on the same module. That is， the fully-connected layer FC-n extracts global features from the convolutional layer Conv-n. The fully-connected layers also serve as interfaces for receiving supervisory signals during training and outputting features during feature extraction. Fully-connected layers may be formulated as

where x_i represent the output of the i-th neuron in the previous convolutional layer (followed by max-pooling) . y_j represent the output of the j-th neuron in the current fully-connected layer. w_i，j is a weight on connections between the i-th neuron in the previous convolutional layer (followed by max-pooling) and the j-th neuron in the current fully-connected layer. b_j is a bias of the j-th neuron in the current fully-connected layer. Max (0， x) is the ReLU non-linearity.

Features extracted in the last/highest feature extraction modules of the feature extractor 10， e.g.， those in FC-4 layer as shown in Fig. 3， are sparse， selective， and robust： features are sparse in both the sense that features of each face image have approximately half zero values and half positive values， and each feature has approximately half of the time being zero and half of the time being positive over all face images； features are selective to both identities and identity-related attributes such as sex and race in the sense that there are features which take either positive (activated) or zero (inhibited) values for all face images of a given identity or containing a given identity-related attribute； features are robust to image corruptions such as occlusions， wherein feature values remain largely unchanged under moderate image corruptions. The sparse features can be converted to binary code by comparing to a threshold， wherein the binary code can be used for face recognition.

Fig. 2 illustrates the three properties， sparsity， selectiveness， and robustness， of features extracted in FC-4 layer. Fig. 2 left shows features on three face images of Bush and one face image of Powell. The second face image of Bush is partially occluded. In one embodiment of the present application， there are 512 features in FC-4 layer， from which Fig. 2 subsamples 32 for illustration as an example. Features are sparsely activated on each face image， in which there are approximately half of features being positive and half being zero. Features of face images of the same identity have similar activation patterns while being different for different identities. Features are robust in that when occlusions are presented， as shown on the second face of Bush， activation patterns of features keep largely unchanged. Fig. 2 right shows activation histograms of a few selected features over all face images (as background) ， all images belonging to Bush， all images with attribute “male” ， and all images with attribute “female” . A feature is generally activated on about half of face images. But it may constantly have activations (or no activation) for all images belonging to a particular identity of attribute. In this sense， features are sparse， and selective to identities and attributes.

The moderate sparsity on images makes faces of different identities maximally distinguishable， while the moderate sparsity on features makes them to have maximum discrimination abilities. Fig. 4 left shows the histogram of activated (positive) feature numbers on each of 46， 594 (for example) face images in a validating dataset， and Fig. 4 fight shows the histogram of the number of images on which each feature are activated (positive) . The evaluation is based features extracted by FC-4 layer. Compared to all 512 (for example) features in FC-4 layer in one embodiment of the present application， the mean and standard deviation of the number of activated neurons on images is 292±34， while compared to all 46， 594 validating images， the mean and standard deviation of the number of images on which each feature are activated is 26， 565±5754， both of which are approximated centered at half of all features/images.

The activation patterns， i.e.， whether features are activated (with positive values) ， are more important than precise activation values. Converting feature activations to binary code by thresholding only sacrifices less than 1％face verification accuracies. This shows that the state of excitation or inhibition of features already contains the majority of discriminative information. Binary code is economic for storage and fast for image search.

Fig. 5 and Fig. 6 show examples of activation histograms of features over given identities and attributes， respectively. Histograms over given identities exhibit strong selectiveness. Some features are constantly activated for a given identity， with histograms distributed in values greater than zero， as shown in the first two rows in Fig. 5， while some others are constantly inhibited， with histograms accumulated at zero or small values， as shown in the last two rows in Fig. 5. For attributes， each row of Fig. 7 shows histograms of a single feature over a few related attributes (those related to sex， race， and age) . The selected features are excitatory on each of attributes given in the left of each row. As shown in Fig. 6， features exhibit strong selectiveness to sex， race， and certain ages such as child and senior， in which features are strongly activated for a given attribute while inhibited for other attributes in the same category. For some other attributes such as youth and middle aged， the selectiveness is weak， in which there are no features solely activated for each of these attributes. This is because ages do not exactly correspond to identities. For example， in face recognition， features have to be invariant to the same identity photographed at both young and middle aged.

Fig. 7 and Fig. 8 illustrate the robustness of features extracted in later feature extraction modules (FC-4 layer) against image corruptions. Face images are occluded by random blocks with various sizes from 10×10 to 70×70， as illustrated in Fig. 7. Fig. 8 shows mean feature activations over images with random block occlusions， in which each column shows the mean activation over face images of a single identity given in the top of each column， with various degrees of occlusions given in the left of each row. Feature values are mapped to a color map with warm colors indicating positive values and cool colors indicating zero or small values. The order of features in figures in each column is sorted by the mean feature activation values on the original face images of each identity， respectively. As can be seen in Fig. 8， the activation patterns keep largely unchanged (with most activated features still being activated and most inhibited features still being inhibited) until a large degree of occlusions.

The Recognizer20

The recognizer 20 operates to calculate distances between global features for different face images extracted by the fully-connected layer of the feature extractor 10 to determine if two face images are from the same identity for face verification or determine if one of the input images， as a probe face image， is belonging to a same identity as one of gallery face images consisting of the input images for face identification. Fig. 10 is a schematic flowchart illustrating the recognition process in the recognizer 20. In step 201， the recognizer 20 calculates distances between features (i.e. global features for different face images extracted by the fully-connected layer) extracted from different face images by the feature extractor 10. Then in step 202， the recognizer 20 determines if two face images are from the same identity for face verification， or， alternatively， in step 203， it determines one of the input images， as a probe face image， is belonging to a same identity as one of gallery face images consisting of the input images for face identification.

In the recognizer 20， two face images are determined to belong to the same identity if their feature distance is smaller than a threshold， or the probe face image is determined to belong to the same identity as one of gallery face images if their feature distance is the smallest compared to feature distances of the probe face image to all the other gallery face images， wherein feature distances determined by the recognizer 20 could be Euclidean distances， Joint Bayesian distances， cosine distances， Hamming distances， or any other distances.

In one embodiment of the present application， Joint Bayesian distances are used as feature distances. Joint Bayesian has been a popular similarity metric of faces， which represents the extracted facial features x (after subtracting the mean) by the sum of two independent Gaussian variables

x＝μ+ò， (5)

where μ～N (0， S_μ) represents the face identity and ò～N (0， S_ò) represents the intra-personal variations. Joint Bayesian models the joint probability of two faces given the intra or extra-personal variation hypothesis， P (x₁， x₂|H₁) and P(x₁， x₂|H_E) . It is readily shown from Equation (5) that these two probabilities are also Gaussian with variations

and

respectively. S_S and S_ò can be learned from data with EM algorithm. In test， it calculates the likelihood ratio

which has closed-form solutions and is efficient.

The Trainer 30

The a trainer 30 is used to update the weights w on connections between neurons in convolutional and fully-connected layers in the feature extractor 10 by inputting initial weights on connections between neurons in convolutional and fully-connected layers in the feature extractor， a plurality of identification supervisory signals， and a plurality of verification supervisory signals， such that that features extracted in last one of the cascaded feature extraction modules in the extractor are sparse， selective， and robust..

As shown in Fig. 3， the identification and verification supervisory signals in the trainer 30， denoted as “Id” and “Ve” respectively in， are simultaneously added to each of the fully-connected layers FC-n for n ＝ 1， ...， 4 in each of the feature extraction modules in the feature extractor 10， and respectively back-propagated to the input face image， so as to update the weights on connections between neurons in all the cascaded feature extraction modules.

The identification supervisory signals “Id” are generated in the trainer 30 by classifying all of the fully-connected layer representations/outputs (i.e.， formula (4) ) of a single face image into one of N identities， wherein the classification errors are used as the identification supervisory signals.

The verification supervisory signals in the trainer 30 are generated by verifying the fully-connected layer representations of two compared face images， respectively， in each of the feature extraction modules， to determine if the two compared face images belong to the same identity， wherein the verification errors are used as the verification supervisory signals. Given a pair of training face images， the feature extractor 10 extracts two feature vectors f_i and f_j from the two face images respectively in each of the feature extraction modules. The verification error is

if f_i and f_j are features of face images of the same identity， or

if f_i and f_j are features of face images of different identities， where ||f_i-f_j||₂ is Euclidean distance of the two feature vectors， m is a positive constant value. There are errors if f_i and f_j are dissimilar for the same identity， or if f_i and f_j are similar for different identities.

Fig. 9 is a schematic flowchart illustrating the training process in the trainer 30. In step 101， the trainer 30 samples two face images and inputs them to the feature extractor 10， respectively， to get feature representations of each of the two face images in all fully-connected layers of the feature extractor 10. Then in step 102， the trainer 30 calculates identification errors by classifying feature representations of each face image in each fully-connected layer into one of a plurality of (N) identities. Simultaneously， in step 103， the trainer 30 calculates verification errors by verifying if feature representations of two face images， respectively， in each fully-connected layer are from the same identity. The identification and verification errors are used as identification and verification supervisory signals， respectively. In step 104， the trainer 30 back propagates all identification and verification supervisory signals through the feature extractor 10 simultaneously， so as to update weights on connections between neurons in the feature extractor 10. Identification and verification supervisory signals (or errors) simultaneously added to fully-connected layers FC-n for n ＝ 1， 2， 3， 4 are back-propagated through the cascade of feature extraction modules until the input image. After back-propagation， errors got in each layer in the cascade of feature extraction modules are accumulated. Weights on connections between neurons in the feature extractor 10 are updated according to the magnitude of the errors. At last， in step 105， the trainer 30 judges if training process has converged， and repeats steps 101-104 if a convergence point has not reached.

The corresponding structures， materials， acts， and equivalents of all means or step plus function elements in the claims below are intended to include any structure， material， or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description， but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application， and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Claims

An apparatus for face recognition， comprising：

an extractor having a plurality of cascaded feature extraction modules， wherein each of the cascaded feature extraction modules comprises：

a convolutional layer for extracting local features from input face images or from features extracted in a previous feature extraction module of the modules； and

a fully-connected layer connected to the convolutional layer on a same feature extraction module and extracting global features from the extracted local features； and

a recognizer configured to， in accordance with distances between the extracted global features， determine：

if two face images of the input images are from a same identity， or

if one of the input images， as a probe face image， is belonging to a same identity as one of gallery face images consisting of the input images.
An apparatus of claim 1， wherein the convolutional layer in the first feature extraction module of the cascaded feature extraction modules is configured to extract the local features from the input face images， and the convolutional layer in each of following feature extraction modules is connected to the convolutional layer in a previous feature extraction module of the modules.
An apparatus of claim 2， wherein the fully-connected layer in each feature extraction module is connected to the convolutional layer in a same feature extraction module of the modules.
An apparatus of claim 3， further comprising：

a trainer configured to update neural weights on connections between the convolutional layer in a first feature extraction module and an input layer containing the input face images， connections between each convolutional layer in a second to a last feature extraction modules and a corresponding convolutional layer in the previous feature extraction module， and connections between each convolutional layer and a corresponding fully-connected layer in the same feature extraction module， by back-propagating identification supervisory signals and verification supervisory signals through the cascaded feature extraction modules.
An apparatus of claim 4， wherein the features extracted in the last feature extraction module for each face image are sparsely organized in 2D with approximately half zero values and half positive values， and each of the features has approximately half of the time being zero and half of the time being positive over all face images.
An apparatus of claim 4， wherein features extracted in a last feature extraction module are selective to both identities and identity-related attributes such that there are features which take either positive (activated) or zero (inhibited) values for all face images of a given identity or containing a given identity-related attribute.
An apparatus of claim 6， wherein the identity-related attributes comprise sex and/or race.
An apparatus of claim 4， wherein features extracted in the last feature extraction module are robust to image corruptions， wherein values of the feature remain largely unchanged under moderate image corruptions.
An apparatus of claim 1， wherein the recognizer determines that two faces belong to the same identity if the determined feature distance thereof is smaller than a threshold， or one of the input images， as the probe face image， is belonging to the same identity as one of gallery face images consisting of the input images if their feature distance is the smallest compared to feature distances of the probe face image to all the other gallery face images.
An apparatus of claim 9， wherein the feature distances comprise one selected from a group consisting of Euclidean distances， Joint Bayesian distances， cosine distances， and Hamming distances.
An apparatus of claim 4， wherein features outputted from each fully-connected layer for a single face image are classified to one of a plurality of identities， wherein classification errors are treated as the identification supervisory signals.
An apparatus of claim 4， wherein features outputted from each fully-connected layer for two compared face images， respectively， are verified to determine if the two compared face images belong to the same identity， wherein verification errors are treated as the verification supervisory signals.
A method for face recognition， comprising：

extracting， by a trained neural network， local features of two or more input images；

extracting， by the trained neural network， global features from the extracted local features；

determining distances between the extracted global features； and

determining， in accordance with the determined distance， iftwo face images of the input images are from the same identity for face verification or ifone of the input images， as a probe face image， is belonging to a same identity as one of gallery face images consisting of the input images.
A method of claim 13， wherein the neural network comprises a plurality of cascaded feature extraction modules， each of the feature extraction modules having a convolutional layer， and wherein the convolutional layer in the first feature extraction module of the cascaded feature extraction modules is connected to the input face images and the convolutional layer in each of the following feature extraction modules is connected to the convolutional layer in the previous feature extraction module.
A method of claim 14， wherein each of the feature extraction modules further comprises a fully-connected layer， the fully-connected layer in each feature extraction module being connected to the convolutional layer in the same feature extraction module.
A method of claim 15， further comprising：

updating neural weights on connections between the convolutional layer in the first feature extraction module and an input layer containing an input face image， connections between each convolutional layer in a second to a last feature extraction modules and the corresponding convolutional layer in the previous feature extraction module， and connections between each convolutional layer and a corresponding fully-connected layer in the same feature extraction module， by back-propagating identification supervisory signals and verification supervisory signals through the cascaded feature extraction modules.
A method of claim 16， wherein the updating further comprises：

inputting two face images to the neural network， respectively， to get feature representations of each of the two face images；

calculating identification errors by classifying feature representations of each face image in each fully-connected layer of the neural network into one of a plurality of identities；

calculating verification errors by verifying if feature representations of two face images， respectively， in each fully-connected layer are from the same identity， the identification and verification errors being treated as the identification and verification supervisory signals， respectively； and

back-propagating the identification and verification supervisory signals through the neural network simultaneously， so as to update the neural weights on connections between the convolutional layer in the first feature extraction module and the input layer containing an input face image， connections between each convolutional layer in the second to the last feature extraction modules and the corresponding convolutional layer in the previous feature extraction module， and connections between each convolutional layer and the corresponding fully-connected layer in the same feature extraction module， .
A method of claim 16， wherein the features extracted in the last feature extraction module for each face image are sparsely organized in 2D with approximately half zero values and half positive values， and each of the features has approximately half of the time being zero and half of the time being positive over all face images.
A method of claim 16， wherein features extracted in the last feature extraction module are selective to both identities and identity-related attributes such that there are features which take either positive (activated) or zero (inhibited) values for all face images of a given identity or containing a given identity-related attribute.
A method of claim 19， wherein the identity-related attributes comprise sex and/or race.
A method of claim 13， wherein the determining further comprises：

determining that two faces belong to the same identity if the determined feature distance thereof is smaller than a threshold， or one of the input images， as the probe face image， is belonging to a same identity as one of gallery face images consisting of the input images if their feature distance is the smallest compared to feature distances of the probe face image to all the other gallery face images.