[go: up one dir, main page]

WO2016119076A1 - Procédé et système de reconnaissance faciale - Google Patents

Procédé et système de reconnaissance faciale Download PDF

Info

Publication number
WO2016119076A1
WO2016119076A1 PCT/CN2015/000050 CN2015000050W WO2016119076A1 WO 2016119076 A1 WO2016119076 A1 WO 2016119076A1 CN 2015000050 W CN2015000050 W CN 2015000050W WO 2016119076 A1 WO2016119076 A1 WO 2016119076A1
Authority
WO
WIPO (PCT)
Prior art keywords
modules
convolution
inception
features
feature maps
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2015/000050
Other languages
English (en)
Inventor
Xiaoou Tang
Xiaogang Wang
Yi Sun
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201580074278.6A priority Critical patent/CN107209864B/zh
Priority to PCT/CN2015/000050 priority patent/WO2016119076A1/fr
Publication of WO2016119076A1 publication Critical patent/WO2016119076A1/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the present application relates to a method for face recognition and a system thereof.
  • DeepFace and DeepID are independently proposed to learn identity-related facial features through large-scale face identification tasks. DeepID2 made an additional improvement by learning deep facial features with joint face identification-verification tasks. DeepID2+ further improves DeepID2 by increasing the feature dimensions in each layer and adding joint identification-verification supervisory signals to previous feature extraction layers. DeepID2+ achieved the current state-of-the-art face recognition results in a number of widely evaluated face recognition dataset.
  • the network structure of DeepID2+ is still similar to conventional convolutional neural networks with interlacing convolutional and pooling layers.
  • VGG net and GoogLeNet are two representatives.
  • VGG net proposes to use continuous convolutions with small convolutional kernels. In particular, it stacks two or three layers of 3x3 convolutions together between every two pooling layers.
  • GoogLeNet incorporates multi-scale convolutions and pooling into a single feature extraction layer coined inception. To learn efficient features, an inception layer also introduces 1x1 convolutions to reduce the number of feature maps before larger convolutions and after pooling.
  • an apparatus for face recognition may comprise an extractor having a plurality of deep feature extraction hierarchies, the hierarchies extract recognition features from one or more input images; and a recognizer being electronically communicated with the extractor and recognizing face images of the input images based on the extracted recognition features.
  • each of the hierarchies comprises N multi-convolution modules and M pooling modules, each of N and M is integer greater than 1.
  • a first one of the multi-convolution modules extracts local features from the input images, and the followings of the multi-convolution modules extract further local features from the extracted features outputted from a previous module of the pooling modules, wherein each of the pooling modules receives local features from respective multi-convolution modules and reduces dimensions of the received features.
  • the features obtained from all the extraction hierarchies are concatenated into a feature vector as said recognition features.
  • each of the pooling modules is coupled between two of adjacent multi-convolution modules, between one multi-convolution module and one adjacent multi-inception module, or between two of adjacent multi-inception modules.
  • each of the hierarchies further comprises one or more multi-inception modules.
  • Each of the multi-inception modules performs multi-scale convolutional operation on the features received from previous coupled pooling modules and reduces dimensions of the received features.
  • Each of multi-convolution and multi-inception modules in each hierarchy is followed by one of the pooling modules, and each pooling module is followed by a multi-convolution module or a multi-inception module, except for a last pooling module, a last multi-convolution module, or a last multi-inception module in the hierarchy.
  • each of the multi-inception modules may comprise a plurality of cascaded inception layers.
  • Each of inception layers receives features outputted from a previous inception layer as its input, and the inception layers are configured to perform multi-scale convolution operations and pooling operations on the received features to obtain multi-scale convolutional feature maps and locally invariant feature maps, and perform 1x1 convolution operations before multi-scale convolution operations and after pooling operations to reduce dimensions of features before multi-scale convolution operations and after pooling operations.
  • the obtained multi-scale convolutional feature maps and the obtained locally invafiant feature maps are stacked together to form input feature maps of the layer that follows.
  • each of the inception layers comprises: one or more first 1x1 convolution operation layers are configured to receive input feature maps from a previous feature extraction layer and perform 1x1convolution operations on the received features maps to compress a number of feature maps; one or more multi-scale convolution operation layers are configured to perform N ⁇ N convolution operations on the compressed feature maps received from respective 1x1 convolution operation layers to form first output feature maps, where N>1.
  • One or more pooling operation layers are configured to pool over local regions of the input feature maps from the previous layer to form locally invariant feature maps; and one or more second 1x1 convolution operation layers are configured to perform 1x1convolution operations on the locally invariant feature maps received from the pooling operation layers to compress a number of feature maps so as to obtain second output feature maps.
  • One or more third convolution operation layers are configured to receive input feature maps from the previous layer and perform 1x1 convolution operations on the received features maps to compress a number of feature maps to obtain third feature maps.
  • the first, second and third feature maps are stacked together to form feature maps for inputting a following inception layer of the inception layers or inputting a next feature extraction module.
  • each of multi-convolution modules may comprise one or more cascaded convolution layers, each of the convolution layers receives features outputted from a previous convolution layer as its input, and each of the convolution layers is configured to perform local convolution operations on inputted features, wherein the convolutional layers share neural weights for the convolution operations only in local areas of the inputted images.
  • a trainer may be electronically communicated with the extractor to add supervisory signals on the feature extraction unit during training so as to adjust neural weights in the deep feature extraction hierarchies by back-propagating supervisory signals through the cascaded multi-convolution modules and pooling modules, or through the cascaded multi-convolution modules, pooling modules and the multi-inception modules.
  • the supervisory signals comprise one identification supervisory signal and one verification supervisory signal, wherein the identification supervisory signal is generated by classifying features in any of the modules extracted from an input face region into one of N identities in a training dataset, and taking a classification error as the supervisory signal, and wherein the verification signal is generated by comparing features in any of the modules extracted from two input face images respectively for determining if they are from the same person, and taking a verification error as the supervisory signal.
  • each of the multi-convolution modules, the pooling modules and the multi-inception modules receives a plurality of supervisory signals which are either added on said each module or back-propagated from later feature extraction modules. These supervisory signals are aggregated to adjust neural weights in each of multi-convolution and multi-inception modules during training.
  • each of the deep feature extraction hierarchies may comprise a different number of the multi-convolution modules, a different number of the multi-inception modules, a different number of pooling modules, and a different number of full-connection modules, or takes a different input face region to extract the features.
  • a method for face recognition comprising: extracting, by an extractor having a plurality of deep feature extraction hierarchies, recognition features from one or more input images; and recognizing face images of the input images based on the extracted recognition features, wherein each of the hierarchies comprises N multi-convolution modules and M pooling modules, each of N and M is integer greater than 1.
  • a first one of the multi-convolution modules extracts local features from the input images, the followings of the multi-convolution modules extract further local features from the extracted features outputted from a previous module of the pooling modules, wherein each of pooling modules receives local features from respective multi-convolution modules and reduces dimensions of the received features.
  • Features obtained from all the extraction hierarchies are concatenated into a feature vector as said recognition features.
  • each of the hierarchies further comprises one or more multi-inception modules, each of which has a plurality of cascaded inception layers
  • the extracting further comprises: performing, by each of the inception layers, convolution operations on the received features to obtain multi-scale convolutional feature maps, and performing, by said each of the inception layers, pooling operations on the received features to obtain pooled feature maps (i.e. to pool over local regions of the feature maps received from the previous layer to form locally invariant feature maps) , wherein the obtained multi-scale convolutional feature maps and the pooled feature maps are stacked together to form input feature maps of the layer that follows.
  • each of the hierarchies further comprises one or more multi-inception modules, each of which has a plurality of cascaded inception layers, and wherein, during the extracting, each of the inception layers operates to: receive input feature maps from a previous feature extraction layer and perform 1x1 convolution operations on the received features maps to compress a number of feature maps; perform N ⁇ N convolution operations on the compressed feature maps received from respective 1x1 convolution operation layers to form first output feature maps, where N>1; perform pooling operations on the received features from said previous layer (i.e.
  • an apparatus for face recognition which may comprise: one or more memories that stores executable components; and one or more processors, coupled to the memories, that executes the executable components to perform operations of the apparatus, the executable components comprising:
  • an extracting component having a plurality of deep feature extraction hierarchies configured to extract recognition features from one or more input images
  • a recognizing component recognizing face images of the input images based on the extracted recognition features
  • each of the hierarchies comprises N multi-convolution modules and M pooling modules, each of N and M is an integer greater than 1,
  • a first one of the multi-convolution modules extracts local features from the input images, the followings of the multi-convolution modules extracts further local features from the extracted features outputted from a previous module of the pooling modules, wherein each of pooling modules receives local features from respective multi-convolution modules and reduces dimensions of the received features, and
  • Fig. 1 is a schematic diagram illustrating an apparatus for face recognition consistent with some disclosed embodiments.
  • Fig. 2 is a schematic diagram illustrating an apparatus for face recognition when it is implemented in software, consistent with some disclosed embodiments.
  • Fig. 3a and 3b are two schematic diagrams illustrating two examples of deep feature extraction hierarchies in the feature extraction unit as shown in Fig. 1.
  • Fig. 4a is a schematic diagram illustrating structures of a multi-convolution module, consistent with some disclosed embodiments.
  • Fig. 4b is a multi-inception module in deep feature extraction hierarchies, consistent with some disclosed embodiments.
  • Fig. 5 is a schematic diagram illustrating structures of an inception layer in multi-inception modules, consistent with some disclosed embodiments.
  • Fig. 6 is a schematic flowchart illustrating the trainer as shown in Fig. 1 consistent with some disclosed embodiments.
  • Fig. 7 is a schematic flowchart illustrating the extractor as shown in Fig. 1 consistent with some disclosed embodiments.
  • Fig. 8 is a schematic flowchart illustrating the recognizer as shown in Fig. 1 consistent with some disclosed embodiments.
  • Fig. 9 is a schematic flowchart illustrating the process for the inception layer as shown in Fig. 5 consistent with some disclosed embodiments.
  • the present invention may be embodied as a system, method or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc. ) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit, ” “module” or “system. ” Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer-usable program code embodied in the medium.
  • the apparatus 1000 may include a general purpose computer, a computer cluster, a mainstream computer, a computing device dedicated for providing online contents, or a computer network comprising a group of computers operating in a centralized or distributed fashion.
  • the apparatus 1000 may include one or more processors (processors 102, 104, 106 etc. ) , a memory 112, a storage device 116, a communication interface 114, and a bus to facilitate information exchange among various components of apparatus 1000.
  • Processors 102-106 may include a central processing unit ( “CPU” ) , a graphic processing unit ( “GPU” ) , or other suitable information processing devices.
  • processors 102-106 can include one or more printed circuit boards, and/or one or more microprocessor chips. Processors 102-106 can execute sequences of computer program instructions to perform various methods or run the modules that will be explained in greater detail below.
  • Memory 112 can include, among other things, a random access memory ( “RAM” ) and a read-only memory ( “ROM” ) .
  • Computer program instructions can be stored, accessed, and read from memory 112 for execution by one or more of processors 102-106.
  • memory 112 may store one or more software applications.
  • memory 112 may store an entire software application or only a part of a software application that is executable by one or more of processors 102-106 to carry out the functions as disclosed below for the apparatus 1000. It is noted that although only one block is shown in Fig. 1, memory 112 may include multiple physical devices installed on a central computing device or on different computing devices.
  • the apparatus 1000 may comprise an extractor 10 and a recognizer 20.
  • the extractor 10 is configured with a plurality of deep feature extraction hierarchies, which may be formed as a neural network configured or trained to extract recognition features from one or more input images.
  • the recognizer 20 is electronically communicated with the extractor 10 and recognizes face images of the input images based on the extracted recognition features.
  • each of the hierarchies comprises N multi-convolution modules and M pooling modules, each of N and M is integer greater than 1.
  • a first one of the multi-convolution modules extracts local features from the input images, and the followings of the multi-convolution modules extract further local features from the extracted features outputted from a previous module of the pooling modules, wherein each of pooling modules receives local features from respective multi-convolution modules and reduces dimensions of the received features.
  • the features obtained from all the extraction hierarchies are concatenated into a feature vector as said recognition features.
  • the apparatus 1000 may further comprise a trainer 30 used to train the neural network.
  • the feature extractor 10 contains a plurality of deep feature extraction hierarchies. Each of the feature extraction hierarchies is a cascade of feature extraction modules.
  • Fig. 7 is a schematic flowchart illustrating the feature extraction process in the extractor 10, which contains three steps.
  • step 101 the feature extractor 10 forward propagates an input face image through each of deep feature extraction hierarchies, respectively.
  • step 102 the extractor 10 takes representations outputted by each of the deep feature extraction hierarchies as features.
  • step 103 it concatenates features of all deep feature extraction hierarchies.
  • each of the deep feature extraction hierarchies may include a plurality of multi-convolution modules, a plurality of multi-inception modules, a plurality of pooling modules, and a plurality of full-connection modules.
  • Each of the deep feature extraction hierarchies may contain a different number of cascaded multi-convolution modules, a different number of multi-inception modules, a different number of pooling modules, and a different number of full-connection modules, or may take a different input face region to extract features.
  • Fig. 3a illustrates an example of feature extraction hierarchies in the extractor 10.As shown in Fig. 3a, each of the deep feature extraction hierarchies contains alternate multi-convolution modules 21-1, 21-2, 21-3... and pooling modules 22-1, 22-2, 22-3.... For purpose of description, four multi-convolution modules 21-1, 21-2, 21-3 and 21-4 and three pooling modules 22-1, 22-2 and 22-3 are illustrated in Fig. 3a as an example.
  • Fig. 4a is a schematic diagram illustrating structures of each of the multi-convolution modules 21-1, 21-2, 21-3.... As shown, each multi-convolution module contains a plurality of cascaded convolutional layers. Fig. 4a shows an example of three cascaded convolutional layers of convolutional layer 1-3. However, in the present application, a multi-convolution module could contain any number of convolutional layers such as one, two, three, or more. In one extreme of a multi-convolution module containing only one convolutional layer, it degrades to a conventional convolution module. Therefore, multi-convolution modules are generalizations of conventional convolution modules. Likewise, a multi-inception module contains one or more cascaded inception layers.
  • the convolutional layers in a multi-convolution module are configured to extract local facial features from input feature maps (which is output feature maps of the previous layer) to form output feature maps of the current layer.
  • each convolutional layer performs convolution operations on the input feature maps to form output feature maps of the current layer, and the formed output feature maps will be input to the next convolutional layer.
  • Each feature map is a certain kind of features organized in 2D.
  • the features in the same output feature map or in local regions of the same feature map are extracted from input feature maps with the same set of neural connection weights.
  • the convolution operation in each convolutional laver may be expressed as
  • x i and y j are the i-th input feature map and the j-th output feature map, respectively;
  • k ij is the convolution kernel between the i-th input feature map and the j-th output feature map
  • b j is the bias of the j-th output feature map
  • r indicates a local region where weights are shared. In one extreme of the local region r corresponding to entire input feature maps, convolution becomes global convolution. In another extreme of the local region r corresponding to a single pixel in input feature maps, a convolutional layer degrades to a local-connection layer.
  • 1x1 convolution operations may be carried out in inception layers (as shown in Fig. 4) compress the number of feature maps by setting the number of output feature maps significantly smaller than the number of input feature maps, which will be discussed below.
  • Each of the pooling modules 22-1, 22-2... aims to reduce feature dimensions and form more invariant features.
  • the goal of cascading multiple convolution/inception layers is to extract hierarchical local features (i.e. features extracted from local regions of the input images or the input features) , wherein features extracted by higher convolution/inception layers have larger effective receptive field on input images and more complex non-linearity.
  • the pooling modules 22-1, 22-2... are configured to pool local facial features from input feature maps from previous layer to form output feature maps of the current layer.
  • Each of the pooling modules 22-1, 22-2... receives the feature maps from the respective connected multi-convolution/multi-inception module and then reduces the feature dimensions of the received feature maps and forms more invariant features by pooling operations, which may be formulated as
  • each neuron in the i-th output feature map y i pools over an M ⁇ N local region in the i-th input feature map x i , with s as the step size.
  • the feature maps with the reduced dimensions are then input to the next cascaded convolutional module.
  • each of pooling modules is in addition followed by a full-connection module 23 (23-1, 23-2 and 23-3) .
  • Features extracted in the three full-connection modules 21-1, 21-2 and 21-3 and the last multi-convolution module 21-4 (multi-convolution module 4) are supervised by supervisory signals.
  • Features in the last multi-convolution module 21-4 are used for face recognition.
  • the full-connection modules 23-1, 23-2 and 23-3 in deep feature extraction hierarchies are configured to extract global features (features extracted from the entire region of input feature maps) from previous feature extraction modules, i.e. the pooling modules 22-1, 22-2 and 22-3.
  • the fully-connected layers also serve as interfaces for receiving supervisory signals during training, which will be discussed later.
  • the full-connection modules 23-1, 23-2 and 23-3 also have the function of feature dimension reduction as pooling modules 22-1, 22-2 and 22-3 by restricting the number of neurons in them.
  • the fully-connection modules 23-1, 23-2 and 23-3 may be formulated as
  • x denote neural outputs (features) from the cascaded pooling module
  • y denote neural outputs (features) in the current fully-connection
  • w denotes neural weights in current feature extraction module (current fully-connection) . Neurons in fully-connection modules linearly combine features in previous feature extraction module, followed by ReLU non-linearity.
  • a feature extraction unit may contain a plurality of the deep feature extraction hierarchies.
  • Features in top feature extraction modules in all deep feature extraction hierarchies are concatenated into a long feature vector as a final feature representation for face recognition.
  • branching-out modules serve as interfaces for receiving supervisory signals during training, which will be disclosed later.
  • the top feature extraction module which extracts features for face recognition
  • all branching-out modules will be discarded and only the module cascade for extracting features for face recognition is reserved in test.
  • the hierarchy contains two multi-convolution modules 21-1 and 21-2, each of which is followed by a pooling module 22 (22-1 or 22-2) .
  • the multi-convolution module 21-1 is connected to an input face image as an input layer, and is configured to extract local facial features (i.e. features extracted from local regions of the input images) from input images by rule of formulation 1) .
  • the pooling module 22-1 is configured to pool local facial features from the previous layer (the multi-convolution module 21-1 ) to form output feature maps of the current layer. To be specific, the pooling module 22-1 receives the feature maps from the respectively connected convolutional module and then reduces the feature dimensions of the received feature maps, and forms more invariant features by pooling operations, which is formulated as formulation 2) .
  • each feature map is a certain kind of features organized in 2D.
  • the feature extraction hierarchy further comprises two multi-inception modules 24-1 and 24-2, each of which is followed by a pooling module 22 (22-3 or 22-4) .
  • Fig. 4b shows an example of three cascaded inception layers 1-3 in each of the multi-inception modules 24-1 and 24-2.
  • the goal of cascading the inception layers is to extract multi-scale local features by incorporating convolutions of various kernel sizes as well as local pooling operations in a single layer.
  • the features extracted by higher convolution/inception layers have larger effective receptive field on input images and more complex non-linearity.
  • each of the inception layers comprises one or more first 1x1 convolution operation layers241; one or more second 1x1 convolution operation layers 242, one or more multi-convolution operation layers (N ⁇ N convolution, N>1) 243, one or more pooling operation layers 244, and one or more third 1x1 convolution operation layers 245.
  • the number of the 1x1 convolution operation layers 241 is the same as that of the multi-scale convolution operation layers 243, and each layer 243 is coupled to a corresponding layer 241.
  • the number of the third 1x1 convolution operation layers 245 is the same as that of the pooling layers 244.
  • the second 1x1 convolution operation layers 242 are coupled to the previous inception layer.
  • the 1x1 convolution layers 241 are used to make computation efficient before the operations of the multi-convolution operation layers 243 and after pooling operation layers 244, which will be discussed below.
  • Fig. 5 just shows two first 1x1 convolution operation layers 241, one second 1x1 convolution operation layer 242, one third 1x1 convolution operation layer 245 and two multi-scale convolution operation layers 243, but the invention is not limited thereto.
  • the inception layer ensembles convolution operations with convolutional kernel sizes of 1x1, 3x3, and 5x5, as well as pooling operations by rule of formulation 2.
  • the first 1x1 convolution layers 241 are used to make computation efficient before 3x3 and 5x5 convolutions.
  • the number of output feature maps of a 1x1 convolution layer is set to be much smaller than its input feature maps.
  • 3x3 and 5x5 convolutions take output feature maps of 1x1 convolutions as their input feature maps, the number of input feature maps of 3x3 and 5x5 convolutions become much smaller. In this way, computations in 3x3 and 5x5 convolutions are reduced significantly.
  • the 1x1 convolution 245 after pooling helps reduce the number of output feature maps of pooling. Since output feature maps of 1x1, 3x3, and 5x5 convolutions are concatenated to form input feature maps of the next layer, a small number of output feature maps of 1x1 convolutions reduces the total number of output feature maps, and therefore reduces computation in next layer.
  • the 1x1 convolution itself does not take the majority computation due to the extremely small convolutional kernel size.
  • Fig. 9 is a schematic flowchart illustrating the process for the inception layer as shown in Fig. 5 consistent with some disclosed embodiments.
  • each of 1x1 convolution operation layers 241 operates to receive input feature maps from the previous layer and perform 1x1 convolution operations on the received features maps to compress a number of feature maps by rule of formula 1) as stated in the above.
  • the multi-scale convolution operation layer 243 performs N ⁇ N convolution operations on the compressed feature maps received from respective 1x1 convolution operation layer 241 to form a plurality of first output feature maps.
  • the pooling operation layer 244 operates to receive the input feature maps from the previous layer and perform pooling operations on the received feature maps by rule of formula 2) .
  • the pooling operations in inception layers aim to pool over local regions of input feature maps to form locally invariant features as stated in the above.
  • pooling in inception layers may not reduce feature dimensions, which is achieved by setting step-size s equal to 1 by rule of formula 2.
  • the third 1x1 convolution operation layers 245 operate to perform 1x1 convolution operations on the features maps received from the pooling operation layer 244 to compress numbers of the feature maps by rule of formula 1) as stated in the above so as to obtain a plurality of second output feature maps.
  • the second 1x1 convolution operation layers 242 operate to receive the input feature maps from the previous layer and perform 1x1 convolution operations on the received features maps to compress numbers of the feature maps by rule of formula 1) so as to obtain a plurality of third feature maps.
  • the first, second and third feature maps are concatenated to form feature maps for inputting the following inception layer or inputting the following feature extraction module.
  • the recognizer 20 operates to calculate distances between features for different face images extracted by the feature extractor 10 to determine if two face images are from the same identity for face verification or determine if one of the input images, as a probe face image, is belonging to a same identity as one of gallery face images consisting of the input images for face identification.
  • Fig. 8 is a schematic flowchart illustrating the recognition process in the recognizer 20. In step 201, the recognizer 20 calculates distances between features extracted from different face images by the feature extractor 10.
  • the recognizer 20 determines if two face images are from the same identity for face verification, or, alternatively, in step 203, it determines one of the input images, as a probe face image, is belonging to a same identity as one of gallery face images consisting of the input images for face identification.
  • two face images are determined to belong to the same identity if their feature distance is smaller than a threshold, or the probe face image is determined to belong to the same identity as one of gallery face images if their feature distance is the smallest compared to feature distances of the probe face image to all the other gallery face images, wherein feature distances determined by the recognizer 20 could be Euclidean distances, Joint Bayesian distances, cosine distances, Hamming distances, or any other distances.
  • Joint Bayesian distances are used as feature distances.
  • Joint Bayesian has been a popular similarity metric of faces, which represents the extracted facial features x (after subtracting the mean) by the sum of two independent Gaussian variables
  • S ⁇ and S ⁇ can be learned from data with EM algorithm. In test, it calculates the likelihood ratio
  • the Trainer 30 The Trainer 30
  • the trainer 30 is used to update the weights w on connections between neurons in feature extraction layers (i.e. the layers of the multi-convolution modules, the multi-inception modules and the full connection modules) in the feature extractor 10 by inputting initial weights on connections between neurons in feature extraction layers in the feature extractor, a plurality of identification supervisory signals, and a plurality of verification supervisory signals.
  • the trainer 30 aims to iteratively find a set of optimized neural weights in deep feature extraction hierarchies for extracting identity-related features for face recognition.
  • the identification and verification supervisory signals in the trainer 30 are simultaneously added to each of the supervised layers in each of the feature extraction hierarchies in the feature extractor 10,and respectively back-propagated to the input face image, so as to update the weights on connections between neurons in all the cascaded feature extraction modules.
  • the identification supervisory signals are generated in the trainer 30 by classifying all of the supervised layer (layers selected for supervision, which could be those in multi-convolution modules, multi-inception modules, pooling modules, or full-connection modules) representations into one of N identities, wherein the classification errors are used as the identification supervisory signals.
  • the verification supervisory signals in the trainer 30 are generated by verifying the supervised layer representations of two compared face images, respectively, in each of the feature extraction modules, to determine if the two compared face images belong to the same identity, wherein the verification errors are used as the verification supervisory signals.
  • the feature extractor 10 extracts two feature vectors f i and f j from the two face images respectively in each of the feature extraction modules.
  • the verification error is if f i and f j are features of face images of the same identity, or if f i and f j are features of face images of different identities, where
  • 2 is Euclidean distance of the two feature vectors, m is a positive constant value.
  • f i and f j are dissimilar for the same identity, or if f i and f j are similar for different identities.
  • Fig. 6 is a schematic flowchart illustrating the training process in the trainer 30.
  • the trainer 30 samples two face images and inputs them to the feature extractor 10, respectively, to get feature representations of each of the two face images in all feature extraction layers of the feature extractor 10.
  • the trainer 30 calculates identification errors by classifying feature representations of each face image in each supervised layer into one of a plurality of (N) identities.
  • the trainer 30 calculates verification errors by verifying if feature representations of two face images, respectively, in each supervised layer are from the same identity.
  • the identification and verification errors are used as identification and verification supervisory signals, respectively.
  • step 304 the trainer 30 back propagates all identification and verification supervisory signals through the feature extractor 10 simultaneously, so as to update weights on connections between neurons in the feature extractor 10.
  • Identification and verification supervisory signals (or errors) simultaneously added to supervised layers are back-propagated through the cascade of feature extraction modules until the input image. After back-propagation, errors got in each layer in the cascade of feature extraction modules are accumulated. Weights on connections between neurons in the feature extractor 10 are updated according to the magnitude of the errors.
  • step 305 the trainer 30 judges if training process has converged, and repeats steps 301-304 if a convergence point has not reached.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Biomedical Technology (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

L'invention concerne un appareil de reconnaissance faciale. L'appareil peut comprendre un extracteur qui a une pluralité de hiérarchies d'extraction de caractéristiques profondes, lesdites hiérarchies extrayant des caractéristiques de reconnaissance à partir d'une ou plusieurs images d'entrée ; et un dispositif de reconnaissance qui est électroniquement en communication avec l'extracteur et reconnaît des images faciales des images d'entrée sur la base des caractéristiques de reconnaissance extraites.
PCT/CN2015/000050 2015-01-27 2015-01-27 Procédé et système de reconnaissance faciale Ceased WO2016119076A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201580074278.6A CN107209864B (zh) 2015-01-27 2015-01-27 人脸识别方法和装置
PCT/CN2015/000050 WO2016119076A1 (fr) 2015-01-27 2015-01-27 Procédé et système de reconnaissance faciale

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2015/000050 WO2016119076A1 (fr) 2015-01-27 2015-01-27 Procédé et système de reconnaissance faciale

Publications (1)

Publication Number Publication Date
WO2016119076A1 true WO2016119076A1 (fr) 2016-08-04

Family

ID=56542092

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/000050 Ceased WO2016119076A1 (fr) 2015-01-27 2015-01-27 Procédé et système de reconnaissance faciale

Country Status (2)

Country Link
CN (1) CN107209864B (fr)
WO (1) WO2016119076A1 (fr)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107798381A (zh) * 2017-11-13 2018-03-13 河海大学 一种基于卷积神经网络的图像识别方法
WO2018090905A1 (fr) * 2016-11-15 2018-05-24 Huawei Technologies Co., Ltd. Détection automatique d'identité
CN108073876A (zh) * 2016-11-14 2018-05-25 北京三星通信技术研究有限公司 面部解析设备和面部解析方法
US10282589B2 (en) 2017-08-29 2019-05-07 Konica Minolta Laboratory U.S.A., Inc. Method and system for detection and classification of cells using convolutional neural networks
CN110648316A (zh) * 2019-09-07 2020-01-03 创新奇智(成都)科技有限公司 一种基于深度学习的钢卷端面边缘检测算法
CN110889373A (zh) * 2019-11-27 2020-03-17 中国农业银行股份有限公司 基于区块链的身份识别方法、信息保存方法及相关装置
US10621424B2 (en) 2018-03-27 2020-04-14 Wistron Corporation Multi-level state detecting system and method

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107844541A (zh) * 2017-10-25 2018-03-27 北京奇虎科技有限公司 图像查重的方法及装置
WO2019095333A1 (fr) * 2017-11-17 2019-05-23 华为技术有限公司 Procédé et dispositif de traitement de données
CN109344779A (zh) * 2018-10-11 2019-02-15 高新兴科技集团股份有限公司 一种基于卷积神经网络的匝道场景下的人脸检测方法
US10740593B1 (en) * 2019-01-31 2020-08-11 StradVision, Inc. Method for recognizing face using multiple patch combination based on deep neural network with fault tolerance and fluctuation robustness in extreme situation
CN110598716A (zh) * 2019-09-09 2019-12-20 北京文安智能技术股份有限公司 一种人员属性识别方法、装置及系统
EP4058933A4 (fr) 2019-11-20 2022-12-28 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Dispositif de détection de visage, procédé et système de déverrouillage à reconnaissance faciale
CN111968264A (zh) * 2020-10-21 2020-11-20 东华理工大学南昌校区 体育项目时间登记装置
CN115035572B (zh) * 2022-05-27 2025-09-16 汤姆逊(广东)智能科技有限公司 人脸识别方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6038337A (en) * 1996-03-29 2000-03-14 Nec Research Institute, Inc. Method and apparatus for object recognition
US20080014563A1 (en) * 2004-06-04 2008-01-17 France Teleom Method for Recognising Faces by Means of a Two-Dimensional Linear Disriminant Analysis
US8345962B2 (en) * 2007-11-29 2013-01-01 Nec Laboratories America, Inc. Transfer learning methods and systems for feed-forward visual recognition systems
CN103530657A (zh) * 2013-09-26 2014-01-22 华南理工大学 一种基于加权l2抽取深度学习人脸识别方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6038337A (en) * 1996-03-29 2000-03-14 Nec Research Institute, Inc. Method and apparatus for object recognition
US20080014563A1 (en) * 2004-06-04 2008-01-17 France Teleom Method for Recognising Faces by Means of a Two-Dimensional Linear Disriminant Analysis
US8345962B2 (en) * 2007-11-29 2013-01-01 Nec Laboratories America, Inc. Transfer learning methods and systems for feed-forward visual recognition systems
CN103530657A (zh) * 2013-09-26 2014-01-22 华南理工大学 一种基于加权l2抽取深度学习人脸识别方法

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
SUN, YI ET AL.: "Deep learning Face Representation by Joint Identification-Verification", CORNELL UNIVERSITY LIBRARY, 18 June 2014 (2014-06-18), Retrieved from the Internet <URL:https://arxiv.org/abs/1406.4773> *
SUN, YI ET AL.: "Deep Learning Face Representation from Predicting 10000 Classes", 2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 28 June 2014 (2014-06-28), pages 1891 - 1898 *
SUN, YI ET AL.: "Hybrid Deep learning for Face Verification", 2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 8 October 2013 (2013-10-08), pages 1489 - 1496 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108073876A (zh) * 2016-11-14 2018-05-25 北京三星通信技术研究有限公司 面部解析设备和面部解析方法
CN108073876B (zh) * 2016-11-14 2023-09-19 北京三星通信技术研究有限公司 面部解析设备和面部解析方法
WO2018090905A1 (fr) * 2016-11-15 2018-05-24 Huawei Technologies Co., Ltd. Détection automatique d'identité
US10460153B2 (en) 2016-11-15 2019-10-29 Futurewei Technologies, Inc. Automatic identity detection
US10282589B2 (en) 2017-08-29 2019-05-07 Konica Minolta Laboratory U.S.A., Inc. Method and system for detection and classification of cells using convolutional neural networks
CN107798381A (zh) * 2017-11-13 2018-03-13 河海大学 一种基于卷积神经网络的图像识别方法
CN107798381B (zh) * 2017-11-13 2021-11-30 河海大学 一种基于卷积神经网络的图像识别方法
US10621424B2 (en) 2018-03-27 2020-04-14 Wistron Corporation Multi-level state detecting system and method
CN110648316A (zh) * 2019-09-07 2020-01-03 创新奇智(成都)科技有限公司 一种基于深度学习的钢卷端面边缘检测算法
CN110889373A (zh) * 2019-11-27 2020-03-17 中国农业银行股份有限公司 基于区块链的身份识别方法、信息保存方法及相关装置
CN110889373B (zh) * 2019-11-27 2022-04-08 中国农业银行股份有限公司 基于区块链的身份识别方法、信息保存方法及相关装置

Also Published As

Publication number Publication date
CN107209864B (zh) 2018-03-30
CN107209864A (zh) 2017-09-26

Similar Documents

Publication Publication Date Title
WO2016119076A1 (fr) Procédé et système de reconnaissance faciale
CN112561027B (zh) 神经网络架构搜索方法、图像处理方法、装置和存储介质
CN112597941B (zh) 一种人脸识别方法、装置及电子设备
EP3732619B1 (fr) Procédé de traitement d&#39;image basé sur un réseau neuronal convolutionnel et appareil de traitement d&#39;image
Paisitkriangkrai et al. Pedestrian detection with spatially pooled features and structured ensemble learning
US9811718B2 (en) Method and a system for face verification
JP6345276B2 (ja) 顔認証方法およびシステム
US20240143977A1 (en) Model training method and apparatus
CN110765860A (zh) 摔倒判定方法、装置、计算机设备及存储介质
WO2019228317A1 (fr) Procédé et dispositif de reconnaissance faciale et support lisible par ordinateur
WO2016086330A1 (fr) Procédé et système de reconnaissance faciale
CN110175671A (zh) 神经网络的构建方法、图像处理方法及装置
WO2014205231A1 (fr) Cadre d&#39;apprentissage en profondeur destiné à la détection d&#39;objet générique
CN111914908A (zh) 一种图像识别模型训练方法、图像识别方法及相关设备
Imani et al. Neural computation for robust and holographic face detection
CN111414875B (zh) 基于深度回归森林的三维点云头部姿态估计系统
CN108537235B (zh) 一种低复杂度尺度金字塔提取图像特征的方法
CN114358205B (zh) 模型训练方法、模型训练装置、终端设备及存储介质
CN111079739A (zh) 一种多尺度注意力特征检测方法
CN106803054B (zh) 人脸模型矩阵训练方法和装置
CN113762249B (zh) 图像攻击检测、图像攻击检测模型训练方法和装置
CN113536970A (zh) 一种视频分类模型的训练方法及相关装置
CN114677611B (zh) 数据识别方法、存储介质及设备
Liu et al. Self-constructing graph convolutional networks for semantic labeling
CN113255604A (zh) 基于深度学习网络的行人重识别方法、装置、设备及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15879288

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15879288

Country of ref document: EP

Kind code of ref document: A1