[go: up one dir, main page]

CN119323815A - Emotion recognition method based on facial feature and key point fusion network - Google Patents

Emotion recognition method based on facial feature and key point fusion network Download PDF

Info

Publication number
CN119323815A
CN119323815A CN202411349869.7A CN202411349869A CN119323815A CN 119323815 A CN119323815 A CN 119323815A CN 202411349869 A CN202411349869 A CN 202411349869A CN 119323815 A CN119323815 A CN 119323815A
Authority
CN
China
Prior art keywords
facial
feature
fusion
module
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411349869.7A
Other languages
Chinese (zh)
Inventor
李福芳
李志杰
钟俊赢
范禹轩
张月华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou University
Original Assignee
Guangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou University filed Critical Guangzhou University
Priority to CN202411349869.7A priority Critical patent/CN119323815A/en
Publication of CN119323815A publication Critical patent/CN119323815A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Evolutionary Computation (AREA)
  • Human Computer Interaction (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开了一种基于面部特征与关键点融合网络的情绪识别方法,其涉及面部识别技术领域。包括:构建基于面部特征与关键点融合网络LFFNet的情绪识别模型,获取待识别的人脸图像,将待识别的人脸图像输入情绪识别模型,通过处理层提取面部特征,通过关键点提取模块MobileFaceNet提取面部关键点;通过特征融合模块将面部特征和每面部关键点进行融合,得到相似性映射;通过多尺度融合模块将所有相似性映射进行拼接整合,得到深层特征;通过全连接层FC对深层特征进行全连接处理,得到情绪识别结果。本发明在面部关键点和面部特征多尺度融合的同时,对得到的融合信息再次进行聚合,最大程度地保留和利用两者的有价值信息。

The present invention discloses an emotion recognition method based on a facial feature and key point fusion network, which relates to the field of facial recognition technology. The method comprises: constructing an emotion recognition model based on a facial feature and key point fusion network LFFNet, obtaining a face image to be recognized, inputting the face image to be recognized into the emotion recognition model, extracting facial features through a processing layer, extracting facial key points through a key point extraction module MobileFaceNet; fusing facial features and each facial key point through a feature fusion module to obtain a similarity map; splicing and integrating all similarity maps through a multi-scale fusion module to obtain deep features; performing full connection processing on deep features through a fully connected layer FC to obtain an emotion recognition result. While the present invention fuses facial key points and facial features at multiple scales, it aggregates the obtained fusion information again to retain and utilize the valuable information of both to the greatest extent.

Description

Emotion recognition method based on facial feature and key point fusion network
Technical Field
The invention relates to the technical field of facial recognition, in particular to a emotion recognition method based on a facial feature and key point fusion network.
Background
The emotion recognition technology has extremely wide application prospect. Firstly, in the field of education, the teaching system can analyze emotional response of students, help teachers to adjust teaching strategies, and realize personalized teaching. In the medical field, the technology can be used for assisting in diagnosing psychological diseases such as depression, anxiety and the like, and provides important diagnosis basis for doctors. In the aspect of human-computer interaction, facial expression recognition enables a machine to more intelligently understand human needs, and provides a more natural and smooth interaction experience. In addition, in the field of security, emotion recognition technology can be applied to identity verification, so that the security of the system is improved. With the continuous progress of technology, emotion recognition plays an important role in more fields, and brings more convenience to life of people.
In the prior art, a emotion recognition method based on a convolutional neural network is provided, and specifically comprises the steps of collecting face image data sets containing various emotion labels, designing a proper convolutional neural network structure according to data characteristics, training by using a training set, inputting images of the face image data sets into a trained network, and calculating layer by layer to obtain an emotion classification result.
The prior art has the defects that in the process of carrying out emotion recognition by using the convolutional neural network, the convolutional neural network adopts single characteristic information, so that the convolutional neural network cannot completely recognize the emotion characteristic information on the face image, and further, the emotion recognition result is inaccurate.
Disclosure of Invention
Based on the above, it is necessary to provide a method for emotion recognition based on a network of facial features and key points.
The embodiment of the invention provides a emotion recognition method based on a facial feature and key point fusion network, which comprises the following steps:
The method comprises the steps of constructing an emotion recognition model based on a facial feature and key point fusion network LFFNet, wherein the emotion recognition model comprises a main network Swin transform, a multi-scale fusion module and a full-connection layer FC, and the main network Swin transform comprises a plurality of processing layers block, a feature fusion module and a key point extraction module MobileFaceNet;
The method comprises the steps of obtaining a face image to be recognized, inputting the face image to be recognized into a emotion recognition model, extracting facial features of the face image through a processing layer, extracting facial key points of the face image through a key point extraction module MobileFaceNet, fusing the facial features extracted by each processing layer and the facial key points extracted by each key point extraction module MobileFaceNet through a feature fusion module to obtain similarity mapping of pixel points at corresponding positions in two face feature images extracted by adjacent processing layers, splicing and integrating all the similarity mapping according to the processing sequence of the processing layers through a multi-scale fusion module to obtain deep features, and carrying out full-connection processing on the deep features through a full-connection layer FC to obtain an emotion recognition result.
Optionally, the backbone network Swin Transformer further includes a windowed multi-head self-attention module w_msa and a shift windowed multi-head self-attention module sw_msa, the self-attention is calculated inside each processing layer by the windowed multi-head self-attention module w_msa, the self-attention is interacted between different processing layers by the shift windowed multi-head self-attention module sw_msa, including:
Wherein, For the output characteristics of the first windowed multi-head self-attention module w_msa and the shift windowed multi-head self-attention module sw_msa, z l is the output characteristic of the first multi-layer perception module MLP, and LN is the layer normalization operation.
Optionally, the key point extracting module MobileFaceNet is a 68-point model, and the extracting, by the key point extracting module MobileFaceNet, facial key points of the face image includes:
detecting 68 key points of face image data including eyes, eyebrows, nose, mouth and face contours;
a thermodynamic diagram is generated for each of the keypoints, and coordinates of the keypoints are determined based on peak positions of each thermodynamic diagram.
Optionally, the feature fusion module fuses the facial features extracted by each processing layer and the facial key points extracted by each key point extraction module MobileFaceNet, including:
The pixel attention mechanism aligns the feature map of the facial feature x extracted by the backbone network Swin Transformer with the feature map of the facial key point y extracted by the key point extraction module MobileFaceNet by convolution;
Performing point multiplication on pixel points at corresponding positions of the two feature maps to obtain similarity mapping for quantifying the importance of the facial key point features relative to the facial feature x at each pixel position;
Processing the similarity mapping through a Sigmoid function to obtain a characteristic influence value between 0 and 1 of each pixel point on the characteristic map, wherein the characteristic influence value is close to 1, the contribution of the y characteristic on the pixel point to the fusion output is higher, and the characteristic influence value is close to 0, the contribution of the x characteristic on the pixel point to the fusion output is higher;
and (3) carrying out weight addition on the feature influence value calculated by aligning the feature graph of the facial feature x and the feature graph of the facial key point y to obtain a final fusion feature, wherein the formula is as follows:
σ=Sigmoid(fx(vx)·fy(vy))
output=σvx+(1-σ)vy
Where v x is a corresponding pixel in the feature map of the facial feature, v y is a corresponding pixel in the feature map of the facial key point, f is an alignment operation, and σ is a probability that two pixels belong to the same object.
Optionally, the multi-scale fusion module includes a plurality of 1×1 convolution layers and a plurality of 3×3 convolution layers, and performs splicing and integration on all similarity maps according to a processing sequence of the processing layers through the multi-scale fusion module, including:
Different input features are adjusted to the same size by a1 x 1 convolutional layer and an upsampling operation;
transmitting the result of the first processing layer to the next layer, adding the output of the last processing layer and the input of the next processing layer, and fusing the result by 3X 3 convolution for each subsequent processing layer to obtain a deeper feature;
And splicing and integrating the outputs of all the processing layers according to the processing sequence of the processing layers to obtain the deep features of the face image.
Optionally, the fully-connected processing is performed on the deep features through the fully-connected layer FC, including:
Flattening the deep features of the face image into one-dimensional vectors, and carrying out feature reduction and feature abstraction on the one-dimensional vectors to obtain probability distribution of the expression category in the face image.
Compared with the prior art, the emotion recognition method based on the facial feature and key point fusion network provided by the embodiment of the invention has the following beneficial effects:
According to the invention, the facial features extracted by each processing layer are fused with the facial key points extracted by each key point extraction module MobileFaceNet through the feature fusion module, and all similarity maps are spliced and integrated according to the processing sequence of the processing layers through the multi-scale fusion module, so that the problem that the convolutional neural network in the prior art cannot completely identify emotion feature information on a face image through single features can be solved, and the obtained fusion information is aggregated again while the facial key points and the facial features are fused in a multi-scale manner, so that valuable information of the facial key points and the facial features is reserved and utilized to the greatest extent.
In addition, all similarity maps are spliced and integrated according to the processing sequence of the processing layer through the multi-scale fusion module, deep features are obtained, deep multi-scale feature fusion is achieved, and stronger performance and expressive force are provided for facial emotion recognition tasks.
Drawings
FIG. 1 is a model block diagram of a method for emotion recognition based on a facial feature and keypoint fusion network, provided in one embodiment;
FIG. 2 is a process flow diagram of a Swin transducer for a facial feature and keypoint fusion network based emotion recognition method provided in one embodiment;
FIG. 3 is a schematic diagram of a feature fusion module of an emotion recognition method based on a facial feature and keypoint fusion network according to an embodiment;
FIG. 4 is a schematic diagram of a multi-scale fusion module of a method for emotion recognition based on a facial feature and keypoint fusion network, provided in one embodiment;
FIG. 5 is a recall result graph of a facial feature and keypoint fusion network based emotion recognition method provided in one embodiment;
FIG. 6 is a confusion matrix contrast diagram of a emotion recognition method based on a facial feature and keypoint fusion network, provided in one embodiment;
FIG. 7 is a comparison plot on RAFDB dataset of a facial feature and keypoint fusion network based emotion recognition method provided in one embodiment;
FIG. 8 is a comparison plot on SFEW dataset of a facial feature and keypoint fusion network based emotion recognition method provided in one embodiment;
FIG. 9 is a graph comparing recall results of a emotion recognition method based on a facial feature and keypoint fusion network, provided in one embodiment;
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
In one embodiment, a method for emotion recognition based on a network of facial features and keypoints fusion is provided, the method comprising:
1. model design
1.1 Model Structure
An emotion recognition model based on a facial feature and key point fusion network LFFNet is constructed, and the whole model framework is shown in fig. 1 and comprises a main network Swin Transformer, a multi-scale fusion module and a full-connection layer FC. The backbone network Swin Transformer comprises a plurality of processing layers block (block 1, block2, block3 and block 4), a feature fusion module and a key point extraction module MobileFaceNet, wherein four processing layers are arranged, a key point extraction module MobileFaceNet is arranged between adjacent processing layers, the adjacent processing layers and the key point extraction module MobileFaceNet are connected through the feature fusion module, and the output of each processing layer is connected with the multi-scale fusion module.
The method comprises the steps of obtaining a face image to be recognized, inputting the face image to be recognized into a emotion recognition model, extracting facial features of the face image through a processing layer, extracting facial key points of the face image through a key point extraction module MobileFaceNet, fusing the facial features extracted by each processing layer and the facial key points extracted by each key point extraction module MobileFaceNet through a feature fusion module to obtain similarity mapping of pixel points at corresponding positions in two face feature images extracted by adjacent processing layers, splicing and integrating all the similarity mapping according to the processing sequence of the processing layers through a multi-scale fusion module to obtain deep features, and carrying out full-connection processing on the deep features through a full-connection layer FC to obtain an emotion recognition result.
The feature images output in different layers are sequentially transmitted and spliced, so that deeper features are obtained layer by layer, and further multi-scale fusion is realized. In the fine-granularity recognition task of facial expression recognition, the hierarchical fusion structure can pay attention to the detailed features and global information of the face at the same time, so that the model can learn the features with more discriminant.
1.2 Backbone network Swin transducer
Facial features (Facial features) focus mainly on texture and color information of the face, such as skin color, facial wrinkles, etc., which capture visual attributes of the face, and are very useful for identifying specific Facial expressions or age, gender, etc. features of an individual. These features can be obtained either by means of manual design or automatically by means of deep learning models.
The transducer network can capture the relation between any two positions in the input sequence without being limited by the length of the sequence, thereby effectively solving the problem of performance degradation of the traditional sequence processing model when processing long sequences. However, the conventional transducer model faces huge computational and memory overhead in processing large-sized images due to the computational complexity of its self-attention mechanism, and may lose local information due to window size limitations, resulting in an insufficient understanding of the whole image. Therefore, a local window mechanism is introduced, so that the model is allowed to pay attention to only local areas in each attention operation, the complexity of global attention calculation is avoided, and meanwhile, the model is ensured to capture rich local detail information. As shown in fig. 2, the backbone network Swin Transformer further includes a windowed Multi-head Self-Attention module (w_msa) and a shift windowed Multi-head Self-Attention module (Shifted Window Multi-head Self-Attention, sw_msa). The self-attention is calculated in each processing layer through the windowed multi-head self-attention module W_MSA, the self-attention is interacted between different processing layers through the windowed multi-head self-attention module SW_MSA, the expression capacity of the model is enhanced, and the calculation formula comprises:
Wherein, For the output characteristics of the first windowed multi-head self-attention module w_msa and the shift windowed multi-head self-attention module sw_msa, z l is the output characteristic of the first multi-layer perception module MLP, and LN is the layer normalization operation.
Facial expression recognition requires accurate capture of subtle changes in the face, including subtle movements in local areas such as eyes, mouth corners, eyebrows, etc., which are key cues for judging expression. Swin transducer can focus more effectively on these fine-grained local features through a windowed self-attention mechanism. The Swin transform allows the model to focus only on local areas in each attention operation by using a local window mechanism, avoiding the complexity of global attention computation while ensuring that the model can capture rich detailed information. Meanwhile, in consideration of the requirements of facial expression recognition in practical application, for example, the method needs to run on mobile phones, embedded systems and other devices with limited resources, the design of the Swin transducer remarkably reduces the requirements of computing resources by limiting the self-attention range and adopting layering processing. Therefore, the Swin transducer is used as a backbone network for facial feature extraction, the advantages of the Swin transducer in the aspects of capturing fine granularity features, calculating efficiency and the like are fully utilized, and the accuracy and the efficiency of recognition are improved.
1.3 Critical Point extraction Module MobileFaceNet
Facial key points (FACIAL LANDMARKS) refer to key points on the face that generally correspond to salient features of the face, such as the contours of the eyes, nose, mouth, and edges of the chin and cheeks, etc. The facial keypoints provide sparse representation of the facial key regions in facial image analysis, which has low sensitivity to various inherent changes of the face, such as skin color, gender, age, and different manifestations of the background, thereby effectively reducing intra-class differences.
MobileFaceNet the model used is a 68-point model, i.e., 68 key points including eyes, eyebrows, nose, mouth, and facial contours of the face image data are detected, a thermodynamic diagram is generated for each key point, and coordinates of the key point are determined based on the peak position of each thermodynamic diagram.
1.4 Feature fusion Module
In the facial expression recognition task, the facial key points and the facial features each play different roles. Facial features may be of careful attention to key areas indicated by facial keypoints that provide support for capturing global information of the facial features. The careful fusion of the two features is very important for improving the accuracy of expression recognition. For this purpose, a feature fusion module based on the Pixel Attention (PA) is used, aiming at precisely integrating the valid features originating from different branches, fusing the valuable information in the facial key points and facial features on the Pixel level.
Specifically, the pixel attention mechanism first aligns the feature map of the facial feature x extracted from the backbone network Swin Transformer with the feature map of the facial key y extracted from the key extraction module MobileFaceNet by convolution, and then performs a point multiplication on the pixel points at each corresponding position of the two feature maps to obtain a similarity map that quantifies the importance of the y feature relative to x at each pixel position. After further mapping by the Sigmoid function, a value between 0 and 1 is obtained, wherein a pixel point close to 1 indicates that at this position the contribution of the y feature to the fusion output is more critical, whereas a value close to 0 means that at this position the x feature is more dominant. And finally, adding the two feature maps with the calculated weight to obtain the final fusion feature. This mechanism not only preserves the characteristics of x, but also effectively integrates the supplemental information provided by y, thereby enhancing the expressive power of the model. The calculation mode is as follows:
σ=Sigmoid(fx(vx)·fy(vy))
output=σvx+(1-σ)vy
Where v x is a corresponding pixel in the feature map of the facial feature, v y is a corresponding pixel in the feature map of the facial key point, f is an alignment operation, and σ is a probability that two pixels belong to the same object.
In facial expression recognition, facial key points are typically focused on key areas of expression, while facial features cover a wider range of global information. When the similarity between the facial key points and the facial features at a certain position is high, the features captured by the facial key points and the facial features at the certain position are shown to have high consistency, and the features of the facial key points at the certain position show more remarkable importance. In this case, it is reasonable to rely on information provided by the facial key points to enrich the final feature representation. Conversely, if the similarity is low, this indicates that facial feature information should be retained more. Meanwhile, as the facial key points are fine-grained features, the method for fusing at the pixel level can fuse the facial key points and the facial features more effectively than other fusion methods. As shown in fig. 3, the basic structure of the feature fusion module is illustrated.
1.5 Multi-scale fusion module based on residual error connection
The multi-scale fusion technology can comprehensively utilize information obtained on different scales to improve the identification performance. The core idea behind this technology is that data of different scales often reveal different features of the image, small scale data can provide rich detailed information, while large scale data provides more context and global view. By effectively fusing the multi-level information, a more comprehensive and accurate image analysis result can be realized. By introducing a pooling layer that can generate a fixed-size representation of the features, the network is enabled to accept input images of arbitrary size. The pyramid pooling module (Pyramid Pooling Module, PPM) performs pooling operations of different scales on a single feature map, and adjusts the obtained feature map to a uniform size through upsampling so as to be fused with the original feature map.
Inspired by the network structure, a multi-scale fusion module is provided. The specific structure is as shown in fig. 4, different input features are firstly adjusted to the same size through 1×1 convolution and up-sampling operation, then the result of the first processing layer is transferred to the next layer, and for each subsequent processing layer, the output of the last processing layer and the input of the next processing layer are added and then fused through 3×3 convolution, so that the deeper features are obtained. And finally, splicing and integrating the outputs of all the processing layers according to the processing sequence of the processing layers to obtain the deep features of the face image. Flattening the deep features of the face image into one-dimensional vectors, and carrying out feature reduction and feature abstraction on the one-dimensional vectors to obtain probability distribution of the expression category in the face image.
In facial expression recognition, it is often necessary to precisely understand small changes of a face, the low-level output of the network focuses more on detailed features (such as edges and contours) of the face, and the high-level output focuses more on global information (such as semantics of the whole expression), and by fusing these different-level features, a more comprehensive facial expression representation can be obtained. Meanwhile, different expressions may have more obvious differences in certain feature levels, and the model can learn the features with more discriminant by combining multi-level information, so that the accuracy and the robustness of expression recognition are improved.
2. Experimental results and analysis
2.1 Experimental Environment and optimization strategy
In this experiment, the Ubuntu 16.04 operating system, intel Core i9 CPU, 32GB memory and NVIDIA RTX 3060 graphics processor, 12GB video memory were used. CUDA version 10.2 and cudnn version v8.4. In terms of programming language, python 3.7 and PyTorch.1.7.1 are used.
In terms of model optimization, SAM (SHARPNESS-Aware Minimization) is used as a model optimization strategy. Traditional deep learning optimization methods, such as SGD or Adam, focus mainly on minimizing the loss function on the training set. However, these methods sometimes result in poor performance of the model on unseen data. The SAM alleviates this problem by taking into account the lost sharpness (sharpness), i.e. the sensitivity of the loss to changes in model parameters. In particular, the SAM calculates the sharpness of the loss by adding a small perturbation to the model parameters, and if a model is less sensitive to changes in the loss in its neighborhood of parameters, then the model is more likely to perform well on unseen data, thereby promoting the generalization ability of the model. By adopting Adam as a basic optimization method, the disturbance coefficient rho is set to be 0.05, the rest super parameters continue to follow the strategy of the third chapter, batchsize is set to be 64, the initial learning rate is 3.5e-5, the weight attenuation is 1e-4, and the total training iteration number is set to be 100. The learning rate adjustment strategy uses ExponentialLR and the gamma value is set to 0.98.
2.2 Ablation experiments
To evaluate the effectiveness of LFFNet, an ablation experiment was performed on the RAFDB dataset. The experimental results comprise the selection of a backbone network, the influence of each module on the overall model and confusion matrix analysis.
(1) The backbone networks were selected for comparison, and the performance differences of the ResNet, VGGNet, swin transducer three different backbone networks in facial feature extraction were compared, and the experimental results are shown in Table 1.
TABLE 1 feature extraction effects for different backbone networks
Method of Accuracy (%) Parameters (M)
VGGNet 86.22 138
ResNet18 86.70 11.18
Swin Transformer 87.32 28
As can be seen from the table, the Swin transducer is superior to ResNet and VGGNet with 87.32% accuracy. This result shows that Swin transducer is effective in capturing more comprehensive facial features with its ability to capture long range dependencies. Whereas ResNet and VGGNet are conventional CNN architectures, it is difficult to capture global dependencies in images due to the inherent limitations of convolution operations. In terms of parameter quantity, VGGNet is a typical relatively deep convolutional neural network with relatively many parameters, and the ResNet model appearing later obtains better effect with fewer parameters by introducing residual connection, while the Swin transducer uses windowed attention method to obtain the best accuracy while maintaining acceptable parameter quantity.
(2) The effectiveness of each module is verified, the proposed network model LFFNet fully utilizes the complementary relation between the facial key points and the facial features, and the multi-scale fusion module utilizes information on different scales to improve the recognition performance. In order to verify the performance of each proposed module, the feature fusion module and the multi-scale fusion module are fused into a Swin transducer backbone network in sequence, and an ablation experiment is performed on RAFDB datasets. The experimental results are shown in table 2, and it can be observed that the network model performance is effectively improved as different modules are gradually introduced. Specifically, after the feature fusion module is added, the accuracy of the model is increased by 1.28%, the module can realize fine fusion of the facial key points and the facial features on the pixel level, and valuable information of the facial key points and the facial features is utilized to the greatest extent. Furthermore, after the multi-scale fusion module is added, the accuracy of the model is further improved by 0.54%, and the module fuses output characteristics of each stage more deeply through residual connection and pyramid structures, so that the model obtains richer characteristic representation.
Table 2 results of validity verification of each Module of the model
Method of Accuracy (%)
Swin Transformer 87.32
Swin transducer+ feature fusion module 88.60
Swin transducer+feature fusion module+multiscale fusion module 89.14
(3) Confusion matrix analysis to further investigate the details of model performance, the performance of the proposed model LFFNet was further compared to the performance of the base network Swin transducer on each facial expression class, with the comparison results shown in table 3 and fig. 5. From the results, LFFNet outperform the base network in each expression category. This results mainly from the efficient fusion of the model in capturing facial features and facial key points, providing a richer clue to the recognition of facial expressions. In addition, by using a multi-scale information fusion mechanism, the model can learn the characteristic with more distinguishing force, so that a better result is obtained in the facial expression recognition task. As shown in fig. 6, a confusion matrix of Swin transducer and model LFFNet on RAFDB dataset is shown, confirming the improvement in overall performance of the model.
Recall results for tables 3 Swin Transformer and LFFNet on different expression categories
2.3 Comparison with the existing model
To evaluate the effectiveness of the proposed model, the best results of the model are compared to several existing methods on the raddb, SFEW data set. The experimental results are shown in table 4, table 5, fig. 7 and fig. 8. It can be seen that on RAFDB datasets, the proposed LFFNet model achieved the best accuracy, up to 89.14%, over other methods. This shows that the model has more powerful performance in expression recognition tasks. Similarly, LFFNet also achieved an optimal accuracy of 60.44% over the SFEW dataset over other comparison methods.
Some of these models employ a similar approach to processing expression recognition tasks. The RAN model decomposes the facial image into regions by facial key points and introduces region bias loss to guide the model towards the most important regions. The ACNN model divides the input facial feature map into multiple sub-feature maps, assigns attention weights to each feature sub-map and global feature map, and then simply concatenates all feature vectors into one vector. The MANet model utilizes a dual-branch network to extract and fuse multi-scale features and local features. The VTFF model then uses the RGB image and the LBP image as inputs, selects the most important features by the attention module, and then integrates these features using a transducer encoder.
Table 4RAFDB comparison of datasets with other methods
Method of Accuracy (%) Year of year
DLP-CNN 84.22 2019
gACNN 85.07 2019
SCN 88.14 2020
MANet 88.36 2021
DACL 87.78 2021
VTFF 88.14 2021
RAN 86.90 2021
MixAugment 87.54 2022
LFFNet(ours) 89.14 2024
Table 5SFEW comparison of datasets with other methods
Method of Accuracy (%) Year of year
DLP-CNN 51.05 2019
IPFR 55.10 2019
LDL-ALSG 56.50 2020
RAN 56.40 2021
MANet 59.40 2021
LFFNet(ours) 60.44 2024
Other models explore expression recognition from different angles. The SCN model is considered to be difficult to label in high quality due to the subjectivity of labeling and uncertainty of facial images, starting from the label perspective. Thus, SCN weights each sample using the attention mechanism, giving less importance to uncertain facial images and modifying their labels. The LDL-ALSG model builds a similarity list through the facial action units and facial key points from the expression distribution point of view, and improves facial expression recognition performance by minimizing the probability of a center image and a nearest neighbor image while minimizing the actual value and the predicted value. The DACL model proposes a deep attention center loss from the point of view of the loss function, combining the attention mechanism and the sparse center loss to adaptively select the most important feature element subset. From the point of data enhancement, the MixAugment model generates a new virtual expression image by mixing different expression images so as to increase the diversity of training data.
Table 6 recall results for models across different expression categories
To further compare the performance differences for the different models, tables 6 and 9 show recall results for each model over different expression categories. As can be seen from the table, the proposed LFFNet model achieves 88.24%, 96.20%, 90.79% and 67.57% recall rates on neutral, happy, sad and aversive expression categories, respectively, showing higher recognition capability. Poor performance in surprise, fear and anger expression classifications. However, in general, the LFFNet model achieved excellent results in most cases, demonstrating its excellent performance in expression recognition tasks, although each model exhibited some difference in the different expression categories. Meanwhile, exploration and optimization of each model in different aspects provide rich ideas and methods for research in the expression recognition field.
Summary 3
The emotion recognition model based on the facial feature and key point fusion network LFFNet is provided, and aims to fully extract and fuse facial feature and facial key point information by using a self-attention mechanism, a feature fusion method and a multi-scale technology so as to solve the problem that single feature is difficult to comprehensively capture scene details. The model adopts a Swin transducer as a backbone network, and the windowed self-attention mechanism is utilized to capture the abundant detail information of the facial image finely without increasing the computational complexity significantly. And the feature fusion module optimizes information fusion between the facial key points and the facial features on the pixel level, and ensures the maximum reservation and utilization of valuable information. The multi-scale fusion module realizes deep multi-scale fusion by combining pyramid structures and residual connection, and further enriches the feature representation. Through an ablation experiment, the improvement effect of each module on the model performance is verified, and competitive results are obtained on two published facial expression recognition data sets RAFDB and SFEW compared with the prior art.
The above examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims (6)

1. A method for emotion recognition based on a network of facial features and keypoints fusion, comprising:
Constructing an emotion recognition model based on a facial feature and key point fusion network LFFNet, wherein the emotion recognition model comprises a main network SwinTransformer, a multi-scale fusion module and a full-connection layer FC, and the main network Swin Transformer comprises a plurality of processing layers block, a feature fusion module and a key point extraction module MobileFaceNet;
The method comprises the steps of obtaining a face image to be recognized, inputting the face image to be recognized into a emotion recognition model, extracting facial features of the face image through a processing layer, extracting facial key points of the face image through a key point extraction module MobileFaceNet, fusing the facial features extracted by each processing layer and the facial key points extracted by each key point extraction module MobileFaceNet through a feature fusion module to obtain similarity mapping of pixel points at corresponding positions in two face feature images extracted by adjacent processing layers, splicing and integrating all the similarity mapping according to the processing sequence of the processing layers through a multi-scale fusion module to obtain deep features, and carrying out full-connection processing on the deep features through a full-connection layer FC to obtain an emotion recognition result.
2. The emotion recognition method based on a facial feature and keypoint fusion network of claim 1, wherein said backbone network Swin Transformer further comprises a windowed multi-headed self-attention module w_msa and a shift windowed multi-headed self-attention module sw_msa, self-attention is calculated inside each processing layer by the windowed multi-headed self-attention module w_msa, self-attention is interacted between different processing layers by the shift windowed multi-headed self-attention module sw_msa, comprising:
Wherein, For the output characteristics of the first windowed multi-head self-attention module w_msa and the shift windowed multi-head self-attention module sw_msa, z l is the output characteristic of the first multi-layer perception module MLP, and LN is the layer normalization operation.
3. The emotion recognition method based on a facial feature and keypoint fusion network of claim 1, wherein the keypoint extraction module MobileFaceNet is a 68-point model, and the extraction of facial keypoints of a face image by the keypoint extraction module MobileFaceNet comprises:
detecting 68 key points of face image data including eyes, eyebrows, nose, mouth and face contours;
a thermodynamic diagram is generated for each of the keypoints, and coordinates of the keypoints are determined based on peak positions of each thermodynamic diagram.
4. The emotion recognition method based on a facial feature and keypoint fusion network of claim 1, wherein the fusing, by the feature fusion module, of the facial features extracted by each processing layer and the facial keypoints extracted by each keypoint extraction module MobileFaceNet comprises:
The pixel attention mechanism aligns the feature map of the facial feature x extracted by the backbone network SwinTransformer with the feature map of the facial key y extracted by the key extraction module MobileFaceNet by convolution;
Performing point multiplication on pixel points at corresponding positions of the two feature maps to obtain similarity mapping for quantifying the importance of the facial key point features relative to the facial feature x at each pixel position;
Processing the similarity mapping through a Sigmoid function to obtain a characteristic influence value between 0 and 1 of each pixel point on the characteristic map, wherein the characteristic influence value is close to 1, the contribution of the y characteristic on the pixel point to the fusion output is higher, and the characteristic influence value is close to 0, the contribution of the x characteristic on the pixel point to the fusion output is higher;
and (3) carrying out weight addition on the feature influence value calculated by aligning the feature graph of the facial feature x and the feature graph of the facial key point y to obtain a final fusion feature, wherein the formula is as follows:
σ=Sigmoid(fx(vx)·fy(vy))
output=σvx+(1-σ)vy
Where v x is a corresponding pixel in the feature map of the facial feature, v y is a corresponding pixel in the feature map of the facial key point, f is an alignment operation, and σ is a probability that two pixels belong to the same object.
5. The emotion recognition method based on a facial feature and keypoint fusion network of claim 1, wherein the multi-scale fusion module comprises a plurality of 1×1 convolution layers and a plurality of 3×3 convolution layers, and all similarity maps are spliced and integrated according to a processing sequence of a processing layer by the multi-scale fusion module, comprising:
Different input features are adjusted to the same size by a1 x 1 convolutional layer and an upsampling operation;
transmitting the result of the first processing layer to the next layer, adding the output of the last processing layer and the input of the next processing layer, and fusing the result by 3X 3 convolution for each subsequent processing layer to obtain a deeper feature;
And splicing and integrating the outputs of all the processing layers according to the processing sequence of the processing layers to obtain the deep features of the face image.
6. The emotion recognition method based on the facial feature and key point fusion network of claim 1, wherein the performing full connection processing on the deep features through the full connection layer FC comprises:
Flattening the deep features of the face image into one-dimensional vectors, and carrying out feature reduction and feature abstraction on the one-dimensional vectors to obtain probability distribution of the expression category in the face image.
CN202411349869.7A 2024-09-26 2024-09-26 Emotion recognition method based on facial feature and key point fusion network Pending CN119323815A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411349869.7A CN119323815A (en) 2024-09-26 2024-09-26 Emotion recognition method based on facial feature and key point fusion network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411349869.7A CN119323815A (en) 2024-09-26 2024-09-26 Emotion recognition method based on facial feature and key point fusion network

Publications (1)

Publication Number Publication Date
CN119323815A true CN119323815A (en) 2025-01-17

Family

ID=94231166

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411349869.7A Pending CN119323815A (en) 2024-09-26 2024-09-26 Emotion recognition method based on facial feature and key point fusion network

Country Status (1)

Country Link
CN (1) CN119323815A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119672787A (en) * 2025-02-20 2025-03-21 中国人民解放军国防科技大学 A multi-scale fusion expression recognition method and system based on visual transformer

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020248376A1 (en) * 2019-06-14 2020-12-17 平安科技(深圳)有限公司 Emotion detection method and apparatus, electronic device, and storage medium
CN117373095A (en) * 2023-11-02 2024-01-09 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) A facial expression recognition method and system for cross-fusion of local and global information
CN118247821A (en) * 2024-03-11 2024-06-25 江苏理工学院 A method for human emotion recognition based on hybrid attention mechanism and multi-scale feature fusion

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020248376A1 (en) * 2019-06-14 2020-12-17 平安科技(深圳)有限公司 Emotion detection method and apparatus, electronic device, and storage medium
CN117373095A (en) * 2023-11-02 2024-01-09 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) A facial expression recognition method and system for cross-fusion of local and global information
CN118247821A (en) * 2024-03-11 2024-06-25 江苏理工学院 A method for human emotion recognition based on hybrid attention mechanism and multi-scale feature fusion

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119672787A (en) * 2025-02-20 2025-03-21 中国人民解放军国防科技大学 A multi-scale fusion expression recognition method and system based on visual transformer
CN119672787B (en) * 2025-02-20 2025-05-13 中国人民解放军国防科技大学 Multi-scale fusion expression recognition method and system based on visual transducer

Similar Documents

Publication Publication Date Title
CN111582225B (en) A remote sensing image scene classification method and device
CN115050064B (en) Human face liveness detection method, device, equipment and medium
EP3968179A1 (en) Place recognition method and apparatus, model training method and apparatus for place recognition, and electronic device
CN112766158A (en) Multi-task cascading type face shielding expression recognition method
CN111160533A (en) Neural network acceleration method based on cross-resolution knowledge distillation
CN113177559B (en) Image recognition method, system, equipment and medium combining breadth and dense convolutional neural network
CN110689523A (en) Personalized image information evaluation method based on meta-learning and information data processing terminal
CN112801146A (en) Target detection method and system
CN110414344A (en) A kind of human classification method, intelligent terminal and storage medium based on video
CN112084913B (en) End-to-end human body detection and attribute identification method
CN115050075B (en) Cross-granularity interactive learning micro-expression image labeling method and device
CN104933428A (en) Human face recognition method and device based on tensor description
CN116758621B (en) Self-attention mechanism-based face expression depth convolution identification method for shielding people
CN115147641A (en) A Video Classification Method Based on Knowledge Distillation and Multimodal Fusion
CN112597324A (en) Image hash index construction method, system and equipment based on correlation filtering
CN107871103A (en) Face authentication method and device
CN118823856A (en) A facial expression recognition method based on multi-scale and deep fine-grained feature enhancement
CN114863186A (en) Three-dimensional model classification method based on double transform branches
CN119323815A (en) Emotion recognition method based on facial feature and key point fusion network
CN120220210A (en) A micro-expression recognition method based on cross-domain feature center-assisted emotion intensity invariant feature extraction
CN114782983A (en) Road scene pedestrian detection method based on improved feature pyramid and boundary loss
CN116645607B (en) Remote sensing scene classification method based on context attention
CN119559691A (en) A classroom behavior detection method based on dimensional diffusion perception adaptive fusion
CN115410223B (en) A domain-generalized person re-identification method based on invariant feature extraction
CN118644674A (en) A small sample medical image segmentation method based on multi-level feature guidance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination