Disclosure of Invention
Based on the above, it is necessary to provide a method for emotion recognition based on a network of facial features and key points.
The embodiment of the invention provides a emotion recognition method based on a facial feature and key point fusion network, which comprises the following steps:
The method comprises the steps of constructing an emotion recognition model based on a facial feature and key point fusion network LFFNet, wherein the emotion recognition model comprises a main network Swin transform, a multi-scale fusion module and a full-connection layer FC, and the main network Swin transform comprises a plurality of processing layers block, a feature fusion module and a key point extraction module MobileFaceNet;
The method comprises the steps of obtaining a face image to be recognized, inputting the face image to be recognized into a emotion recognition model, extracting facial features of the face image through a processing layer, extracting facial key points of the face image through a key point extraction module MobileFaceNet, fusing the facial features extracted by each processing layer and the facial key points extracted by each key point extraction module MobileFaceNet through a feature fusion module to obtain similarity mapping of pixel points at corresponding positions in two face feature images extracted by adjacent processing layers, splicing and integrating all the similarity mapping according to the processing sequence of the processing layers through a multi-scale fusion module to obtain deep features, and carrying out full-connection processing on the deep features through a full-connection layer FC to obtain an emotion recognition result.
Optionally, the backbone network Swin Transformer further includes a windowed multi-head self-attention module w_msa and a shift windowed multi-head self-attention module sw_msa, the self-attention is calculated inside each processing layer by the windowed multi-head self-attention module w_msa, the self-attention is interacted between different processing layers by the shift windowed multi-head self-attention module sw_msa, including:
Wherein, For the output characteristics of the first windowed multi-head self-attention module w_msa and the shift windowed multi-head self-attention module sw_msa, z l is the output characteristic of the first multi-layer perception module MLP, and LN is the layer normalization operation.
Optionally, the key point extracting module MobileFaceNet is a 68-point model, and the extracting, by the key point extracting module MobileFaceNet, facial key points of the face image includes:
detecting 68 key points of face image data including eyes, eyebrows, nose, mouth and face contours;
a thermodynamic diagram is generated for each of the keypoints, and coordinates of the keypoints are determined based on peak positions of each thermodynamic diagram.
Optionally, the feature fusion module fuses the facial features extracted by each processing layer and the facial key points extracted by each key point extraction module MobileFaceNet, including:
The pixel attention mechanism aligns the feature map of the facial feature x extracted by the backbone network Swin Transformer with the feature map of the facial key point y extracted by the key point extraction module MobileFaceNet by convolution;
Performing point multiplication on pixel points at corresponding positions of the two feature maps to obtain similarity mapping for quantifying the importance of the facial key point features relative to the facial feature x at each pixel position;
Processing the similarity mapping through a Sigmoid function to obtain a characteristic influence value between 0 and 1 of each pixel point on the characteristic map, wherein the characteristic influence value is close to 1, the contribution of the y characteristic on the pixel point to the fusion output is higher, and the characteristic influence value is close to 0, the contribution of the x characteristic on the pixel point to the fusion output is higher;
and (3) carrying out weight addition on the feature influence value calculated by aligning the feature graph of the facial feature x and the feature graph of the facial key point y to obtain a final fusion feature, wherein the formula is as follows:
σ=Sigmoid(fx(vx)·fy(vy))
output=σvx+(1-σ)vy
Where v x is a corresponding pixel in the feature map of the facial feature, v y is a corresponding pixel in the feature map of the facial key point, f is an alignment operation, and σ is a probability that two pixels belong to the same object.
Optionally, the multi-scale fusion module includes a plurality of 1×1 convolution layers and a plurality of 3×3 convolution layers, and performs splicing and integration on all similarity maps according to a processing sequence of the processing layers through the multi-scale fusion module, including:
Different input features are adjusted to the same size by a1 x 1 convolutional layer and an upsampling operation;
transmitting the result of the first processing layer to the next layer, adding the output of the last processing layer and the input of the next processing layer, and fusing the result by 3X 3 convolution for each subsequent processing layer to obtain a deeper feature;
And splicing and integrating the outputs of all the processing layers according to the processing sequence of the processing layers to obtain the deep features of the face image.
Optionally, the fully-connected processing is performed on the deep features through the fully-connected layer FC, including:
Flattening the deep features of the face image into one-dimensional vectors, and carrying out feature reduction and feature abstraction on the one-dimensional vectors to obtain probability distribution of the expression category in the face image.
Compared with the prior art, the emotion recognition method based on the facial feature and key point fusion network provided by the embodiment of the invention has the following beneficial effects:
According to the invention, the facial features extracted by each processing layer are fused with the facial key points extracted by each key point extraction module MobileFaceNet through the feature fusion module, and all similarity maps are spliced and integrated according to the processing sequence of the processing layers through the multi-scale fusion module, so that the problem that the convolutional neural network in the prior art cannot completely identify emotion feature information on a face image through single features can be solved, and the obtained fusion information is aggregated again while the facial key points and the facial features are fused in a multi-scale manner, so that valuable information of the facial key points and the facial features is reserved and utilized to the greatest extent.
In addition, all similarity maps are spliced and integrated according to the processing sequence of the processing layer through the multi-scale fusion module, deep features are obtained, deep multi-scale feature fusion is achieved, and stronger performance and expressive force are provided for facial emotion recognition tasks.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
In one embodiment, a method for emotion recognition based on a network of facial features and keypoints fusion is provided, the method comprising:
1. model design
1.1 Model Structure
An emotion recognition model based on a facial feature and key point fusion network LFFNet is constructed, and the whole model framework is shown in fig. 1 and comprises a main network Swin Transformer, a multi-scale fusion module and a full-connection layer FC. The backbone network Swin Transformer comprises a plurality of processing layers block (block 1, block2, block3 and block 4), a feature fusion module and a key point extraction module MobileFaceNet, wherein four processing layers are arranged, a key point extraction module MobileFaceNet is arranged between adjacent processing layers, the adjacent processing layers and the key point extraction module MobileFaceNet are connected through the feature fusion module, and the output of each processing layer is connected with the multi-scale fusion module.
The method comprises the steps of obtaining a face image to be recognized, inputting the face image to be recognized into a emotion recognition model, extracting facial features of the face image through a processing layer, extracting facial key points of the face image through a key point extraction module MobileFaceNet, fusing the facial features extracted by each processing layer and the facial key points extracted by each key point extraction module MobileFaceNet through a feature fusion module to obtain similarity mapping of pixel points at corresponding positions in two face feature images extracted by adjacent processing layers, splicing and integrating all the similarity mapping according to the processing sequence of the processing layers through a multi-scale fusion module to obtain deep features, and carrying out full-connection processing on the deep features through a full-connection layer FC to obtain an emotion recognition result.
The feature images output in different layers are sequentially transmitted and spliced, so that deeper features are obtained layer by layer, and further multi-scale fusion is realized. In the fine-granularity recognition task of facial expression recognition, the hierarchical fusion structure can pay attention to the detailed features and global information of the face at the same time, so that the model can learn the features with more discriminant.
1.2 Backbone network Swin transducer
Facial features (Facial features) focus mainly on texture and color information of the face, such as skin color, facial wrinkles, etc., which capture visual attributes of the face, and are very useful for identifying specific Facial expressions or age, gender, etc. features of an individual. These features can be obtained either by means of manual design or automatically by means of deep learning models.
The transducer network can capture the relation between any two positions in the input sequence without being limited by the length of the sequence, thereby effectively solving the problem of performance degradation of the traditional sequence processing model when processing long sequences. However, the conventional transducer model faces huge computational and memory overhead in processing large-sized images due to the computational complexity of its self-attention mechanism, and may lose local information due to window size limitations, resulting in an insufficient understanding of the whole image. Therefore, a local window mechanism is introduced, so that the model is allowed to pay attention to only local areas in each attention operation, the complexity of global attention calculation is avoided, and meanwhile, the model is ensured to capture rich local detail information. As shown in fig. 2, the backbone network Swin Transformer further includes a windowed Multi-head Self-Attention module (w_msa) and a shift windowed Multi-head Self-Attention module (Shifted Window Multi-head Self-Attention, sw_msa). The self-attention is calculated in each processing layer through the windowed multi-head self-attention module W_MSA, the self-attention is interacted between different processing layers through the windowed multi-head self-attention module SW_MSA, the expression capacity of the model is enhanced, and the calculation formula comprises:
Wherein, For the output characteristics of the first windowed multi-head self-attention module w_msa and the shift windowed multi-head self-attention module sw_msa, z l is the output characteristic of the first multi-layer perception module MLP, and LN is the layer normalization operation.
Facial expression recognition requires accurate capture of subtle changes in the face, including subtle movements in local areas such as eyes, mouth corners, eyebrows, etc., which are key cues for judging expression. Swin transducer can focus more effectively on these fine-grained local features through a windowed self-attention mechanism. The Swin transform allows the model to focus only on local areas in each attention operation by using a local window mechanism, avoiding the complexity of global attention computation while ensuring that the model can capture rich detailed information. Meanwhile, in consideration of the requirements of facial expression recognition in practical application, for example, the method needs to run on mobile phones, embedded systems and other devices with limited resources, the design of the Swin transducer remarkably reduces the requirements of computing resources by limiting the self-attention range and adopting layering processing. Therefore, the Swin transducer is used as a backbone network for facial feature extraction, the advantages of the Swin transducer in the aspects of capturing fine granularity features, calculating efficiency and the like are fully utilized, and the accuracy and the efficiency of recognition are improved.
1.3 Critical Point extraction Module MobileFaceNet
Facial key points (FACIAL LANDMARKS) refer to key points on the face that generally correspond to salient features of the face, such as the contours of the eyes, nose, mouth, and edges of the chin and cheeks, etc. The facial keypoints provide sparse representation of the facial key regions in facial image analysis, which has low sensitivity to various inherent changes of the face, such as skin color, gender, age, and different manifestations of the background, thereby effectively reducing intra-class differences.
MobileFaceNet the model used is a 68-point model, i.e., 68 key points including eyes, eyebrows, nose, mouth, and facial contours of the face image data are detected, a thermodynamic diagram is generated for each key point, and coordinates of the key point are determined based on the peak position of each thermodynamic diagram.
1.4 Feature fusion Module
In the facial expression recognition task, the facial key points and the facial features each play different roles. Facial features may be of careful attention to key areas indicated by facial keypoints that provide support for capturing global information of the facial features. The careful fusion of the two features is very important for improving the accuracy of expression recognition. For this purpose, a feature fusion module based on the Pixel Attention (PA) is used, aiming at precisely integrating the valid features originating from different branches, fusing the valuable information in the facial key points and facial features on the Pixel level.
Specifically, the pixel attention mechanism first aligns the feature map of the facial feature x extracted from the backbone network Swin Transformer with the feature map of the facial key y extracted from the key extraction module MobileFaceNet by convolution, and then performs a point multiplication on the pixel points at each corresponding position of the two feature maps to obtain a similarity map that quantifies the importance of the y feature relative to x at each pixel position. After further mapping by the Sigmoid function, a value between 0 and 1 is obtained, wherein a pixel point close to 1 indicates that at this position the contribution of the y feature to the fusion output is more critical, whereas a value close to 0 means that at this position the x feature is more dominant. And finally, adding the two feature maps with the calculated weight to obtain the final fusion feature. This mechanism not only preserves the characteristics of x, but also effectively integrates the supplemental information provided by y, thereby enhancing the expressive power of the model. The calculation mode is as follows:
σ=Sigmoid(fx(vx)·fy(vy))
output=σvx+(1-σ)vy
Where v x is a corresponding pixel in the feature map of the facial feature, v y is a corresponding pixel in the feature map of the facial key point, f is an alignment operation, and σ is a probability that two pixels belong to the same object.
In facial expression recognition, facial key points are typically focused on key areas of expression, while facial features cover a wider range of global information. When the similarity between the facial key points and the facial features at a certain position is high, the features captured by the facial key points and the facial features at the certain position are shown to have high consistency, and the features of the facial key points at the certain position show more remarkable importance. In this case, it is reasonable to rely on information provided by the facial key points to enrich the final feature representation. Conversely, if the similarity is low, this indicates that facial feature information should be retained more. Meanwhile, as the facial key points are fine-grained features, the method for fusing at the pixel level can fuse the facial key points and the facial features more effectively than other fusion methods. As shown in fig. 3, the basic structure of the feature fusion module is illustrated.
1.5 Multi-scale fusion module based on residual error connection
The multi-scale fusion technology can comprehensively utilize information obtained on different scales to improve the identification performance. The core idea behind this technology is that data of different scales often reveal different features of the image, small scale data can provide rich detailed information, while large scale data provides more context and global view. By effectively fusing the multi-level information, a more comprehensive and accurate image analysis result can be realized. By introducing a pooling layer that can generate a fixed-size representation of the features, the network is enabled to accept input images of arbitrary size. The pyramid pooling module (Pyramid Pooling Module, PPM) performs pooling operations of different scales on a single feature map, and adjusts the obtained feature map to a uniform size through upsampling so as to be fused with the original feature map.
Inspired by the network structure, a multi-scale fusion module is provided. The specific structure is as shown in fig. 4, different input features are firstly adjusted to the same size through 1×1 convolution and up-sampling operation, then the result of the first processing layer is transferred to the next layer, and for each subsequent processing layer, the output of the last processing layer and the input of the next processing layer are added and then fused through 3×3 convolution, so that the deeper features are obtained. And finally, splicing and integrating the outputs of all the processing layers according to the processing sequence of the processing layers to obtain the deep features of the face image. Flattening the deep features of the face image into one-dimensional vectors, and carrying out feature reduction and feature abstraction on the one-dimensional vectors to obtain probability distribution of the expression category in the face image.
In facial expression recognition, it is often necessary to precisely understand small changes of a face, the low-level output of the network focuses more on detailed features (such as edges and contours) of the face, and the high-level output focuses more on global information (such as semantics of the whole expression), and by fusing these different-level features, a more comprehensive facial expression representation can be obtained. Meanwhile, different expressions may have more obvious differences in certain feature levels, and the model can learn the features with more discriminant by combining multi-level information, so that the accuracy and the robustness of expression recognition are improved.
2. Experimental results and analysis
2.1 Experimental Environment and optimization strategy
In this experiment, the Ubuntu 16.04 operating system, intel Core i9 CPU, 32GB memory and NVIDIA RTX 3060 graphics processor, 12GB video memory were used. CUDA version 10.2 and cudnn version v8.4. In terms of programming language, python 3.7 and PyTorch.1.7.1 are used.
In terms of model optimization, SAM (SHARPNESS-Aware Minimization) is used as a model optimization strategy. Traditional deep learning optimization methods, such as SGD or Adam, focus mainly on minimizing the loss function on the training set. However, these methods sometimes result in poor performance of the model on unseen data. The SAM alleviates this problem by taking into account the lost sharpness (sharpness), i.e. the sensitivity of the loss to changes in model parameters. In particular, the SAM calculates the sharpness of the loss by adding a small perturbation to the model parameters, and if a model is less sensitive to changes in the loss in its neighborhood of parameters, then the model is more likely to perform well on unseen data, thereby promoting the generalization ability of the model. By adopting Adam as a basic optimization method, the disturbance coefficient rho is set to be 0.05, the rest super parameters continue to follow the strategy of the third chapter, batchsize is set to be 64, the initial learning rate is 3.5e-5, the weight attenuation is 1e-4, and the total training iteration number is set to be 100. The learning rate adjustment strategy uses ExponentialLR and the gamma value is set to 0.98.
2.2 Ablation experiments
To evaluate the effectiveness of LFFNet, an ablation experiment was performed on the RAFDB dataset. The experimental results comprise the selection of a backbone network, the influence of each module on the overall model and confusion matrix analysis.
(1) The backbone networks were selected for comparison, and the performance differences of the ResNet, VGGNet, swin transducer three different backbone networks in facial feature extraction were compared, and the experimental results are shown in Table 1.
TABLE 1 feature extraction effects for different backbone networks
| Method of |
Accuracy (%) |
Parameters (M) |
| VGGNet |
86.22 |
138 |
| ResNet18 |
86.70 |
11.18 |
| Swin Transformer |
87.32 |
28 |
As can be seen from the table, the Swin transducer is superior to ResNet and VGGNet with 87.32% accuracy. This result shows that Swin transducer is effective in capturing more comprehensive facial features with its ability to capture long range dependencies. Whereas ResNet and VGGNet are conventional CNN architectures, it is difficult to capture global dependencies in images due to the inherent limitations of convolution operations. In terms of parameter quantity, VGGNet is a typical relatively deep convolutional neural network with relatively many parameters, and the ResNet model appearing later obtains better effect with fewer parameters by introducing residual connection, while the Swin transducer uses windowed attention method to obtain the best accuracy while maintaining acceptable parameter quantity.
(2) The effectiveness of each module is verified, the proposed network model LFFNet fully utilizes the complementary relation between the facial key points and the facial features, and the multi-scale fusion module utilizes information on different scales to improve the recognition performance. In order to verify the performance of each proposed module, the feature fusion module and the multi-scale fusion module are fused into a Swin transducer backbone network in sequence, and an ablation experiment is performed on RAFDB datasets. The experimental results are shown in table 2, and it can be observed that the network model performance is effectively improved as different modules are gradually introduced. Specifically, after the feature fusion module is added, the accuracy of the model is increased by 1.28%, the module can realize fine fusion of the facial key points and the facial features on the pixel level, and valuable information of the facial key points and the facial features is utilized to the greatest extent. Furthermore, after the multi-scale fusion module is added, the accuracy of the model is further improved by 0.54%, and the module fuses output characteristics of each stage more deeply through residual connection and pyramid structures, so that the model obtains richer characteristic representation.
Table 2 results of validity verification of each Module of the model
| Method of |
Accuracy (%) |
| Swin Transformer |
87.32 |
| Swin transducer+ feature fusion module |
88.60 |
| Swin transducer+feature fusion module+multiscale fusion module |
89.14 |
(3) Confusion matrix analysis to further investigate the details of model performance, the performance of the proposed model LFFNet was further compared to the performance of the base network Swin transducer on each facial expression class, with the comparison results shown in table 3 and fig. 5. From the results, LFFNet outperform the base network in each expression category. This results mainly from the efficient fusion of the model in capturing facial features and facial key points, providing a richer clue to the recognition of facial expressions. In addition, by using a multi-scale information fusion mechanism, the model can learn the characteristic with more distinguishing force, so that a better result is obtained in the facial expression recognition task. As shown in fig. 6, a confusion matrix of Swin transducer and model LFFNet on RAFDB dataset is shown, confirming the improvement in overall performance of the model.
Recall results for tables 3 Swin Transformer and LFFNet on different expression categories
2.3 Comparison with the existing model
To evaluate the effectiveness of the proposed model, the best results of the model are compared to several existing methods on the raddb, SFEW data set. The experimental results are shown in table 4, table 5, fig. 7 and fig. 8. It can be seen that on RAFDB datasets, the proposed LFFNet model achieved the best accuracy, up to 89.14%, over other methods. This shows that the model has more powerful performance in expression recognition tasks. Similarly, LFFNet also achieved an optimal accuracy of 60.44% over the SFEW dataset over other comparison methods.
Some of these models employ a similar approach to processing expression recognition tasks. The RAN model decomposes the facial image into regions by facial key points and introduces region bias loss to guide the model towards the most important regions. The ACNN model divides the input facial feature map into multiple sub-feature maps, assigns attention weights to each feature sub-map and global feature map, and then simply concatenates all feature vectors into one vector. The MANet model utilizes a dual-branch network to extract and fuse multi-scale features and local features. The VTFF model then uses the RGB image and the LBP image as inputs, selects the most important features by the attention module, and then integrates these features using a transducer encoder.
Table 4RAFDB comparison of datasets with other methods
| Method of |
Accuracy (%) |
Year of year |
| DLP-CNN |
84.22 |
2019 |
| gACNN |
85.07 |
2019 |
| SCN |
88.14 |
2020 |
| MANet |
88.36 |
2021 |
| DACL |
87.78 |
2021 |
| VTFF |
88.14 |
2021 |
| RAN |
86.90 |
2021 |
| MixAugment |
87.54 |
2022 |
| LFFNet(ours) |
89.14 |
2024 |
Table 5SFEW comparison of datasets with other methods
| Method of |
Accuracy (%) |
Year of year |
| DLP-CNN |
51.05 |
2019 |
| IPFR |
55.10 |
2019 |
| LDL-ALSG |
56.50 |
2020 |
| RAN |
56.40 |
2021 |
| MANet |
59.40 |
2021 |
| LFFNet(ours) |
60.44 |
2024 |
Other models explore expression recognition from different angles. The SCN model is considered to be difficult to label in high quality due to the subjectivity of labeling and uncertainty of facial images, starting from the label perspective. Thus, SCN weights each sample using the attention mechanism, giving less importance to uncertain facial images and modifying their labels. The LDL-ALSG model builds a similarity list through the facial action units and facial key points from the expression distribution point of view, and improves facial expression recognition performance by minimizing the probability of a center image and a nearest neighbor image while minimizing the actual value and the predicted value. The DACL model proposes a deep attention center loss from the point of view of the loss function, combining the attention mechanism and the sparse center loss to adaptively select the most important feature element subset. From the point of data enhancement, the MixAugment model generates a new virtual expression image by mixing different expression images so as to increase the diversity of training data.
Table 6 recall results for models across different expression categories
To further compare the performance differences for the different models, tables 6 and 9 show recall results for each model over different expression categories. As can be seen from the table, the proposed LFFNet model achieves 88.24%, 96.20%, 90.79% and 67.57% recall rates on neutral, happy, sad and aversive expression categories, respectively, showing higher recognition capability. Poor performance in surprise, fear and anger expression classifications. However, in general, the LFFNet model achieved excellent results in most cases, demonstrating its excellent performance in expression recognition tasks, although each model exhibited some difference in the different expression categories. Meanwhile, exploration and optimization of each model in different aspects provide rich ideas and methods for research in the expression recognition field.
Summary 3
The emotion recognition model based on the facial feature and key point fusion network LFFNet is provided, and aims to fully extract and fuse facial feature and facial key point information by using a self-attention mechanism, a feature fusion method and a multi-scale technology so as to solve the problem that single feature is difficult to comprehensively capture scene details. The model adopts a Swin transducer as a backbone network, and the windowed self-attention mechanism is utilized to capture the abundant detail information of the facial image finely without increasing the computational complexity significantly. And the feature fusion module optimizes information fusion between the facial key points and the facial features on the pixel level, and ensures the maximum reservation and utilization of valuable information. The multi-scale fusion module realizes deep multi-scale fusion by combining pyramid structures and residual connection, and further enriches the feature representation. Through an ablation experiment, the improvement effect of each module on the model performance is verified, and competitive results are obtained on two published facial expression recognition data sets RAFDB and SFEW compared with the prior art.
The above examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.