CN117877081A

CN117877081A - Video facial expression recognition method, system, device, processor and storage medium based on facial key point optimization region characteristics

Info

Publication number: CN117877081A
Application number: CN202311526318.9A
Authority: CN
Inventors: 朱煜; 黄志伟; 叶炜韬; 刘燕滨; 叶炯耀
Original assignee: East China University of Science and Technology
Current assignee: East China University of Science and Technology
Priority date: 2023-11-16
Filing date: 2023-11-16
Publication date: 2024-04-12

Abstract

The present invention relates to a video facial expression recognition method based on face key point optimized regional features, comprising: (1) collecting a video facial expression data set and performing preprocessing operations on the video; (2) constructing a spatial feature extraction module of a CNN structure and a feature extraction module based on face key points, respectively extracting relevant information from the collected data; (3) enhancing the information of the extracted relevant information; (4) constructing a feature fusion module to enhance the features of an image within a frame; (5) decoding and processing the extracted relevant information and making a decision, and constructing an overall loss function. The present invention also relates to a corresponding system, device, processor and storage medium thereof. The video facial expression recognition method, system, device, processor and storage medium based on face key point optimized regional features of the present invention are adopted to more effectively utilize face key point features and have a better expression recognition effect compared with a baseline model.

Description

Video facial expression recognition method, system, device, processor and storage medium based on face key point optimization regional features

技术领域Technical Field

本发明涉及数字图像技术领域，尤其涉及计算机视觉技术领域，具体是指一种基于人脸关键点优化区域特征的视频面部表情识别方法、系统、装置、处理器及其计算机可读存储介质。The present invention relates to the field of digital image technology, in particular to the field of computer vision technology, and specifically refers to a video facial expression recognition method, system, device, processor and computer-readable storage medium thereof based on optimizing regional features of facial key points.

背景技术Background technique

人脸表情识别(FER，Facial expression recognition)是计算机理解人类情感的一个重要方向，也是人机交互的一个重要方面。人脸表情识别对于理解和改善人机交互、安全、机器人制造、自动化、医疗、通信和驾驶等领域有着广泛的应用价值和社会意义。随着计算机技术和深度学习技术的发展，以及大规模自然环境下自发式的表情数据集的开源，近年来，人脸表情识别取得了显著的进步，并成为学术界和工业界的研究热点。Facial expression recognition (FER) is an important direction for computers to understand human emotions and an important aspect of human-computer interaction. Facial expression recognition has broad application value and social significance in understanding and improving human-computer interaction, security, robot manufacturing, automation, medical care, communication, and driving. With the development of computer technology and deep learning technology, as well as the open source of large-scale spontaneous expression datasets in natural environments, facial expression recognition has made significant progress in recent years and has become a research hotspot in academia and industry.

随着深度卷积网络的不断发展，研究人员可以从大规模表情数据中挖掘出可用的信息，各种基于CNN的方法被应用于人脸表情。然而受制于表情的复合性、真实场景数据集的类别不平衡等问题，人脸表情识别仍然是一个艰巨的任务。比如：申请号为：CN202311049424.2的发明专利申请，其公开了在backbone的基础上添加了自注意力模块，增强了网络关注特定区域的特征提取能力，其设计重点在于真实场景下的静态图像数据集中出现遮挡人脸时的表情识别能力上；申请号为：CN202310942697.3的发明专利申请，使用了跨层多尺度通道相互注意学习的机制，重点考虑了卷积核设计、注意力设计，在真实场景下的静态图像数据集中表现出优势；申请号为：CN202310805217.9的发明专利申请，其设计了基于增强自注意力Transformer的人脸表情识别方法，使用较轻量级的网络IR50作为backbone，再用增强的子注意力模块对特征做增强操作，最后用于最终的分类，其重点在于网络的轻量化。上述公开的申请均存在的缺陷是：未关注如何利用时序信息和人脸关键点的先验知识，都是直接利用从图像中提取的特征图进行优化的注意力操作识别分类，但没有考虑到时序信息、人脸关键点信息与真实场景下视频表情识别的相关性，而这也应当是当前亟需解决的缺陷问题。With the continuous development of deep convolutional networks, researchers can mine useful information from large-scale expression data, and various CNN-based methods have been applied to facial expressions. However, due to the complexity of expressions and the imbalance of categories in real scene datasets, facial expression recognition is still a difficult task. For example: the invention patent application with application number: CN202311049424.2 discloses the addition of a self-attention module on the basis of the backbone, which enhances the network's feature extraction ability to focus on specific areas. Its design focuses on the expression recognition ability when occluded faces appear in static image data sets in real scenes; the invention patent application with application number: CN202310942697.3 uses a cross-layer multi-scale channel mutual attention learning mechanism, focusing on convolution kernel design and attention design, and shows advantages in static image data sets in real scenes; the invention patent application with application number: CN202310805217.9 designs a facial expression recognition method based on enhanced self-attention Transformer, uses a lighter network IR50 as the backbone, and then uses an enhanced sub-attention module to enhance the features, and finally uses it for final classification. Its focus is on the lightweight of the network. The defects of the above-mentioned public applications are: they do not pay attention to how to utilize the prior knowledge of time series information and facial key points, and they all directly use the feature maps extracted from the image to perform optimized attention operation recognition and classification, but do not take into account the correlation between time series information, facial key point information and video expression recognition in real scenes, which should be the defect problem that needs to be solved urgently.

发明内容Summary of the invention

本发明的目的是克服了上述现有技术的缺点，提供了一种能够有效利用视频前后帧关联信息和人脸关键点信息的基于人脸关键点优化区域特征的视频面部表情识别方法、系统、装置、处理器及其计算机可读存储介质。The purpose of the present invention is to overcome the shortcomings of the above-mentioned prior art and provide a video facial expression recognition method, system, device, processor and computer-readable storage medium based on facial key point optimized regional features that can effectively utilize the correlation information of previous and next frames of the video and the facial key point information.

为了实现上述目的，本发明的基于人脸关键点优化区域特征的视频面部表情识别方法、系统、装置、处理器及其计算机可读存储介质如下：In order to achieve the above-mentioned object, the video facial expression recognition method, system, device, processor and computer-readable storage medium thereof based on face key point optimization regional features of the present invention are as follows:

该基于人脸关键点优化区域特征的视频面部表情识别方法，其主要特点是，所述的方法包括以下步骤：The video facial expression recognition method based on face key point optimization regional features is mainly characterized in that the method comprises the following steps:

(1)采集视频人脸表情数据集，并对视频进行预处理操作；(1) Collect video facial expression datasets and perform preprocessing operations on the videos;

(2)构建CNN结构的空间特征提取模块和基于人脸关键点的特征提取模块，分别对采集的数据进行初级空域特征和人脸关键点局部区域特征的提取；(2) Constructing a spatial feature extraction module based on the CNN structure and a feature extraction module based on facial key points to extract primary spatial features and local area features of facial key points from the collected data respectively;

(3)构建由人脸关键点引导的图卷积模块对提取到的相关信息进行信息强化；(3) Construct a graph convolution module guided by facial key points to enhance the extracted relevant information;

(4)构建特征融合模块对一帧内的图像进行特征增强；(4) Construct a feature fusion module to enhance the features of the image within a frame;

(5)使用时序模块和分类模块对提取到的相关信息解码处理并作出决策，并构建总体损失函数。(5) Use the timing module and classification module to decode and process the extracted relevant information, make decisions, and construct an overall loss function.

较佳地，所述的步骤(1)具体包括以下步骤：Preferably, the step (1) specifically comprises the following steps:

(1.1)从数据集官网上下载DFEW、AFEW数据集，经过视频分帧和人脸提取器得到裁剪后的人脸图像，得到尺寸为256×256pt的原始视频帧；(1.1) Download the DFEW and AFEW datasets from the dataset official website, obtain the cropped face images through video framing and face extractor, and obtain the original video frames with a size of 256×256pt;

(1.2)对所述的原始视频帧序列使用数据增强方式，构建得到最终112×112pt的训练以及测试图像。(1.2) Using data augmentation on the original video frame sequence, a final 112×112 pt training and test image is constructed.

较佳地，所述的数据增强方式包括：Preferably, the data enhancement method includes:

随机采样，对一组视频连续采样2帧，顺序向后采样8次得到当此训练或测试的图像组，将经过预处理处理后的图像组进行随机旋转[-45°,45°]，随后针对图像组进行随机水平翻转，随机色彩抖动，随机水平翻转，随机高斯模糊和随机灰度化，最后通过尺寸缩放操作得到最终训练以及测试的图像。Random sampling: continuously sample 2 frames of a video, and then sample backward 8 times in sequence to obtain the image group for training or testing. The preprocessed image group is randomly rotated [-45°, 45°], and then the image group is randomly flipped horizontally, randomly color-dithered, randomly flipped horizontally, randomly Gaussian blurred, and randomly grayed. Finally, the final training and testing images are obtained through rescaling.

尤佳地，所述的步骤(2)具体包括以下步骤：Preferably, the step (2) specifically comprises the following steps:

(2.1)构建所述的空间特征提取模块，具体为：(2.1) Construct the spatial feature extraction module, specifically:

使用CNN模型作为人脸表情识别视频数据帧的特征提取网络，将作为样本的112×112×3×16维度输入提取转换成一组2D特征图其中h和w分别代表特征图的长和宽，d代表特征图的维度，t代表样本内的帧数；构建三元组注意力模块，将2D特征图转换为2D特征图/>所述的特征图F与特征图T的维度保持一致；Use the CNN model as the feature extraction network for the facial expression recognition video data frame, and convert the 112×112×3×16 dimensional input as a sample into a set of 2D feature maps Where h and w represent the length and width of the feature map, d represents the dimension of the feature map, and t represents the number of frames in the sample. Construct a triplet attention module to convert the 2D feature map Convert to 2D feature map/> The dimension of the feature map F is consistent with that of the feature map T;

(2.2)构建所述的人脸关键点特征提取模块，具体为：(2.2) Construct the facial key point feature extraction module, specifically:

构建了人脸关键点特征提取网络作为人脸关键点特征提取模块，将作为样本的112×112×3×16维度输入提取转换成一组人脸关键点特征图其中N代表人脸关键点个数，h和w分别代表特征图的长和宽；从特征图A’采样固定的18个人脸关键点，将特征图A’转换为2D特征热图/>且所述的特征热图与(2.1)节的特征图T的长和宽保持一致；A facial key point feature extraction network is constructed as a facial key point feature extraction module, which converts the 112×112×3×16 dimensional input sample into a set of facial key point feature maps. Where N represents the number of facial key points, h and w represent the length and width of the feature map respectively; sample 18 fixed facial key points from the feature map A' and convert the feature map A' into a 2D feature heat map/> And the feature heat map is consistent with the length and width of the feature map T in Section (2.1);

(2.3)基于构建的所述的空间特征提取模块和人脸关键点特征提取模块，对采集到的人脸表情识别数据集进行初级空域特征和人脸关键点局部区域特征的提取。(2.3) Based on the constructed spatial feature extraction module and facial key point feature extraction module, primary spatial domain features and local area features of facial key points are extracted from the collected facial expression recognition data set.

尤佳地，所述的步骤(3)包括以下步骤：Preferably, the step (3) comprises the following steps:

(3.1)构建所述的人脸关键点引导的图卷积模块输入特征，具体为：(3.1) Construct the input features of the graph convolution module guided by the facial key points, specifically:

使用(2.1)节构建的空间特征提取模块获取的2D特征图和(2.2)节前18个人脸关键点特征提取模块构建的2D特征热图/>使用池化计算和向量元素乘法，并在执行完每个操作之后，均使用层归一化来进一步调整输出，得到用于人脸关键点引导的图卷积模块的输入信息/>和/>如以下公式：The 2D feature map obtained using the spatial feature extraction module constructed in Section (2.1) and (2.2) 2D feature heat map constructed by the feature extraction module of 18 facial key points before the festival/> Using pooling calculations and vector element multiplication, and after each operation, layer normalization is used to further adjust the output to obtain the input information of the graph convolution module guided by facial key points/> and/> Such as the following formula:

其中，g为全局池化操作，⊙为向量元素相乘操作，()为向量拼接操作，T_i为第i帧图像的2D空域特征图，v_i,g为第i帧图像的全局人脸关键点特征，为第i帧图像第j个关键点的特征热图；Among them, g is the global pooling operation, ⊙ is the vector element multiplication operation, () is the vector splicing operation, _Ti is the 2D spatial domain feature map of the i-th frame image, vi _,g is the global face key point feature of the i-th frame image, is the feature heat map of the jth key point of the i-th frame image;

(3.2)构建所述的人脸关键点引导的图卷积模块，具体为：(3.2) Construct the face key point guided graph convolution module, specifically:

使用采样并特征增强的特征热图A，(3.1)节构建得到的输入特征和新建的可学习特征矩阵/>使用矩阵乘法、激活函数、矩阵元素乘法操作转换得到如以下公式：Using the sampled and enhanced feature heatmap A, the input features constructed in Section (3.1) and the newly created learnable feature matrix/> Use matrix multiplication, activation function, and matrix element multiplication to convert Such as the following formula:

其中，为矩阵相乘操作，⊙为向量元素相乘操作，T为矩阵转置操作，f为全链接操作，A为特征热图，W_l为可学习特征矩阵，W_a是由可学习矩阵，W_l经过矩阵乘法和激活计算得到的矩阵，/>是(3.1)节中得到的V_i中前18个人脸关键点特征构成的特征向量，V_i ^g为V_i中最后一个全局特征向量。in, is a matrix multiplication operation, ⊙ is a vector element multiplication operation, T is a matrix transpose operation, f is a full-link operation, A is a feature heat map, W _l is a learnable feature matrix, _Wa is a matrix obtained by matrix multiplication and activation calculation of the learnable matrix and W _l , /> It is the feature vector composed of the first 18 facial key point features in _Vi obtained in Section (3.1), ^and _Vig is the last global feature vector in _Vi .

更佳地，所述的步骤4具体包括如下步骤：More preferably, the step 4 specifically comprises the following steps:

(4.1)根据采样得到的人脸关键点将人脸划分为三个区域，依据划分的不同区域对(3.2)构建的人脸关键点引导的图卷积模块输出信息做特征融合操作，构建融合人脸关键点信息的空间特征信息V_spatial，并以一组交叉熵损失函数约束该融合操作，如以下公式：(4.1) Divide the face into three regions based on the sampled facial key points, and output information of the facial key point-guided graph convolution module constructed in (3.2) according to the different divided regions. Perform feature fusion operation to construct spatial feature information V _spatial that integrates facial key point information, and constrain the fusion operation with a set of cross entropy loss functions, as shown in the following formula:

L_S＝L_class(FC(V_spatial)) _LS = L _class (FC(V _spatial ))

其中，FC为全链接操作，L_class(·)为多类别交叉熵损失函数,L_S代表空间分量贡献的分类损失。Among them, FC is the full-link operation, L _class (·) is the multi-category cross entropy loss function, and _LS represents the classification loss contributed by the spatial component.

更佳地，所述的步骤5具体包括如下步骤：More preferably, the step 5 specifically comprises the following steps:

(5.1)构建时序模块的输入，具体为：(5.1) Construct the input of the timing module, specifically:

构建类别特征、位置编码和(3.2)的人脸关键点引导的融合特征V_spatial，三者组成的时序模块输入特征其中N为拼接特征的个数，d为特征的维度，公式如下：Construct the fusion feature V _spatial of category feature, position encoding and face key point guidance (3.2), and the temporal module input feature composed of the three Where N is the number of concatenated features, d is the dimension of the feature, and the formula is as follows:

z⁰＝[x_class；V_spatial]+E_pos＝[x_class；v₁；v₂；…；v_N]+E_pos z ⁰ = [x _class ; V _spatial ] + E _pos = [x _class ; v ₁ ; v ₂ ; … ; v _N ] + E _pos

其中，x_class为类别特征，E_pos为位置编码，；符号为拼接操作，V_spatial为融合人脸关键点信息的空间特征信息；Among them, x _class is the category feature, E _pos is the position code, the symbol is the splicing operation, and V _spatial is the spatial feature information of the fused face key point information;

(5.2)输入特征首先经过一个线性映射层，产生一个query矩阵一个Key矩阵/>以及一个Value矩阵/>接着再将三个矩阵传入所述的多头自注意力机制MHSA中，计算得到权重矩阵W，如以下公式所示：(5.2) Input features First, a linear mapping layer is passed to generate a query matrix A Key matrix/> And a Value matrix/> Then the three matrices are passed into the multi-head self-attention mechanism MHSA to calculate the weight matrix W, as shown in the following formula:

其中，T为矩阵转置操作，d为归一化常数；Where, T is the matrix transpose operation, and d is the normalization constant;

(5.3)权重矩阵W与Value矩阵相乘，并经过一个残差操作和多层MLP处理，得到可用于分类的输出/>公式如下：(5.3) Weight matrix W and Value matrix Multiply, and after a residual operation and multi-layer MLP processing, get the output that can be used for classification/> The formula is as follows:

s_i＝WV_i；i∈[1,N]s _i =WV _i ; i∈[1,N]

z^l＝W_H[s₁；s₂；…；s_N]^T+z^l-1 z ^l = W _H [ s ₁ ; s ₂ ; … ; s _N ] ^T + z ^l-1

其中，T为矩阵转置操作，W_H为可学习的权重矩阵；Among them, T is the matrix transpose operation, W _H is the learnable weight matrix;

(5.4)堆叠多层(5.2)和(5.3)节所述的注意力机制和残差操作，在最后一层处对于输出使用一个分类器模块，将特征通过全链接网络映射到类别数c，得到该向量表示c个类别可能的概率，对/>做一次softmax操作得到用于参与计算有标签监督的分类损失函数，公式如下:(5.4) Stack multiple layers of attention mechanisms and residual operations described in sections (5.2) and (5.3), and at the last layer, Use a classifier module to map the features to the number of categories c through a fully linked network, and get This vector represents the possible probabilities of c categories. Do a softmax operation to get Used to participate in the calculation of the classification loss function with label supervision, the formula is as follows:

L_T＝L_class(R) _LT = L _class (R)

其中，L_class(·)为多类别交叉熵损失函数,L_T代表时间分量贡献的分类损失。Among them, _Lclass (·) is the multi-category cross entropy loss function, and _LT represents the classification loss contributed by the time component.

更佳地，所述的步骤5中的构建总体损失函数，具体包括如下步骤：More preferably, the construction of the overall loss function in step 5 specifically includes the following steps:

用重要性权重λ平衡来自空间分量贡献的分类损失L_S和来自空间分量贡献的分类损失L_T，公式如下：The importance weight λ is used to balance the classification loss _LS from the spatial component contribution and the classification loss _LT from the spatial component contribution, as follows:

L_total＝λ×L_S+(1-λ)×L_T。L _total = λ × _LS + (1-λ) × _LT .

该实现上述的方法的基于人脸关键点优化区域特征的视频面部表情识别的系统，其主要特点是，所述的系统包括：The system for video facial expression recognition based on face key point optimized regional features for implementing the above method is mainly characterized in that the system comprises:

空间特征提取模块和基于人脸关键点的特征提取模块，用于对采集到的视频人脸表情识别数据集中的相关数据信息进行空间特征数据提取以及人脸关键点特征数据提取；The spatial feature extraction module and the feature extraction module based on facial key points are used to extract spatial feature data and facial key point feature data from the relevant data information in the collected video facial expression recognition data set;

人脸关键点引导的图卷积模块，与所述的空间特征提取模块、基于人脸关键点的特征提取模块相连接，用于通过图卷积形式借助人脸关键点强化该帧内的人脸表情信息；A face key point guided graph convolution module, connected to the spatial feature extraction module and the face key point based feature extraction module, for enhancing the facial expression information in the frame with the help of face key points in the form of graph convolution;

特征融合模块，与所述的人脸关键点引导的图卷积模块相连接，用于压缩高维特征并根据人脸区域划分决定情感信息的关键特征向量；A feature fusion module, connected to the face key point guided graph convolution module, is used to compress high-dimensional features and determine the key feature vectors of emotional information according to face area division;

时序模块和分类模块，与所述的特征融合模块相连接，使用多头自注意力机制MHSA，以及一个多层MLP，构建一个时序特征提取网络，并利用所述的分类网络和交叉熵损失函数对分类结果进行有标签监督，以获取最终的视频人脸表情识别结果。The timing module and the classification module are connected to the feature fusion module, and a timing feature extraction network is constructed using a multi-head self-attention mechanism MHSA and a multi-layer MLP. The classification results are supervised with labels using the classification network and the cross entropy loss function to obtain the final video facial expression recognition results.

该用于实现基于人脸关键点优化区域特征的视频面部表情识别的装置，其主要特点是，所述的装置包括：The device for realizing video facial expression recognition based on face key point optimization regional features has the following main features: the device comprises:

处理器，被配置成执行计算机可执行指令；a processor configured to execute computer-executable instructions;

存储器，存储一个或多个计算机可执行指令，所述计算机可执行指令被所述处理器执行时，实现上述所述的基于人脸关键点优化区域特征的视频面部表情识别的方法的各个步骤。The memory stores one or more computer executable instructions. When the computer executable instructions are executed by the processor, the various steps of the above-mentioned method for video facial expression recognition based on optimizing regional features of facial key points are implemented.

该用于实现基于人脸关键点优化区域特征的视频面部表情识别的处理器，其主要特点是，所述的处理器被配置成执行计算机可执行指令，所述的计算机可执行指令被所述的处理器执行时，实现上述所述的基于人脸关键点优化区域特征的视频面部表情识别的方法的各个步骤。The processor for realizing video facial expression recognition based on face key point optimized regional features has the main feature that the processor is configured to execute computer executable instructions, and when the computer executable instructions are executed by the processor, the various steps of the above-mentioned method for video facial expression recognition based on face key point optimized regional features are realized.

该计算机可读存储介质，其主要特点是，其上存储有计算机程序，所述的计算机程序可被处理器执行以实现上述所述的基于人脸关键点优化区域特征的视频面部表情识别方法的各个步骤。The main feature of the computer-readable storage medium is that a computer program is stored thereon, and the computer program can be executed by a processor to implement the various steps of the above-mentioned video facial expression recognition method based on optimizing regional features of facial key points.

采用了本发明的该基于人脸关键点优化区域特征的视频面部表情识别的方法、系统、装置、处理器及其计算机可读存储介质，使用经典的CNN模型(例如ResNet18)和人脸关键定位模型(例如Dlib算法库)作为空间特征提取模块和基于人脸关键点的特征提取模块。为了加强人脸关键点信息的潜在引导能力，本发明还创新性地引入了人脸关键点引导的图卷积模块和特征融合模块，借助人脸关键点强化该帧内的人脸表情信息。时序模块和分类模块则由多头自注意力机制MHSA、多层MLP、多分类全链接网络组成，通过堆叠多层时序网络提取视频中潜在的时间特征。分类器用重要性权重平衡来自空间分量贡献的分类损失和来自空间分量贡献的分类损失，以达到更好的效果。本技术方案在AFEW、DFEW数据集上进行实验验证，相较于基线模型，具有更为突出的分类识别效果。The method, system, device, processor and computer-readable storage medium for video facial expression recognition based on face key point optimized regional features of the present invention are adopted, and the classic CNN model (such as ResNet18) and face key positioning model (such as Dlib algorithm library) are used as spatial feature extraction module and feature extraction module based on face key points. In order to enhance the potential guiding ability of face key point information, the present invention also innovatively introduces a graph convolution module and a feature fusion module guided by face key points, and strengthens the facial expression information in the frame with the help of face key points. The timing module and the classification module are composed of a multi-head self-attention mechanism MHSA, a multi-layer MLP, and a multi-classification fully linked network, and the potential time features in the video are extracted by stacking multi-layer timing networks. The classifier uses importance weights to balance the classification loss contributed by the spatial component and the classification loss contributed by the spatial component to achieve better results. The technical solution is experimentally verified on the AFEW and DFEW data sets, and has a more prominent classification and recognition effect compared with the baseline model.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明的基于人脸关键点优化区域特征的视频面部表情识别的方法的整体原理架构示意图。FIG1 is a schematic diagram of the overall principle architecture of the method for video facial expression recognition based on face key point optimization regional features of the present invention.

图2为本发明的空间特征空间特征提取模块、基于人脸关键点的特征提取模块和人脸关键点引导的图卷积模块示意图。FIG2 is a schematic diagram of a spatial feature extraction module, a feature extraction module based on facial key points, and a graph convolution module guided by facial key points of the present invention.

图3为本发明的基于人脸关键点优化区域特征的视频面部表情识别的方法中的特征融合模块示意图。FIG. 3 is a schematic diagram of a feature fusion module in a method for video facial expression recognition based on face key point optimized regional features of the present invention.

具体实施方式Detailed ways

为了能够更清楚地描述本发明的技术内容，下面结合具体实施例来进行进一步的描述。In order to more clearly describe the technical content of the present invention, further description is given below in conjunction with specific embodiments.

在详细说明根据本发明的实施例前，应该注意到的是，在下文中，术语“包括”、“包含”或任何其他变体旨在涵盖非排他性的包含，由此使得包括一系列要素的过程、方法、物品或者设备不仅包含这些要素，而且还包含没有明确列出的其他要素，或者为这种过程、方法、物品或者设备所固有的要素。Before describing in detail embodiments according to the present invention, it should be noted that, hereinafter, the terms "comprises", "includes" or any other variations are intended to cover non-exclusive inclusion, whereby a process, method, article or apparatus comprising a series of elements includes not only these elements, but also other elements not explicitly listed or inherent to such process, method, article or apparatus.

请参阅图1所示，该基于人脸关键点优化区域特征的视频面部表情识别的方法，其中，所述的方法包括以下步骤：Please refer to FIG. 1 , the method for video facial expression recognition based on face key point optimization regional features, wherein the method comprises the following steps:

作为本发明的优选实施方式，所述的步骤(1)具体包括以下步骤：As a preferred embodiment of the present invention, the step (1) specifically comprises the following steps:

作为本发明的优选实施方式，所述的数据增强方式包括：As a preferred embodiment of the present invention, the data enhancement method includes:

作为本发明的优选实施方式，所述的步骤(2)具体包括以下步骤：As a preferred embodiment of the present invention, the step (2) specifically comprises the following steps:

(2.1)采用如下方式构建所述的空间特征提取模块：(2.1) The spatial feature extraction module is constructed in the following manner:

采用CNN模型构建空间特征提取模块，对输入样例提取空间特征信息，计算得到代表初步空间特征信息的一组2D特征图模块构建三元组注意力模块，计算得到代表空间强化特征信息的2D特征图/>所述的特征图F与特征图T的维度保持一致；The spatial feature extraction module is constructed using the CNN model to extract spatial feature information from the input samples and calculate a set of 2D feature maps representing preliminary spatial feature information. The module constructs a triplet attention module and calculates a 2D feature map representing spatial enhanced feature information/> The dimension of the feature map F is consistent with that of the feature map T;

(2.2)采用如下方式构建所述的人脸关键点特征提取模块：(2.2) The facial key point feature extraction module is constructed in the following manner:

模块将输入的多帧图像利用人脸关键点提取算法提取一组人脸关键点特征图随后从特征图A’采样固定的18个人脸关键点，得到2D特征热图/> The module uses the face key point extraction algorithm to extract a set of face key point feature maps from the input multi-frame images Then, 18 fixed facial key points are sampled from the feature map A' to obtain a 2D feature heat map/>

在实际应用当中，上述步骤(2)具体为：In practical applications, the above step (2) is specifically as follows:

步骤2.1：构建特征编码器模块：Step 2.1: Build the feature encoder module:

使用CNN模型作为人脸表情识别视频数据帧的特征提取网络，将作为样本的112×112×3×16维度输入提取转换成一组2D特征图其中，h和w分别代表特征图的长和宽，d代表特征图的维度，t代表样本内的帧数；构建三元组注意力模块，将2D特征图转换为2D特征图/>所述的特征图F与特征图T的维度保持一致；Use the CNN model as the feature extraction network for the facial expression recognition video data frame, and convert the 112×112×3×16 dimensional input as a sample into a set of 2D feature maps Among them, h and w represent the length and width of the feature map respectively, d represents the dimension of the feature map, and t represents the number of frames in the sample; construct a triplet attention module to convert the 2D feature map Convert to 2D feature map/> The dimension of the feature map F is consistent with that of the feature map T;

步骤2.2：构建标签编码器模块：Step 2.2: Build the label encoder module:

构建了人脸关键点特征提取网络作为人脸关键点特征提取模块，将作为样本的112×112×3×16维度输入提取转换成一组人脸关键点特征图其中N代表人脸关键点个数，h和w分别代表特征图的长和宽；从特征图A’采样固定的18个人脸关键点，将特征图A’转换为2D特征热图/>且所述的特征热图与(2.1)节的特征图T的长和宽保持一致。A facial key point feature extraction network is constructed as a facial key point feature extraction module, which converts the 112×112×3×16 dimensional input sample into a set of facial key point feature maps. Where N represents the number of facial key points, h and w represent the length and width of the feature map respectively; sample 18 fixed facial key points from the feature map A' and convert the feature map A' into a 2D feature heat map/> And the feature heat map is consistent with the length and width of the feature map T in Section (2.1).

作为本发明的优选实施方式，所述的步骤(3)包括以下步骤：As a preferred embodiment of the present invention, the step (3) comprises the following steps:

(3.1)构建所述的人脸关键点引导的图卷积模块输入特征，使用(2.1)节构建的空间特征提取模块获取的2D特征图和(2.2)节前18个人脸关键点特征提取模块构建的2D特征热图/>使用池化计算和向量元素乘法，得到用于人脸关键点引导的图卷积模块的输入信息/>和/>如以下公式：(3.1) Construct the face key point guided graph convolution module input features, and use the 2D feature map obtained by the spatial feature extraction module constructed in section (2.1) and (2.2) 2D feature heat map constructed by the feature extraction module of 18 facial key points before the festival/> Using pooling calculation and vector element multiplication, we get the input information of the graph convolution module for guiding facial key points/> and/> Such as the following formula:

(3.2)构建所述的人脸关键点引导的图卷积模块：使用采样并特征增强的特征热图A，输入特征和可学习特征矩阵/>使用矩阵乘法、激活函数、矩阵元素乘法操作转换得到/>如以下公式：(3.2) Construct the face key point guided graph convolution module: use the sampled and feature enhanced feature heat map A, input feature and learnable feature matrix/> Use matrix multiplication, activation function, matrix element multiplication operation to convert it/> Such as the following formula:

在实际应用当中，上述步骤(3)具体为：In practical applications, the above step (3) is specifically as follows:

步骤3.1：构建所述的人脸关键点引导的图卷积模块输入特征：Step 3.1: Construct the face key point guided graph convolution module input features:

图卷积模块输入有两部分组成，一部分由空间特征提取器的2D特征图经过全局平均池化操作得到；另一部分输入由人脸关键点特征提取模块构建的2D特征热图/> 与空间特征提取器的2D特征图/>融合计算得到，其计算方式为向量元素乘法，并在执行完每个操作之后，均使用层归一化来进一步调整输出，该组特征向量的计算公式为：The input of the graph convolution module consists of two parts, one of which is the 2D feature map of the spatial feature extractor. The other part is obtained by global average pooling operation; the other part is input by the 2D feature heat map constructed by the face key point feature extraction module/> 2D feature map with spatial feature extractor/> The fusion calculation is obtained by vector element multiplication, and after each operation, layer normalization is used to further adjust the output. The calculation formula for this set of feature vectors is:

其中，g为全局池化操作，⊙为向量元素相乘操作，()为向量拼接操作；Among them, g is the global pooling operation, ⊙ is the vector element multiplication operation, and () is the vector concatenation operation;

步骤3.2：构建所述的人脸关键点引导的图卷积模块：Step 3.2: Construct the face key point guided graph convolution module:

使用采样并特征增强的特征热图A，输入特征和新的可学习特征矩阵作为输入搭建图卷积模块，依次使用矩阵乘法、激活函数、矩阵元素乘法操作转换得到/>其中/>中的参数随机初始化，使用两次连续的矩阵相乘操作得到一个科学系的权重矩阵W_a，参与图卷积的计算，如以下公式：Using the sampled and enhanced feature heatmap A, the input feature and the new learnable feature matrix As input, we build a graph convolution module, and use matrix multiplication, activation function, and matrix element multiplication operations to convert it. Where/> The parameters in are randomly initialized, and two consecutive matrix multiplication operations are used to obtain a scientific weight matrix _Wa , which participates in the calculation of graph convolution, as shown in the following formula:

其中，为矩阵相乘操作，⊙为向量元素相乘操作，f为全链接操作。in, is a matrix multiplication operation, ⊙ is a vector element multiplication operation, and f is a full link operation.

作为本发明的优选实施方式，所述的步骤(4)具体包括以下步骤：As a preferred embodiment of the present invention, the step (4) specifically comprises the following steps:

L_S＝L_class(FC(V_spatial)) _LS = L _class (FC(V _spatial ))

在实际应用当中，上述步骤(4)具体为：In practical applications, the above step (4) is specifically as follows:

步骤4.1：构建特征融合模块：Step 4.1: Construct feature fusion module:

对于面部区域按照左眼、右眼、唇部划分为三个部分，三个部分分别对应一些特征向量，获取这三个部分的特征后使用池化对图卷积模块输出信息做特征融合操作，得到一帧内的融合了人脸关键点信息的空间特征信息V_spatial，并以一组交叉熵损失函数约束该融合操作，如以下公式：The facial area is divided into three parts: left eye, right eye, and lip. The three parts correspond to some feature vectors respectively. After obtaining the features of these three parts, pooling is used to output information of the graph convolution module. Perform feature fusion operation to obtain spatial feature information V _spatial that integrates facial key point information within a frame, and constrain the fusion operation with a set of cross entropy loss functions, such as the following formula:

L_S＝L_class(FC(V_spatial)) _LS = L _class (FC(V _spatial ))

其中FC为全链接操作，L_class(·)为多类别交叉熵损失函数,L_S代表空间分量贡献的分类损失。Where FC is the full-link operation, L _class (·) is the multi-category cross entropy loss function, and _LS represents the classification loss contributed by the spatial component.

作为本发明的优选实施方式，所述的步骤(5)具体为：As a preferred embodiment of the present invention, the step (5) is specifically as follows:

(5.1)构建时序模块的输入：构建类别特征、位置编码和人脸关键点引导的融合特征V_spatial，三者组成的时序模块输入特征其中N为拼接特征的个数，d为特征的维度，公式如下：(5.1) Construct the input of the temporal module: construct the fusion feature V _spatial guided by category feature, position encoding and facial key points, and the temporal module input feature composed of the three Where N is the number of concatenated features, d is the dimension of the feature, and the formula is as follows:

其中x_class为类别特征，E_pos为位置编码，；符号为拼接操作。Among them, x _class is the category feature, E _pos is the position code, and the symbol is the concatenation operation.

s_i＝WV_i；i∈[1,N]s _i =WV _i ; i∈[1,N]

L_T＝L_class(R) _LT = L _class (R)

在实际应用当中，上述步骤(5)具体为：In practical applications, the above step (5) is specifically as follows:

步骤5.1：构建时序模块的输入：Step 5.1: Construct the input of the timing module:

时序模块输入由三部分组成：类别特征、位置编码和人脸关键点引导的融合特征V_spatial，其中类别特征由随机初始化的一个向量组成，该向量维度需要和时序模块的输入维度一致确保可以直接相加；位置编码采用正余弦交替的位置编码算法，用以确定输入的时间顺序；三者组成的时序模块输入特征其中N为拼接特征的个数，d为特征的维度，公式如下：The input of the timing module consists of three parts: category features, position coding, and fusion features V _spatial guided by facial key points. The category features are composed of a randomly initialized vector whose dimension needs to be consistent with the input dimension of the timing module to ensure that they can be directly added. The position coding uses a sine-cosine alternating position coding algorithm to determine the time sequence of the input. The timing module input features composed of the three are Where N is the number of concatenated features, d is the dimension of the feature, and the formula is as follows:

步骤5.2：构建注意力权重矩阵：Step 5.2: Construct the attention weight matrix:

注意力权重矩阵经由自注意力操作学习得到，需要将输入特征经过一个线性映射层，产生一个query矩阵/>一个Key矩阵/>以及一个Value矩阵三个矩阵的映射关系不共享参数，接着再将三个矩阵传入多头自注意力机制MHSA中，计算得到权重矩阵W，该权重矩阵代表特征图中某些区域的影响能力更高，如以下公式所示：The attention weight matrix is learned through self-attention operation, which requires the input features After a linear mapping layer, a query matrix is generated/> A Key matrix/> And a Value matrix The mapping relationship of the three matrices does not share parameters. Then the three matrices are passed into the multi-head self-attention mechanism MHSA to calculate the weight matrix W, which represents the higher influence of certain areas in the feature map, as shown in the following formula:

步骤5.3：构建时序模型：Step 5.3: Build a timing model:

计算得到的权重矩阵W与Value矩阵相乘，并经过一个残差操作和多层MLP处理，可以得到与输入特征维度尺寸均相同的特征向量。该特征向量更好地融合了时序特征，且可以作为下一层时序网络的输入：The calculated weight matrix W and Value matrix After multiplication, a residual operation and multi-layer MLP processing, a feature vector with the same dimension as the input feature can be obtained. This feature vector better integrates the time series features and can be used as the input of the next layer of time series network:

s_i＝WV_i；i∈[1,N]s _i =WV _i ; i∈[1,N]

步骤5.4：堆叠时序网络Step 5.4: Stacking the Timing Network

堆叠多层节所述的注意力机制和残差操作，堆叠的层数影响网络对时序特征的提取性能，在最后一层处对于输出的维度尺寸均与输入特征相同，对其使用一个分类器模块，将特征通过全链接网络映射到类别数c，得到/>该向量表示c个类别可能的概率，对/>做一次softmax操作得到/>用于参与计算有标签监督的分类损失函数，公式如下:The attention mechanism and residual operation described in the stacking multi-layer section, the number of stacked layers affects the network's performance in extracting temporal features, and the last layer has an important effect on the output The dimensions of are the same as the input features. A classifier module is used to map the features to the number of categories c through a fully linked network, and we get /> This vector represents the possible probabilities of c categories. Do a softmax operation to get /> Used to participate in the calculation of the classification loss function with label supervision, the formula is as follows:

L_T＝L_class(R) _LT = L _class (R)

其中，L_class(·)为多类别交叉熵损失函数,L_T代表时间分量贡献的分类损失。用重要性权重λ平衡来自空间分量贡献的分类损失L_S和来自空间分量贡献的分类损失L_T，公式如下：Among them, L _class (·) is the multi-class cross entropy loss function, and _LT represents the classification loss contributed by the time component. The importance weight λ is used to balance the classification loss _LS contributed by the spatial component and the classification loss _LT contributed by the spatial component. The formula is as follows:

L_total＝λ×L_S+(1-λ)×L_T。L _total = λ × _LS + (1-λ) × _LT .

该利用上述方法的基于人脸关键点优化区域特征的视频面部表情识别的系统，其中，所述的系统包括：The system for video facial expression recognition based on face key point optimization regional features using the above method, wherein the system comprises:

人脸关键点引导的图卷积模块，与所述的空间特征提取模块、基于人脸关键点的特征提取模块相连接，用于通过图卷积形式借助人脸关键点强化该帧内的人脸表情信息。The face key point guided graph convolution module is connected to the spatial feature extraction module and the feature extraction module based on face key points, and is used to enhance the facial expression information in the frame with the help of face key points in the form of graph convolution.

特征融合模块，与所述的人脸关键点引导的图卷积模块相连接，用于压缩高维特征并根据人脸区域划分决定情感信息的关键特征向量。The feature fusion module is connected to the face key point guided graph convolution module, and is used to compress high-dimensional features and determine the key feature vectors of emotional information according to face area division.

在本发明的一具体实施例中，采用了本技术方案的该分类识别方法测试如下：In a specific embodiment of the present invention, the classification and identification method using the present technical solution is tested as follows:

(1)实验数据集(1) Experimental Dataset

本发明使用由视频人脸表情识别数据集AFEW和DFEW进行实验验证。，AFEW数据集从2013年到2019年一直是The Wild Challenge(EmotiW)中情感识别的评估数据集，在此期间该数据集进行了更新，该数据集是包含音频和视频数据的时态多模态数据库。样本标有七种情感基类：快乐、惊讶、愤怒、厌恶、恐惧、悲伤和中立。AFEW数据集中提供了1,809个视频，其中训练集中有773个，验证集中有383个，测试集中有653个。为了保证数据的严谨性，这三组视频没有重复，甚至视频中的人的身份也不尽相同。DFEW数据集包含来自1500多部逼真高清电影的样本，涵盖各种主题，逼真地反映了人们在各种环境中的面部动作。该数据集包含16372个视频样本，同样包含7个情感基类，该数据集的样本被平均分成五个非重复部分，测试过程中其中一个部分被选为测试集，其他部分用作训练集。The present invention uses video facial expression recognition datasets AFEW and DFEW for experimental verification. The AFEW dataset has been an evaluation dataset for emotion recognition in The Wild Challenge (EmotiW) from 2013 to 2019, during which the dataset was updated. The dataset is a temporal multimodal database containing audio and video data. The samples are labeled with seven emotion base classes: happiness, surprise, anger, disgust, fear, sadness, and neutrality. The AFEW dataset provides 1,809 videos, including 773 in the training set, 383 in the validation set, and 653 in the test set. In order to ensure the rigor of the data, the three sets of videos are not repeated, and even the identities of the people in the videos are not the same. The DFEW dataset contains samples from more than 1,500 realistic high-definition movies, covering a variety of topics, and realistically reflects people's facial movements in various environments. The dataset contains 16,372 video samples and also contains 7 emotion base classes. The samples of the dataset are evenly divided into five non-repeating parts. One of the parts is selected as the test set during the test, and the other parts are used as training sets.

(2)训练过程(2) Training process

训练图片缩放成112×112pt，并采用随机旋转，灰度变换，随机翻折等数据增强方式。选用Adam优化器，初始学习率设为le-3，采用固定轮次衰减学习率，batch设置为64，共训练80轮。The training images were scaled to 112×112 pt, and data enhancement methods such as random rotation, grayscale transformation, and random folding were used. The Adam optimizer was used, the initial learning rate was set to le-3, the learning rate was decayed with a fixed round, the batch was set to 64, and a total of 80 rounds of training were used.

(3)测试结果(3) Test results

在本实施例中，分别在AFEW、DFEW数据集上进行训练，然后在这两个数据集的验证集上次数性能，并选择Accuracy(Acc.)作为算法评价指标。实验结果如表1所示。In this embodiment, training is performed on the AFEW and DFEW datasets respectively, and then the performance is tested on the validation sets of the two datasets, and Accuracy (Acc.) is selected as the algorithm evaluation index. The experimental results are shown in Table 1.

表1AFEW、DFEW数据集训练模型的性能比较(％)Table 1 Performance comparison of models trained on AFEW and DFEW datasets (%)

由表1可见，本实施例在以AFEW、DFEW数据集上的表现均优于以往方法，可以达到两个数据集上检测时的最佳性能。As can be seen from Table 1, the performance of this embodiment on the AFEW and DFEW datasets is better than that of the previous methods, and can achieve the best performance when detecting on the two datasets.

该用于实现基于人脸关键点优化区域特征的视频面部表情识别的装置，其中，所述的装置包括：The device for realizing video facial expression recognition based on face key point optimization regional features, wherein the device comprises:

该用于实现基于人脸关键点优化区域特征的视频面部表情识别的处理器，其中，所述的处理器被配置成执行计算机可执行指令，所述的计算机可执行指令被所述的处理器执行时，实现上述所述的基于人脸关键点优化区域特征的视频面部表情识别方法的各个步骤。The processor for realizing video facial expression recognition based on face key point optimized regional features, wherein the processor is configured to execute computer executable instructions, and when the computer executable instructions are executed by the processor, the various steps of the above-mentioned video facial expression recognition method based on face key point optimized regional features are realized.

该计算机可读存储介质，其中，其上存储有计算机程序，所述的计算机程序可被处理器执行以实现上述所述的基于人脸关键点优化区域特征的视频面部表情识别的方法的各个步骤。The computer-readable storage medium stores a computer program, which can be executed by a processor to implement the various steps of the above-mentioned method for video facial expression recognition based on optimizing regional features of facial key points.

流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为，表示包括一个或更多个用于实现特定逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分，并且本发明的优选实施方式的范围包括另外的实现，其中可以不按所示出或讨论的顺序，包括根据所涉及的功能按基本同时的方式或按相反的顺序，来执行功能，这应被本发明的实施例所属技术领域的技术人员所理解。Any process or method description in a flowchart or otherwise described herein may be understood to represent a module, segment or portion of code that includes one or more executable instructions for implementing the steps of a specific logical function or process, and the scope of the preferred embodiments of the present invention includes alternative implementations in which functions may not be performed in the order shown or discussed, including performing functions in a substantially simultaneous manner or in the reverse order depending on the functions involved, which should be understood by those skilled in the art to which the embodiments of the present invention belong.

应当理解，本发明的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中，多个步骤或方法可以用存储在存储器中且由合适的指令执行装置执行的软件或固件来实现。It should be understood that each part of the present invention can be implemented by hardware, software, firmware or a combination thereof. In the above embodiments, multiple steps or methods can be implemented by software or firmware stored in a memory and executed by a suitable instruction execution device.

本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成的，程序可以存储于一种计算机可读存储介质中，该程序在执行时，包括方法实施例的步骤之一或其组合。A person of ordinary skill in the art may understand that all or part of the steps of the method for implementing the above-mentioned embodiment may be completed by instructing the relevant hardware through a program, and the program may be stored in a computer-readable storage medium, which, when executed, includes one of the steps of the method embodiment or a combination thereof.

上述提到的存储介质可以是只读存储器，磁盘或光盘等。The storage medium mentioned above can be a read-only memory, a magnetic disk or an optical disk, etc.

在本说明书的描述中，参考术语“一实施例”、“一些实施例”、“示例”、“具体示例”、或“实施例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不一定指的是相同的实施例或示例。而且，描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。In the description of this specification, the description with reference to the terms "an embodiment", "some embodiments", "example", "specific example", or "embodiment" means that the specific features, structures, materials or characteristics described in conjunction with the embodiment or example are included in at least one embodiment or example of the present invention. In this specification, the schematic representation of the above terms does not necessarily refer to the same embodiment or example. Moreover, the specific features, structures, materials or characteristics described can be combined in any one or more embodiments or examples in a suitable manner.

尽管上面已经示出和描述了本发明的实施例，可以理解的是，上述实施例是示例性的，不能理解为对本发明的限制，本领域的普通技术人员在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。Although the embodiments of the present invention have been shown and described above, it is to be understood that the above embodiments are exemplary and are not to be construed as limitations of the present invention. A person skilled in the art may change, modify, replace and vary the above embodiments within the scope of the present invention.

在此说明书中，本发明已参照其特定的实施例作了描述。但是，很显然仍可以作出各种修改和变换而不背离本发明的精神和范围。因此，说明书和附图应被认为是说明性的而非限制性的。In this specification, the present invention has been described with reference to specific embodiments thereof. However, it is apparent that various modifications and variations may be made without departing from the spirit and scope of the present invention. Therefore, the specification and drawings should be regarded as illustrative rather than restrictive.

Claims

1. A video facial expression recognition method based on face key point optimization regional features, characterized in that the method comprises the following steps:

(1) Collect video facial expression datasets and perform preprocessing operations on the videos;

(2) Constructing a spatial feature extraction module based on the CNN structure and a feature extraction module based on facial key points to extract primary spatial features and local area features of facial key points from the collected data respectively;

(3) Construct a graph convolution module guided by facial key points to enhance the extracted relevant information;

(4) Construct a feature fusion module to enhance the features of the image within a frame;

(5) Use the timing module and classification module to decode and process the extracted relevant information, make decisions, and construct an overall loss function.

2. The video facial expression recognition method based on face key point optimization regional features according to claim 1, characterized in that the step (1) specifically comprises the following steps:

(1.1) The captured video is subjected to video framing and face extraction to obtain a cropped face image, and an original video frame with a size of 256×256 is obtained;

(1.2) Data augmentation and downsampling are used on the original video frame sequence to construct the final 112×112 training and test images; the data augmentation method includes: random sampling, continuously sampling 2 frames of a group of videos, sequentially sampling backward 8 times to obtain the image group for training or testing, randomly rotating the pre-processed image group [-45°, 45°], and then performing random horizontal flipping, random color jittering, random horizontal flipping, random Gaussian blurring and random graying on the image group, and finally obtaining the final training and test images through size scaling operations.

3. The video facial expression recognition method based on face key point optimization regional features according to claim 1, characterized in that the step (2) specifically comprises the following steps:

(2.1) Construct the spatial feature extraction module, specifically:

Use the CNN model as the feature extraction network for the facial expression recognition video data frame, and convert the 112×112×3×16 dimensional input as a sample into a set of 2D feature maps Where h and w represent the length and width of the feature map, d represents the dimension of the feature map, and t represents the number of frames in the sample. Construct a triplet attention module to convert the 2D feature map Convert to 2D feature map/> The dimension of the feature map F is consistent with that of the feature map T;

(2.2) Construct the facial key point feature extraction module, specifically:

The 112×112×3×16 dimension input as a sample is extracted and converted into a set of facial key point feature maps Where N represents the number of facial key points, h and w represent the length and width of the feature map respectively; sample 18 fixed facial key points from the feature map A' and convert the feature map A' into a 2D feature heat map/> And the feature heat map is consistent with the length and width of the feature map T in Section (2.1);

(2.3) Based on the constructed spatial feature extraction module and facial key point feature extraction module, primary spatial domain features and local area features of facial key points are extracted from the collected facial expression recognition data set.

4. The video facial expression recognition method based on face key point optimized regional features according to claim 1, characterized in that the face key point guided graph convolution module in step (3) comprises the following steps:

(3.1) Construct the input features of the graph convolution module guided by the facial key points, specifically:

The 2D feature map obtained using the spatial feature extraction module constructed in Section (2.1) and (2.2) 2D feature heat map constructed by the feature extraction module of 18 facial key points before the festival/> Using pooling calculations and vector element multiplication, and after each operation, layer normalization is used to further adjust the output to obtain the input information of the graph convolution module guided by facial key points/> and/> Such as the following formula:

Among them, g is the global pooling operation, ⊙ is the vector element multiplication operation, () is the vector splicing operation, _Ti is the 2D spatial domain feature map of the i-th frame image, vi _,g is the global face key point feature of the i-th frame image, is the feature heat map of the jth key point of the i-th frame image;

(3.2) Construct the face key point guided graph convolution module, specifically:

Use the sampled and enhanced feature heatmap A, and the input features constructed in Section (3.1) And the newly created learnable feature matrix/> Use matrix multiplication, activation function, matrix element multiplication operation to convert it/> Such as the following formula:

in, is a matrix multiplication operation, ⊙ is a vector element multiplication operation, T is a matrix transpose operation, f is a full-link operation, A is a feature heat map, W _l is a learnable feature matrix, _Wa is a matrix obtained by matrix multiplication and activation calculation of W _l , _Vi ^l is a feature vector composed of the first 18 facial key point features in _Vi obtained in Section (3.1), _{and Vig} ^is the last global feature vector in _Vi .

5. The video facial expression recognition method based on face key point optimization regional features according to claim 4 is characterized in that the face is divided into three regions according to the sampled face key points, and the face key point guided graph convolution module output information constructed in (3.2) is used according to the different divided regions. Perform feature fusion operation to construct spatial feature information V _spatial that integrates facial key point information, and constrain the fusion operation with a set of cross entropy loss functions, as shown in the following formula:

_LS = L _class (FC(V _spatial ))

Among them, FC is the full-link operation, L _class (·) is the multi-category cross entropy loss function, and _LS represents the classification loss contributed by the spatial component.

6. The video facial expression recognition method based on face key point optimization regional features according to claim 5, characterized in that the timing module and the classification module comprise the following steps:

(5.1) Construct the input of the timing module, specifically:

Construct the fusion feature V _spatial of category feature, position encoding and face key point guidance (3.2), and the temporal module input feature composed of the three Where N is the number of concatenated features, d is the dimension of the feature, and the formula is as follows:

z ⁰ = [x _Class ; V _spatial ] + E _pos = [x _class ; v ₁ ; v ₂ ; … ; v _N ] + E _pos

Among them, x _class is the category feature, E _pos is the position code, the symbol is the splicing operation, and V _spatial is the spatial feature information of the fused face key point information;

(5.2) Input features First, a linear mapping layer is passed to generate a query matrix/> A Key matrix/> And a Value matrix/> Then the three matrices are passed into the multi-head self-attention mechanism MHSA to calculate the weight matrix W, as shown in the following formula:

Where, T is the matrix transpose operation, and d is the normalization constant;

(5.3) Weight matrix W and Value matrix Multiply, and after a residual operation and multi-layer MLP processing, get the output that can be used for classification/> The formula is as follows:

s _i =WV _i ; i∈[1,N]

z ^l = W _H [ s ₁ ; s ₂ ; … ; s _N ] ^T + z ^l-1

Among them, T is the matrix transpose operation, W _H is the learnable weight matrix;

(5.4) Stack multiple layers of attention mechanisms and residual operations described in sections (5.2) and (5.3), and at the last layer, Use a classifier module to map the features to the number of categories c through a fully linked network, and get/> This vector represents the possible probabilities of c categories. Do a softmax operation to get /> Used to participate in the calculation of the classification loss function with label supervision, the formula is as follows:

L _t = L _class (R)

Among them, _Lclass (·) is the multi-category cross entropy loss function, and _LT represents the classification loss contributed by the time component.

7. The video facial expression recognition method based on face key point optimization regional features according to claim 1 is characterized in that the overall loss function constructed in step (5) is specifically:

The importance weight λ is used to balance the classification loss _LS from the spatial component contribution and the classification loss _LT from the spatial component contribution, as follows:

L _total = λ × _LS + (1-λ) × _LT .

8. A system for implementing the video facial expression recognition method based on face key point optimization regional features of any one of the methods of claims 1 to 7, characterized in that the system comprises:

The spatial feature extraction module and the feature extraction module based on facial key points are used to extract spatial feature data and facial key point feature data from the relevant data information in the collected video facial expression recognition data set;

A face key point guided graph convolution module, connected to the spatial feature extraction module and the face key point based feature extraction module, for enhancing the facial expression information in the frame with the help of face key points in the form of graph convolution;

A feature fusion module, connected to the face key point guided graph convolution module, is used to compress high-dimensional features and determine the key feature vectors of emotional information according to face area division;

The timing module and the classification module are connected to the feature fusion module, and a timing feature extraction network is constructed using a multi-head self-attention mechanism MHSA and a multi-layer MLP. The classification results are supervised with labels using the classification network and the cross entropy loss function to obtain the final video facial expression recognition results.

9. A device for realizing video facial expression recognition based on face key point optimization regional features, characterized in that the device comprises:

a processor configured to execute computer-executable instructions;

A memory storing one or more computer executable instructions, wherein when the computer executable instructions are executed by the processor, each step of the video facial expression recognition method based on optimizing regional features of facial key points described in any one of claims 1 to 7 is implemented.

10. A processor for implementing video facial expression recognition based on optimized regional features of facial key points, characterized in that the processor is configured to execute computer-executable instructions, and when the computer-executable instructions are executed by the processor, the various steps of the video facial expression recognition method based on optimized regional features of facial key points described in any one of claims 1 to 7 are implemented.

11. A computer-readable storage medium, characterized in that a computer program is stored thereon, and the computer program can be executed by a processor to implement the various steps of the video facial expression recognition method based on optimizing regional features of facial key points as described in any one of claims 1 to 7.