Disclosure of Invention
The invention aims to solve the technical problems that the conventional video behavior identification method is large in calculation amount, poor in robustness, incapable of extracting motion information of a video, low in accuracy rate and the like, and provides a video behavior identification method based on a compression reward and punishment mechanism.
In order to solve the technical problems, the invention adopts the technical scheme that:
the video behavior identification method based on the compression reward and punishment mechanism comprises the following steps:
step 1, dividing a video to be identified into a plurality of equal-length segments, and randomly extracting stacked optical flow images and RGB video frames from each segment;
step 2, respectively inputting the stacked optical flow image and the RGB video frame into a time and space double-current convolution neural network containing a compression reward and punishment mechanism, weighting the characteristics extracted by the network through compression and reward and punishment operations, and respectively making preliminary prediction on video behaviors on two channels of time and space according to the weighted time and space characteristics;
step 3, respectively fusing the time and space preliminary prediction results of each segment to obtain video-level time and space prediction results;
and 4, finally, fusing the video-level time and the spatial prediction result to obtain a final video behavior recognition result.
And 5, training the network by iteratively updating the model parameters, and optimizing the loss value of video level prediction.
The invention provides a network structure based on a compression reward punishment mechanism, wherein the network is constructed based on a time segmentation double-current network, long-time video can be modeled, time and space characteristics are respectively extracted through a time convolution network and a space convolution network, the calculated amount and the parameter amount are small, the characteristics extracted from the double-current network are endowed with different weights according to the importance degree on a characteristic channel through the compression reward punishment mechanism, and the characteristics of a current task are subjected to different excitation reward punishments according to the weights, so that the effective characteristic weight is larger, the invalid or small-effect characteristic weight is smaller, and the identification accuracy rate is higher.
Further, in step 1, the video to be identified is divided into a plurality of equal-length segments, which specifically includes:
dividing the video to be identified into K segments { S ] according to equal time intervals1,S2…SkAnd (4) through experiments, the accuracy is highest when the K is selected to be 3, the video is segmented, the long-time video can be modeled, each segment makes preliminary prediction on classification, and finally, a plurality of results of all the segments are fused to obtain the prediction of the whole video, so that the information of the whole video can be fully utilized.
Furthermore, in step 2, the stacked optical flow image and the RGB video frame are respectively input to a time convolution neural network and a space convolution neural network which include a compression reward and punishment mechanism, the features extracted by the network are weighted by the compression and reward and punishment operation, and preliminary predictions are made on video behaviors on two channels, namely time and space, respectively according to the weighted time and space features, specifically: inputting the stacked optical flow image into a time convolution neural network of a compression reward and punishment mechanism, and inputting the RGB video frame into a space convolution neural network of the compression reward and punishment mechanism;
T1,T2,…Tkrepresenting video segments, each segment TkFrom the corresponding segment SkObtaining by random sampling; function F (T)k(ii) a W) represents the convolutional network pair short segment T using W as a parameterkThe feature extraction of (2) is output as a feature vector corresponding to the category number dimension, which is equivalent to the category score of each short segment.
The time and space double-current convolution network respectively extracts the time characteristic and the space characteristic of the video, and compared with a three-dimensional convolution network structure, the time and space characteristic is simultaneously extracted, the calculated amount and the parameter amount are smaller, and the training is easier.
Further, in the step 3, the temporal and spatial preliminary prediction results of each segment are respectively fused to obtain a video-level temporal and spatial prediction result, specifically:
combining the category scores of a plurality of short segments to obtain the consensus of the short segments on the category prediction, wherein the segment consensus function G is shown as formula 1, and a certain category score G is deduced from the same category scores of all the segments by adopting an aggregation function GiThe polymerization function g adopts a uniform average method; the prediction function H predicts the probability that the whole video segment belongs to each behavior category, the Softmax function is selected as the prediction function,
Gi=g(Fi(Ti),...,Fk(Tk)) (1)。
and furthermore, the time and space double-current convolution neural network structure containing the compression reward punishment mechanism in the step 2 adopts a residual error network containing the compression reward punishment mechanism. The use of the residual error network can ensure that the problems of gradient dispersion and gradient explosion are avoided along with the deepening of the network layer number, so that the network is easier to converge.
And furthermore, the fusion mode in the step 4 adopts average fusion, and the prediction of the whole video segment by the time and space convolution network is fused to obtain the final prediction result.
Further, in the step 5, the network is trained by iteratively updating the model parameters, and the loss value of the video-level prediction is optimized; the whole network models a series of segments according to formula 1, and divides a video into K segments { S ] according to equal time intervals1,S2…SkAccording to the test, the accuracy is highest when K is selected to be 3;
Y(S1,S2,...Sk)=H(G(F(T1;W),...F(Tk;W))) (2)
wherein Y represents a video level prediction result;
the loss function adopts cross entropy loss, the form of the loss function is shown in formula 3, C is the total category number of the behaviors, yiTrue classification for the category;
still further, the number of the stacked optical flow images of the time convolution neural network of the input compression reward and punishment mechanism is 20, and the images are obtained by preprocessing a data set; the RGB video frame of the spatial convolution neural network input to the compression reward and punishment mechanism is a standard three-channel video frame extracted from a data set, the size of the RGB video frame depends on the video frame of an original data set, the RGB video frame is 224 multiplied by 3, the features are extracted through a residual error network, the extracted features obtain the weight of each feature channel in the channel dimension through the compression reward and punishment mechanism, the original features are weighted channel by channel through multiplication, and the recalibration of the original features on the channels is completed.
The compression reward and punishment mechanism serves as a channel attention mechanism, extracted features are weighted in channel dimension, effective features can be endowed with large weight, invalid or small-effect features are endowed with small weight, and the classification accuracy of the whole network is improved.
Furthermore, in the residual error network containing the compression reward and punishment mechanism, the input image passes through the residual error network to obtain a characteristic graph group of c × h × w; firstly, performing compression operation F (sq) on the feature map group, compressing by adopting global average pooling, converting the feature map group of cxhxw into output of cx1 x 1, and compressing global channel information as a channel descriptor by the compression operation to realize channel description; the output dimension after the compression operation is matched with the number of input characteristic channels;
the formula for the compression operation is shown in equation 4, ZcRepresenting the result of a punishment operation, FsqRepresenting a compression operation, ucRepresenting the c-th feature map with h multiplied by w size of the feature map group U; the reward and punishment operation needs to judge the importance degree of each channel, the reward and punishment operation is realized by using a door mechanism with a sigmoid activation function, two full-connection layers are added before the door mechanism to enhance the generalization capability and the nonlinear expression capability of a model, the first full-connection layer firstly reduces the characteristic dimension into input c/r, wherein r is a scaling parameter, and then the characteristic dimension is raised back to the original dimension through one full-connection layer after being activated by a Relu activation function; this has the advantage over directly using one fully connected layer that two-layer fully connected networks have more non-linearities and can reduce the amount of parameters and computations;
performing reward and punishment operation F (ex) to generate weight for each feature channel by using global information obtained by compression operation, wherein the weight is used for representing the importance degree of the feature channel; the reward and punishment operation formula is shown as formula 5, S represents the result of reward and punishment operation, namely the weight of the channel, the dimensionality is c multiplied by 1, z represents the result of the compression operation, and W represents the weight of the channel1z represents the first fully-connected operation, W1C/r × 1 × 1, then W2The output dimensionality is restored to be c multiplied by 1 after the second full connection layer, finally, the normalized weight between 0 and 1 is obtained through a door mechanism with a Sigmoid activation function, and after S is obtained, the weight is added channel by channel through Scal operationWeighting the original features to the previous feature group U, completing the recalibration of the original features on the channel dimension, and finally connecting to form a complete residual error network;
S=Fex(z,W)=σ(g(z,W))=σ(W2δ(W1z)) (5)。
the compression and reward punishment operation is a light attention mechanism, and compared with other methods, the compression reward punishment mechanism obtains a better effect on the premise of only increasing a little calculated amount and parameter amount.
Still further, the parameters of the residual error network with the compression reward and punishment mechanism are as follows: the compression reward and punishment mechanism residual error network structure parameters set to be 50 layers are shown in the following table:
SE-ResNet50 structure
The structure of SE-ResNet50
The method is characterized in that 50 layers of residual error networks containing compression reward and punishment mechanisms are arranged, the network is deep, deeper features can be extracted, the basic network adopts the residual error network, gradient disappearance and gradient dispersion can be avoided when the network depth is deep, and the training convergence is easier.
In most video behavior recognition tasks, spatiotemporal information extracted by a network is equally processed, in order to ignore irrelevant information and pay attention to key information, a convolutional neural network containing a compression reward and punishment mechanism is designed for video behavior recognition. The method comprises the steps that a network is built on the basis of a time segmentation network, firstly, a video is divided into a plurality of equal-length segments, a stacked optical flow image and an RGB video frame are extracted from each segment at random, the stacked optical flow image and the RGB video frame are respectively input into a time and space double-current convolution neural network with a compression reward and punishment mechanism, characteristics extracted by the network are weighted through compression and reward and punishment operations, and preliminary prediction is conducted on behaviors on two channels of time and space according to the weighted time and space characteristics; then, respectively fusing the time and space preliminary prediction results of each segment to obtain a video-level prediction result; and finally, fusing the video-level time and the spatial prediction result to obtain a final video behavior recognition result. Experiments are carried out on the data sets UCF101 and HMDB51, and the results show that compared with various other network models without compression reward punishment mechanisms, the model has higher accuracy rate.
Detailed Description
Experiments are carried out on a video identification main stream data set UCF101 and an HMDB51, and the video behavior identification method based on the compression reward and punishment mechanism comprises the following steps:
step 1, dividing a video to be identified into a plurality of equal-length segments, specifically: dividing the video to be identified into K segments { S ] according to equal time intervals1,S2…SkAnd f, selecting 3 from K. Randomly extracting stacked optical flow images and RGB video frames from each segment;
step 2, inputting the stacked optical flow image and the RGB video frame into a time and space double-flow convolution neural network containing a compression reward and punishment mechanism respectively, weighting the characteristics extracted by the network through compression and reward and punishment operations, performing preliminary prediction on video behaviors on two channels of time and space according to the weighted time and space characteristics, and setting the time and space double-flow convolution neural network into 50 layers, specifically:
inputting the stacked optical flow image into a time convolution neural network of a compression reward and punishment mechanism, and inputting the RGB video frame into a space convolution neural network of the compression reward and punishment mechanism;
T1,T2,…Tkrepresenting video segments eachA fragment TkFrom the corresponding segment SkObtaining by random sampling; function F (T)k(ii) a W) represents the convolutional network pair short segment T using W as a parameterkThe feature extraction of (2) is output as a feature vector corresponding to the category number dimension, which is equivalent to the category score of each short segment. As shown in fig. 1, fig. 1 is a time-division network structure based on a compression reward punishment mechanism according to the present invention.
The time and space double-current convolution neural network structure containing the compression reward and punishment mechanism adopts a residual error network containing the compression reward and punishment mechanism. As shown in fig. 2, fig. 2 is a convolutional network structure of time and space of the compression reward punishment mechanism of the present invention, and 20 stacked optical flow images of the time convolutional neural network input to the compression reward punishment mechanism are obtained by preprocessing a data set; the RGB video frame of the spatial convolution neural network input to the compression reward and punishment mechanism is a standard three-channel video frame extracted from a data set, the size of the RGB video frame depends on the video frame of an original data set, the RGB video frame is 224 multiplied by 3, the features are extracted through a residual error network, the extracted features obtain the weight of each feature channel in the channel dimension through the compression reward and punishment mechanism, the original features are weighted channel by channel through multiplication, and the recalibration of the original features on the channels is completed. The compression reward and punishment mechanism is shown in fig. 3, and the core idea of the compression reward and punishment mechanism is to learn the feature weight according to loss through a network, so that the effective feature weight is larger, and the ineffective or small-effect feature weight is smaller.
Fig. 4 shows a residual error network with a compression reward and punishment mechanism, where an input image is 224 × 224 × 3, and a 56 × 56 × 256 feature graph group is obtained after the input image passes through the residual error network; firstly, performing compression operation F (sq) on the feature map group, compressing by adopting global average pooling, converting the 56 × 56 × 256 feature map group into 1 × 1 × 256 output, and compressing global channel information as a channel descriptor by the compression operation to realize channel description; the output dimension after the compression operation is matched with the number of input characteristic channels;
the formula for the compression operation is shown in equation 4, ZcRepresenting the result of a punishment operation, FsqRepresenting a compression operation, ucRepresentsThe 256 th feature map of the feature map group U is h multiplied by w; the reward and punishment operation needs to judge the importance degree of each channel, the reward and punishment operation is realized by using a door mechanism with a sigmoid activation function, two full-connection layers are added before the door mechanism to enhance the generalization capability and the nonlinear expression capability of a model, the first full-connection layer firstly reduces the characteristic dimensionality to 256/16 of input, wherein 16 is a scaling parameter, and then the characteristic dimensionality is increased back to the original dimensionality through one full-connection layer after being activated by a Relu activation function; this has the advantage over directly using one fully connected layer that two-layer fully connected networks have more non-linearities and can reduce the amount of parameters and computations;
performing reward and punishment operation F (ex) to generate weight for each feature channel by using global information obtained by compression operation, wherein the weight is used for representing the importance degree of the feature channel; the reward and punishment operation formula is shown in formula 5, S represents the result of reward and punishment operation, namely the weight of the channel, the dimensionality is 1 × 1 × 256, z represents the result of compression operation, and W represents the weight of the channel1z represents the first fully-connected operation, W1Dimension of 1X 256/16, then W2The output dimensionality of the second full connection layer is restored to be 1 multiplied by 256, finally, a normalized weight between 0 and 1 is obtained through a door mechanism with a Sigmoid activation function, after S is obtained, the weight is weighted to a previous feature group U channel by channel through Scal operation, the re-calibration of the original feature on the channel dimensionality is completed, and finally, the original feature is connected to form a complete residual error network;
S=Fex(z,W)=σ(g(z,W))=σ(W2δ(W1z)) (5)
and 3, respectively fusing the time and space preliminary prediction results of each segment to obtain video-level time and space prediction results, wherein the fusion mode adopts average fusion. Combining the category scores of a plurality of short segments to obtain the consensus of the short segments on the category prediction, wherein the segment consensus function G is shown as formula 1, and a certain category score G is deduced from the same category scores of all the segments by adopting an aggregation function GiThe polymerization function g adopts a uniform average method; preparation ofA measuring function H predicts the probability that the whole video segment belongs to each behavior class, a Softmax function is selected as a prediction function,
Gi=g(Fi(Ti),...,Fk(Tk)) (1)
and 4, fusing the prediction of the whole video segment by the time and space convolution network to obtain a final prediction result, wherein the fusion mode also adopts average fusion.
Step 5, training the network by iteratively updating the model parameters, and optimizing the loss value of video level prediction;
the loss function adopts cross entropy loss, the form of the loss function is shown in formula 3, C is the total category number of the behaviors, yiTrue classification for the category;
training and testing iteration is carried out under a windows system based on a Pythrch deep learning framework. The Batch-size setting is 16, and each round of training requires 600 iterations. The input of the time convolution network of the compression reward and punishment mechanism is an optical flow image, and the initial learning rate is set to be 0.01; the input of the spatial convolution network of the compression reward and punishment mechanism is an RGB video frame, and the initial learning rate is set to be 0.0005. And during optimization, the learning rate adopts a self-adaptive method, the learning rate is updated in a self-adaptive mode according to the learning result, and the momentum is set to be 0.9. The compression reward and punishment mechanism time and space convolution network adopts a cross entropy loss function as an optimization function, and the optimization method is a random gradient descent algorithm.
FIG. 5 is a graph showing the variation of the accuracy with the number of iterations, and FIG. 6 is a graph showing the variation of the loss function with the number of iterations; the abscissa represents the iteration times, and the ordinate represents the Top-1 accuracy, so that the network reaches the maximum recognition rate at 35000 iterations, which reaches 94.6%, and then the change tends to be stable, and no overfitting phenomenon occurs in the whole training process. Fig. 7 shows a graph comparing the recognition rates of the present network and the ResNet network as a function of the number of iterations. The abscissa represents the iteration times, and the ordinate represents the Top-1 accuracy, so that the graph shows that the variation trend of the network is similar to that of the ResNet network, the maximum recognition rate is reached when the network is iterated to 35000 times, the accuracy of the network is 1.1% higher than that of the ResNet network, and the variation tends to be stable.
The parameters of the residual error network containing the compression reward and punishment mechanism are as follows: the compression reward and punishment mechanism residual error network structure parameters set to be 50 layers are shown in the following table:
SE-ResNet50 structure
The structure of SE-ResNet50