CN111178319A

CN111178319A - Video behavior identification method based on compression reward and punishment mechanism

Info

Publication number: CN111178319A
Application number: CN202010011032.7A
Authority: CN
Inventors: 张丽红; 郭磊
Original assignee: Shanxi University
Current assignee: Shanxi University
Priority date: 2020-01-06
Filing date: 2020-01-06
Publication date: 2020-05-19

Abstract

The invention belongs to the field of computer vision, and relates to a video behavior recognition method based on a compression reward and punishment mechanism. It mainly solves the technical problems of the current video behavior recognition methods, such as large amount of calculation, poor robustness, and low accuracy. The present invention designs a convolutional neural network with a compression reward and punishment mechanism for video behavior recognition. The network is constructed based on a time-segmented network. First, the video is divided into three segments, and optical flow images and RGB frames are randomly extracted from each segment, and then input to the time and space networks respectively. The extracted features are weighted by compression and reward and punishment operations. The weighted temporal and spatial features are used to make preliminary predictions on the two channels of time and space respectively; then the preliminary prediction results of each segment are fused to obtain video-level prediction results; finally, the video-level prediction results are fused to obtain video behavior recognition. result. Experiments are carried out on the datasets UCF101 and HMDB51, and the results show that the model has higher accuracy compared to other models.

Description

Video behavior identification method based on compression reward and punishment mechanism

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a video behavior identification method based on a compression reward punishment mechanism.

Background

Video behavior recognition is a hot spot in the field of computer vision today, aiming at automatically analyzing the ongoing behavior of a video or a sequence of images.

Video behavior recognition is divided into a traditional method and a deep learning-based method. The traditional method comprises a dense track algorithm and an improved dense track algorithm. The basic idea of the dense track algorithm is to obtain tracks in a video sequence by using an optical flow field and then calculate the characteristics of a direction gradient histogram, an optical flow direction histogram and the like along the tracks; the improved dense track algorithm utilizes the optical flow between the front frame video and the rear frame video and the key points for matching, thereby eliminating or weakening the influence caused by the motion of the camera. The traditional method has the problems of large calculation amount and poor robustness. Based on a deep learning method, Kappathy et al trains a deep network DeepNet to fuse the characteristics of different image frames in a video, and the model cannot extract the motion information of the video; varol and the like pay attention to the timing relationship reasoning of the network, and classify actions which cannot be distinguished only by the key frame, such as tumble, through the timing relationship reasoning, so that the accuracy is improved; the siooonyan et al propose a double-current convolutional neural network; ngyh and the like add a long and short term memory network into a double-current network model for enhancing the relation of time domain information, but the double-current convolution network can not model a long-time video; tran et al propose three-dimensional convolutional neural network extraction features, and the method can realize end-to-end training of the network, but has huge calculated amount and parameter amount; chollet et al propose a deep separation convolution neural network, which performs convolution separately for a three-dimensional convolution region and a channel, thereby greatly reducing the amount of calculation under the condition of keeping the precision basically unchanged. However, the three-dimensional convolution network method and the improvement method thereof have the effect of being lower than the accuracy of the double-flow method by a few percent.

Disclosure of Invention

The invention aims to solve the technical problems that the conventional video behavior identification method is large in calculation amount, poor in robustness, incapable of extracting motion information of a video, low in accuracy rate and the like, and provides a video behavior identification method based on a compression reward and punishment mechanism.

In order to solve the technical problems, the invention adopts the technical scheme that:

the video behavior identification method based on the compression reward and punishment mechanism comprises the following steps:

step 1, dividing a video to be identified into a plurality of equal-length segments, and randomly extracting stacked optical flow images and RGB video frames from each segment;

step 2, respectively inputting the stacked optical flow image and the RGB video frame into a time and space double-current convolution neural network containing a compression reward and punishment mechanism, weighting the characteristics extracted by the network through compression and reward and punishment operations, and respectively making preliminary prediction on video behaviors on two channels of time and space according to the weighted time and space characteristics;

step 3, respectively fusing the time and space preliminary prediction results of each segment to obtain video-level time and space prediction results;

and 4, finally, fusing the video-level time and the spatial prediction result to obtain a final video behavior recognition result.

And 5, training the network by iteratively updating the model parameters, and optimizing the loss value of video level prediction.

The invention provides a network structure based on a compression reward punishment mechanism, wherein the network is constructed based on a time segmentation double-current network, long-time video can be modeled, time and space characteristics are respectively extracted through a time convolution network and a space convolution network, the calculated amount and the parameter amount are small, the characteristics extracted from the double-current network are endowed with different weights according to the importance degree on a characteristic channel through the compression reward punishment mechanism, and the characteristics of a current task are subjected to different excitation reward punishments according to the weights, so that the effective characteristic weight is larger, the invalid or small-effect characteristic weight is smaller, and the identification accuracy rate is higher.

Further, in step 1, the video to be identified is divided into a plurality of equal-length segments, which specifically includes:

dividing the video to be identified into K segments { S ] according to equal time intervals₁,S₂…S_kAnd (4) through experiments, the accuracy is highest when the K is selected to be 3, the video is segmented, the long-time video can be modeled, each segment makes preliminary prediction on classification, and finally, a plurality of results of all the segments are fused to obtain the prediction of the whole video, so that the information of the whole video can be fully utilized.

Furthermore, in step 2, the stacked optical flow image and the RGB video frame are respectively input to a time convolution neural network and a space convolution neural network which include a compression reward and punishment mechanism, the features extracted by the network are weighted by the compression and reward and punishment operation, and preliminary predictions are made on video behaviors on two channels, namely time and space, respectively according to the weighted time and space features, specifically: inputting the stacked optical flow image into a time convolution neural network of a compression reward and punishment mechanism, and inputting the RGB video frame into a space convolution neural network of the compression reward and punishment mechanism;

T₁,T₂,…T_krepresenting video segments, each segment T_kFrom the corresponding segment S_kObtaining by random sampling; function F (T)_k(ii) a W) represents the convolutional network pair short segment T using W as a parameter_kThe feature extraction of (2) is output as a feature vector corresponding to the category number dimension, which is equivalent to the category score of each short segment.

The time and space double-current convolution network respectively extracts the time characteristic and the space characteristic of the video, and compared with a three-dimensional convolution network structure, the time and space characteristic is simultaneously extracted, the calculated amount and the parameter amount are smaller, and the training is easier.

Further, in the step 3, the temporal and spatial preliminary prediction results of each segment are respectively fused to obtain a video-level temporal and spatial prediction result, specifically:

combining the category scores of a plurality of short segments to obtain the consensus of the short segments on the category prediction, wherein the segment consensus function G is shown as formula 1, and a certain category score G is deduced from the same category scores of all the segments by adopting an aggregation function G_iThe polymerization function g adopts a uniform average method; the prediction function H predicts the probability that the whole video segment belongs to each behavior category, the Softmax function is selected as the prediction function,

G_i＝g(F_i(T_i),...,F_k(T_k)) (1)。

and furthermore, the time and space double-current convolution neural network structure containing the compression reward punishment mechanism in the step 2 adopts a residual error network containing the compression reward punishment mechanism. The use of the residual error network can ensure that the problems of gradient dispersion and gradient explosion are avoided along with the deepening of the network layer number, so that the network is easier to converge.

And furthermore, the fusion mode in the step 4 adopts average fusion, and the prediction of the whole video segment by the time and space convolution network is fused to obtain the final prediction result.

Further, in the step 5, the network is trained by iteratively updating the model parameters, and the loss value of the video-level prediction is optimized; the whole network models a series of segments according to formula 1, and divides a video into K segments { S ] according to equal time intervals₁,S₂…S_kAccording to the test, the accuracy is highest when K is selected to be 3;

Y(S₁,S₂,...S_k)＝H(G(F(T₁；W),...F(T_k；W))) (2)

wherein Y represents a video level prediction result;

the loss function adopts cross entropy loss, the form of the loss function is shown in formula 3, C is the total category number of the behaviors, y_iTrue classification for the category;

still further, the number of the stacked optical flow images of the time convolution neural network of the input compression reward and punishment mechanism is 20, and the images are obtained by preprocessing a data set; the RGB video frame of the spatial convolution neural network input to the compression reward and punishment mechanism is a standard three-channel video frame extracted from a data set, the size of the RGB video frame depends on the video frame of an original data set, the RGB video frame is 224 multiplied by 3, the features are extracted through a residual error network, the extracted features obtain the weight of each feature channel in the channel dimension through the compression reward and punishment mechanism, the original features are weighted channel by channel through multiplication, and the recalibration of the original features on the channels is completed.

The compression reward and punishment mechanism serves as a channel attention mechanism, extracted features are weighted in channel dimension, effective features can be endowed with large weight, invalid or small-effect features are endowed with small weight, and the classification accuracy of the whole network is improved.

Furthermore, in the residual error network containing the compression reward and punishment mechanism, the input image passes through the residual error network to obtain a characteristic graph group of c × h × w; firstly, performing compression operation F (sq) on the feature map group, compressing by adopting global average pooling, converting the feature map group of cxhxw into output of cx1 x 1, and compressing global channel information as a channel descriptor by the compression operation to realize channel description; the output dimension after the compression operation is matched with the number of input characteristic channels;

the formula for the compression operation is shown in equation 4, Z_cRepresenting the result of a punishment operation, F_sqRepresenting a compression operation, u_cRepresenting the c-th feature map with h multiplied by w size of the feature map group U; the reward and punishment operation needs to judge the importance degree of each channel, the reward and punishment operation is realized by using a door mechanism with a sigmoid activation function, two full-connection layers are added before the door mechanism to enhance the generalization capability and the nonlinear expression capability of a model, the first full-connection layer firstly reduces the characteristic dimension into input c/r, wherein r is a scaling parameter, and then the characteristic dimension is raised back to the original dimension through one full-connection layer after being activated by a Relu activation function; this has the advantage over directly using one fully connected layer that two-layer fully connected networks have more non-linearities and can reduce the amount of parameters and computations;

performing reward and punishment operation F (ex) to generate weight for each feature channel by using global information obtained by compression operation, wherein the weight is used for representing the importance degree of the feature channel; the reward and punishment operation formula is shown as formula 5, S represents the result of reward and punishment operation, namely the weight of the channel, the dimensionality is c multiplied by 1, z represents the result of the compression operation, and W represents the weight of the channel₁z represents the first fully-connected operation, W₁C/r × 1 × 1, then W₂The output dimensionality is restored to be c multiplied by 1 after the second full connection layer, finally, the normalized weight between 0 and 1 is obtained through a door mechanism with a Sigmoid activation function, and after S is obtained, the weight is added channel by channel through Scal operationWeighting the original features to the previous feature group U, completing the recalibration of the original features on the channel dimension, and finally connecting to form a complete residual error network;

S＝F_ex(z,W)＝σ(g(z,W))＝σ(W₂δ(W_1z)) (5)。

the compression and reward punishment operation is a light attention mechanism, and compared with other methods, the compression reward punishment mechanism obtains a better effect on the premise of only increasing a little calculated amount and parameter amount.

Still further, the parameters of the residual error network with the compression reward and punishment mechanism are as follows: the compression reward and punishment mechanism residual error network structure parameters set to be 50 layers are shown in the following table:

SE-ResNet50 structure

The structure of SE-ResNet50

The method is characterized in that 50 layers of residual error networks containing compression reward and punishment mechanisms are arranged, the network is deep, deeper features can be extracted, the basic network adopts the residual error network, gradient disappearance and gradient dispersion can be avoided when the network depth is deep, and the training convergence is easier.

In most video behavior recognition tasks, spatiotemporal information extracted by a network is equally processed, in order to ignore irrelevant information and pay attention to key information, a convolutional neural network containing a compression reward and punishment mechanism is designed for video behavior recognition. The method comprises the steps that a network is built on the basis of a time segmentation network, firstly, a video is divided into a plurality of equal-length segments, a stacked optical flow image and an RGB video frame are extracted from each segment at random, the stacked optical flow image and the RGB video frame are respectively input into a time and space double-current convolution neural network with a compression reward and punishment mechanism, characteristics extracted by the network are weighted through compression and reward and punishment operations, and preliminary prediction is conducted on behaviors on two channels of time and space according to the weighted time and space characteristics; then, respectively fusing the time and space preliminary prediction results of each segment to obtain a video-level prediction result; and finally, fusing the video-level time and the spatial prediction result to obtain a final video behavior recognition result. Experiments are carried out on the data sets UCF101 and HMDB51, and the results show that compared with various other network models without compression reward punishment mechanisms, the model has higher accuracy rate.

Drawings

FIG. 1 is a time-segmented network structure based on a compression reward punishment mechanism according to the present invention;

FIG. 2 is a schematic diagram of a convolutional network structure of the compression reward and punishment mechanism of the present invention;

FIG. 3 is a diagram of the compression reward and punishment mechanism of the present invention;

FIG. 4 is a compression reward and punishment mechanism residual error network according to the present invention;

FIG. 5 is a graph of the accuracy of the present invention;

FIG. 6 is a graph of the variation of the loss function of the present invention;

FIG. 7 is a comparison of the variation of the accuracy with the number of iterations.

Detailed Description

Experiments are carried out on a video identification main stream data set UCF101 and an HMDB51, and the video behavior identification method based on the compression reward and punishment mechanism comprises the following steps:

step 1, dividing a video to be identified into a plurality of equal-length segments, specifically: dividing the video to be identified into K segments { S ] according to equal time intervals₁,S₂…S_kAnd f, selecting 3 from K. Randomly extracting stacked optical flow images and RGB video frames from each segment;

step 2, inputting the stacked optical flow image and the RGB video frame into a time and space double-flow convolution neural network containing a compression reward and punishment mechanism respectively, weighting the characteristics extracted by the network through compression and reward and punishment operations, performing preliminary prediction on video behaviors on two channels of time and space according to the weighted time and space characteristics, and setting the time and space double-flow convolution neural network into 50 layers, specifically:

inputting the stacked optical flow image into a time convolution neural network of a compression reward and punishment mechanism, and inputting the RGB video frame into a space convolution neural network of the compression reward and punishment mechanism;

T₁,T₂,…T_krepresenting video segments eachA fragment T_kFrom the corresponding segment S_kObtaining by random sampling; function F (T)_k(ii) a W) represents the convolutional network pair short segment T using W as a parameter_kThe feature extraction of (2) is output as a feature vector corresponding to the category number dimension, which is equivalent to the category score of each short segment. As shown in fig. 1, fig. 1 is a time-division network structure based on a compression reward punishment mechanism according to the present invention.

The time and space double-current convolution neural network structure containing the compression reward and punishment mechanism adopts a residual error network containing the compression reward and punishment mechanism. As shown in fig. 2, fig. 2 is a convolutional network structure of time and space of the compression reward punishment mechanism of the present invention, and 20 stacked optical flow images of the time convolutional neural network input to the compression reward punishment mechanism are obtained by preprocessing a data set; the RGB video frame of the spatial convolution neural network input to the compression reward and punishment mechanism is a standard three-channel video frame extracted from a data set, the size of the RGB video frame depends on the video frame of an original data set, the RGB video frame is 224 multiplied by 3, the features are extracted through a residual error network, the extracted features obtain the weight of each feature channel in the channel dimension through the compression reward and punishment mechanism, the original features are weighted channel by channel through multiplication, and the recalibration of the original features on the channels is completed. The compression reward and punishment mechanism is shown in fig. 3, and the core idea of the compression reward and punishment mechanism is to learn the feature weight according to loss through a network, so that the effective feature weight is larger, and the ineffective or small-effect feature weight is smaller.

Fig. 4 shows a residual error network with a compression reward and punishment mechanism, where an input image is 224 × 224 × 3, and a 56 × 56 × 256 feature graph group is obtained after the input image passes through the residual error network; firstly, performing compression operation F (sq) on the feature map group, compressing by adopting global average pooling, converting the 56 × 56 × 256 feature map group into 1 × 1 × 256 output, and compressing global channel information as a channel descriptor by the compression operation to realize channel description; the output dimension after the compression operation is matched with the number of input characteristic channels;

the formula for the compression operation is shown in equation 4, Z_cRepresenting the result of a punishment operation, F_sqRepresenting a compression operation, u_cRepresentsThe 256 th feature map of the feature map group U is h multiplied by w; the reward and punishment operation needs to judge the importance degree of each channel, the reward and punishment operation is realized by using a door mechanism with a sigmoid activation function, two full-connection layers are added before the door mechanism to enhance the generalization capability and the nonlinear expression capability of a model, the first full-connection layer firstly reduces the characteristic dimensionality to 256/16 of input, wherein 16 is a scaling parameter, and then the characteristic dimensionality is increased back to the original dimensionality through one full-connection layer after being activated by a Relu activation function; this has the advantage over directly using one fully connected layer that two-layer fully connected networks have more non-linearities and can reduce the amount of parameters and computations;

performing reward and punishment operation F (ex) to generate weight for each feature channel by using global information obtained by compression operation, wherein the weight is used for representing the importance degree of the feature channel; the reward and punishment operation formula is shown in formula 5, S represents the result of reward and punishment operation, namely the weight of the channel, the dimensionality is 1 × 1 × 256, z represents the result of compression operation, and W represents the weight of the channel₁z represents the first fully-connected operation, W₁Dimension of 1X 256/16, then W₂The output dimensionality of the second full connection layer is restored to be 1 multiplied by 256, finally, a normalized weight between 0 and 1 is obtained through a door mechanism with a Sigmoid activation function, after S is obtained, the weight is weighted to a previous feature group U channel by channel through Scal operation, the re-calibration of the original feature on the channel dimensionality is completed, and finally, the original feature is connected to form a complete residual error network;

S＝F_ex(z,W)＝σ(g(z,W))＝σ(W₂δ(W_1z)) (5)

and 3, respectively fusing the time and space preliminary prediction results of each segment to obtain video-level time and space prediction results, wherein the fusion mode adopts average fusion. Combining the category scores of a plurality of short segments to obtain the consensus of the short segments on the category prediction, wherein the segment consensus function G is shown as formula 1, and a certain category score G is deduced from the same category scores of all the segments by adopting an aggregation function G_iThe polymerization function g adopts a uniform average method; preparation ofA measuring function H predicts the probability that the whole video segment belongs to each behavior class, a Softmax function is selected as a prediction function,

G_i＝g(F_i(T_i),...,F_k(T_k)) (1)

and 4, fusing the prediction of the whole video segment by the time and space convolution network to obtain a final prediction result, wherein the fusion mode also adopts average fusion.

Step 5, training the network by iteratively updating the model parameters, and optimizing the loss value of video level prediction;

training and testing iteration is carried out under a windows system based on a Pythrch deep learning framework. The Batch-size setting is 16, and each round of training requires 600 iterations. The input of the time convolution network of the compression reward and punishment mechanism is an optical flow image, and the initial learning rate is set to be 0.01; the input of the spatial convolution network of the compression reward and punishment mechanism is an RGB video frame, and the initial learning rate is set to be 0.0005. And during optimization, the learning rate adopts a self-adaptive method, the learning rate is updated in a self-adaptive mode according to the learning result, and the momentum is set to be 0.9. The compression reward and punishment mechanism time and space convolution network adopts a cross entropy loss function as an optimization function, and the optimization method is a random gradient descent algorithm.

FIG. 5 is a graph showing the variation of the accuracy with the number of iterations, and FIG. 6 is a graph showing the variation of the loss function with the number of iterations; the abscissa represents the iteration times, and the ordinate represents the Top-1 accuracy, so that the network reaches the maximum recognition rate at 35000 iterations, which reaches 94.6%, and then the change tends to be stable, and no overfitting phenomenon occurs in the whole training process. Fig. 7 shows a graph comparing the recognition rates of the present network and the ResNet network as a function of the number of iterations. The abscissa represents the iteration times, and the ordinate represents the Top-1 accuracy, so that the graph shows that the variation trend of the network is similar to that of the ResNet network, the maximum recognition rate is reached when the network is iterated to 35000 times, the accuracy of the network is 1.1% higher than that of the ResNet network, and the variation tends to be stable.

The parameters of the residual error network containing the compression reward and punishment mechanism are as follows: the compression reward and punishment mechanism residual error network structure parameters set to be 50 layers are shown in the following table:

SE-ResNet50 structure

The structure of SE-ResNet50

Claims

1. Video behavior recognition method based on compression reward and punishment mechanism is characterized in that: the method comprises the following steps:

step 2, respectively inputting the stacked optical flow image and the RGB video frame into a time convolution neural network and a space convolution neural network containing a compression reward and punishment mechanism, weighting the characteristics extracted by the networks through compression and reward and punishment operations, and respectively making preliminary prediction on video behaviors on two channels of time and space according to the weighted time and space characteristics;

2. The video behavior recognition method based on the compression reward punishment mechanism of claim 1, wherein: in the step 1, the video to be identified is divided into a plurality of equal-length segments, which specifically includes:

dividing the video to be identified into K segments { S ] according to equal time intervals₁,S₂…S_kAnd f, selecting 3 from K.

3. The video behavior recognition method based on the compression reward punishment mechanism of claim 1, wherein: in the step 2, the stacked optical flow image and the RGB video frame are respectively input to a time convolution neural network and a space convolution neural network which contain a compression reward and punishment mechanism, the characteristics extracted by the network are weighted by compression and reward and punishment operations, and preliminary predictions are made on video behaviors on two channels of time and space respectively according to the weighted time and space characteristics, specifically: inputting the stacked optical flow image into a time convolution neural network of a compression reward and punishment mechanism, and inputting the RGB video frame into a space convolution neural network of the compression reward and punishment mechanism;

4. The video behavior recognition method based on the compression reward punishment mechanism of claim 1, wherein: in the step 3, the preliminary prediction results of time and space of each segment are respectively fused to obtain a video-level time and space prediction result, and the fusion mode adopts average fusion, specifically:

combining the category scores of a plurality of short segments to obtain the consensus of the short segments on the category prediction, wherein the segment consensus function G is shown as formula 1, and a certain category score G is deduced from the same category scores of all the segments by adopting an aggregation function G_iThe polymerization function g adopts a uniform average method;

G_i＝g(F_i(T_i),...,F_k(T_k)) (1)。

5. the video behavior recognition method based on the compression reward punishment mechanism of claim 1, wherein: and in the step 2, the time and space double-current convolution neural network structure containing the compression reward punishment mechanism adopts a residual error network containing the compression reward punishment mechanism.

6. The video behavior recognition method based on the compression reward punishment mechanism of claim 1, wherein: and in the step 4, the video-level time and the spatial prediction result are fused to obtain a final video behavior recognition result, and the fusion mode adopts average fusion.

7. The video behavior recognition method based on the compression reward punishment mechanism of claim 1, wherein: step 5, training the network by iteratively updating the model parameters, and optimizing the loss value of video level prediction;

the whole network models a series of segments according to formula 2, and divides a video into K segments { S ] according to equal time intervals₁,S₂…S_kTesting, wherein the accuracy is highest when K is selected to be 3, the segment consensus function G combines the category scores of a plurality of short segments to obtain the consensus of the short segments on category prediction, the prediction function H predicts the probability of the whole segment of video belonging to each behavior category, and a Softmax function is selected as the prediction function, wherein Y represents the video level prediction result;

Y(S₁,S₂,...S_k)＝H(G(F(T₁；W),...F(T_k；W))) (2)

the overall model loss function adopts cross entropy loss, the form of the loss function is shown in formula 3, C is the total category number of the behaviors, y_iIs the true classification of the category.

8. The video behavior recognition method based on the compression reward punishment mechanism of claim 3, wherein: the number of the stacked optical flow images of the time convolution neural network of the input compression reward and punishment mechanism is 20, and the images are obtained by preprocessing a data set; the RGB video frame of the spatial convolution neural network input to the compression reward and punishment mechanism is a standard three-channel video frame extracted from a data set, the size of the RGB video frame depends on the video frame of an original data set, the RGB video frame is 224 multiplied by 3, the features are extracted through a residual error network, the extracted features obtain the weight of each feature channel in the channel dimension through the compression reward and punishment mechanism, the original features are weighted channel by channel through multiplication, and the recalibration of the original features on the channels is completed.

9. The video behavior recognition method based on the compression reward punishment mechanism of claim 5, wherein: in the residual error network containing the compression reward and punishment mechanism, an input image passes through the residual error network to obtain a characteristic graph group of c multiplied by h multiplied by w; firstly, performing compression operation F (sq) on the feature map group, compressing by adopting global average pooling, converting the feature map group of cxhxw into output of cx1 x 1, and compressing global channel information as a channel descriptor by the compression operation to realize channel description; the output dimension after the compression operation is matched with the number of input characteristic channels;

the formula for the compression operation is shown in equation 4, Z_cIndicating the result of the compression operation, F_sqRepresenting a compression operation, u_cRepresenting the c-th feature map with h multiplied by w size of the feature map group U; the reward and punishment operation needs to judge the importance degree of each channel, the reward and punishment operation is realized by using a door mechanism with a sigmoid activation function, two full-connection layers are added before the door mechanism to enhance the generalization capability and the nonlinear expression capability of a model, the first full-connection layer firstly reduces the characteristic dimension into input c/r, wherein r is a scaling parameter, and then the characteristic dimension is raised back to the original dimension through one full-connection layer after being activated by a Relu activation function; this has the advantage over directly using one fully connected layer that two-layer fully connected networks have more non-linearities and can reduce the amount of parameters and computations;

reward and punishment operation F(ex) generating a weight for each feature channel by using global information obtained by compression operation, wherein the weight is used for representing the importance degree of the feature channel; the reward and punishment operation formula is shown as formula 5, S represents the result of reward and punishment operation, namely the weight of the channel, the dimensionality is c multiplied by 1, z represents the result of the compression operation, and W represents the weight of the channel₁z represents the first fully-connected operation, W₁C/r × 1 × 1, then W₂The output dimensionality of the second full connection layer is restored to be c multiplied by 1, finally, a door mechanism with a Sigmoid activation function is used for obtaining the normalized weight between 0 and 1, after S is obtained, the weight is weighted to the previous feature group U channel by channel through Scal operation, the re-calibration of the original feature on the channel dimensionality is completed, and finally, the original feature is connected to form a complete residual error network;

S＝F_ex(z,W)＝σ(g(z,W))＝σ(W₂δ(W_1z)) (5)。

10. the video behavior recognition method based on the compression reward and punishment mechanism of claim 8, wherein: the parameters of the residual error network containing the compression reward and punishment mechanism are as follows: the compression reward and punishment mechanism residual error network structure parameters set to be 50 layers are shown in the following table:

SE-ResNet50 structure

The structure of SE-ResNet50