[go: up one dir, main page]

CN111178319A - Video behavior identification method based on compression reward and punishment mechanism - Google Patents

Video behavior identification method based on compression reward and punishment mechanism Download PDF

Info

Publication number
CN111178319A
CN111178319A CN202010011032.7A CN202010011032A CN111178319A CN 111178319 A CN111178319 A CN 111178319A CN 202010011032 A CN202010011032 A CN 202010011032A CN 111178319 A CN111178319 A CN 111178319A
Authority
CN
China
Prior art keywords
compression
reward
video
punishment
channel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010011032.7A
Other languages
Chinese (zh)
Inventor
张丽红
郭磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanxi University
Original Assignee
Shanxi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanxi University filed Critical Shanxi University
Priority to CN202010011032.7A priority Critical patent/CN111178319A/en
Publication of CN111178319A publication Critical patent/CN111178319A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

本发明属于计算机视觉领域,涉及一种基于压缩奖惩机制的视频行为识别方法。主要解决了目前视频行为识别方法存在计算量大、鲁棒性差、以及准确率低等技术问题。本发明设计了含有压缩奖惩机制的卷积神经网络用于视频行为识别。网络基于时间分段网络构建,先将视频分为三段,从每个片段随机提取光流图像和RGB帧,再分别输入到时间与空间网络,通过压缩与奖惩操作对提取到的特征加权,对加权后的时间与空间特征在时间与空间两个通道上对行为分别作出初步预测;然后把每个片段的初步预测结果融合得到视频级预测结果;最后将视频级预测结果融合得到视频行为识别结果。在数据集UCF101与HMDB51上进行了实验,结果表明,与其他模型相比,该模型具有较高的准确率。

Figure 202010011032

The invention belongs to the field of computer vision, and relates to a video behavior recognition method based on a compression reward and punishment mechanism. It mainly solves the technical problems of the current video behavior recognition methods, such as large amount of calculation, poor robustness, and low accuracy. The present invention designs a convolutional neural network with a compression reward and punishment mechanism for video behavior recognition. The network is constructed based on a time-segmented network. First, the video is divided into three segments, and optical flow images and RGB frames are randomly extracted from each segment, and then input to the time and space networks respectively. The extracted features are weighted by compression and reward and punishment operations. The weighted temporal and spatial features are used to make preliminary predictions on the two channels of time and space respectively; then the preliminary prediction results of each segment are fused to obtain video-level prediction results; finally, the video-level prediction results are fused to obtain video behavior recognition. result. Experiments are carried out on the datasets UCF101 and HMDB51, and the results show that the model has higher accuracy compared to other models.

Figure 202010011032

Description

Video behavior identification method based on compression reward and punishment mechanism
Technical Field
The invention belongs to the field of computer vision, and particularly relates to a video behavior identification method based on a compression reward punishment mechanism.
Background
Video behavior recognition is a hot spot in the field of computer vision today, aiming at automatically analyzing the ongoing behavior of a video or a sequence of images.
Video behavior recognition is divided into a traditional method and a deep learning-based method. The traditional method comprises a dense track algorithm and an improved dense track algorithm. The basic idea of the dense track algorithm is to obtain tracks in a video sequence by using an optical flow field and then calculate the characteristics of a direction gradient histogram, an optical flow direction histogram and the like along the tracks; the improved dense track algorithm utilizes the optical flow between the front frame video and the rear frame video and the key points for matching, thereby eliminating or weakening the influence caused by the motion of the camera. The traditional method has the problems of large calculation amount and poor robustness. Based on a deep learning method, Kappathy et al trains a deep network DeepNet to fuse the characteristics of different image frames in a video, and the model cannot extract the motion information of the video; varol and the like pay attention to the timing relationship reasoning of the network, and classify actions which cannot be distinguished only by the key frame, such as tumble, through the timing relationship reasoning, so that the accuracy is improved; the siooonyan et al propose a double-current convolutional neural network; ngyh and the like add a long and short term memory network into a double-current network model for enhancing the relation of time domain information, but the double-current convolution network can not model a long-time video; tran et al propose three-dimensional convolutional neural network extraction features, and the method can realize end-to-end training of the network, but has huge calculated amount and parameter amount; chollet et al propose a deep separation convolution neural network, which performs convolution separately for a three-dimensional convolution region and a channel, thereby greatly reducing the amount of calculation under the condition of keeping the precision basically unchanged. However, the three-dimensional convolution network method and the improvement method thereof have the effect of being lower than the accuracy of the double-flow method by a few percent.
Disclosure of Invention
The invention aims to solve the technical problems that the conventional video behavior identification method is large in calculation amount, poor in robustness, incapable of extracting motion information of a video, low in accuracy rate and the like, and provides a video behavior identification method based on a compression reward and punishment mechanism.
In order to solve the technical problems, the invention adopts the technical scheme that:
the video behavior identification method based on the compression reward and punishment mechanism comprises the following steps:
step 1, dividing a video to be identified into a plurality of equal-length segments, and randomly extracting stacked optical flow images and RGB video frames from each segment;
step 2, respectively inputting the stacked optical flow image and the RGB video frame into a time and space double-current convolution neural network containing a compression reward and punishment mechanism, weighting the characteristics extracted by the network through compression and reward and punishment operations, and respectively making preliminary prediction on video behaviors on two channels of time and space according to the weighted time and space characteristics;
step 3, respectively fusing the time and space preliminary prediction results of each segment to obtain video-level time and space prediction results;
and 4, finally, fusing the video-level time and the spatial prediction result to obtain a final video behavior recognition result.
And 5, training the network by iteratively updating the model parameters, and optimizing the loss value of video level prediction.
The invention provides a network structure based on a compression reward punishment mechanism, wherein the network is constructed based on a time segmentation double-current network, long-time video can be modeled, time and space characteristics are respectively extracted through a time convolution network and a space convolution network, the calculated amount and the parameter amount are small, the characteristics extracted from the double-current network are endowed with different weights according to the importance degree on a characteristic channel through the compression reward punishment mechanism, and the characteristics of a current task are subjected to different excitation reward punishments according to the weights, so that the effective characteristic weight is larger, the invalid or small-effect characteristic weight is smaller, and the identification accuracy rate is higher.
Further, in step 1, the video to be identified is divided into a plurality of equal-length segments, which specifically includes:
dividing the video to be identified into K segments { S ] according to equal time intervals1,S2…SkAnd (4) through experiments, the accuracy is highest when the K is selected to be 3, the video is segmented, the long-time video can be modeled, each segment makes preliminary prediction on classification, and finally, a plurality of results of all the segments are fused to obtain the prediction of the whole video, so that the information of the whole video can be fully utilized.
Furthermore, in step 2, the stacked optical flow image and the RGB video frame are respectively input to a time convolution neural network and a space convolution neural network which include a compression reward and punishment mechanism, the features extracted by the network are weighted by the compression and reward and punishment operation, and preliminary predictions are made on video behaviors on two channels, namely time and space, respectively according to the weighted time and space features, specifically: inputting the stacked optical flow image into a time convolution neural network of a compression reward and punishment mechanism, and inputting the RGB video frame into a space convolution neural network of the compression reward and punishment mechanism;
T1,T2,…Tkrepresenting video segments, each segment TkFrom the corresponding segment SkObtaining by random sampling; function F (T)k(ii) a W) represents the convolutional network pair short segment T using W as a parameterkThe feature extraction of (2) is output as a feature vector corresponding to the category number dimension, which is equivalent to the category score of each short segment.
The time and space double-current convolution network respectively extracts the time characteristic and the space characteristic of the video, and compared with a three-dimensional convolution network structure, the time and space characteristic is simultaneously extracted, the calculated amount and the parameter amount are smaller, and the training is easier.
Further, in the step 3, the temporal and spatial preliminary prediction results of each segment are respectively fused to obtain a video-level temporal and spatial prediction result, specifically:
combining the category scores of a plurality of short segments to obtain the consensus of the short segments on the category prediction, wherein the segment consensus function G is shown as formula 1, and a certain category score G is deduced from the same category scores of all the segments by adopting an aggregation function GiThe polymerization function g adopts a uniform average method; the prediction function H predicts the probability that the whole video segment belongs to each behavior category, the Softmax function is selected as the prediction function,
Gi=g(Fi(Ti),...,Fk(Tk)) (1)。
and furthermore, the time and space double-current convolution neural network structure containing the compression reward punishment mechanism in the step 2 adopts a residual error network containing the compression reward punishment mechanism. The use of the residual error network can ensure that the problems of gradient dispersion and gradient explosion are avoided along with the deepening of the network layer number, so that the network is easier to converge.
And furthermore, the fusion mode in the step 4 adopts average fusion, and the prediction of the whole video segment by the time and space convolution network is fused to obtain the final prediction result.
Further, in the step 5, the network is trained by iteratively updating the model parameters, and the loss value of the video-level prediction is optimized; the whole network models a series of segments according to formula 1, and divides a video into K segments { S ] according to equal time intervals1,S2…SkAccording to the test, the accuracy is highest when K is selected to be 3;
Y(S1,S2,...Sk)=H(G(F(T1;W),...F(Tk;W))) (2)
wherein Y represents a video level prediction result;
the loss function adopts cross entropy loss, the form of the loss function is shown in formula 3, C is the total category number of the behaviors, yiTrue classification for the category;
Figure BDA0002357165270000041
still further, the number of the stacked optical flow images of the time convolution neural network of the input compression reward and punishment mechanism is 20, and the images are obtained by preprocessing a data set; the RGB video frame of the spatial convolution neural network input to the compression reward and punishment mechanism is a standard three-channel video frame extracted from a data set, the size of the RGB video frame depends on the video frame of an original data set, the RGB video frame is 224 multiplied by 3, the features are extracted through a residual error network, the extracted features obtain the weight of each feature channel in the channel dimension through the compression reward and punishment mechanism, the original features are weighted channel by channel through multiplication, and the recalibration of the original features on the channels is completed.
The compression reward and punishment mechanism serves as a channel attention mechanism, extracted features are weighted in channel dimension, effective features can be endowed with large weight, invalid or small-effect features are endowed with small weight, and the classification accuracy of the whole network is improved.
Furthermore, in the residual error network containing the compression reward and punishment mechanism, the input image passes through the residual error network to obtain a characteristic graph group of c × h × w; firstly, performing compression operation F (sq) on the feature map group, compressing by adopting global average pooling, converting the feature map group of cxhxw into output of cx1 x 1, and compressing global channel information as a channel descriptor by the compression operation to realize channel description; the output dimension after the compression operation is matched with the number of input characteristic channels;
the formula for the compression operation is shown in equation 4, ZcRepresenting the result of a punishment operation, FsqRepresenting a compression operation, ucRepresenting the c-th feature map with h multiplied by w size of the feature map group U; the reward and punishment operation needs to judge the importance degree of each channel, the reward and punishment operation is realized by using a door mechanism with a sigmoid activation function, two full-connection layers are added before the door mechanism to enhance the generalization capability and the nonlinear expression capability of a model, the first full-connection layer firstly reduces the characteristic dimension into input c/r, wherein r is a scaling parameter, and then the characteristic dimension is raised back to the original dimension through one full-connection layer after being activated by a Relu activation function; this has the advantage over directly using one fully connected layer that two-layer fully connected networks have more non-linearities and can reduce the amount of parameters and computations;
Figure BDA0002357165270000051
performing reward and punishment operation F (ex) to generate weight for each feature channel by using global information obtained by compression operation, wherein the weight is used for representing the importance degree of the feature channel; the reward and punishment operation formula is shown as formula 5, S represents the result of reward and punishment operation, namely the weight of the channel, the dimensionality is c multiplied by 1, z represents the result of the compression operation, and W represents the weight of the channel1z represents the first fully-connected operation, W1C/r × 1 × 1, then W2The output dimensionality is restored to be c multiplied by 1 after the second full connection layer, finally, the normalized weight between 0 and 1 is obtained through a door mechanism with a Sigmoid activation function, and after S is obtained, the weight is added channel by channel through Scal operationWeighting the original features to the previous feature group U, completing the recalibration of the original features on the channel dimension, and finally connecting to form a complete residual error network;
S=Fex(z,W)=σ(g(z,W))=σ(W2δ(W1z)) (5)。
the compression and reward punishment operation is a light attention mechanism, and compared with other methods, the compression reward punishment mechanism obtains a better effect on the premise of only increasing a little calculated amount and parameter amount.
Still further, the parameters of the residual error network with the compression reward and punishment mechanism are as follows: the compression reward and punishment mechanism residual error network structure parameters set to be 50 layers are shown in the following table:
SE-ResNet50 structure
The structure of SE-ResNet50
Figure BDA0002357165270000061
The method is characterized in that 50 layers of residual error networks containing compression reward and punishment mechanisms are arranged, the network is deep, deeper features can be extracted, the basic network adopts the residual error network, gradient disappearance and gradient dispersion can be avoided when the network depth is deep, and the training convergence is easier.
In most video behavior recognition tasks, spatiotemporal information extracted by a network is equally processed, in order to ignore irrelevant information and pay attention to key information, a convolutional neural network containing a compression reward and punishment mechanism is designed for video behavior recognition. The method comprises the steps that a network is built on the basis of a time segmentation network, firstly, a video is divided into a plurality of equal-length segments, a stacked optical flow image and an RGB video frame are extracted from each segment at random, the stacked optical flow image and the RGB video frame are respectively input into a time and space double-current convolution neural network with a compression reward and punishment mechanism, characteristics extracted by the network are weighted through compression and reward and punishment operations, and preliminary prediction is conducted on behaviors on two channels of time and space according to the weighted time and space characteristics; then, respectively fusing the time and space preliminary prediction results of each segment to obtain a video-level prediction result; and finally, fusing the video-level time and the spatial prediction result to obtain a final video behavior recognition result. Experiments are carried out on the data sets UCF101 and HMDB51, and the results show that compared with various other network models without compression reward punishment mechanisms, the model has higher accuracy rate.
Drawings
FIG. 1 is a time-segmented network structure based on a compression reward punishment mechanism according to the present invention;
FIG. 2 is a schematic diagram of a convolutional network structure of the compression reward and punishment mechanism of the present invention;
FIG. 3 is a diagram of the compression reward and punishment mechanism of the present invention;
FIG. 4 is a compression reward and punishment mechanism residual error network according to the present invention;
FIG. 5 is a graph of the accuracy of the present invention;
FIG. 6 is a graph of the variation of the loss function of the present invention;
FIG. 7 is a comparison of the variation of the accuracy with the number of iterations.
Detailed Description
Experiments are carried out on a video identification main stream data set UCF101 and an HMDB51, and the video behavior identification method based on the compression reward and punishment mechanism comprises the following steps:
step 1, dividing a video to be identified into a plurality of equal-length segments, specifically: dividing the video to be identified into K segments { S ] according to equal time intervals1,S2…SkAnd f, selecting 3 from K. Randomly extracting stacked optical flow images and RGB video frames from each segment;
step 2, inputting the stacked optical flow image and the RGB video frame into a time and space double-flow convolution neural network containing a compression reward and punishment mechanism respectively, weighting the characteristics extracted by the network through compression and reward and punishment operations, performing preliminary prediction on video behaviors on two channels of time and space according to the weighted time and space characteristics, and setting the time and space double-flow convolution neural network into 50 layers, specifically:
inputting the stacked optical flow image into a time convolution neural network of a compression reward and punishment mechanism, and inputting the RGB video frame into a space convolution neural network of the compression reward and punishment mechanism;
T1,T2,…Tkrepresenting video segments eachA fragment TkFrom the corresponding segment SkObtaining by random sampling; function F (T)k(ii) a W) represents the convolutional network pair short segment T using W as a parameterkThe feature extraction of (2) is output as a feature vector corresponding to the category number dimension, which is equivalent to the category score of each short segment. As shown in fig. 1, fig. 1 is a time-division network structure based on a compression reward punishment mechanism according to the present invention.
The time and space double-current convolution neural network structure containing the compression reward and punishment mechanism adopts a residual error network containing the compression reward and punishment mechanism. As shown in fig. 2, fig. 2 is a convolutional network structure of time and space of the compression reward punishment mechanism of the present invention, and 20 stacked optical flow images of the time convolutional neural network input to the compression reward punishment mechanism are obtained by preprocessing a data set; the RGB video frame of the spatial convolution neural network input to the compression reward and punishment mechanism is a standard three-channel video frame extracted from a data set, the size of the RGB video frame depends on the video frame of an original data set, the RGB video frame is 224 multiplied by 3, the features are extracted through a residual error network, the extracted features obtain the weight of each feature channel in the channel dimension through the compression reward and punishment mechanism, the original features are weighted channel by channel through multiplication, and the recalibration of the original features on the channels is completed. The compression reward and punishment mechanism is shown in fig. 3, and the core idea of the compression reward and punishment mechanism is to learn the feature weight according to loss through a network, so that the effective feature weight is larger, and the ineffective or small-effect feature weight is smaller.
Fig. 4 shows a residual error network with a compression reward and punishment mechanism, where an input image is 224 × 224 × 3, and a 56 × 56 × 256 feature graph group is obtained after the input image passes through the residual error network; firstly, performing compression operation F (sq) on the feature map group, compressing by adopting global average pooling, converting the 56 × 56 × 256 feature map group into 1 × 1 × 256 output, and compressing global channel information as a channel descriptor by the compression operation to realize channel description; the output dimension after the compression operation is matched with the number of input characteristic channels;
the formula for the compression operation is shown in equation 4, ZcRepresenting the result of a punishment operation, FsqRepresenting a compression operation, ucRepresentsThe 256 th feature map of the feature map group U is h multiplied by w; the reward and punishment operation needs to judge the importance degree of each channel, the reward and punishment operation is realized by using a door mechanism with a sigmoid activation function, two full-connection layers are added before the door mechanism to enhance the generalization capability and the nonlinear expression capability of a model, the first full-connection layer firstly reduces the characteristic dimensionality to 256/16 of input, wherein 16 is a scaling parameter, and then the characteristic dimensionality is increased back to the original dimensionality through one full-connection layer after being activated by a Relu activation function; this has the advantage over directly using one fully connected layer that two-layer fully connected networks have more non-linearities and can reduce the amount of parameters and computations;
Figure BDA0002357165270000091
performing reward and punishment operation F (ex) to generate weight for each feature channel by using global information obtained by compression operation, wherein the weight is used for representing the importance degree of the feature channel; the reward and punishment operation formula is shown in formula 5, S represents the result of reward and punishment operation, namely the weight of the channel, the dimensionality is 1 × 1 × 256, z represents the result of compression operation, and W represents the weight of the channel1z represents the first fully-connected operation, W1Dimension of 1X 256/16, then W2The output dimensionality of the second full connection layer is restored to be 1 multiplied by 256, finally, a normalized weight between 0 and 1 is obtained through a door mechanism with a Sigmoid activation function, after S is obtained, the weight is weighted to a previous feature group U channel by channel through Scal operation, the re-calibration of the original feature on the channel dimensionality is completed, and finally, the original feature is connected to form a complete residual error network;
S=Fex(z,W)=σ(g(z,W))=σ(W2δ(W1z)) (5)
and 3, respectively fusing the time and space preliminary prediction results of each segment to obtain video-level time and space prediction results, wherein the fusion mode adopts average fusion. Combining the category scores of a plurality of short segments to obtain the consensus of the short segments on the category prediction, wherein the segment consensus function G is shown as formula 1, and a certain category score G is deduced from the same category scores of all the segments by adopting an aggregation function GiThe polymerization function g adopts a uniform average method; preparation ofA measuring function H predicts the probability that the whole video segment belongs to each behavior class, a Softmax function is selected as a prediction function,
Gi=g(Fi(Ti),...,Fk(Tk)) (1)
and 4, fusing the prediction of the whole video segment by the time and space convolution network to obtain a final prediction result, wherein the fusion mode also adopts average fusion.
Step 5, training the network by iteratively updating the model parameters, and optimizing the loss value of video level prediction;
the loss function adopts cross entropy loss, the form of the loss function is shown in formula 3, C is the total category number of the behaviors, yiTrue classification for the category;
Figure BDA0002357165270000101
training and testing iteration is carried out under a windows system based on a Pythrch deep learning framework. The Batch-size setting is 16, and each round of training requires 600 iterations. The input of the time convolution network of the compression reward and punishment mechanism is an optical flow image, and the initial learning rate is set to be 0.01; the input of the spatial convolution network of the compression reward and punishment mechanism is an RGB video frame, and the initial learning rate is set to be 0.0005. And during optimization, the learning rate adopts a self-adaptive method, the learning rate is updated in a self-adaptive mode according to the learning result, and the momentum is set to be 0.9. The compression reward and punishment mechanism time and space convolution network adopts a cross entropy loss function as an optimization function, and the optimization method is a random gradient descent algorithm.
FIG. 5 is a graph showing the variation of the accuracy with the number of iterations, and FIG. 6 is a graph showing the variation of the loss function with the number of iterations; the abscissa represents the iteration times, and the ordinate represents the Top-1 accuracy, so that the network reaches the maximum recognition rate at 35000 iterations, which reaches 94.6%, and then the change tends to be stable, and no overfitting phenomenon occurs in the whole training process. Fig. 7 shows a graph comparing the recognition rates of the present network and the ResNet network as a function of the number of iterations. The abscissa represents the iteration times, and the ordinate represents the Top-1 accuracy, so that the graph shows that the variation trend of the network is similar to that of the ResNet network, the maximum recognition rate is reached when the network is iterated to 35000 times, the accuracy of the network is 1.1% higher than that of the ResNet network, and the variation tends to be stable.
The parameters of the residual error network containing the compression reward and punishment mechanism are as follows: the compression reward and punishment mechanism residual error network structure parameters set to be 50 layers are shown in the following table:
SE-ResNet50 structure
The structure of SE-ResNet50
Figure BDA0002357165270000111

Claims (10)

1. Video behavior recognition method based on compression reward and punishment mechanism is characterized in that: the method comprises the following steps:
step 1, dividing a video to be identified into a plurality of equal-length segments, and randomly extracting stacked optical flow images and RGB video frames from each segment;
step 2, respectively inputting the stacked optical flow image and the RGB video frame into a time convolution neural network and a space convolution neural network containing a compression reward and punishment mechanism, weighting the characteristics extracted by the networks through compression and reward and punishment operations, and respectively making preliminary prediction on video behaviors on two channels of time and space according to the weighted time and space characteristics;
step 3, respectively fusing the time and space preliminary prediction results of each segment to obtain video-level time and space prediction results;
and 4, finally, fusing the video-level time and the spatial prediction result to obtain a final video behavior recognition result.
And 5, training the network by iteratively updating the model parameters, and optimizing the loss value of video level prediction.
2. The video behavior recognition method based on the compression reward punishment mechanism of claim 1, wherein: in the step 1, the video to be identified is divided into a plurality of equal-length segments, which specifically includes:
dividing the video to be identified into K segments { S ] according to equal time intervals1,S2…SkAnd f, selecting 3 from K.
3. The video behavior recognition method based on the compression reward punishment mechanism of claim 1, wherein: in the step 2, the stacked optical flow image and the RGB video frame are respectively input to a time convolution neural network and a space convolution neural network which contain a compression reward and punishment mechanism, the characteristics extracted by the network are weighted by compression and reward and punishment operations, and preliminary predictions are made on video behaviors on two channels of time and space respectively according to the weighted time and space characteristics, specifically: inputting the stacked optical flow image into a time convolution neural network of a compression reward and punishment mechanism, and inputting the RGB video frame into a space convolution neural network of the compression reward and punishment mechanism;
T1,T2,…Tkrepresenting video segments, each segment TkFrom the corresponding segment SkObtaining by random sampling; function F (T)k(ii) a W) represents the convolutional network pair short segment T using W as a parameterkThe feature extraction of (2) is output as a feature vector corresponding to the category number dimension, which is equivalent to the category score of each short segment.
4. The video behavior recognition method based on the compression reward punishment mechanism of claim 1, wherein: in the step 3, the preliminary prediction results of time and space of each segment are respectively fused to obtain a video-level time and space prediction result, and the fusion mode adopts average fusion, specifically:
combining the category scores of a plurality of short segments to obtain the consensus of the short segments on the category prediction, wherein the segment consensus function G is shown as formula 1, and a certain category score G is deduced from the same category scores of all the segments by adopting an aggregation function GiThe polymerization function g adopts a uniform average method;
Gi=g(Fi(Ti),...,Fk(Tk)) (1)。
5. the video behavior recognition method based on the compression reward punishment mechanism of claim 1, wherein: and in the step 2, the time and space double-current convolution neural network structure containing the compression reward punishment mechanism adopts a residual error network containing the compression reward punishment mechanism.
6. The video behavior recognition method based on the compression reward punishment mechanism of claim 1, wherein: and in the step 4, the video-level time and the spatial prediction result are fused to obtain a final video behavior recognition result, and the fusion mode adopts average fusion.
7. The video behavior recognition method based on the compression reward punishment mechanism of claim 1, wherein: step 5, training the network by iteratively updating the model parameters, and optimizing the loss value of video level prediction;
the whole network models a series of segments according to formula 2, and divides a video into K segments { S ] according to equal time intervals1,S2…SkTesting, wherein the accuracy is highest when K is selected to be 3, the segment consensus function G combines the category scores of a plurality of short segments to obtain the consensus of the short segments on category prediction, the prediction function H predicts the probability of the whole segment of video belonging to each behavior category, and a Softmax function is selected as the prediction function, wherein Y represents the video level prediction result;
Y(S1,S2,...Sk)=H(G(F(T1;W),...F(Tk;W))) (2)
the overall model loss function adopts cross entropy loss, the form of the loss function is shown in formula 3, C is the total category number of the behaviors, yiIs the true classification of the category.
Figure FDA0002357165260000031
8. The video behavior recognition method based on the compression reward punishment mechanism of claim 3, wherein: the number of the stacked optical flow images of the time convolution neural network of the input compression reward and punishment mechanism is 20, and the images are obtained by preprocessing a data set; the RGB video frame of the spatial convolution neural network input to the compression reward and punishment mechanism is a standard three-channel video frame extracted from a data set, the size of the RGB video frame depends on the video frame of an original data set, the RGB video frame is 224 multiplied by 3, the features are extracted through a residual error network, the extracted features obtain the weight of each feature channel in the channel dimension through the compression reward and punishment mechanism, the original features are weighted channel by channel through multiplication, and the recalibration of the original features on the channels is completed.
9. The video behavior recognition method based on the compression reward punishment mechanism of claim 5, wherein: in the residual error network containing the compression reward and punishment mechanism, an input image passes through the residual error network to obtain a characteristic graph group of c multiplied by h multiplied by w; firstly, performing compression operation F (sq) on the feature map group, compressing by adopting global average pooling, converting the feature map group of cxhxw into output of cx1 x 1, and compressing global channel information as a channel descriptor by the compression operation to realize channel description; the output dimension after the compression operation is matched with the number of input characteristic channels;
the formula for the compression operation is shown in equation 4, ZcIndicating the result of the compression operation, FsqRepresenting a compression operation, ucRepresenting the c-th feature map with h multiplied by w size of the feature map group U; the reward and punishment operation needs to judge the importance degree of each channel, the reward and punishment operation is realized by using a door mechanism with a sigmoid activation function, two full-connection layers are added before the door mechanism to enhance the generalization capability and the nonlinear expression capability of a model, the first full-connection layer firstly reduces the characteristic dimension into input c/r, wherein r is a scaling parameter, and then the characteristic dimension is raised back to the original dimension through one full-connection layer after being activated by a Relu activation function; this has the advantage over directly using one fully connected layer that two-layer fully connected networks have more non-linearities and can reduce the amount of parameters and computations;
Figure FDA0002357165260000041
reward and punishment operation F(ex) generating a weight for each feature channel by using global information obtained by compression operation, wherein the weight is used for representing the importance degree of the feature channel; the reward and punishment operation formula is shown as formula 5, S represents the result of reward and punishment operation, namely the weight of the channel, the dimensionality is c multiplied by 1, z represents the result of the compression operation, and W represents the weight of the channel1z represents the first fully-connected operation, W1C/r × 1 × 1, then W2The output dimensionality of the second full connection layer is restored to be c multiplied by 1, finally, a door mechanism with a Sigmoid activation function is used for obtaining the normalized weight between 0 and 1, after S is obtained, the weight is weighted to the previous feature group U channel by channel through Scal operation, the re-calibration of the original feature on the channel dimensionality is completed, and finally, the original feature is connected to form a complete residual error network;
S=Fex(z,W)=σ(g(z,W))=σ(W2δ(W1z)) (5)。
10. the video behavior recognition method based on the compression reward and punishment mechanism of claim 8, wherein: the parameters of the residual error network containing the compression reward and punishment mechanism are as follows: the compression reward and punishment mechanism residual error network structure parameters set to be 50 layers are shown in the following table:
SE-ResNet50 structure
The structure of SE-ResNet50
Figure FDA0002357165260000042
Figure FDA0002357165260000051
CN202010011032.7A 2020-01-06 2020-01-06 Video behavior identification method based on compression reward and punishment mechanism Pending CN111178319A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010011032.7A CN111178319A (en) 2020-01-06 2020-01-06 Video behavior identification method based on compression reward and punishment mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010011032.7A CN111178319A (en) 2020-01-06 2020-01-06 Video behavior identification method based on compression reward and punishment mechanism

Publications (1)

Publication Number Publication Date
CN111178319A true CN111178319A (en) 2020-05-19

Family

ID=70657906

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010011032.7A Pending CN111178319A (en) 2020-01-06 2020-01-06 Video behavior identification method based on compression reward and punishment mechanism

Country Status (1)

Country Link
CN (1) CN111178319A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111709304A (en) * 2020-05-21 2020-09-25 江南大学 A behavior recognition method based on spatiotemporal attention-enhanced feature fusion network
CN111814589A (en) * 2020-06-18 2020-10-23 浙江大华技术股份有限公司 Part identification method and related equipment and device
CN111931602A (en) * 2020-07-22 2020-11-13 北方工业大学 Multi-stream segmented network human body action identification method and system based on attention mechanism
CN112949460A (en) * 2021-02-26 2021-06-11 陕西理工大学 Human body behavior network model based on video and identification method
CN113420734A (en) * 2021-08-23 2021-09-21 东华理工大学南昌校区 English character input method and English character input system
CN113469142A (en) * 2021-03-12 2021-10-01 山西长河科技股份有限公司 Classification method, device and terminal for monitoring video time-space information fusion
CN114332670A (en) * 2021-10-15 2022-04-12 腾讯科技(深圳)有限公司 Video behavior recognition method, device, computer equipment and storage medium
CN114373194A (en) * 2022-01-14 2022-04-19 南京邮电大学 Human behavior identification method based on key frame and attention mechanism
CN114724237A (en) * 2022-01-29 2022-07-08 北京育达东方软件科技有限公司 Action recognition method and device, storage medium and electronic equipment
CN115311740A (en) * 2022-07-26 2022-11-08 国网江苏省电力有限公司苏州供电分公司 Method and system for recognizing abnormal human body behaviors in power grid infrastructure site
CN115588162A (en) * 2022-10-18 2023-01-10 招商局检测车辆技术研究院有限公司 Pedestrian crossing intention prediction method
CN117630344A (en) * 2024-01-25 2024-03-01 西南科技大学 Method for real-time online detection of concrete slump range
CN114677704B (en) * 2022-02-23 2024-03-26 西北大学 A behavior recognition method based on multi-level fusion of spatiotemporal features based on three-dimensional convolution

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019017720A1 (en) * 2017-07-20 2019-01-24 주식회사 이고비드 Camera system for protecting privacy and method therefor
CN109325430A (en) * 2018-09-11 2019-02-12 北京飞搜科技有限公司 Real-time Activity recognition method and system
CN109670446A (en) * 2018-12-20 2019-04-23 泉州装备制造研究所 Anomaly detection method based on linear dynamic system and depth network
CN109993077A (en) * 2019-03-18 2019-07-09 南京信息工程大学 A Behavior Recognition Method Based on Two-Stream Network
CN110188637A (en) * 2019-05-17 2019-08-30 西安电子科技大学 A method of behavior recognition technology based on deep learning
CN110383288A (en) * 2019-06-06 2019-10-25 深圳市汇顶科技股份有限公司 The method, apparatus and electronic equipment of recognition of face
CN110378208A (en) * 2019-06-11 2019-10-25 杭州电子科技大学 A kind of Activity recognition method based on depth residual error network
CN110647903A (en) * 2019-06-20 2020-01-03 杭州趣维科技有限公司 Short video frequency classification method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019017720A1 (en) * 2017-07-20 2019-01-24 주식회사 이고비드 Camera system for protecting privacy and method therefor
CN109325430A (en) * 2018-09-11 2019-02-12 北京飞搜科技有限公司 Real-time Activity recognition method and system
CN109670446A (en) * 2018-12-20 2019-04-23 泉州装备制造研究所 Anomaly detection method based on linear dynamic system and depth network
CN109993077A (en) * 2019-03-18 2019-07-09 南京信息工程大学 A Behavior Recognition Method Based on Two-Stream Network
CN110188637A (en) * 2019-05-17 2019-08-30 西安电子科技大学 A method of behavior recognition technology based on deep learning
CN110383288A (en) * 2019-06-06 2019-10-25 深圳市汇顶科技股份有限公司 The method, apparatus and electronic equipment of recognition of face
CN110378208A (en) * 2019-06-11 2019-10-25 杭州电子科技大学 A kind of Activity recognition method based on depth residual error network
CN110647903A (en) * 2019-06-20 2020-01-03 杭州趣维科技有限公司 Short video frequency classification method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WEI DAI ET AL.: "Two-Stream Convolution Neural Network with Video-stream for Action Recognition", 《2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS》 *
王向东 等: "基于动静态特征结合的改进模糊支持向量机行为识别", 《测试技术学报》 *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111709304B (en) * 2020-05-21 2023-05-05 江南大学 Behavior recognition method based on space-time attention-enhancing feature fusion network
CN111709304A (en) * 2020-05-21 2020-09-25 江南大学 A behavior recognition method based on spatiotemporal attention-enhanced feature fusion network
CN111814589A (en) * 2020-06-18 2020-10-23 浙江大华技术股份有限公司 Part identification method and related equipment and device
CN111931602A (en) * 2020-07-22 2020-11-13 北方工业大学 Multi-stream segmented network human body action identification method and system based on attention mechanism
CN111931602B (en) * 2020-07-22 2023-08-08 北方工业大学 Human action recognition method and system based on multi-stream segmentation network based on attention mechanism
CN112949460A (en) * 2021-02-26 2021-06-11 陕西理工大学 Human body behavior network model based on video and identification method
CN112949460B (en) * 2021-02-26 2024-02-13 陕西理工大学 Human behavior network model based on video and identification method
CN113469142A (en) * 2021-03-12 2021-10-01 山西长河科技股份有限公司 Classification method, device and terminal for monitoring video time-space information fusion
CN113420734A (en) * 2021-08-23 2021-09-21 东华理工大学南昌校区 English character input method and English character input system
CN114332670A (en) * 2021-10-15 2022-04-12 腾讯科技(深圳)有限公司 Video behavior recognition method, device, computer equipment and storage medium
CN114373194A (en) * 2022-01-14 2022-04-19 南京邮电大学 Human behavior identification method based on key frame and attention mechanism
CN114373194B (en) * 2022-01-14 2024-11-12 南京邮电大学 Human action recognition method based on keyframe and attention mechanism
CN114724237A (en) * 2022-01-29 2022-07-08 北京育达东方软件科技有限公司 Action recognition method and device, storage medium and electronic equipment
CN114677704B (en) * 2022-02-23 2024-03-26 西北大学 A behavior recognition method based on multi-level fusion of spatiotemporal features based on three-dimensional convolution
CN115311740A (en) * 2022-07-26 2022-11-08 国网江苏省电力有限公司苏州供电分公司 Method and system for recognizing abnormal human body behaviors in power grid infrastructure site
CN115588162A (en) * 2022-10-18 2023-01-10 招商局检测车辆技术研究院有限公司 Pedestrian crossing intention prediction method
CN115588162B (en) * 2022-10-18 2025-08-22 招商局检测车辆技术研究院有限公司 A method for predicting pedestrian crossing intention
CN117630344A (en) * 2024-01-25 2024-03-01 西南科技大学 Method for real-time online detection of concrete slump range
CN117630344B (en) * 2024-01-25 2024-04-05 西南科技大学 Method for detecting slump range of concrete on line in real time

Similar Documents

Publication Publication Date Title
CN111178319A (en) Video behavior identification method based on compression reward and punishment mechanism
CN110378208B (en) A Behavior Recognition Method Based on Deep Residual Networks
CN108133188B (en) Behavior identification method based on motion history image and convolutional neural network
CN111507311B (en) A video person recognition method based on multi-modal feature fusion deep network
CN110516536B (en) A Weakly Supervised Video Behavior Detection Method Based on Complementarity of Temporal Category Activation Maps
CN114842267A (en) Image classification method and system based on label noise domain adaptation
CN113496217A (en) Method for identifying human face micro expression in video image sequence
CN112308158A (en) A Multi-source Domain Adaptive Model and Method Based on Partial Feature Alignment
CN110175551B (en) Sign language recognition method
CN111639564B (en) Video pedestrian re-identification method based on multi-attention heterogeneous network
CN110135386B (en) Human body action recognition method and system based on deep learning
CN111126488A (en) Image identification method based on double attention
Dastbaravardeh et al. Channel attention‐based approach with autoencoder network for human action recognition in low‐resolution frames
CN108427921A (en) A kind of face identification method based on convolutional neural networks
CN113628059A (en) Associated user identification method and device based on multilayer graph attention network
CN113297936A (en) Volleyball group behavior identification method based on local graph convolution network
CN108776796B (en) Action identification method based on global space-time attention model
CN112052795B (en) Video behavior identification method based on multi-scale space-time feature aggregation
CN112784929A (en) Small sample image classification method and device based on double-element group expansion
CN112381179A (en) Heterogeneous graph classification method based on double-layer attention mechanism
CN115035418A (en) A method and system for semantic segmentation of remote sensing images based on improved DeepLabV3+ network
CN114513367A (en) Cellular network anomaly detection method based on graph neural network
CN111783688B (en) A classification method of remote sensing image scene based on convolutional neural network
CN119693999B (en) A Human Posture Video Assessment Method Based on Spatiotemporal Graph Convolutional Network
CN112115849A (en) Video scene recognition method based on multi-granularity video information and attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200519

WD01 Invention patent application deemed withdrawn after publication