Disclosure of Invention
The invention aims to solve the problems that: in the video target segmentation, the target object of the current frame is determined only by virtue of the segmentation result of the previous frame, the accurate position of the target object cannot be obtained, and even the target object drifts due to excessive dependence on the segmentation result of the previous frame; in the existing video object segmentation method based on motion information, most of the video object segmentation methods based on optical flow information between a current frame and a previous frame are used for segmenting an object, so that the calculated amount is large, and the segmentation result is limited to a specific motion mode. Therefore, a new video object segmentation method based on motion information needs to be proposed, so as to improve the segmentation effect.
In order to solve the above problems, the present invention provides a video object segmentation method based on motion attention, which fuses motion timing information in a video sequence and implements video object segmentation based on an attention mechanism, comprising the following steps:
1) Constructing a split backbone network to divide the current frame I t First frame I 0 Respectively inputting into a main network to obtain a corresponding characteristic diagram F t ,F 0 ;
2) Constructing a motion attention network, and mapping the characteristic diagram F of the current frame t Of the first frameFeature map F 0 And the hidden state H of the previous frame memory module t-1 As input to the exercise attention network, the output F of the exercise attention network out Namely, the segmentation result of the current frame;
3) A loss function is constructed, which consists of two parts. The first part is a pixel level loss function; the second part is a structural similarity loss function.
Further, constructing the split backbone network in the step 1) to obtain a feature map F t And F 0 The method specifically comprises the following steps:
1.1 Modify the Resnet-50 network and fuse the hole convolution. First, the expansion factor dilated=2 of conv_1 in Resnet-50 is set; secondly, deleting the pooling layer in Resnet-50; setting step size stride=1 of two layers, namely conv_3 and conv_4 in Resnet-50; finally, taking the modified Resnet-50 as a backbone network, wherein the feature map output by the backbone network is 1/8 of the original image size;
1.2 To the current frame I) t Inputting the main network to obtain a characteristic diagram F about the current frame t ;
1.3 To the first frame I) 0 Inputting into a backbone network to obtain a feature map F about a first frame 0 。
Further, the motion attention network is constructed in the step 2), and a segmentation result of the current frame is obtained.
The exercise attention network consists of a channel attention module, an exercise attention module and a memory module. The channel attention module, the motion attention module and the memory module are constructed specifically as follows:
2.1 Constructing a channel attention module, F t ,F 0 As input to the channel attention module. F (F) 0 Providing appearance information such as the color, the gesture and the like of the target object. First, F t ,F 0 Obtaining a channel weight attention map X of a target object through matrix multiplication and softmax function c ,X c The relevance between the current frame and the channel in the first frame is described, the higher the relevance is, the higher the response value is, and the more similar the characteristics are; then, X is c And F is equal to t Multiplying the result with F to strengthen the characteristics t Carrying out residual error addition to obtain a channel characteristic diagram;
2.2 Construction of a motion attention module, to F t ,H t-1 As input to the current module, H t-1 Position information of a target object of a current frame predicted based on a previous frame division result and timing information is provided. First, feature map F t Respectively passing through two convolution layers with convolution kernels of 1 multiplied by 1 to obtain two characteristic graphs of F a And F b The method comprises the steps of carrying out a first treatment on the surface of the Then F a And H is t-1 Obtaining a position weight attention map of the target object through matrix multiplication and a softmax function; finally, X is again s And F is equal to b Multiplying the result with F to strengthen the characteristics t Residual error addition is carried out, and a position feature diagram is obtained;
2.3 Adding the channel feature map and the position feature map to obtain a final segmentation result F of the current frame out 。
2.4 Constructing a memory module convLSTM to divide the current frame into a plurality of divided results F out And memory cell C outputted from the previous frame memory module t-1 Hidden state H of previous frame memory module t-1 As input to the current module, the output of the module is memory cell C t And hidden state H t ;
The convLSTM consists of an input gate, a forget gate and an output gate.
Further, the step 2.4) constructs a memory module convLSTM, specifically:
2.4.1 First, the forgetting gate discards the memory cell C of the memory module of the previous frame t-1 The input gate then divides the current frame into the result F out Useful information in the memory cell C of the previous frame is saved t-1 Finally, the current frame memory cell C is updated and outputted t ;
2.4.2 First, the output gate divides the current frame by Sigmoid function to obtain the result F out And hidden state H of the previous frame memory module t-1 Filtering to determine the information to be output, and then calling the tanh activation function to modify the memory of the current frameCell C t Finally, the partial information to be output is compared with the modified memory cell C of the current frame t Matrix multiplication to obtain and output current frame hiding state H t 。
Advantageous effects
The invention provides a video target segmentation method based on motion attention, which comprises the steps of firstly obtaining feature images of a first frame and a current frame, and then obtaining the feature image F of the current frame t Appearance characteristics F of the object provided by the first frame 0 And the position information H of the target object predicted by the memory module in the previous frame motion attention network t-1 And inputting the motion attention network of the current frame, and obtaining the segmentation result of the current frame. The invention can solve the problem of diversified motion modes which cannot be solved by other segmentation methods. The method is suitable for video target segmentation, has good robustness and accurate segmentation effect.
The invention is characterized in that: firstly, the invention does not pay attention to the segmentation result of the previous frame only, but can segment the target object more accurately by means of the appearance information of the target object in the first frame and the time sequence information of the target object in the video sequence; secondly, the use of the motion attention network greatly reduces useless features and improves the robustness of the model.
Detailed Description
The invention provides a video target segmentation method based on motion attention, which comprises the steps of firstly obtaining a characteristic diagram of a first frame and a characteristic diagram of a current frame, then inputting the characteristic diagram of the first frame, the characteristic diagram of the current frame and target object position information predicted by a memory module in a motion attention network of a previous frame into the motion attention network, and obtaining a segmentation result of the current frame. The method is suitable for video target segmentation, has good robustness and accurate segmentation effect.
The invention will be described in more detail with reference to specific examples and figures.
The invention comprises the following steps:
1) Acquiring YouTube and Davis data sets which are respectively used as a training set and a testing set of a model;
2) The training data is preprocessed. Clipping each training sample (video frame) and the first frame mask of the video sequence, adjusting the image to 224×224 resolution, and performing data enhancement by using rotation and other modes;
3) Constructing a segmentation backbone network, and inputting a first frame segmentation mask and a current frame of a video sequence to obtain segmentation feature maps of the first frame segmentation mask and the current frame;
3.1 First, the expansion factor dilated=2 of conv_1 in Resnet-50 is set; secondly, deleting the pooling layer in Resnet-50; setting step size stride=1 of two layers, namely conv_3 and conv_4 in Resnet-50; finally, taking the modified Resnet-50 as a backbone network, wherein the feature map output by the backbone network is 1/8 of the original image size, as shown in FIG. 4;
3.2 The resolution of the first frame division mask is 224×224, and the first frame division mask is input into the backbone network to obtain a feature map F of the first frame division mask 0 The size is 2048 multiplied by 28;
3.3 The resolution of the current frame is 224 multiplied by 224, and the current frame is input into a backbone network to obtain a characteristic diagram F of the current frame t The size is 2048×28×28.
4) Constructing a motion attention network, and inputting a current frame characteristic diagram F t Hidden state H of memory module of previous frame t-1 And a feature map F of the first frame 0 Obtaining the segmentation result F of the current frame out . The exercise attention network consists of a channel attention module, an exercise attention module and a memory module.
4.1 A channel attention module is constructed. Inputting a current frameFeature map F t And a first frame characteristic diagram F 0 Obtaining a channel characteristic diagram E c The method is characterized by comprising the following steps:
4.1.1 Calling reshape function in python, adjusting F t Size, convert it into a feature map F' t The size is n multiplied by 2048; invoking the reshape function in python, adjusting F 0 Size, convert it into a feature map F' 0 The size is 2048×n, wherein n represents the total number of pixels of the current frame;
4.1.2)F′ 0 with F' t Matrix multiplication and invoking a softmax function, the mathematical expression of which is as follows:
matrix multiplication can achieve utilization and fusion of global information, which functions similarly to fully connected operations. The fully connected operation can consider the data relation of all positions, but destroys the space structure, so that matrix multiplication is used for replacing the fully connected operation, and the space information is reserved as much as possible on the basis of using global information;
4.1.3 Obtaining channel weight attention map X c The size is 2048 multiplied by 2048, and the channel weight attention is strive for X c Element x of j-th row i column in (b) ji The mathematical expression of (2) is as follows:
wherein F' ti Represents F' t Column i, representing the current frame feature map F t The ith channel, F' 0j Represents F' 0 Line j represents the first frame feature map F 0 The j th channel, x ji Representing a first frame characteristic F 0 I-th channel of (b) versus current frame feature map F t C represents the effect of the jth channel of the current frame feature map F t The number of channels.
4.1.4 Channel weight attention map X c And feature map F' t Multiplication to strengthen the current frame characteristic diagram F t Features of (2), the result of which is compared with the current frame feature map F t Residual error addition is carried out to obtain a channel characteristic diagram E c The mathematical expression is as follows:
wherein beta represents the attention weight of the channel, the initial value is set to be zero, and the model distributes larger and more reasonable weight for beta through learning, F' ti Represents F' t Column i, representing the current frame feature map F t The ith channel, C, represents the current frame profile F t The number of channels.
4.2 A motion attention module is constructed. Inputting current frame feature map F t And hidden state H of the previous frame memory module t-1 Obtaining a position characteristic diagram E s The method is characterized by comprising the following steps:
4.2.1)F t respectively passing through two convolution layers with convolution kernels of 1 multiplied by 1 to obtain two characteristic graphs of F a And F b The sizes are 2048 multiplied by 28;
4.2.2 Calling reshape function in python, adjusting F a Size, convert it into a feature map F' a The size is 2048×n; invoking the reshape function in python, adjusting F b Size, convert it into a feature map F' b The size is 2048×n; invoking the reshape and transfer functions in python to adjust H t-1 Size, convert it into a feature map H' t-1 The size is n multiplied by 2048, wherein n represents the total number of pixels of the current frame;
4.2.3)H′ t-1 with F' a Matrix multiplication, calling softmax function, obtaining position weight attention graph X s The size is 28×28, which is expressed mathematically as follows:
wherein N represents the number of pixel points of the current frame, F aj Represents F' a Column j, F a Is the j-th position, h j Represents H' t-1 I-th row of (2) represents H t-1 The i-th position, s ji Is a position weight attention map X s The element value of the j-th row, i-th column, s ji Indicating hidden state H t-1 Ith position versus current frame profile F t Effect of the j-th position.
4.2.4 Position weight attention map and feature map F) b Matrix multiplication for strengthening current frame characteristic diagram F t Features of (1), the result and F t Residual addition is carried out to obtain a position characteristic diagram E after fusion s The mathematical formula is expressed as follows:
wherein alpha represents the position attention weight, the initial value is set to be zero, the model distributes larger and more reasonable weight for alpha through learning, F bi Representing a characteristic diagram F b Column i, representing the current frame feature map F t The i-th position, N, represents the total number of pixels of the current frame.
4.3 Map E) of position features s And channel characteristic map E c Adding to obtain the final segmentation result F of the current frame out 。
4.4 A memory module is constructed to input the current frame segmentation result F out Hidden state H of previous frame memory module t-1 And memory cell C of the previous frame memory module t-1 The memory module convLSTM of the current frame is formed by forgetting the gate f t Input gate i t Output door o t Composition;
4.4.1 Each value in the output tensor of the forgetting gate is between 0 and 1, 0 means complete forgetting and 1 means complete retention, so that the forgetting gate can realize selective discarding of the memory cells C of the previous frame t-1 The mathematical formula is expressed as follows:
f t =σ(W xf *F out +W hf *H t-1 +b f )
wherein, x represents convolution operation, σ represents sigmoid function, F out Representing the segmentation result of the current frame, H t-1 Indicating the hidden state of the memory module of the previous frame. W (W) xf ,W hf Takes the weight parameter as the value between 0 and 1, b f The initial value is set to be 0.1, and the model is learned to be b f Assigning more reasonable values;
4.4.2 Input gate will divide the result F from the current frame out The mathematical formula is expressed as follows:
i t =σ(W xi *F out +W hi *H t-1 +b i )
wherein, x represents convolution operation, σ represents sigmoid function, F out Representing the segmentation result of the current frame, H t-1 Indicating the hidden state of the memory module of the previous frame. W (W) xi ,W hi Takes the weight parameter as the value between 0 and 1, b i The initial value is set to be 0.1, and the model is learned to be b i Assigning more reasonable values;
4.4.3 Discarding the memory cell C of the previous frame using the forgetting gate t-1 Part of the information in the memory cell C and save the useful information to the memory cell C of the previous frame t-1 In the method, the updated current frame memory cell C is output t The mathematical formula is expressed as follows:
wherein, represents the convolution operation,
is Hadamard product, fo
ut Representing the segmentation result of the current frame, H
t-1 Indicating the hidden state of the memory module of the previous frame. W (W)
xc ,W
hc The value of the weight parameter isB between 0 and 1
c The initial value is set to be 0.1, and the model is learned to be b
c Assigning more reasonable values;
4.4.4 Hidden state H output by the current frame memory module t The mathematical formula is expressed as follows:
o t =σ(W xo *F out +W ho *H t-1 +b o )
wherein, tan h is the activation function,
is Hadamard product, o
t Representing the output gate in the current frame memory module, representing the convolution operation, F
out Representing the segmentation result of the current frame, H
t-1 Represents the hidden state of the memory module of the previous frame, W
ho ,W
xo Takes the weight parameter as the value between 0 and 1, b
o The initial value is set to be 0.1, and the model is learned to be b
o More reasonable values are assigned.
6) The loss function adopted by the segmentation model consists of two parts. The first part is a pixel-level loss function, and the second part is a structural similarity loss function, and the specific design is as follows:
l=l cross +l ssim
6.1)l cross representing a pixel level cross entropy loss function, the mathematical formula is expressed as follows:
wherein T (r, c) represents the pixel value at the r-th row and c-th column of the target mask, and S (r, c) represents the pixel value at the r-th row and c-th column of the division result;
6.2)l ssim representing a structure similarity loss function, comparing object masks from three aspects of brightness, contrast, and structure toAnd differences between the segmentation results, the mathematical formula is expressed as follows:
wherein A is x ,A y Respectively represent the same-size regions, x, truncated on the segmentation map and the target mask predicted from the model i Representation A x Pixel value, y of i-th pixel point in region i Representation A y The pixel value of the ith pixel point in the area, N represents the total number of pixel points of the intercepted area, C 1 And C 2 Is a constant for preventing denominator from being zero, C 1 Set to 6.5025, C 2 Set to 58.5225, mu x Representation A x Average luminance, mu y Table A y Mean luminance, sigma of x Represents A x Degree of change of middle brightness, sigma y Represents A y Degree of change of middle brightness, sigma xy Representing a covariance formula associated with the structure.
7) Training the model, namely selecting YouTube in the step 1) as a training set, setting the batch-size of training samples in each batch to be 4, setting the learning rate to be 1e-4, modifying the learning rate to be 1e-5 after the YouTube is trained for 30 ten thousand times in the last iteration, and training the model by using the loss function in the step 6) again for 10 ten thousand times in the YouTube, wherein the weight attenuation rate is set to be 0.0005, and training the model until the model converges.
The invention has wide application in the field of video object segmentation and computer vision, for example: target tracking, image recognition, etc. The present invention will be described in detail below with reference to the accompanying drawings.
(1) Constructing a split backbone network to divide the current frame I t First frame I 0 Respectively inputting into a main network to obtain a corresponding characteristic diagram F t ,F 0 ;
(2) Constructing a motion attention network, and mapping the current frame characteristic diagram F t First frame feature map F 0 As input of a channel attention module in the motion attention network, obtaining a channel characteristic diagram, and carrying out F t And hidden state H of the previous frame memory module t-1 As input of a motion attention module in the motion attention network, a position feature map is obtained, and the position feature map and the channel feature map are added to obtain output F of the motion attention network out Namely, the segmentation result F of the current frame is obtained as the segmentation result of the current frame out Memory cell C outputted by the memory module of the previous frame t-1 Hidden state H of previous frame memory module t-1 As input to a memory module in the exercise attention network, memory cell C is obtained t And hidden state H t . Memory cell C t For storing the time sequence information of the target object and updating the time sequence information according to the segmentation result of the current frame, H t Position information of a target object of the next frame predicted based on the current segmentation result and the time sequence information is provided. The memory module convLSTM not only reserves the space information of the target object of the current frame, but also reserves the time sequence information of the target object, so that the remote position dependency relationship of the target object can be obtained.
Under the GTX 1080Ti GPU and Ubuntu14.04 bit operating system, the method is realized by using a PyTorch framework and Python language.
The invention provides a video target segmentation method based on motion attention, which is suitable for segmenting a moving object in a video, has good robustness and accurate segmentation result. Experiments show that the method can effectively divide the moving object.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any modifications and substitutions will be apparent to those skilled in the art within the scope of the present invention, and it is intended that the scope of the present invention shall be defined by the appended claims.