CN111161306B

CN111161306B - A Video Object Segmentation Method Based on Motion Attention

Info

Publication number: CN111161306B
Application number: CN201911402450.2A
Authority: CN
Inventors: 付利华; 杨寒雪; 杜宇斌; 姜涵煦
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2023-06-02
Anticipated expiration: 2039-12-31
Also published as: CN111161306A

Abstract

The present invention provides a video object segmentation method based on motion attention. The method adds the channel feature map output by the channel attention module and the position feature map output by the motion attention module to obtain the segmentation result of the current frame. Among them, the input of the channel attention module is the current frame feature map F _t and the appearance feature map F ₀ of the target object provided by the first frame. The channel attention module calculates the correlation between the input feature map F _t and the F ₀ channel, The output channel feature map reflects the object whose appearance is closest to the target object in the current frame; the input of the motion attention module is the feature map F _t of the current frame and the position information of the target object predicted by the memory module in the motion attention network in the previous frame Ht _‑1 , the motion attention module calculates the association between the input feature map _Ft and the position of Ht _‑1 , and the output feature map reflects the approximate position of the target object in the current frame. The invention combines the two factors of appearance and position to realize more accurate segmentation of video objects.

Description

Video target segmentation method based on motion attention

Technical Field

The invention belongs to the field of image processing and computer vision, relates to a video target segmentation method, and particularly relates to a video target segmentation method based on motion attention.

Background

Video object segmentation is a prerequisite for solving many video tasks, and plays a role in object recognition, video compression and other fields. Video object segmentation may be defined as tracking objects and segmenting target objects according to an object mask. The video target segmentation can be divided into a semi-supervised segmentation method and an unsupervised segmentation method according to the presence or absence of an initial mask, wherein the semi-supervised segmentation method is to manually initialize the segmentation mask in the first frame of the video, and then track and segment the target object. The unsupervised method automatically segments the target object in a given video according to a mechanism on the basis of no prior information.

In video scenes, background clutter, object deformation, and rapid motion of objects will all affect the segmentation results. The traditional video target segmentation technology adopts a rigid background motion model and combines scene priori to realize the segmentation of a target object. However, the assumption on which the conventional video object segmentation technique is based has a certain limitation in practical use. The existing video object segmentation technology mostly adopts convolutional neural networks, but the existing video object segmentation technology also has various defects, such as segmentation of moving objects in video by relying on optical flow between frames, which makes the segmentation effect easily affected by optical flow estimation errors. In addition, these methods do not fully utilize timing information in the video and do not memorize the relevant features of the target object in the scene.

In order to solve the problems, the invention researches the problem of dividing the moving target in a semi-supervision mode and provides a video target dividing method with a memory module based on the movement attention.

Disclosure of Invention

The invention aims to solve the problems that: in the video target segmentation, the target object of the current frame is determined only by virtue of the segmentation result of the previous frame, the accurate position of the target object cannot be obtained, and even the target object drifts due to excessive dependence on the segmentation result of the previous frame; in the existing video object segmentation method based on motion information, most of the video object segmentation methods based on optical flow information between a current frame and a previous frame are used for segmenting an object, so that the calculated amount is large, and the segmentation result is limited to a specific motion mode. Therefore, a new video object segmentation method based on motion information needs to be proposed, so as to improve the segmentation effect.

In order to solve the above problems, the present invention provides a video object segmentation method based on motion attention, which fuses motion timing information in a video sequence and implements video object segmentation based on an attention mechanism, comprising the following steps:

1) Constructing a split backbone network to divide the current frame I _t First frame I ₀ Respectively inputting into a main network to obtain a corresponding characteristic diagram F _t ,F ₀ ；

2) Constructing a motion attention network, and mapping the characteristic diagram F of the current frame _t Of the first frameFeature map F ₀ And the hidden state H of the previous frame memory module _t-1 As input to the exercise attention network, the output F of the exercise attention network _out Namely, the segmentation result of the current frame;

3) A loss function is constructed, which consists of two parts. The first part is a pixel level loss function; the second part is a structural similarity loss function.

Further, constructing the split backbone network in the step 1) to obtain a feature map F _t And F ₀ The method specifically comprises the following steps:

1.1 Modify the Resnet-50 network and fuse the hole convolution. First, the expansion factor dilated=2 of conv_1 in Resnet-50 is set; secondly, deleting the pooling layer in Resnet-50; setting step size stride=1 of two layers, namely conv_3 and conv_4 in Resnet-50; finally, taking the modified Resnet-50 as a backbone network, wherein the feature map output by the backbone network is 1/8 of the original image size;

1.2 To the current frame I) _t Inputting the main network to obtain a characteristic diagram F about the current frame _t ；

1.3 To the first frame I) ₀ Inputting into a backbone network to obtain a feature map F about a first frame ₀ 。

Further, the motion attention network is constructed in the step 2), and a segmentation result of the current frame is obtained.

The exercise attention network consists of a channel attention module, an exercise attention module and a memory module. The channel attention module, the motion attention module and the memory module are constructed specifically as follows:

2.1 Constructing a channel attention module, F _t ,F ₀ As input to the channel attention module. F (F) ₀ Providing appearance information such as the color, the gesture and the like of the target object. First, F _t ,F ₀ Obtaining a channel weight attention map X of a target object through matrix multiplication and softmax function _c ，X _c The relevance between the current frame and the channel in the first frame is described, the higher the relevance is, the higher the response value is, and the more similar the characteristics are; then, X is _c And F is equal to _t Multiplying the result with F to strengthen the characteristics _t Carrying out residual error addition to obtain a channel characteristic diagram;

2.2 Construction of a motion attention module, to F _t ,H _t-1 As input to the current module, H _t-1 Position information of a target object of a current frame predicted based on a previous frame division result and timing information is provided. First, feature map F _t Respectively passing through two convolution layers with convolution kernels of 1 multiplied by 1 to obtain two characteristic graphs of F _a And F _b The method comprises the steps of carrying out a first treatment on the surface of the Then F _a And H is _t-1 Obtaining a position weight attention map of the target object through matrix multiplication and a softmax function; finally, X is again _s And F is equal to _b Multiplying the result with F to strengthen the characteristics _t Residual error addition is carried out, and a position feature diagram is obtained;

2.3 Adding the channel feature map and the position feature map to obtain a final segmentation result F of the current frame _out 。

2.4 Constructing a memory module convLSTM to divide the current frame into a plurality of divided results F _out And memory cell C outputted from the previous frame memory module _t-1 Hidden state H of previous frame memory module _t-1 As input to the current module, the output of the module is memory cell C _t And hidden state H _t ；

The convLSTM consists of an input gate, a forget gate and an output gate.

Further, the step 2.4) constructs a memory module convLSTM, specifically:

2.4.1 First, the forgetting gate discards the memory cell C of the memory module of the previous frame _t-1 The input gate then divides the current frame into the result F _out Useful information in the memory cell C of the previous frame is saved _t-1 Finally, the current frame memory cell C is updated and outputted _t ；

2.4.2 First, the output gate divides the current frame by Sigmoid function to obtain the result F _out And hidden state H of the previous frame memory module _t-1 Filtering to determine the information to be output, and then calling the tanh activation function to modify the memory of the current frameCell C _t Finally, the partial information to be output is compared with the modified memory cell C of the current frame _t Matrix multiplication to obtain and output current frame hiding state H _t 。

Advantageous effects

The invention provides a video target segmentation method based on motion attention, which comprises the steps of firstly obtaining feature images of a first frame and a current frame, and then obtaining the feature image F of the current frame _t Appearance characteristics F of the object provided by the first frame ₀ And the position information H of the target object predicted by the memory module in the previous frame motion attention network _t-1 And inputting the motion attention network of the current frame, and obtaining the segmentation result of the current frame. The invention can solve the problem of diversified motion modes which cannot be solved by other segmentation methods. The method is suitable for video target segmentation, has good robustness and accurate segmentation effect.

The invention is characterized in that: firstly, the invention does not pay attention to the segmentation result of the previous frame only, but can segment the target object more accurately by means of the appearance information of the target object in the first frame and the time sequence information of the target object in the video sequence; secondly, the use of the motion attention network greatly reduces useless features and improves the robustness of the model.

Drawings

Fig. 1 is a flow chart of the video object segmentation method based on motion attention of the present invention.

FIG. 2 is a network structure diagram of a video object segmentation method based on motion attention of the present invention;

FIG. 3 is a diagram of Resnet-50

FIG. 4 is a diagram of a modified Resnet-50 structure used by the video object segmentation method based on motion attention of the present invention

Detailed Description

The invention provides a video target segmentation method based on motion attention, which comprises the steps of firstly obtaining a characteristic diagram of a first frame and a characteristic diagram of a current frame, then inputting the characteristic diagram of the first frame, the characteristic diagram of the current frame and target object position information predicted by a memory module in a motion attention network of a previous frame into the motion attention network, and obtaining a segmentation result of the current frame. The method is suitable for video target segmentation, has good robustness and accurate segmentation effect.

The invention will be described in more detail with reference to specific examples and figures.

The invention comprises the following steps:

1) Acquiring YouTube and Davis data sets which are respectively used as a training set and a testing set of a model;

2) The training data is preprocessed. Clipping each training sample (video frame) and the first frame mask of the video sequence, adjusting the image to 224×224 resolution, and performing data enhancement by using rotation and other modes;

3) Constructing a segmentation backbone network, and inputting a first frame segmentation mask and a current frame of a video sequence to obtain segmentation feature maps of the first frame segmentation mask and the current frame;

3.1 First, the expansion factor dilated=2 of conv_1 in Resnet-50 is set; secondly, deleting the pooling layer in Resnet-50; setting step size stride=1 of two layers, namely conv_3 and conv_4 in Resnet-50; finally, taking the modified Resnet-50 as a backbone network, wherein the feature map output by the backbone network is 1/8 of the original image size, as shown in FIG. 4;

3.2 The resolution of the first frame division mask is 224×224, and the first frame division mask is input into the backbone network to obtain a feature map F of the first frame division mask ₀ The size is 2048 multiplied by 28;

3.3 The resolution of the current frame is 224 multiplied by 224, and the current frame is input into a backbone network to obtain a characteristic diagram F of the current frame _t The size is 2048×28×28.

4) Constructing a motion attention network, and inputting a current frame characteristic diagram F _t Hidden state H of memory module of previous frame _t-1 And a feature map F of the first frame ₀ Obtaining the segmentation result F of the current frame _out . The exercise attention network consists of a channel attention module, an exercise attention module and a memory module.

4.1 A channel attention module is constructed. Inputting a current frameFeature map F _t And a first frame characteristic diagram F ₀ Obtaining a channel characteristic diagram E _c The method is characterized by comprising the following steps:

4.1.1 Calling reshape function in python, adjusting F _t Size, convert it into a feature map F' _t The size is n multiplied by 2048; invoking the reshape function in python, adjusting F ₀ Size, convert it into a feature map F' ₀ The size is 2048×n, wherein n represents the total number of pixels of the current frame;

4.1.2)F′ ₀ with F' _t Matrix multiplication and invoking a softmax function, the mathematical expression of which is as follows:

matrix multiplication can achieve utilization and fusion of global information, which functions similarly to fully connected operations. The fully connected operation can consider the data relation of all positions, but destroys the space structure, so that matrix multiplication is used for replacing the fully connected operation, and the space information is reserved as much as possible on the basis of using global information;

4.1.3 Obtaining channel weight attention map X _c The size is 2048 multiplied by 2048, and the channel weight attention is strive for X _c Element x of j-th row i column in (b) _ji The mathematical expression of (2) is as follows:

wherein F' _ti Represents F' _t Column i, representing the current frame feature map F _t The ith channel, F' _0j Represents F' ₀ Line j represents the first frame feature map F ₀ The j th channel, x _ji Representing a first frame characteristic F ₀ I-th channel of (b) versus current frame feature map F _t C represents the effect of the jth channel of the current frame feature map F _t The number of channels.

4.1.4 Channel weight attention map X _c And feature map F' _t Multiplication to strengthen the current frame characteristic diagram F _t Features of (2), the result of which is compared with the current frame feature map F _t Residual error addition is carried out to obtain a channel characteristic diagram E _c The mathematical expression is as follows:

wherein beta represents the attention weight of the channel, the initial value is set to be zero, and the model distributes larger and more reasonable weight for beta through learning, F' _ti Represents F' _t Column i, representing the current frame feature map F _t The ith channel, C, represents the current frame profile F _t The number of channels.

4.2 A motion attention module is constructed. Inputting current frame feature map F _t And hidden state H of the previous frame memory module _t-1 Obtaining a position characteristic diagram E _s The method is characterized by comprising the following steps:

4.2.1)F _t respectively passing through two convolution layers with convolution kernels of 1 multiplied by 1 to obtain two characteristic graphs of F _a And F _b The sizes are 2048 multiplied by 28;

4.2.2 Calling reshape function in python, adjusting F _a Size, convert it into a feature map F' _a The size is 2048×n; invoking the reshape function in python, adjusting F _b Size, convert it into a feature map F' _b The size is 2048×n; invoking the reshape and transfer functions in python to adjust H _t-1 Size, convert it into a feature map H' _t-1 The size is n multiplied by 2048, wherein n represents the total number of pixels of the current frame;

4.2.3)H′ _t-1 with F' _a Matrix multiplication, calling softmax function, obtaining position weight attention graph X _s The size is 28×28, which is expressed mathematically as follows:

wherein N represents the number of pixel points of the current frame, F _aj Represents F' _a Column j, F _a Is the j-th position, h _j Represents H' _t-1 I-th row of (2) represents H _t-1 The i-th position, s _ji Is a position weight attention map X _s The element value of the j-th row, i-th column, s _ji Indicating hidden state H _t-1 Ith position versus current frame profile F _t Effect of the j-th position.

4.2.4 Position weight attention map and feature map F) _b Matrix multiplication for strengthening current frame characteristic diagram F _t Features of (1), the result and F _t Residual addition is carried out to obtain a position characteristic diagram E after fusion _s The mathematical formula is expressed as follows:

wherein alpha represents the position attention weight, the initial value is set to be zero, the model distributes larger and more reasonable weight for alpha through learning, F _bi Representing a characteristic diagram F _b Column i, representing the current frame feature map F _t The i-th position, N, represents the total number of pixels of the current frame.

4.3 Map E) of position features _s And channel characteristic map E _c Adding to obtain the final segmentation result F of the current frame _out 。

4.4 A memory module is constructed to input the current frame segmentation result F _out Hidden state H of previous frame memory module _t-1 And memory cell C of the previous frame memory module _t-1 The memory module convLSTM of the current frame is formed by forgetting the gate f _t Input gate i _t Output door o _t Composition;

4.4.1 Each value in the output tensor of the forgetting gate is between 0 and 1, 0 means complete forgetting and 1 means complete retention, so that the forgetting gate can realize selective discarding of the memory cells C of the previous frame _t-1 The mathematical formula is expressed as follows:

f _t ＝σ(W _xf *F _out +W _hf *H _t-1 +b _f )

wherein, x represents convolution operation, σ represents sigmoid function, F _out Representing the segmentation result of the current frame, H _t-1 Indicating the hidden state of the memory module of the previous frame. W (W) _xf ,W _hf Takes the weight parameter as the value between 0 and 1, b _f The initial value is set to be 0.1, and the model is learned to be b _f Assigning more reasonable values;

4.4.2 Input gate will divide the result F from the current frame _out The mathematical formula is expressed as follows:

i _t ＝σ(W _xi *F _out +W _hi *H _t-1 +b _i )

wherein, x represents convolution operation, σ represents sigmoid function, F _out Representing the segmentation result of the current frame, H _t-1 Indicating the hidden state of the memory module of the previous frame. W (W) _xi ,W _hi Takes the weight parameter as the value between 0 and 1, b _i The initial value is set to be 0.1, and the model is learned to be b _i Assigning more reasonable values;

4.4.3 Discarding the memory cell C of the previous frame using the forgetting gate _t-1 Part of the information in the memory cell C and save the useful information to the memory cell C of the previous frame _t-1 In the method, the updated current frame memory cell C is output _t The mathematical formula is expressed as follows:

wherein, represents the convolution operation,

is Hadamard product, fo _ut Representing the segmentation result of the current frame, H _t-1 Indicating the hidden state of the memory module of the previous frame. W (W) _xc ,W _hc The value of the weight parameter isB between 0 and 1 _c The initial value is set to be 0.1, and the model is learned to be b _c Assigning more reasonable values;

4.4.4 Hidden state H output by the current frame memory module _t The mathematical formula is expressed as follows:

o _t ＝σ(W _xo *F _out +W _ho *H _t-1 +b _o )

wherein, tan h is the activation function,

is Hadamard product, o _t Representing the output gate in the current frame memory module, representing the convolution operation, F _out Representing the segmentation result of the current frame, H _t-1 Represents the hidden state of the memory module of the previous frame, W _ho ,W _xo Takes the weight parameter as the value between 0 and 1, b _o The initial value is set to be 0.1, and the model is learned to be b _o More reasonable values are assigned.

6) The loss function adopted by the segmentation model consists of two parts. The first part is a pixel-level loss function, and the second part is a structural similarity loss function, and the specific design is as follows:

l＝l _cross +l _ssim

6.1)l _cross representing a pixel level cross entropy loss function, the mathematical formula is expressed as follows:

wherein T (r, c) represents the pixel value at the r-th row and c-th column of the target mask, and S (r, c) represents the pixel value at the r-th row and c-th column of the division result;

6.2)l _ssim representing a structure similarity loss function, comparing object masks from three aspects of brightness, contrast, and structure toAnd differences between the segmentation results, the mathematical formula is expressed as follows:

wherein A is _x ，A _y Respectively represent the same-size regions, x, truncated on the segmentation map and the target mask predicted from the model _i Representation A _x Pixel value, y of i-th pixel point in region _i Representation A _y The pixel value of the ith pixel point in the area, N represents the total number of pixel points of the intercepted area, C ₁ And C ₂ Is a constant for preventing denominator from being zero, C ₁ Set to 6.5025, C ₂ Set to 58.5225, mu _x Representation A _x Average luminance, mu _y Table A _y Mean luminance, sigma of _x Represents A _x Degree of change of middle brightness, sigma _y Represents A _y Degree of change of middle brightness, sigma _xy Representing a covariance formula associated with the structure.

7) Training the model, namely selecting YouTube in the step 1) as a training set, setting the batch-size of training samples in each batch to be 4, setting the learning rate to be 1e-4, modifying the learning rate to be 1e-5 after the YouTube is trained for 30 ten thousand times in the last iteration, and training the model by using the loss function in the step 6) again for 10 ten thousand times in the YouTube, wherein the weight attenuation rate is set to be 0.0005, and training the model until the model converges.

The invention has wide application in the field of video object segmentation and computer vision, for example: target tracking, image recognition, etc. The present invention will be described in detail below with reference to the accompanying drawings.

(1) Constructing a split backbone network to divide the current frame I _t First frame I ₀ Respectively inputting into a main network to obtain a corresponding characteristic diagram F _t ,F ₀ ；

(2) Constructing a motion attention network, and mapping the current frame characteristic diagram F _t First frame feature map F ₀ As input of a channel attention module in the motion attention network, obtaining a channel characteristic diagram, and carrying out F _t And hidden state H of the previous frame memory module _t-1 As input of a motion attention module in the motion attention network, a position feature map is obtained, and the position feature map and the channel feature map are added to obtain output F of the motion attention network _out Namely, the segmentation result F of the current frame is obtained as the segmentation result of the current frame _out Memory cell C outputted by the memory module of the previous frame _t-1 Hidden state H of previous frame memory module _t-1 As input to a memory module in the exercise attention network, memory cell C is obtained _t And hidden state H _t . Memory cell C _t For storing the time sequence information of the target object and updating the time sequence information according to the segmentation result of the current frame, H _t Position information of a target object of the next frame predicted based on the current segmentation result and the time sequence information is provided. The memory module convLSTM not only reserves the space information of the target object of the current frame, but also reserves the time sequence information of the target object, so that the remote position dependency relationship of the target object can be obtained.

Under the GTX 1080Ti GPU and Ubuntu14.04 bit operating system, the method is realized by using a PyTorch framework and Python language.

The invention provides a video target segmentation method based on motion attention, which is suitable for segmenting a moving object in a video, has good robustness and accurate segmentation result. Experiments show that the method can effectively divide the moving object.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any modifications and substitutions will be apparent to those skilled in the art within the scope of the present invention, and it is intended that the scope of the present invention shall be defined by the appended claims.

Claims

1. A video target segmentation method based on motion attention, characterized by comprising the following steps:

(1) Build a segmentation backbone network to transform the current frame

And the first frame

Input the backbone network respectively to obtain the corresponding feature map

,

;

(2) Construct a motion attention network to convert the current frame feature map

, the first frame feature map

, as the input of the channel attention module in the motion attention network, the channel feature map is obtained.

and the hidden state of the memory module of the previous frame

As the input of the motion attention module in the motion attention network, the position feature map is obtained, and the position feature map and the channel feature map are added to obtain the output of the motion attention network

,

is the segmentation result of the current frame.

, the memory cell output by the memory module of the previous frame

And the hidden state of the memory module of the previous frame

As the input of the memory module in the motion attention network, the memory cell is obtained

and hidden state

; Memory cells

It is used to save the timing information of the target object and update the timing information according to the segmentation result of the current frame.

Provide the position information of the target object in the next frame predicted based on the current segmentation result and timing information;

The specific working process of the channel attention module is as follows:

first,

,

After matrix multiplication and softmax function, the channel weight attention map of the target object is obtained

;

Then,

and

Multiplying, the result is

Add the residuals to get the channel feature map

;

The specific working process of the motion attention module is:

,

As the input of the current module,

Provides the location information of the target object in the current frame predicted based on the segmentation result and timing information of the previous frame;

First, the feature map

After two convolution kernels respectively

The convolution layer obtains two feature maps denoted as

and

;

Then,

and

After matrix multiplication and softmax function, we get the position weight attention map of the target object.

;

Finally,

and

Multiply them together to enhance the features, and the result is the same as

Add the residuals to get the position feature map

;

The memory module includes a forget gate, an input gate, and an output gate. The specific working process is as follows:

First, the forget gate abandons the memory cells of the previous frame.

Middle part status information;

Then, the input gate segments the current frame

Useful information is saved to the memory cell of the previous frame

middle;

Finally, update and output the current frame memory cell

.

2. According to the video target segmentation method based on motion attention in claim 1, it is characterized in that: the feature map

,

The specific acquisition process is as follows:

1) Build a segmentation backbone network, specifically: modify the Resnet-50 network and integrate the dilated convolution:

1.1.1) Set the dilation factor of conv_1 in Resnet-50 to dilated=2;

1.1.2) Delete the pooling layer in Resnet-50;

1.1.3) Set the stride of the conv_3 and conv_4 layers in Resnet-50 to 1;

1.1.4) Use the modified Resnet-50 as the backbone network. At this time, the feature map output by the backbone network is 1/8 of the original image size;

1.2) Set the current frame

Input the segmentation backbone network to obtain the feature map of the current frame

;

1.3) The first frame

Input the segmentation backbone network to obtain the feature map of the first frame

.

3. According to the motion attention-based video target segmentation method described in claim 1, the characteristic is that the loss function consists of two parts: the first part is the pixel-level loss function; the second part is the structural similarity loss function.