[go: up one dir, main page]

CN111161306B - A Video Object Segmentation Method Based on Motion Attention - Google Patents

A Video Object Segmentation Method Based on Motion Attention Download PDF

Info

Publication number
CN111161306B
CN111161306B CN201911402450.2A CN201911402450A CN111161306B CN 111161306 B CN111161306 B CN 111161306B CN 201911402450 A CN201911402450 A CN 201911402450A CN 111161306 B CN111161306 B CN 111161306B
Authority
CN
China
Prior art keywords
feature map
current frame
frame
segmentation
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911402450.2A
Other languages
Chinese (zh)
Other versions
CN111161306A (en
Inventor
付利华
杨寒雪
杜宇斌
姜涵煦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201911402450.2A priority Critical patent/CN111161306B/en
Publication of CN111161306A publication Critical patent/CN111161306A/en
Application granted granted Critical
Publication of CN111161306B publication Critical patent/CN111161306B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/215Motion-based segmentation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/20Image enhancement or restoration using local operators
    • G06T5/30Erosion or dilatation, e.g. thinning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

本发明提供了一种基于运动注意力的视频目标分割方法,该方法将通道注意力模块输出的通道特征图和运动注意力模块输出的位置特征图进行相加,获得当前帧的分割结果。其中,通道注意力模块的输入为当前帧特征图Ft和第一帧提供的目标物体的外观特征图F0,通道注意力模块通过计算输入特征图Ft和F0通道之间的关联,输出的通道特征图反映了当前帧中外观最接近目标物体的物体;运动注意力模块的输入为当前帧特征图Ft和前一帧运动注意力网络中的记忆模块预测的目标物体的位置信息Ht‑1,运动注意力模块通过计算输入特征图Ft和Ht‑1位置之间的关联,输出的位置特征图反映了当前帧中目标物体的大致位置。本发明结合外观和位置两个因素,实现了对视频目标更精确的分割。

Figure 201911402450

The present invention provides a video object segmentation method based on motion attention. The method adds the channel feature map output by the channel attention module and the position feature map output by the motion attention module to obtain the segmentation result of the current frame. Among them, the input of the channel attention module is the current frame feature map F t and the appearance feature map F 0 of the target object provided by the first frame. The channel attention module calculates the correlation between the input feature map F t and the F 0 channel, The output channel feature map reflects the object whose appearance is closest to the target object in the current frame; the input of the motion attention module is the feature map F t of the current frame and the position information of the target object predicted by the memory module in the motion attention network in the previous frame Ht ‑1 , the motion attention module calculates the association between the input feature map Ft and the position of Ht ‑1 , and the output feature map reflects the approximate position of the target object in the current frame. The invention combines the two factors of appearance and position to realize more accurate segmentation of video objects.

Figure 201911402450

Description

Video target segmentation method based on motion attention
Technical Field
The invention belongs to the field of image processing and computer vision, relates to a video target segmentation method, and particularly relates to a video target segmentation method based on motion attention.
Background
Video object segmentation is a prerequisite for solving many video tasks, and plays a role in object recognition, video compression and other fields. Video object segmentation may be defined as tracking objects and segmenting target objects according to an object mask. The video target segmentation can be divided into a semi-supervised segmentation method and an unsupervised segmentation method according to the presence or absence of an initial mask, wherein the semi-supervised segmentation method is to manually initialize the segmentation mask in the first frame of the video, and then track and segment the target object. The unsupervised method automatically segments the target object in a given video according to a mechanism on the basis of no prior information.
In video scenes, background clutter, object deformation, and rapid motion of objects will all affect the segmentation results. The traditional video target segmentation technology adopts a rigid background motion model and combines scene priori to realize the segmentation of a target object. However, the assumption on which the conventional video object segmentation technique is based has a certain limitation in practical use. The existing video object segmentation technology mostly adopts convolutional neural networks, but the existing video object segmentation technology also has various defects, such as segmentation of moving objects in video by relying on optical flow between frames, which makes the segmentation effect easily affected by optical flow estimation errors. In addition, these methods do not fully utilize timing information in the video and do not memorize the relevant features of the target object in the scene.
In order to solve the problems, the invention researches the problem of dividing the moving target in a semi-supervision mode and provides a video target dividing method with a memory module based on the movement attention.
Disclosure of Invention
The invention aims to solve the problems that: in the video target segmentation, the target object of the current frame is determined only by virtue of the segmentation result of the previous frame, the accurate position of the target object cannot be obtained, and even the target object drifts due to excessive dependence on the segmentation result of the previous frame; in the existing video object segmentation method based on motion information, most of the video object segmentation methods based on optical flow information between a current frame and a previous frame are used for segmenting an object, so that the calculated amount is large, and the segmentation result is limited to a specific motion mode. Therefore, a new video object segmentation method based on motion information needs to be proposed, so as to improve the segmentation effect.
In order to solve the above problems, the present invention provides a video object segmentation method based on motion attention, which fuses motion timing information in a video sequence and implements video object segmentation based on an attention mechanism, comprising the following steps:
1) Constructing a split backbone network to divide the current frame I t First frame I 0 Respectively inputting into a main network to obtain a corresponding characteristic diagram F t ,F 0
2) Constructing a motion attention network, and mapping the characteristic diagram F of the current frame t Of the first frameFeature map F 0 And the hidden state H of the previous frame memory module t-1 As input to the exercise attention network, the output F of the exercise attention network out Namely, the segmentation result of the current frame;
3) A loss function is constructed, which consists of two parts. The first part is a pixel level loss function; the second part is a structural similarity loss function.
Further, constructing the split backbone network in the step 1) to obtain a feature map F t And F 0 The method specifically comprises the following steps:
1.1 Modify the Resnet-50 network and fuse the hole convolution. First, the expansion factor dilated=2 of conv_1 in Resnet-50 is set; secondly, deleting the pooling layer in Resnet-50; setting step size stride=1 of two layers, namely conv_3 and conv_4 in Resnet-50; finally, taking the modified Resnet-50 as a backbone network, wherein the feature map output by the backbone network is 1/8 of the original image size;
1.2 To the current frame I) t Inputting the main network to obtain a characteristic diagram F about the current frame t
1.3 To the first frame I) 0 Inputting into a backbone network to obtain a feature map F about a first frame 0
Further, the motion attention network is constructed in the step 2), and a segmentation result of the current frame is obtained.
The exercise attention network consists of a channel attention module, an exercise attention module and a memory module. The channel attention module, the motion attention module and the memory module are constructed specifically as follows:
2.1 Constructing a channel attention module, F t ,F 0 As input to the channel attention module. F (F) 0 Providing appearance information such as the color, the gesture and the like of the target object. First, F t ,F 0 Obtaining a channel weight attention map X of a target object through matrix multiplication and softmax function c ,X c The relevance between the current frame and the channel in the first frame is described, the higher the relevance is, the higher the response value is, and the more similar the characteristics are; then, X is c And F is equal to t Multiplying the result with F to strengthen the characteristics t Carrying out residual error addition to obtain a channel characteristic diagram;
2.2 Construction of a motion attention module, to F t ,H t-1 As input to the current module, H t-1 Position information of a target object of a current frame predicted based on a previous frame division result and timing information is provided. First, feature map F t Respectively passing through two convolution layers with convolution kernels of 1 multiplied by 1 to obtain two characteristic graphs of F a And F b The method comprises the steps of carrying out a first treatment on the surface of the Then F a And H is t-1 Obtaining a position weight attention map of the target object through matrix multiplication and a softmax function; finally, X is again s And F is equal to b Multiplying the result with F to strengthen the characteristics t Residual error addition is carried out, and a position feature diagram is obtained;
2.3 Adding the channel feature map and the position feature map to obtain a final segmentation result F of the current frame out
2.4 Constructing a memory module convLSTM to divide the current frame into a plurality of divided results F out And memory cell C outputted from the previous frame memory module t-1 Hidden state H of previous frame memory module t-1 As input to the current module, the output of the module is memory cell C t And hidden state H t
The convLSTM consists of an input gate, a forget gate and an output gate.
Further, the step 2.4) constructs a memory module convLSTM, specifically:
2.4.1 First, the forgetting gate discards the memory cell C of the memory module of the previous frame t-1 The input gate then divides the current frame into the result F out Useful information in the memory cell C of the previous frame is saved t-1 Finally, the current frame memory cell C is updated and outputted t
2.4.2 First, the output gate divides the current frame by Sigmoid function to obtain the result F out And hidden state H of the previous frame memory module t-1 Filtering to determine the information to be output, and then calling the tanh activation function to modify the memory of the current frameCell C t Finally, the partial information to be output is compared with the modified memory cell C of the current frame t Matrix multiplication to obtain and output current frame hiding state H t
Advantageous effects
The invention provides a video target segmentation method based on motion attention, which comprises the steps of firstly obtaining feature images of a first frame and a current frame, and then obtaining the feature image F of the current frame t Appearance characteristics F of the object provided by the first frame 0 And the position information H of the target object predicted by the memory module in the previous frame motion attention network t-1 And inputting the motion attention network of the current frame, and obtaining the segmentation result of the current frame. The invention can solve the problem of diversified motion modes which cannot be solved by other segmentation methods. The method is suitable for video target segmentation, has good robustness and accurate segmentation effect.
The invention is characterized in that: firstly, the invention does not pay attention to the segmentation result of the previous frame only, but can segment the target object more accurately by means of the appearance information of the target object in the first frame and the time sequence information of the target object in the video sequence; secondly, the use of the motion attention network greatly reduces useless features and improves the robustness of the model.
Drawings
Fig. 1 is a flow chart of the video object segmentation method based on motion attention of the present invention.
FIG. 2 is a network structure diagram of a video object segmentation method based on motion attention of the present invention;
FIG. 3 is a diagram of Resnet-50
FIG. 4 is a diagram of a modified Resnet-50 structure used by the video object segmentation method based on motion attention of the present invention
Detailed Description
The invention provides a video target segmentation method based on motion attention, which comprises the steps of firstly obtaining a characteristic diagram of a first frame and a characteristic diagram of a current frame, then inputting the characteristic diagram of the first frame, the characteristic diagram of the current frame and target object position information predicted by a memory module in a motion attention network of a previous frame into the motion attention network, and obtaining a segmentation result of the current frame. The method is suitable for video target segmentation, has good robustness and accurate segmentation effect.
The invention will be described in more detail with reference to specific examples and figures.
The invention comprises the following steps:
1) Acquiring YouTube and Davis data sets which are respectively used as a training set and a testing set of a model;
2) The training data is preprocessed. Clipping each training sample (video frame) and the first frame mask of the video sequence, adjusting the image to 224×224 resolution, and performing data enhancement by using rotation and other modes;
3) Constructing a segmentation backbone network, and inputting a first frame segmentation mask and a current frame of a video sequence to obtain segmentation feature maps of the first frame segmentation mask and the current frame;
3.1 First, the expansion factor dilated=2 of conv_1 in Resnet-50 is set; secondly, deleting the pooling layer in Resnet-50; setting step size stride=1 of two layers, namely conv_3 and conv_4 in Resnet-50; finally, taking the modified Resnet-50 as a backbone network, wherein the feature map output by the backbone network is 1/8 of the original image size, as shown in FIG. 4;
3.2 The resolution of the first frame division mask is 224×224, and the first frame division mask is input into the backbone network to obtain a feature map F of the first frame division mask 0 The size is 2048 multiplied by 28;
3.3 The resolution of the current frame is 224 multiplied by 224, and the current frame is input into a backbone network to obtain a characteristic diagram F of the current frame t The size is 2048×28×28.
4) Constructing a motion attention network, and inputting a current frame characteristic diagram F t Hidden state H of memory module of previous frame t-1 And a feature map F of the first frame 0 Obtaining the segmentation result F of the current frame out . The exercise attention network consists of a channel attention module, an exercise attention module and a memory module.
4.1 A channel attention module is constructed. Inputting a current frameFeature map F t And a first frame characteristic diagram F 0 Obtaining a channel characteristic diagram E c The method is characterized by comprising the following steps:
4.1.1 Calling reshape function in python, adjusting F t Size, convert it into a feature map F' t The size is n multiplied by 2048; invoking the reshape function in python, adjusting F 0 Size, convert it into a feature map F' 0 The size is 2048×n, wherein n represents the total number of pixels of the current frame;
4.1.2)F′ 0 with F' t Matrix multiplication and invoking a softmax function, the mathematical expression of which is as follows:
Figure BDA0002347815920000051
matrix multiplication can achieve utilization and fusion of global information, which functions similarly to fully connected operations. The fully connected operation can consider the data relation of all positions, but destroys the space structure, so that matrix multiplication is used for replacing the fully connected operation, and the space information is reserved as much as possible on the basis of using global information;
4.1.3 Obtaining channel weight attention map X c The size is 2048 multiplied by 2048, and the channel weight attention is strive for X c Element x of j-th row i column in (b) ji The mathematical expression of (2) is as follows:
Figure BDA0002347815920000061
wherein F' ti Represents F' t Column i, representing the current frame feature map F t The ith channel, F' 0j Represents F' 0 Line j represents the first frame feature map F 0 The j th channel, x ji Representing a first frame characteristic F 0 I-th channel of (b) versus current frame feature map F t C represents the effect of the jth channel of the current frame feature map F t The number of channels.
4.1.4 Channel weight attention map X c And feature map F' t Multiplication to strengthen the current frame characteristic diagram F t Features of (2), the result of which is compared with the current frame feature map F t Residual error addition is carried out to obtain a channel characteristic diagram E c The mathematical expression is as follows:
Figure BDA0002347815920000062
wherein beta represents the attention weight of the channel, the initial value is set to be zero, and the model distributes larger and more reasonable weight for beta through learning, F' ti Represents F' t Column i, representing the current frame feature map F t The ith channel, C, represents the current frame profile F t The number of channels.
4.2 A motion attention module is constructed. Inputting current frame feature map F t And hidden state H of the previous frame memory module t-1 Obtaining a position characteristic diagram E s The method is characterized by comprising the following steps:
4.2.1)F t respectively passing through two convolution layers with convolution kernels of 1 multiplied by 1 to obtain two characteristic graphs of F a And F b The sizes are 2048 multiplied by 28;
4.2.2 Calling reshape function in python, adjusting F a Size, convert it into a feature map F' a The size is 2048×n; invoking the reshape function in python, adjusting F b Size, convert it into a feature map F' b The size is 2048×n; invoking the reshape and transfer functions in python to adjust H t-1 Size, convert it into a feature map H' t-1 The size is n multiplied by 2048, wherein n represents the total number of pixels of the current frame;
4.2.3)H′ t-1 with F' a Matrix multiplication, calling softmax function, obtaining position weight attention graph X s The size is 28×28, which is expressed mathematically as follows:
Figure BDA0002347815920000063
Figure BDA0002347815920000071
wherein N represents the number of pixel points of the current frame, F aj Represents F' a Column j, F a Is the j-th position, h j Represents H' t-1 I-th row of (2) represents H t-1 The i-th position, s ji Is a position weight attention map X s The element value of the j-th row, i-th column, s ji Indicating hidden state H t-1 Ith position versus current frame profile F t Effect of the j-th position.
4.2.4 Position weight attention map and feature map F) b Matrix multiplication for strengthening current frame characteristic diagram F t Features of (1), the result and F t Residual addition is carried out to obtain a position characteristic diagram E after fusion s The mathematical formula is expressed as follows:
Figure BDA0002347815920000072
wherein alpha represents the position attention weight, the initial value is set to be zero, the model distributes larger and more reasonable weight for alpha through learning, F bi Representing a characteristic diagram F b Column i, representing the current frame feature map F t The i-th position, N, represents the total number of pixels of the current frame.
4.3 Map E) of position features s And channel characteristic map E c Adding to obtain the final segmentation result F of the current frame out
4.4 A memory module is constructed to input the current frame segmentation result F out Hidden state H of previous frame memory module t-1 And memory cell C of the previous frame memory module t-1 The memory module convLSTM of the current frame is formed by forgetting the gate f t Input gate i t Output door o t Composition;
4.4.1 Each value in the output tensor of the forgetting gate is between 0 and 1, 0 means complete forgetting and 1 means complete retention, so that the forgetting gate can realize selective discarding of the memory cells C of the previous frame t-1 The mathematical formula is expressed as follows:
f t =σ(W xf *F out +W hf *H t-1 +b f )
wherein, x represents convolution operation, σ represents sigmoid function, F out Representing the segmentation result of the current frame, H t-1 Indicating the hidden state of the memory module of the previous frame. W (W) xf ,W hf Takes the weight parameter as the value between 0 and 1, b f The initial value is set to be 0.1, and the model is learned to be b f Assigning more reasonable values;
4.4.2 Input gate will divide the result F from the current frame out The mathematical formula is expressed as follows:
i t =σ(W xi *F out +W hi *H t-1 +b i )
wherein, x represents convolution operation, σ represents sigmoid function, F out Representing the segmentation result of the current frame, H t-1 Indicating the hidden state of the memory module of the previous frame. W (W) xi ,W hi Takes the weight parameter as the value between 0 and 1, b i The initial value is set to be 0.1, and the model is learned to be b i Assigning more reasonable values;
4.4.3 Discarding the memory cell C of the previous frame using the forgetting gate t-1 Part of the information in the memory cell C and save the useful information to the memory cell C of the previous frame t-1 In the method, the updated current frame memory cell C is output t The mathematical formula is expressed as follows:
Figure BDA0002347815920000081
wherein, represents the convolution operation,
Figure BDA0002347815920000087
is Hadamard product, fo ut Representing the segmentation result of the current frame, H t-1 Indicating the hidden state of the memory module of the previous frame. W (W) xc ,W hc The value of the weight parameter isB between 0 and 1 c The initial value is set to be 0.1, and the model is learned to be b c Assigning more reasonable values;
4.4.4 Hidden state H output by the current frame memory module t The mathematical formula is expressed as follows:
Figure BDA0002347815920000082
o t =σ(W xo *F out +W ho *H t-1 +b o )
wherein, tan h is the activation function,
Figure BDA0002347815920000088
is Hadamard product, o t Representing the output gate in the current frame memory module, representing the convolution operation, F out Representing the segmentation result of the current frame, H t-1 Represents the hidden state of the memory module of the previous frame, W ho ,W xo Takes the weight parameter as the value between 0 and 1, b o The initial value is set to be 0.1, and the model is learned to be b o More reasonable values are assigned.
6) The loss function adopted by the segmentation model consists of two parts. The first part is a pixel-level loss function, and the second part is a structural similarity loss function, and the specific design is as follows:
l=l cross +l ssim
6.1)l cross representing a pixel level cross entropy loss function, the mathematical formula is expressed as follows:
Figure BDA0002347815920000083
wherein T (r, c) represents the pixel value at the r-th row and c-th column of the target mask, and S (r, c) represents the pixel value at the r-th row and c-th column of the division result;
6.2)l ssim representing a structure similarity loss function, comparing object masks from three aspects of brightness, contrast, and structure toAnd differences between the segmentation results, the mathematical formula is expressed as follows:
Figure BDA0002347815920000084
Figure BDA0002347815920000085
Figure BDA0002347815920000086
Figure BDA0002347815920000091
wherein A is x ,A y Respectively represent the same-size regions, x, truncated on the segmentation map and the target mask predicted from the model i Representation A x Pixel value, y of i-th pixel point in region i Representation A y The pixel value of the ith pixel point in the area, N represents the total number of pixel points of the intercepted area, C 1 And C 2 Is a constant for preventing denominator from being zero, C 1 Set to 6.5025, C 2 Set to 58.5225, mu x Representation A x Average luminance, mu y Table A y Mean luminance, sigma of x Represents A x Degree of change of middle brightness, sigma y Represents A y Degree of change of middle brightness, sigma xy Representing a covariance formula associated with the structure.
7) Training the model, namely selecting YouTube in the step 1) as a training set, setting the batch-size of training samples in each batch to be 4, setting the learning rate to be 1e-4, modifying the learning rate to be 1e-5 after the YouTube is trained for 30 ten thousand times in the last iteration, and training the model by using the loss function in the step 6) again for 10 ten thousand times in the YouTube, wherein the weight attenuation rate is set to be 0.0005, and training the model until the model converges.
The invention has wide application in the field of video object segmentation and computer vision, for example: target tracking, image recognition, etc. The present invention will be described in detail below with reference to the accompanying drawings.
(1) Constructing a split backbone network to divide the current frame I t First frame I 0 Respectively inputting into a main network to obtain a corresponding characteristic diagram F t ,F 0
(2) Constructing a motion attention network, and mapping the current frame characteristic diagram F t First frame feature map F 0 As input of a channel attention module in the motion attention network, obtaining a channel characteristic diagram, and carrying out F t And hidden state H of the previous frame memory module t-1 As input of a motion attention module in the motion attention network, a position feature map is obtained, and the position feature map and the channel feature map are added to obtain output F of the motion attention network out Namely, the segmentation result F of the current frame is obtained as the segmentation result of the current frame out Memory cell C outputted by the memory module of the previous frame t-1 Hidden state H of previous frame memory module t-1 As input to a memory module in the exercise attention network, memory cell C is obtained t And hidden state H t . Memory cell C t For storing the time sequence information of the target object and updating the time sequence information according to the segmentation result of the current frame, H t Position information of a target object of the next frame predicted based on the current segmentation result and the time sequence information is provided. The memory module convLSTM not only reserves the space information of the target object of the current frame, but also reserves the time sequence information of the target object, so that the remote position dependency relationship of the target object can be obtained.
Under the GTX 1080Ti GPU and Ubuntu14.04 bit operating system, the method is realized by using a PyTorch framework and Python language.
The invention provides a video target segmentation method based on motion attention, which is suitable for segmenting a moving object in a video, has good robustness and accurate segmentation result. Experiments show that the method can effectively divide the moving object.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any modifications and substitutions will be apparent to those skilled in the art within the scope of the present invention, and it is intended that the scope of the present invention shall be defined by the appended claims.

Claims (3)

1.一种基于运动注意力的视频目标分割方法,其特征在于包括以下步骤:1. A video target segmentation method based on motion attention, characterized by comprising the following steps: (1)构建分割主干网络,将当前帧
Figure QLYQS_1
以及第一帧
Figure QLYQS_2
分别输入主干网络,获得对应的特征图
Figure QLYQS_3
,
Figure QLYQS_4
(1) Build a segmentation backbone network to transform the current frame
Figure QLYQS_1
And the first frame
Figure QLYQS_2
Input the backbone network respectively to obtain the corresponding feature map
Figure QLYQS_3
,
Figure QLYQS_4
;
(2)构建运动注意力网络,将当前帧特征图
Figure QLYQS_6
,第一帧特征图
Figure QLYQS_12
,作为运动注意力网络中通道注意力模块的输入,得到通道特征图,将
Figure QLYQS_15
和前一帧记忆模块的隐藏状态
Figure QLYQS_8
作为运动注意力网络中运动注意力模块的输入,得到位置特征图,位置特征图和通道特征图相加得到运动注意力网络的输出
Figure QLYQS_11
Figure QLYQS_14
为当前帧的分割结果,将当前帧的分割结果
Figure QLYQS_17
、前一帧记忆模块输出的记忆细胞
Figure QLYQS_5
以及前一帧记忆模块的隐藏状态
Figure QLYQS_10
作为运动注意力网络中记忆模块的输入,得到记忆细胞
Figure QLYQS_13
和隐藏状态
Figure QLYQS_16
;记忆细胞
Figure QLYQS_7
用于保存目标物体的时序信息并根据当前帧的分割结果对时序信息进行更新,
Figure QLYQS_9
提供基于当前分割结果和时序信息所预测的下一帧目标物体的位置信息;
(2) Construct a motion attention network to convert the current frame feature map
Figure QLYQS_6
, the first frame feature map
Figure QLYQS_12
, as the input of the channel attention module in the motion attention network, the channel feature map is obtained.
Figure QLYQS_15
and the hidden state of the memory module of the previous frame
Figure QLYQS_8
As the input of the motion attention module in the motion attention network, the position feature map is obtained, and the position feature map and the channel feature map are added to obtain the output of the motion attention network
Figure QLYQS_11
,
Figure QLYQS_14
is the segmentation result of the current frame.
Figure QLYQS_17
, the memory cell output by the memory module of the previous frame
Figure QLYQS_5
And the hidden state of the memory module of the previous frame
Figure QLYQS_10
As the input of the memory module in the motion attention network, the memory cell is obtained
Figure QLYQS_13
and hidden state
Figure QLYQS_16
; Memory cells
Figure QLYQS_7
It is used to save the timing information of the target object and update the timing information according to the segmentation result of the current frame.
Figure QLYQS_9
Provide the position information of the target object in the next frame predicted based on the current segmentation result and timing information;
所述的通道注意力模块具体工作过程为:The specific working process of the channel attention module is as follows: 首先,
Figure QLYQS_18
,
Figure QLYQS_19
经过矩阵相乘,以及softmax函数,得到目标物体的通道权重注意力图
Figure QLYQS_20
first,
Figure QLYQS_18
,
Figure QLYQS_19
After matrix multiplication and softmax function, the channel weight attention map of the target object is obtained
Figure QLYQS_20
;
然后,将
Figure QLYQS_21
Figure QLYQS_22
相乘,其结果与
Figure QLYQS_23
进行残差相加,得到通道特征图
Figure QLYQS_24
Then,
Figure QLYQS_21
and
Figure QLYQS_22
Multiplying, the result is
Figure QLYQS_23
Add the residuals to get the channel feature map
Figure QLYQS_24
;
所述的运动注意力模块具体工作过程为:将
Figure QLYQS_25
,
Figure QLYQS_26
作为当前模块的输入,
Figure QLYQS_27
提供基于前一帧分割结果和时序信息所预测的当前帧目标物体的位置信息;
The specific working process of the motion attention module is:
Figure QLYQS_25
,
Figure QLYQS_26
As the input of the current module,
Figure QLYQS_27
Provides the location information of the target object in the current frame predicted based on the segmentation result and timing information of the previous frame;
首先,将特征图
Figure QLYQS_28
分别经过两个卷积核为
Figure QLYQS_29
的卷积层,得到两个特征图记为
Figure QLYQS_30
Figure QLYQS_31
First, the feature map
Figure QLYQS_28
After two convolution kernels respectively
Figure QLYQS_29
The convolution layer obtains two feature maps denoted as
Figure QLYQS_30
and
Figure QLYQS_31
;
然后,
Figure QLYQS_32
Figure QLYQS_33
经过矩阵相乘,并通过softmax函数,得到目标物体的位置权重注意力图
Figure QLYQS_34
Then,
Figure QLYQS_32
and
Figure QLYQS_33
After matrix multiplication and softmax function, we get the position weight attention map of the target object.
Figure QLYQS_34
;
最后,再将
Figure QLYQS_35
Figure QLYQS_36
相乘,进行特征强化,其结果与
Figure QLYQS_37
进行残差相加,得到位置特征图
Figure QLYQS_38
Finally,
Figure QLYQS_35
and
Figure QLYQS_36
Multiply them together to enhance the features, and the result is the same as
Figure QLYQS_37
Add the residuals to get the position feature map
Figure QLYQS_38
;
所述记忆模块包括遗忘门、输入门、输出门,具体工作过程为:The memory module includes a forget gate, an input gate, and an output gate. The specific working process is as follows: 首先,遗忘门遗弃前一帧记忆细胞
Figure QLYQS_39
中部分状态信息;
First, the forget gate abandons the memory cells of the previous frame.
Figure QLYQS_39
Middle part status information;
然后,输入门将当前帧分割结果
Figure QLYQS_40
中有用信息保存到前一帧记忆细胞
Figure QLYQS_41
中;
Then, the input gate segments the current frame
Figure QLYQS_40
Useful information is saved to the memory cell of the previous frame
Figure QLYQS_41
middle;
最后,更新并输出当前帧记忆细胞
Figure QLYQS_42
Finally, update and output the current frame memory cell
Figure QLYQS_42
.
2.根据权利要求1所述的一种基于运动注意力的视频目标分割方法,其特征在于:所述的特征图
Figure QLYQS_43
,
Figure QLYQS_44
,具体获取过程如下:
2. According to the video target segmentation method based on motion attention in claim 1, it is characterized in that: the feature map
Figure QLYQS_43
,
Figure QLYQS_44
The specific acquisition process is as follows:
1)构建分割主干网络,具体为:修改Resnet-50网络并融合空洞卷积:1) Build a segmentation backbone network, specifically: modify the Resnet-50 network and integrate the dilated convolution: 1.1.1)设置Resnet-50中的conv_1的膨胀因子dilated=2;1.1.1) Set the dilation factor of conv_1 in Resnet-50 to dilated=2; 1.1.2)删除Resnet-50中的池化层;1.1.2) Delete the pooling layer in Resnet-50; 1.1.3)设置Resnet-50中的conv_3和conv_4这两层的步长stride=1;1.1.3) Set the stride of the conv_3 and conv_4 layers in Resnet-50 to 1; 1.1.4)将修改后的Resnet-50作为主干网络,此时,主干网络输出的特征图是原始图像尺寸的1/8;1.1.4) Use the modified Resnet-50 as the backbone network. At this time, the feature map output by the backbone network is 1/8 of the original image size; 1.2)将当前帧
Figure QLYQS_45
输入分割主干网络,得到关于当前帧的特征图
Figure QLYQS_46
1.2) Set the current frame
Figure QLYQS_45
Input the segmentation backbone network to obtain the feature map of the current frame
Figure QLYQS_46
;
1.3)将第一帧
Figure QLYQS_47
输入分割主干网络,得到关于第一帧的特征图
Figure QLYQS_48
1.3) The first frame
Figure QLYQS_47
Input the segmentation backbone network to obtain the feature map of the first frame
Figure QLYQS_48
.
3.根据权利要求1所述的一种基于运动注意力的视频目标分割方法,其特征在于损失函数由两部分组成:第一部分是像素级损失函数;第二部分是结构相似性损失函数。3. According to the motion attention-based video target segmentation method described in claim 1, the characteristic is that the loss function consists of two parts: the first part is the pixel-level loss function; the second part is the structural similarity loss function.
CN201911402450.2A 2019-12-31 2019-12-31 A Video Object Segmentation Method Based on Motion Attention Active CN111161306B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911402450.2A CN111161306B (en) 2019-12-31 2019-12-31 A Video Object Segmentation Method Based on Motion Attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911402450.2A CN111161306B (en) 2019-12-31 2019-12-31 A Video Object Segmentation Method Based on Motion Attention

Publications (2)

Publication Number Publication Date
CN111161306A CN111161306A (en) 2020-05-15
CN111161306B true CN111161306B (en) 2023-06-02

Family

ID=70559471

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911402450.2A Active CN111161306B (en) 2019-12-31 2019-12-31 A Video Object Segmentation Method Based on Motion Attention

Country Status (1)

Country Link
CN (1) CN111161306B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021250811A1 (en) * 2020-06-10 2021-12-16 日本電気株式会社 Data processing device, data processing method, and recording medium
CN111968123B (en) * 2020-08-28 2024-02-02 北京交通大学 Semi-supervised video target segmentation method
CN112580473B (en) * 2020-12-11 2024-05-28 北京工业大学 Video super-resolution reconstruction method integrating motion characteristics
CN112669324B (en) * 2020-12-31 2022-09-09 中国科学技术大学 Rapid video target segmentation method based on time sequence feature aggregation and conditional convolution
CN112784750B (en) * 2021-01-22 2022-08-09 清华大学 Fast video object segmentation method and device based on pixel and region feature matching
US11501447B2 (en) * 2021-03-04 2022-11-15 Lemon Inc. Disentangled feature transforms for video object segmentation
CN113570607B (en) * 2021-06-30 2024-02-06 北京百度网讯科技有限公司 Target segmentation method, device and electronic equipment
CN113436199B (en) * 2021-07-23 2022-02-22 人民网股份有限公司 Semi-supervised video target segmentation method and device
CN115272394A (en) * 2022-08-16 2022-11-01 中国工商银行股份有限公司 Method, device and equipment for determining position of target object to be tracked and storage medium
CN115880767B (en) * 2022-10-11 2025-12-05 电子科技大学长三角研究院(湖州) Target motion tracking method based on brain-like memory network
CN116740423A (en) * 2023-05-26 2023-09-12 国网黑龙江省电力有限公司牡丹江供电公司 An automatic identification method and system for smoke anomalies in distribution rooms

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018128741A1 (en) * 2017-01-06 2018-07-12 Board Of Regents, The University Of Texas System Segmenting generic foreground objects in images and videos
CN109784261A (en) * 2019-01-09 2019-05-21 深圳市烨嘉为技术有限公司 Pedestrian's segmentation and recognition methods based on machine vision
CN109919044A (en) * 2019-02-18 2019-06-21 清华大学 Video Semantic Segmentation Method and Device for Feature Propagation Based on Prediction
CN110059662A (en) * 2019-04-26 2019-07-26 山东大学 A deep video behavior recognition method and system
CN110321761A (en) * 2018-03-29 2019-10-11 中国科学院深圳先进技术研究院 A kind of Activity recognition method, terminal device and computer readable storage medium
CN110532955A (en) * 2019-08-30 2019-12-03 中国科学院宁波材料技术与工程研究所 Example dividing method and device based on feature attention and son up-sampling

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI381717B (en) * 2008-03-31 2013-01-01 Univ Nat Taiwan Method of processing partition of dynamic target object in digital video and system thereof

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018128741A1 (en) * 2017-01-06 2018-07-12 Board Of Regents, The University Of Texas System Segmenting generic foreground objects in images and videos
CN110321761A (en) * 2018-03-29 2019-10-11 中国科学院深圳先进技术研究院 A kind of Activity recognition method, terminal device and computer readable storage medium
CN109784261A (en) * 2019-01-09 2019-05-21 深圳市烨嘉为技术有限公司 Pedestrian's segmentation and recognition methods based on machine vision
CN109919044A (en) * 2019-02-18 2019-06-21 清华大学 Video Semantic Segmentation Method and Device for Feature Propagation Based on Prediction
CN110059662A (en) * 2019-04-26 2019-07-26 山东大学 A deep video behavior recognition method and system
CN110532955A (en) * 2019-08-30 2019-12-03 中国科学院宁波材料技术与工程研究所 Example dividing method and device based on feature attention and son up-sampling

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
结合目标色彩特征的基于注意力的图像分割;张建兴 等;计算机工程与应用;全文 *

Also Published As

Publication number Publication date
CN111161306A (en) 2020-05-15

Similar Documents

Publication Publication Date Title
CN111161306B (en) A Video Object Segmentation Method Based on Motion Attention
CN112561027B (en) Neural network architecture search method, image processing method, device and storage medium
CN109949255B (en) Image reconstruction method and device
JP6504590B2 (en) System and computer implemented method for semantic segmentation of images and non-transitory computer readable medium
US10535141B2 (en) Differentiable jaccard loss approximation for training an artificial neural network
CN110378348B (en) Video instance segmentation method, apparatus and computer-readable storage medium
CN111507993A (en) Image segmentation method and device based on generation countermeasure network and storage medium
CN114882493B (en) A 3D hand pose estimation and recognition method based on image sequences
WO2020062299A1 (en) Neural network processor, data processing method and related device
CN109544559B (en) Image semantic segmentation method, device, computer equipment and storage medium
CN117058456B (en) A visual object tracking method based on multi-phase attention mechanism
CN114743069B (en) A method for adaptive dense matching calculation between two frames of images
CN112561028A (en) Method for training neural network model, and method and device for data processing
CN115249313B (en) Image classification method based on meta-module fusion incremental learning
CN112883806A (en) Video style migration method and device based on neural network, computer equipment and storage medium
CN113313731A (en) Three-dimensional human body posture estimation method for monocular video
CN119131085B (en) A target tracking method based on gated attention mechanism and spatiotemporal memory network
WO2023040740A1 (en) Method for optimizing neural network model, and related device
CN116703707A (en) Method for training skin color migration model, method for generating skin care image and related device
CN114693545A (en) Low-illumination enhancement method and system based on curve family function
CN108537825A (en) A kind of method for tracking target based on transfer learning Recurrent networks
CN113869396B (en) PC screen semantic segmentation method based on efficient attention mechanism
CN119600446B (en) A hyperspectral image reconstruction method and system
CN119478469B (en) End-to-end anchor point multi-view image clustering method under guidance of global principal component
CN113642592B (en) Training method of training model, scene recognition method and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
OL01 Intention to license declared
OL01 Intention to license declared