CN117011337A

CN117011337A - A lightweight target tracking method

Info

Publication number: CN117011337A
Application number: CN202310968693.2A
Authority: CN
Inventors: 贾刚勇; 黄寅佐; 饶欢乐; 踪家祥; 王国琨; 谢国杰
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2023-08-03
Filing date: 2023-08-03
Publication date: 2023-11-07

Abstract

The invention discloses a light target tracking method, which comprises the steps of firstly obtaining a target tracking video, defining an initial template and a dynamic template, and constructing a template set. And secondly, reading a search image, inputting the template set and the search image into a preprocessing module to obtain an output result a, and inputting the output result a into a target prediction module to predict a center point and a boundary box of a target. And then inputting the search features into an online target predictor to obtain a predicted result b, and linearly fusing the predicted result b with a center point to obtain a final predicted result of the center point, thereby completing target tracking. And finally, inputting the search image and the prediction result b into a template updater, and determining whether to take the search image as a new dynamic template according to an updating strategy by the template updater. The invention reduces the calculated amount and parameter amount of the model, enhances the capability of coping with scenes with complex environment and larger target change, and improves the success rate and the accuracy rate of tracking.

Description

Light target tracking method

Technical Field

The invention belongs to the field of target tracking in computer images, and particularly relates to a light target tracking method.

Background

Target tracking is an important research direction in the field of computer vision, and is widely applied to the fields of video monitoring, automatic driving, man-machine interaction and the like. The object tracking algorithm is aimed at automatically tracking a specific object, typically an object or a pedestrian, in a video sequence. In recent years, the development of deep learning technology enables the accuracy of a target tracking algorithm based on a neural network to be remarkably improved, and the method becomes a mainstream method of the target tracking algorithm.

The full convolution classification regression neural network SiamCAR is a target tracking algorithm based on deep learning, and firstly adopts a structure of a twin network, respectively processes a template image and a search image through a weight sharing feature extraction network, fuses multiple layers of intermediate features to obtain two feature images, and calculates a similarity feature image between the two feature images; secondly, enhancing the similarity feature map by using a feature enhancement network stacked by a plurality of layers of common convolutions; and finally, respectively predicting the center point of the target and the scale of the target from the enhanced similarity feature map to realize target tracking. However, because the SiamCAR model has a complex structure and large calculated amount and parameter amount, the application of the SiamCAR model in some scenes with limited running resources is limited, and the SiamCAR model does not update the model image in the tracking process, so that the accuracy of the SiamCAR model in the scenes with complex environment and large target change is low.

Disclosure of Invention

Aiming at the problems, the invention provides a light target tracking method based on a characteristic enhancement network with depth separable convolution and space-time attention fusion and a template updating method in running, which optimizes the SiamCAR algorithm and reduces the calculated amount and the parameter amount of a model.

A light target tracking method comprises the following steps:

step 1: the method comprises the steps of obtaining a video to be subjected to target tracking, defining a first frame image of the video as an initial template, copying the initial template to be defined as a dynamic template at the beginning, and putting the initial template and the dynamic template into a set to construct a template set. Each frame of image except the first frame is sequentially used as a search image to track the target.

Step 2: and reading the search image, and inputting the template set and the search image into a preprocessing module to obtain an output result a. The preprocessing module comprises a feature extractor and a feature similarity calculator.

Step 2.1: in the preprocessing module, extracting features by using a feature extractor t1 from a template set to obtain initial features and dynamic features;

the search image is subjected to feature extraction by a feature extractor t2 which is the same as t1, and search features are obtained.

the structure of the two feature extractors t1 and t2 is based on ResNet50, and the last layer of network is removed.

Step 2.2: and (3) inputting the three features extracted in the step (2.1) into a feature similarity calculator to perform similarity calculation. The feature similarity calculator calculates the similarity of the initial feature, the dynamic feature and the search feature, and selects the feature with the highest similarity as an output result a.

Step 3: and inputting the output result a into a target prediction module, and predicting the center point and the boundary box of the target.

Step 3.1: in the target prediction module, inputting a into a feature enhancer b1, and then inputting the output feature of b1 into a target frame predictor to obtain a boundary frame of a target.

Step 3.2: inputting a into a feature enhancer b2 which is the same as b1, and then inputting the output feature of b2 into a target center predictor to obtain the center point of the target.

Step 4: and inputting the search characteristic into an online target predictor to obtain a prediction result b.

And then, linearly fusing the b with the central point predicting result in the step 3.2 to obtain a final central point predicting result, and completing target tracking according to the boundary frame obtained in the step 3.1 and the final central point predicting result.

Step 5: inputting the search image and the prediction result b in the step 4 into a template updater, and determining whether to take the search image as a new dynamic template according to an updating strategy by the template updater to update the dynamic template.

The invention makes the following contributions:

1. the invention reduces the calculated amount and the parameter amount of the model through the feature enhancement network of the depth separable convolution and the space-time attention fusion in the feature enhancer.

2. The template updating operation is added in the tracking stage, the capability of the algorithm for coping with scenes with complex environments and large target changes is enhanced, and the success rate and the accuracy rate of the tracking algorithm are improved.

3. The invention reduces the calculated amount of 68.3% and the parameter amount of 71% of the model, and improves the success rate and the accuracy rate of tracking in the public data set OTB 100.

Drawings

In order to more intuitively illustrate the details of the invention, a brief description of the drawings will follow.

FIG. 1 is an overall network architecture diagram of the present invention;

FIG. 2 is a flow chart of a specific operation of the present invention;

FIG. 3 is a graph of the cross-over ratio of the present invention with SiamCAR over an OTB100 dataset;

fig. 4 is a graph of the accuracy of the present invention with sialmcar on OTB100 data sets.

Detailed Description

The following describes a light-weight target tracking method with reference to the accompanying drawings, the overall network structure of the invention is shown in fig. 1, the operation flow chart is shown in fig. 2, and the specific steps are as follows:

step 1: the method comprises the steps of obtaining a video to be subject to tracking, defining a first frame image of the whole video as an initial template, copying a dynamic template defined as the initial, and placing the initial template and the dynamic template in a set and defining the initial template and the dynamic template as a template set. Each frame of image except the first frame is sequentially used as a search image to track the target.

Step 2: as shown in fig. 1, the search image is read, and the template set and the search image are input to the preprocessing module. The preprocessing module comprises a feature extractor and a feature similarity calculator, wherein the feature extractor and the feature similarity calculator are respectively used for extracting image features and calculating the similarity of input features for subsequent prediction.

Firstly, extracting features of a template set and a search image by using the same feature extractor respectively, and extracting the initial features and dynamic features by using the template set; and extracting the search image to obtain search features. The feature extractor is structured based on ResNet50, and the last network layer of ResNet50 is eliminated, since the trace task does not need to output classifications during feature extraction and coarser granularity features can better meet the trace task requirements. And, unlike the SiamCAR, which uses a multiple feature fusion feature extractor, the present invention uses only the final output result for subsequent operations, with the ResNet50 last layer network removed. This reduces the number of parameters and computation of the feature extractor.

And then, the feature similarity calculator calculates the similarity between the initial template and the dynamic template after extracting the features in the template set and the search image respectively, and selects the result with the highest similarity as an output result a. The feature calculator is formed by single-layer convolution, takes an initial template and a dynamic template as convolution kernels respectively, inputs a search image to carry out convolution operation, achieves the purpose of calculating similarity, generates two similarity results, and takes a feature with a larger similarity peak value as a final output result a.

Step 3: inputting the a into a target prediction module to predict a center point and a boundary box of the target. The target prediction module is composed of a feature enhancer, a target frame predictor and a target center predictor. The feature enhancer is used for enhancing the output result a in the step 2 so as to improve the accuracy and the precision of the subsequent prediction. The network structure of the feature enhancer is based on depth separable convolution and space-time attention CBAM, and the complete structure is sequentially composed of depth separable convolution, group normalization, reLU activation function, space-time attention CBAM, group normalization and ReLU activation function, wherein the size of convolution kernel used in the depth separable convolution process is 3*3. Compared with the ordinary convolution in SiamCAR, the calculated amount of the depth separable convolution is reduced to 1/9 of the original calculation amount. The use of group normalization reduces the impact of the batch size parameter on the accuracy of the present invention. The invention uses the space-time attention CBAM to improve the accuracy rate at the cost of increasing smaller calculated amount, and the convolution kernel size used by the space-time attention CBAM is 7*7.

Firstly inputting a into a characteristic enhancer b1, and then inputting the output characteristic of b1 into a target center predictor to obtain the center point of the target. The target center predictor consists of a single layer convolution with a convolution kernel of size 3*3. The target center predictor takes the enhanced features as input, outputs a predicted target center point, and is used for determining the center position of the target.

In parallel, a is input to the feature enhancer b2 with the same residue b1, and then the output feature of b2 is input to the target frame predictor, so as to obtain the boundary frame of the target. The bounding box is used to represent the scale size of the object. The target frame predictor consists of a single layer convolution with a convolution kernel of size 3*3. The target frame predictor takes the enhanced features as input, outputs a predicted target boundary frame, and is used for framing the scale of the target.

Step 4: inputting the search image into a template updating module, outputting a predicted result b, and linearly fusing the predicted result b with the predicted result in the step 3.2 to obtain a final center point predicted result. As shown in fig. 1, the template updating module is composed of a feature extractor, an online target predictor, and a template updater. Wherein the feature extractor is the same as the feature extractor used in step 2. The online target predictor is a Classifier used in the DROL tracking algorithm, which consists of a spatio-temporal attention and a normal convolution, in a specific order of normal convolution (convolution kernel 1*1) -a spatio-temporal attention CBAM module-normal convolution (convolution kernel 4*4). The Classifier can provide a prediction result at the running stage of the tracking algorithm as the aid of the target frame predictor and the target center predictor in the step 3 so as to improve the accuracy. The template updater combines the confidence score given by the online target predictor and an update strategy set in the template updater to determine whether to take the search image as a new dynamic template.

Further, the template updating module takes the search image as input, firstly, the search feature is input into the online target predictor to obtain the confidence score of the target center prediction, the confidence score has the same size as the output of the center predictor in the step 3, and then the two are subjected to linear fusion to obtain the final center prediction result. The linear fusion formula is as follows:

o＝(1-λ)O _cen +λO _C

wherein O represents the final central prediction result, O _cen Representing the result of the target center predictor, O _C Representing the results of the online target predictor, λ is [0,1]And represents the fusion weight.

Step 5: the template updating module decides whether to take the searching image as a new dynamic template according to the updating strategy, and returns to the step 2. The update policy can be described in particular as: first, counting from the start of target tracking method operation, every T _u Frame selection of T _u The search image with the highest confidence score of the online target predictor in the frame is marked as c, and the prediction result d of the corresponding search image c in the step 3 is recorded. T (T) _u Is a positive integer representing the interval of dynamic template update, the invention sets T _u =5. Then, the IOU score, i.e., the intersection ratio of the two frames, is calculated by calculating the rectangular prediction frame combined by the prediction results and the actual boundary frame of the target. If the overlap ratio is greater than the set threshold tau_r, the present invention sets tau_r=0.6 using the search image c as a new dynamic template. If the cross ratio is not greater than the threshold TAU_R, the dynamic template is not updated. The above described update strategy makes the present invention run on a different basis than the sialcar, which uses only the initial template, also incorporates a dynamic template,because the dynamic template is updated in real time in the running process, and the IOU cross comparison result is used as a reference to avoid pollution of samples which are unfavorable for tracking to the follow-up tracking process. In this way, the dynamic template provides the latest state (center position and size) of the tracking target for the algorithm, so that the dynamic template can output a calculation result with a higher peak value when the similarity calculation is performed with the search image in the step 2, the capability of the algorithm for coping with scenes with complex environments and larger target changes is enhanced, and the success rate and the accuracy rate of the tracking algorithm are improved.

Fig. 3 shows the tracking success rate comparison of the present invention with the sialcar method, the tracking success rate index refers to the intersection ratio of the prediction bounding box and the real bounding box. The higher the overlap ratio, the more accurate the predicted bounding box. The horizontal axis represents a specific cross ratio threshold, when the cross ratio is greater than the threshold, the tracking is considered successful, and the corresponding vertical axis represents the tracking success rate of the whole video under the threshold. The lower left corner value is the average tracking success rate of the present invention with the sialmcar method. Compared with SiamCAR, the invention improves the average tracking success rate by 0.7%.

Fig. 4 shows the tracking accuracy of the present invention compared to the sialcar method, with the tracking accuracy index being the error (pixel) of the predicted center point and the true center point. The smaller the error, the more accurate the predicted center point. The horizontal axis represents a specific error threshold, and below that threshold, tracking accuracy is considered, and the corresponding vertical axis represents the tracking accuracy of the whole video under that threshold. The lower left corner value is the tracking accuracy of the invention at an error threshold of 20 with the SiamCAR method. Compared with SiamCAR, the method improves the average tracking accuracy by 3.1%.

TABLE 1

Table 1 shows the comparison of the present invention with sialmcar in terms of accuracy, success rate, calculated amount and parameter amount, the present invention improves the success rate and accuracy of tracking in the public data set OTB100 while reducing the calculated amount of 68.3% and the parameter amount of 71% of the model.

Claims

1. The light target tracking method is characterized by comprising the following steps of:

step 1: acquiring a video for target tracking, defining a first frame image of the video as an initial template, copying the initial template to be defined as a dynamic template at the beginning, and putting the initial template and the dynamic template into a set to construct a template set;

step 2: reading a search image, inputting a template set and the search image into a preprocessing module, and obtaining an output result a;

step 3: inputting the output result a into a target prediction module to predict a center point and a boundary box of a target;

step 4: inputting the search features into an online target predictor to obtain a predicted result b, and linearly fusing the predicted result b with the center point predicted result in the step 3 to obtain a final center point predicted result, so that target tracking is completed;

2. The light-weight object tracking method according to claim 1, wherein in the step 1, object tracking is performed as search images in order for each frame image except for the first frame.

3. The light-weighted object tracking method according to claim 1, wherein the specific process of the preprocessing module in step 2 obtaining the output result a is as follows:

step 2.1: in the preprocessing module, extracting features by using a feature extractor t1 from a template set to obtain initial features and dynamic features; the searching image is subjected to feature extraction through a feature extractor t2 which is the same as t1, so that searching features are obtained;

step 2.2: the feature similarity calculator calculates the similarity of the initial feature, the dynamic feature and the search feature, and selects the feature with the highest similarity as an output result a.

4. A lightweight object tracking method as defined in claim 3, wherein in step 2.1, the structure of the two feature extractors t1 and t2 is based on the res net50, and the last network layer of the res net50 is removed.

5. The light-weight target tracking method according to claim 4, wherein the specific process of step 3 is as follows:

step 3.1: in the target prediction module, inputting a into a feature enhancer b1, and then inputting the output feature of b1 into a target frame predictor to obtain a boundary frame of a target;

6. The light-weight target tracking method according to any one of claims 1 to 5, wherein in step 5, the update strategy is specifically described as: first, counting from the start of target tracking method operation, every T _u Frame selection of T _u The search image with the highest confidence score of the online target predictor in the frame is marked as c, and the prediction results d and T of the corresponding search image c in the step 3 are recorded _u Is a positive integer representing the interval of dynamic template update; and secondly, calculating the intersection ratio of the rectangular prediction frame combined by the prediction results and the actual boundary frame of the target, and taking the search image c as a new dynamic template if the intersection ratio is larger than a set threshold TAU_R, otherwise, not updating the dynamic template.