Background
Visual object tracking is a fundamental task in computer vision that aims at estimating the position of a specified object in a video sequence from an initial state. The existing tracking reference data sets LaSOT, laSOT ext, TRACKINGNET, UAV123 generally provide bounding boxes for accurate labeling of the first frame picture for tracking initialization. However, accurately marking the initial bounding box inevitably introduces a time delay of the order of seconds in a real scene. In contrast, only millisecond-level delay is needed for target identification through single click, the efficiency is far higher than that of boundary box labeling, and the method is more suitable for human-computer interaction, intelligent robot systems and other practical applications.
The information gap introduced by a single click presents a significant challenge compared to a subtly labeled bounding box. First, the conventional tracker SiamFC, ATOM, transT, stark, swinTrack, etc. is typically initialized by fixed-scale cropping the reference area. Whereas single click-based initialization relies on coordinates of only a single representation location for targeting, lacks scale and boundary information. Scaling the entire frame image to 256×256 or 384×384 may result in loss of key appearance information. This lack of scale and boundary cues makes single click-based initialization more challenging than bounding box-based initialization methods. Second, the initial bounding box of click-based model predictions may have positioning errors that accumulate further in subsequent tracking, resulting in serious tracking failures.
Click-based tracking tasks typically combine a tracking model with the detector Yolov or interaction model SAM, SAM2, etc. in practice. In the first frame image, the detector or interaction model first generates an initial target bounding box from the single click location, followed by subsequent tracking by the tracking model. However, there are several limitations to such a modular tracking framework. First, the additional detection or interaction model increases the amount of parameters, increasing memory and computational overhead. Second, target positioning requires post-processing. Both the detection model and the interaction model may generate a plurality of candidate boxes, from which the final bounding box is selected by predefined rules, possibly increasing the error of the initialization result. Compared with tracking initiated by an accurate bounding box, the click-based tracking method has a remarkable performance gap and is difficult to apply to a real scene. Nonetheless, little research has been focused on implementing robust click-based tracking methods.
Parameter Efficient Fine Tuning (PEFT) significantly reduces the training cost of adapting a large model to downstream tasks by freezing pre-training model parameters, optimizing only a small fraction of the additional modules. An Adapter-based approach inserts small bottleneck structures between the convertors to implement task-specific learning (Houlsby N et al, "Parameter-EFFICIENT TRANSFER LEARNING for NLP", 2019). LoRA introduces a Low rank matrix decomposition strategy that compresses the parameter updates to below 0.01% of the original model (HuEJ et al, "LoRA: low-Rank Adaptation of Large Language Models", 2021). Prompt learning (Prompt Tuning) guides the model by adding a learnable vector in the input sequence (Lester et al, "The Power of Scale for Parameter-EFFICIENT PROMPT TUNING", 2021). To enhance the capabilities of the PEFT approach, researchers have designed a hybrid expert (MoE) architecture (Fedus et al , "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity", 2022).MoE that dynamically routes different task data to multiple experts using gating mechanisms, significantly enhancing performance in a multitasking learning scenario compared to a single adapter based on the above approach, by constructing a click guided expert selection mechanism, the adaptive fusion of the target initialization and tracking task is realized on the premise of not increasing the burden of a main model.
Disclosure of Invention
The invention aims to provide an interactive target recognition and tracking method, which combines a transducer structure, a gate control expert network (MoE) and a multi-stage bounding box optimization mechanism, can complete high-quality target recognition and tracking tasks only by single clicking of a user, and has the advantages of high efficiency and high precision.
The technical scheme of the invention is as follows:
an interactive target recognition and tracking algorithm for intelligent robots comprises the following steps:
step 1, manual click initialization and input pretreatment;
Manually in the initial frame of video Single click in the video is taken as a starting point, and the initial frame of the video isUp-click targetAny point is marked asThen normalize it toInterval, expressed as click coordinatesSimultaneously, the video initial frameScaled to a fixed size 320 x 320, noted as an imageImage(s)And click coordinatesAs joint input, providing a target prompt for initializing the manual click;
step 2, performing target coarse positioning by using a multitasking unified model;
The multitasking unified model mainly comprises an image encoder, a coordinate encoder, a feature encoder, a gating expert network and a prediction network;
The image encoder consists of a layer of convolution network with a convolution step length of 16×16 for applying the image Conversion to depth feature representation;
The coordinate encoder is composed of a double-layer linear feedforward network for clicking coordinatesConversion to depth feature representation;
The feature encoder adopts a backbone network of a visual tracking network OSTrack and is formed by stacking N layers of Transformer structures with a gate control expert network;
The transducer structure consists of an attention mechanism and a linear feed forward layer, wherein the attention mechanism in the transducer structure is defined as follows:
wherein Q, K, V respectively represent query, key, and value of the input feature vector, Representing the dimension of the feature vector;
the transducer structure with the gate expert network sequentially comprises an attention mechanism, the gate expert network and a linear feed-forward layer, wherein the depth characteristic represents Depth characterizationIn passing through a transducer structure with a gated expert network in a feature encoder, depth features are represented by first passing through an attention mechanismLocation information and depth feature representation in (a)The image information in the image is mutually fused, and the depth characteristic representsDepth characterizationThen the expert network is gated, the expert network consists of 3 expert sub-networks and a gating module, each expert sub-network consists of a double-layer linear feed-forward network, and the gating module represents according to depth characteristicsDepth characterizationThe dynamic distribution of the content of the (E) is calculated by only activating part of expert to participate in reasoning, and the depth characteristic representsDepth characterizationFinally, a linear feedforward layer is adopted;
the prediction network consists of three prediction heads which are respectively used for predicting a coordinate probability distribution diagram of a center point of the click target, a coordinate offset of the center point of the click target and the size of the click target, and the rough boundary frame of the click target is obtained by the output of the three prediction heads I.e. click coordinatesIn the imageThe approximate spatial location of the corresponding target; wherein, the 、Respectively represent the center coordinates of the click targets,、The width and height of the click target are represented, respectively.
Step 3, constructing input information of a multi-mode prompt optimization stage, which specifically comprises the following steps:
At the time of acquiring the rough bounding box Then, a fuzzy search area and reference prompts of two different modes are further constructed to serve as auxiliary input to refine an initial boundary box:
Fuzzy search area with rough bounding box Centering, a rough bounding box4 Times the width and height of the image, constituting an image area containing a context background, noted as;
Visual reference cues with rough bounding boxCentered, the rough bounding box is expanded by 2 times, focusing on clicking the target in step 1And its vicinity, denoted as;
Spatial position reference prompt is based on the initial frame of the video in the step 1Click coordinates of (c)Converting into relative position coordinates based on the central point of the fuzzy search area, and taking the relative position coordinates as a click targetSpatial location references in fuzzy search areas, noted as。
Step 4, accurately positioning the optimized bounding box based on the multi-mode prompt;
Will be 、AndThe combined input is sent into the multitasking unified model in the step 2 again, the task target is changed from coarse positioning to multi-mode prompt optimization, and the multitasking unified model is received at the input end、AndThree paths of features are extracted and respectively expressed as fuzzy search area featuresVisual reference area featuresWith spatial position reference featuresWherein the fuzzy search area featureProviding long-range contextual cues, visual reference region featuresProviding high confidence target structure information, spatial location reference featuresGuiding a target core region of interest;
click target bounding box after prediction network output optimization of final multitasking unified model As an initial frame of videoIs a final label of the (c).
Step 5, realizing video target tracking by using a visual tracking network OSTrack based on a transducer;
(1) Cropping the search area of the current frame, namely, for the t-th frame image of the video Target tracking results in t-1 frameThe bounding box is centered, and when t=2,Refer to the initialization result obtained in the fourth stepConstructing a fixed-size search area;
(2) Template and search feature extraction, namely utilizing the click target bounding box obtained in the step 4In the initial frame of videoCutting out the target image clicked by the user as a tracking template T, wherein the tracking template T and the tracking template T areInputting the template features and the search region features into an image encoder together;
(3) Matching and fusing the template features and the search area features through a transducer structure, and capturing the time sequence consistency of the target;
(4) Boundary frame prediction-predicting a current frame using a OSTrack prediction network Target bounding box in (a),As the current frameAt the same time will be used for searching the region in the t+1st frame in step 5 (1)Is used for cutting.
The invention has the beneficial effects that:
(1) The labeling method provided by the invention has the remarkable advantage of low-cost starting. Compared with the traditional initialization mode of manually and finely labeling an initial bounding box, the method can finish the positioning of the target by only clicking once in the initial video frame, and reduces the manual labeling cost and the operation complexity. At the same time, the simplified interaction process significantly shortens the time overhead of the initialization phase.
(2) And (3) a staged reasoning mechanism is adopted to decouple the coarse positioning of the target from the subsequent fine optimization based on multi-mode prompt. By introducing context information of different modes at different stages, the target position is gradually refined, the positioning accuracy and stability of the bounding box are effectively improved, and a solid foundation is provided for high-quality automatic labeling.
Detailed Description
The following describes the embodiments of the present invention further with reference to the drawings and technical schemes.
Fig. 1 is a schematic diagram of a tracker OSTrack based on a transducer, which is composed of an image encoder, a coordinate encoder, a feature encoder, a gating expert network and a prediction network. The feature encoder uses a 12 standard fransformer layer order stack. Compared with the common related operation, the application of the transducer module improves the fusion effect of the template characteristics and the search area characteristics. And extracting center coordinates and size information of the fusion features output by the feature encoder in a prediction network to obtain coordinates of a target boundary box in the input picture.
Fig. 2 is a schematic diagram of a multitasking unified model structure, and fig. 3 is a flowchart of a method for identifying and tracking an interactable target. The process takes a single point click of a person in an initial frame as a starting point, normalizes the click point and inputs the normalized click point and an image scaled to a uniform size into a network together, and provides semantic prompt of a target. The multitasking unified model adopts OSTrack as a basic model, and the feature encoder is formed by stacking a transducer layer with a gate control expert network (MoE), so that joint modeling and efficient reasoning of image and position information are realized. The gate expert network consists of three expert sub-networks and a gate module for inputting dynamic distribution paths, each expert sub-network consists of a double-layer linear feed-forward network, and the calculation efficiency and the expression capacity are improved.
After the boundary box is preliminarily predicted, the system builds a multi-mode auxiliary prompt, wherein the multi-mode auxiliary prompt comprises a fuzzy search area which is extended by 4 times and a visual reference area which is extended by 2 times, and the click point is converted into a relative position coordinate to serve as a spatial position prompt. The multi-mode prompt information and the image are input into the multi-task unified model again, the characteristics from different modes and scales are fused by means of the attention mechanism of a transducer, the multi-mode joint optimization of the bounding box is realized, and the initializing bounding box with higher accuracy is output. Finally, realizing the whole video target propagation through OSTrack tracker, namely cutting the optimized boundary box into a template, carrying out fusion matching with the searching region characteristics of each frame, predicting the target position frame by frame, and completing the complete video target tracking. The whole system has the characteristics of strong expandability and high tracking stability, and is suitable for real scene application.
OSTrack in the multitasking unified model can be arbitrarily replaced with other transform-based trackers. The rough prediction of the first stage and the multi-mode prompt optimization training of the second stage are performed in stages to ensure higher performance benefits. The training set uses all video sequences of LaSOT and Got k datasets. The optimizer selects AdamW, the initial learning rate is set to 0.0001, the learning rate is reduced by 10 times for every 400 training rounds, and the total training rounds is 800.
The feature extraction and fusion network structure in tracker OSTrack and transducer-based tracker OSTrack is as follows:
。