[go: up one dir, main page]

CN120526427A - An interactive target recognition and tracking algorithm for intelligent robots - Google Patents

An interactive target recognition and tracking algorithm for intelligent robots

Info

Publication number
CN120526427A
CN120526427A CN202511013001.4A CN202511013001A CN120526427A CN 120526427 A CN120526427 A CN 120526427A CN 202511013001 A CN202511013001 A CN 202511013001A CN 120526427 A CN120526427 A CN 120526427A
Authority
CN
China
Prior art keywords
target
click
network
bounding box
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202511013001.4A
Other languages
Chinese (zh)
Other versions
CN120526427B (en
Inventor
王栋
袁永胜
赵洁
刘洋
卢湖川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Priority to CN202511013001.4A priority Critical patent/CN120526427B/en
Publication of CN120526427A publication Critical patent/CN120526427A/en
Application granted granted Critical
Publication of CN120526427B publication Critical patent/CN120526427B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/809Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
    • G06V10/811Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data the classifiers operating on different input data, e.g. multi-modal recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

本发明属于机器学习、可交互视频目标识别、单目标视觉跟踪技术领域,提出了一种面向智能机器人的可交互目标识别与跟踪算法。所述方法以人工在初始帧中单点点击为起点,通过Transformer结构提取图像与空间位置提示的深层特征,结合门控专家网络实现多模态信息的动态融合,预测目标初始边界框。在此基础上,构建多模态参考提示,包括视觉提示和空间位置提示,实现边界框的逐步精细优化。优化后的边界框可作为初始条件使用跟踪器实现目标在视频序列中的传播。该方法具备初始化成本低、标注精度高、网络结构轻量、泛化性强等优点,适用于真实环境下智能机器人交互等场景,有效提升了可交互跟踪系统的实用性与扩展能力。

The present invention belongs to the technical fields of machine learning, interactive video target recognition, and single-target visual tracking, and proposes an interactive target recognition and tracking algorithm for intelligent robots. The method takes a manual single-point click in the initial frame as the starting point, extracts deep features of the image and spatial position cues through the Transformer structure, combines the gated expert network to achieve dynamic fusion of multimodal information, and predicts the initial bounding box of the target. On this basis, a multimodal reference prompt is constructed, including visual cues and spatial position cues, to achieve gradual and fine optimization of the bounding box. The optimized bounding box can be used as an initial condition to use a tracker to realize the propagation of the target in the video sequence. This method has the advantages of low initialization cost, high annotation accuracy, lightweight network structure, and strong generalization. It is suitable for scenarios such as intelligent robot interaction in real environments, and effectively improves the practicality and scalability of the interactive tracking system.

Description

Interactive target recognition and tracking algorithm for intelligent robot
Technical Field
The invention belongs to the field of machine learning, interactive video target initialization and single target visual tracking, and relates to an interactive target recognition and tracking algorithm for an intelligent robot.
Background
Visual object tracking is a fundamental task in computer vision that aims at estimating the position of a specified object in a video sequence from an initial state. The existing tracking reference data sets LaSOT, laSOT ext, TRACKINGNET, UAV123 generally provide bounding boxes for accurate labeling of the first frame picture for tracking initialization. However, accurately marking the initial bounding box inevitably introduces a time delay of the order of seconds in a real scene. In contrast, only millisecond-level delay is needed for target identification through single click, the efficiency is far higher than that of boundary box labeling, and the method is more suitable for human-computer interaction, intelligent robot systems and other practical applications.
The information gap introduced by a single click presents a significant challenge compared to a subtly labeled bounding box. First, the conventional tracker SiamFC, ATOM, transT, stark, swinTrack, etc. is typically initialized by fixed-scale cropping the reference area. Whereas single click-based initialization relies on coordinates of only a single representation location for targeting, lacks scale and boundary information. Scaling the entire frame image to 256×256 or 384×384 may result in loss of key appearance information. This lack of scale and boundary cues makes single click-based initialization more challenging than bounding box-based initialization methods. Second, the initial bounding box of click-based model predictions may have positioning errors that accumulate further in subsequent tracking, resulting in serious tracking failures.
Click-based tracking tasks typically combine a tracking model with the detector Yolov or interaction model SAM, SAM2, etc. in practice. In the first frame image, the detector or interaction model first generates an initial target bounding box from the single click location, followed by subsequent tracking by the tracking model. However, there are several limitations to such a modular tracking framework. First, the additional detection or interaction model increases the amount of parameters, increasing memory and computational overhead. Second, target positioning requires post-processing. Both the detection model and the interaction model may generate a plurality of candidate boxes, from which the final bounding box is selected by predefined rules, possibly increasing the error of the initialization result. Compared with tracking initiated by an accurate bounding box, the click-based tracking method has a remarkable performance gap and is difficult to apply to a real scene. Nonetheless, little research has been focused on implementing robust click-based tracking methods.
Parameter Efficient Fine Tuning (PEFT) significantly reduces the training cost of adapting a large model to downstream tasks by freezing pre-training model parameters, optimizing only a small fraction of the additional modules. An Adapter-based approach inserts small bottleneck structures between the convertors to implement task-specific learning (Houlsby N et al, "Parameter-EFFICIENT TRANSFER LEARNING for NLP", 2019). LoRA introduces a Low rank matrix decomposition strategy that compresses the parameter updates to below 0.01% of the original model (HuEJ et al, "LoRA: low-Rank Adaptation of Large Language Models", 2021). Prompt learning (Prompt Tuning) guides the model by adding a learnable vector in the input sequence (Lester et al, "The Power of Scale for Parameter-EFFICIENT PROMPT TUNING", 2021). To enhance the capabilities of the PEFT approach, researchers have designed a hybrid expert (MoE) architecture (Fedus et al , "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity", 2022).MoE that dynamically routes different task data to multiple experts using gating mechanisms, significantly enhancing performance in a multitasking learning scenario compared to a single adapter based on the above approach, by constructing a click guided expert selection mechanism, the adaptive fusion of the target initialization and tracking task is realized on the premise of not increasing the burden of a main model.
Disclosure of Invention
The invention aims to provide an interactive target recognition and tracking method, which combines a transducer structure, a gate control expert network (MoE) and a multi-stage bounding box optimization mechanism, can complete high-quality target recognition and tracking tasks only by single clicking of a user, and has the advantages of high efficiency and high precision.
The technical scheme of the invention is as follows:
an interactive target recognition and tracking algorithm for intelligent robots comprises the following steps:
step 1, manual click initialization and input pretreatment;
Manually in the initial frame of video Single click in the video is taken as a starting point, and the initial frame of the video isUp-click targetAny point is marked asThen normalize it toInterval, expressed as click coordinatesSimultaneously, the video initial frameScaled to a fixed size 320 x 320, noted as an imageImage(s)And click coordinatesAs joint input, providing a target prompt for initializing the manual click;
step 2, performing target coarse positioning by using a multitasking unified model;
The multitasking unified model mainly comprises an image encoder, a coordinate encoder, a feature encoder, a gating expert network and a prediction network;
The image encoder consists of a layer of convolution network with a convolution step length of 16×16 for applying the image Conversion to depth feature representation;
The coordinate encoder is composed of a double-layer linear feedforward network for clicking coordinatesConversion to depth feature representation;
The feature encoder adopts a backbone network of a visual tracking network OSTrack and is formed by stacking N layers of Transformer structures with a gate control expert network;
The transducer structure consists of an attention mechanism and a linear feed forward layer, wherein the attention mechanism in the transducer structure is defined as follows:
wherein Q, K, V respectively represent query, key, and value of the input feature vector, Representing the dimension of the feature vector;
the transducer structure with the gate expert network sequentially comprises an attention mechanism, the gate expert network and a linear feed-forward layer, wherein the depth characteristic represents Depth characterizationIn passing through a transducer structure with a gated expert network in a feature encoder, depth features are represented by first passing through an attention mechanismLocation information and depth feature representation in (a)The image information in the image is mutually fused, and the depth characteristic representsDepth characterizationThen the expert network is gated, the expert network consists of 3 expert sub-networks and a gating module, each expert sub-network consists of a double-layer linear feed-forward network, and the gating module represents according to depth characteristicsDepth characterizationThe dynamic distribution of the content of the (E) is calculated by only activating part of expert to participate in reasoning, and the depth characteristic representsDepth characterizationFinally, a linear feedforward layer is adopted;
the prediction network consists of three prediction heads which are respectively used for predicting a coordinate probability distribution diagram of a center point of the click target, a coordinate offset of the center point of the click target and the size of the click target, and the rough boundary frame of the click target is obtained by the output of the three prediction heads I.e. click coordinatesIn the imageThe approximate spatial location of the corresponding target; wherein, the Respectively represent the center coordinates of the click targets,The width and height of the click target are represented, respectively.
Step 3, constructing input information of a multi-mode prompt optimization stage, which specifically comprises the following steps:
At the time of acquiring the rough bounding box Then, a fuzzy search area and reference prompts of two different modes are further constructed to serve as auxiliary input to refine an initial boundary box:
Fuzzy search area with rough bounding box Centering, a rough bounding box4 Times the width and height of the image, constituting an image area containing a context background, noted as;
Visual reference cues with rough bounding boxCentered, the rough bounding box is expanded by 2 times, focusing on clicking the target in step 1And its vicinity, denoted as;
Spatial position reference prompt is based on the initial frame of the video in the step 1Click coordinates of (c)Converting into relative position coordinates based on the central point of the fuzzy search area, and taking the relative position coordinates as a click targetSpatial location references in fuzzy search areas, noted as
Step 4, accurately positioning the optimized bounding box based on the multi-mode prompt;
Will be AndThe combined input is sent into the multitasking unified model in the step 2 again, the task target is changed from coarse positioning to multi-mode prompt optimization, and the multitasking unified model is received at the input endAndThree paths of features are extracted and respectively expressed as fuzzy search area featuresVisual reference area featuresWith spatial position reference featuresWherein the fuzzy search area featureProviding long-range contextual cues, visual reference region featuresProviding high confidence target structure information, spatial location reference featuresGuiding a target core region of interest;
click target bounding box after prediction network output optimization of final multitasking unified model As an initial frame of videoIs a final label of the (c).
Step 5, realizing video target tracking by using a visual tracking network OSTrack based on a transducer;
(1) Cropping the search area of the current frame, namely, for the t-th frame image of the video Target tracking results in t-1 frameThe bounding box is centered, and when t=2,Refer to the initialization result obtained in the fourth stepConstructing a fixed-size search area;
(2) Template and search feature extraction, namely utilizing the click target bounding box obtained in the step 4In the initial frame of videoCutting out the target image clicked by the user as a tracking template T, wherein the tracking template T and the tracking template T areInputting the template features and the search region features into an image encoder together;
(3) Matching and fusing the template features and the search area features through a transducer structure, and capturing the time sequence consistency of the target;
(4) Boundary frame prediction-predicting a current frame using a OSTrack prediction network Target bounding box in (a),As the current frameAt the same time will be used for searching the region in the t+1st frame in step 5 (1)Is used for cutting.
The invention has the beneficial effects that:
(1) The labeling method provided by the invention has the remarkable advantage of low-cost starting. Compared with the traditional initialization mode of manually and finely labeling an initial bounding box, the method can finish the positioning of the target by only clicking once in the initial video frame, and reduces the manual labeling cost and the operation complexity. At the same time, the simplified interaction process significantly shortens the time overhead of the initialization phase.
(2) And (3) a staged reasoning mechanism is adopted to decouple the coarse positioning of the target from the subsequent fine optimization based on multi-mode prompt. By introducing context information of different modes at different stages, the target position is gradually refined, the positioning accuracy and stability of the bounding box are effectively improved, and a solid foundation is provided for high-quality automatic labeling.
Drawings
Fig. 1 is a schematic diagram of a tracker OSTrack based on a transducer.
Fig. 2 is a schematic diagram of a multitasking unified network architecture.
FIG. 3 is a flow chart of a proposed interactive target recognition and tracking method.
Detailed Description
The following describes the embodiments of the present invention further with reference to the drawings and technical schemes.
Fig. 1 is a schematic diagram of a tracker OSTrack based on a transducer, which is composed of an image encoder, a coordinate encoder, a feature encoder, a gating expert network and a prediction network. The feature encoder uses a 12 standard fransformer layer order stack. Compared with the common related operation, the application of the transducer module improves the fusion effect of the template characteristics and the search area characteristics. And extracting center coordinates and size information of the fusion features output by the feature encoder in a prediction network to obtain coordinates of a target boundary box in the input picture.
Fig. 2 is a schematic diagram of a multitasking unified model structure, and fig. 3 is a flowchart of a method for identifying and tracking an interactable target. The process takes a single point click of a person in an initial frame as a starting point, normalizes the click point and inputs the normalized click point and an image scaled to a uniform size into a network together, and provides semantic prompt of a target. The multitasking unified model adopts OSTrack as a basic model, and the feature encoder is formed by stacking a transducer layer with a gate control expert network (MoE), so that joint modeling and efficient reasoning of image and position information are realized. The gate expert network consists of three expert sub-networks and a gate module for inputting dynamic distribution paths, each expert sub-network consists of a double-layer linear feed-forward network, and the calculation efficiency and the expression capacity are improved.
After the boundary box is preliminarily predicted, the system builds a multi-mode auxiliary prompt, wherein the multi-mode auxiliary prompt comprises a fuzzy search area which is extended by 4 times and a visual reference area which is extended by 2 times, and the click point is converted into a relative position coordinate to serve as a spatial position prompt. The multi-mode prompt information and the image are input into the multi-task unified model again, the characteristics from different modes and scales are fused by means of the attention mechanism of a transducer, the multi-mode joint optimization of the bounding box is realized, and the initializing bounding box with higher accuracy is output. Finally, realizing the whole video target propagation through OSTrack tracker, namely cutting the optimized boundary box into a template, carrying out fusion matching with the searching region characteristics of each frame, predicting the target position frame by frame, and completing the complete video target tracking. The whole system has the characteristics of strong expandability and high tracking stability, and is suitable for real scene application.
OSTrack in the multitasking unified model can be arbitrarily replaced with other transform-based trackers. The rough prediction of the first stage and the multi-mode prompt optimization training of the second stage are performed in stages to ensure higher performance benefits. The training set uses all video sequences of LaSOT and Got k datasets. The optimizer selects AdamW, the initial learning rate is set to 0.0001, the learning rate is reduced by 10 times for every 400 training rounds, and the total training rounds is 800.
The feature extraction and fusion network structure in tracker OSTrack and transducer-based tracker OSTrack is as follows:

Claims (5)

1. An interactive target recognition and tracking algorithm for intelligent robots is characterized by comprising the following steps:
step 1, manual click initialization and input pretreatment;
Manually in the initial frame of video Single click in the video is taken as a starting point, and the initial frame of the video isUp-click targetAny point is marked asThen normalize it toInterval, expressed as click coordinatesSimultaneously, the video initial frameScaled to a fixed size 320 x 320, noted as an imageImage(s)And click coordinatesAs joint input, providing a target prompt for initializing the manual click;
step 2, performing target coarse positioning by using a multitasking unified model;
step3, constructing input information of a multi-mode prompt optimizing stage;
step 4, accurately positioning the optimized bounding box based on the multi-mode prompt;
And 5, realizing video target tracking by using a visual tracking network OSTrack based on a transducer.
2. The intelligent robot-oriented interactive target recognition and tracking algorithm according to claim 1, wherein the specific implementation process of step 2 is as follows:
The multitasking unified model mainly comprises an image encoder, a coordinate encoder, a feature encoder, a gating expert network and a prediction network;
The image encoder consists of a layer of convolution network with a convolution step length of 16×16 for applying the image Conversion to depth feature representation;
The coordinate encoder is composed of a double-layer linear feedforward network for clicking coordinatesConversion to depth feature representation;
The feature encoder adopts a backbone network of a visual tracking network OSTrack and is formed by stacking N layers of Transformer structures with a gate control expert network;
The transducer structure consists of an attention mechanism and a linear feed forward layer, wherein the attention mechanism in the transducer structure is defined as follows:
,
wherein Q, K, V respectively represent query, key, and value of the input feature vector, Representing the dimension of the feature vector;
the transducer structure with the gate expert network sequentially comprises an attention mechanism, the gate expert network and a linear feed-forward layer, wherein the depth characteristic represents Depth characterizationIn passing through a transducer structure with a gated expert network in a feature encoder, depth features are represented by first passing through an attention mechanismLocation information and depth feature representation in (a)The image information in the image is mutually fused, and the depth characteristic representsDepth characterizationThen the expert network is gated, the expert network consists of 3 expert sub-networks and a gating module, each expert sub-network consists of a double-layer linear feed-forward network, and the gating module represents according to depth characteristicsDepth characterizationThe dynamic distribution of the content of the (E) is calculated by only activating part of expert to participate in reasoning, and the depth characteristic representsDepth characterizationFinally, a linear feedforward layer is adopted;
the prediction network consists of three prediction heads which are respectively used for predicting a coordinate probability distribution diagram of a center point of the click target, a coordinate offset of the center point of the click target and the size of the click target, and the rough boundary frame of the click target is obtained by the output of the three prediction heads I.e. click coordinatesIn the imageThe approximate spatial location of the corresponding target; wherein, the Respectively represent the center coordinates of the click targets,The width and height of the click target are represented, respectively.
3. The intelligent robot-oriented interactive target recognition and tracking algorithm according to claim 2, wherein the specific implementation process of step 3 is as follows:
At the time of acquiring the rough bounding box Then, a fuzzy search area and reference prompts of two different modes are further constructed to serve as auxiliary input to refine an initial boundary box:
Fuzzy search area with rough bounding box Centering, a rough bounding box4 Times the width and height of the image, constituting an image area containing a context background, noted as;
Visual reference cues with rough bounding boxCentered, the rough bounding box is expanded by 2 times, focusing on clicking the target in step 1And its vicinity, denoted as;
Spatial position reference prompt is based on the initial frame of the video in the step 1Click coordinates of (c)Converting into relative position coordinates based on the central point of the fuzzy search area, and taking the relative position coordinates as a click targetSpatial location references in fuzzy search areas, noted as
4. The intelligent robot-oriented interactive target recognition and tracking algorithm of claim 3, wherein the specific implementation process of step 4 is as follows:
Will be AndThe combined input is sent into the multitasking unified model in the step 2 again, the task target is changed from coarse positioning to multi-mode prompt optimization, and the multitasking unified model is received at the input endAndThree paths of features are extracted and respectively expressed as fuzzy search area featuresVisual reference area featuresWith spatial position reference featuresWherein the fuzzy search area featureProviding long-range contextual cues, visual reference region featuresProviding high confidence target structure information, spatial location reference featuresGuiding a target core region of interest;
click target bounding box after prediction network output optimization of final multitasking unified model As an initial frame of videoIs a final label of the (c).
5. The intelligent robot-oriented interactive target recognition and tracking algorithm of claim 4, wherein the specific implementation process of step 5 is as follows:
(1) Cropping the search area of the current frame, namely, for the t-th frame image of the video Target tracking results in t-1 frameThe bounding box is centered, and when t=2,Refer to the initialization result obtained in the fourth stepConstructing a fixed-size search area;
(2) Template and search feature extraction, namely utilizing the click target bounding box obtained in the step 4In the initial frame of videoCutting out the target image clicked by the user as a tracking template T, wherein the tracking template T and a search areaInputting the template features and the search region features into an image encoder together;
(3) Matching and fusing the template features and the search area features through a transducer structure, and capturing the time sequence consistency of the target;
(4) Boundary frame prediction-predicting a current frame using a OSTrack prediction network Target bounding box in (a),As the current frameAt the same time will be used for searching the region in the t+1st frame in step 5 (1)Is used for cutting.
CN202511013001.4A 2025-07-23 2025-07-23 Interactive target recognition and tracking algorithm for intelligent robot Active CN120526427B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202511013001.4A CN120526427B (en) 2025-07-23 2025-07-23 Interactive target recognition and tracking algorithm for intelligent robot

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202511013001.4A CN120526427B (en) 2025-07-23 2025-07-23 Interactive target recognition and tracking algorithm for intelligent robot

Publications (2)

Publication Number Publication Date
CN120526427A true CN120526427A (en) 2025-08-22
CN120526427B CN120526427B (en) 2025-09-19

Family

ID=96743816

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202511013001.4A Active CN120526427B (en) 2025-07-23 2025-07-23 Interactive target recognition and tracking algorithm for intelligent robot

Country Status (1)

Country Link
CN (1) CN120526427B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6757571B1 (en) * 2000-06-13 2004-06-29 Microsoft Corporation System and process for bootstrap initialization of vision-based tracking systems
CN115908496A (en) * 2022-12-01 2023-04-04 大连理工大学 A Transformer-based Lightweight Target Tracking Data Annotation Method
CN118155183A (en) * 2024-01-31 2024-06-07 广东工业大学 A network architecture method for autonomous driving in unstructured scenarios with deep multimodal perception

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6757571B1 (en) * 2000-06-13 2004-06-29 Microsoft Corporation System and process for bootstrap initialization of vision-based tracking systems
CN115908496A (en) * 2022-12-01 2023-04-04 大连理工大学 A Transformer-based Lightweight Target Tracking Data Annotation Method
CN118155183A (en) * 2024-01-31 2024-06-07 广东工业大学 A network architecture method for autonomous driving in unstructured scenarios with deep multimodal perception

Also Published As

Publication number Publication date
CN120526427B (en) 2025-09-19

Similar Documents

Publication Publication Date Title
CN115310560B (en) A multimodal sentiment classification method based on modal space assimilation and contrastive learning
CN117874258B (en) Task sequence intelligent planning method based on language visual large model and knowledge graph
CN110263912B (en) An Image Question Answering Method Based on Multi-object Association Deep Reasoning
Cardenas et al. Multimodal hand gesture recognition combining temporal and pose information based on CNN descriptors and histogram of cumulative magnitudes
CN119862861B (en) Visual-text collaborative abstract generation method and system based on multi-modal learning
Wang et al. EMAT: Efficient feature fusion network for visual tracking via optimized multi-head attention
CN110929092A (en) A multi-event video description method based on dynamic attention mechanism
CN116204694B (en) Multi-mode retrieval method based on deep learning and hash algorithm
CN117115474B (en) An end-to-end single target tracking method based on multi-stage feature extraction
Chen et al. Enhancing visual question answering through ranking-based hybrid training and multimodal fusion
CN118736625B (en) Image-text pedestrian retrieval method based on fusion key point attention guidance
CN116912804A (en) An efficient anchor-free 3-D target detection and tracking method and model
CN114168721A (en) A Construction Method of Knowledge Augmentation Model for Multi-sub-objective Dialogue Recommender System
CN117649582A (en) Single-flow single-stage network target tracking method and system based on cascade attention
CN119580269A (en) A cross-modal alignment method and device for a multimodal large language model
CN119005242A (en) Remote sensing interpretation intelligent body system based on large language model
CN116049650A (en) RFSFD-T network-based radio frequency signal fingerprint identification method and system
CN118133114A (en) Track prediction method, medium and system based on graph neural network
CN117218156B (en) Single-target tracking method based on fusion feature decoding structure
CN119206573A (en) A video-oriented event knowledge extraction method, system, device and medium
Guo et al. Distillation-based hashing transformer for cross-modal vessel image retrieval
CN120526427B (en) Interactive target recognition and tracking algorithm for intelligent robot
CN119648749B (en) Target tracking method and system based on space channel summation attention
Mendez et al. Reinforcement learning of multi-domain dialog policies via action embeddings
CN118673181B (en) A frequency-domain guided enhanced video time-of-view retrieval method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant