CN116630392A

CN116630392A - Visual SLAM method for coupling multi-target tracking

Info

Publication number: CN116630392A
Application number: CN202310262036.6A
Authority: CN
Inventors: 陈光柱; 苟荣松; 蒲鑫
Original assignee: Chengdu Univeristy of Technology
Current assignee: Chengdu Univeristy of Technology
Priority date: 2023-03-17
Filing date: 2023-03-17
Publication date: 2023-08-22

Abstract

The invention proposes a visual SLAM method coupled with multi-target tracking. The method is divided into coupled visual mileage and front-end and graph optimization back-end. First, the Extended Kalman Filter (EKF) is used to track the target based on the target detection information; then, the current features are classified according to the target motion information obtained by the EKF, and the static features are used to obtain the pose information of the camera. The dynamic features are used to obtain the 6‑DOF pose information of the target; finally, a multivariate factor graph optimization model is established to jointly optimize the camera pose, target pose, static 3D points, and dynamic 3D points in motion. The present invention aims at the problem of low positioning accuracy and insufficient target information of the SLAM system in a dynamic environment, and studies the role of target information in the scene on the robot in the process of positioning and mapping; the present invention can realize mobile devices in a dynamic environment. Accurate camera pose information and target pose information, and track the target in the scene.

Description

A visual SLAM method coupled with multi-target tracking

技术领域technical field

本发明属于计算机视觉、SLAM领域，具体涉及一种耦合多目标跟踪的视觉SLAM方法The invention belongs to the fields of computer vision and SLAM, and in particular relates to a visual SLAM method coupled with multi-target tracking

背景技术Background technique

随着各种人工智能的发展，极大程度上推动了各种虚拟现实技术在各个领域的应用，比如移动机器人的自主导航、人机协作、工业级场景重建。同步定位与建图(SLAM)技术能够帮助机器人/传感器进行场景感知的同时实现自我的准确定位，得到了很大的发展，却在现实的中的应用相对较少。其主要原因是目前主流的SLAM方法依赖于场景中刚性目标保持全局静止，显然，这一条件十分苛刻。现实场景中存在许多运动的目标，使其成为一个动态环境。这就使得传统SLAM方法不再适用。近年来，出现部分优秀的方法专注于解决SLAM在动态场景下所面临的问题。这些方法中的绝大多数,都利用先验信息(比如场景中的语义信息或者运动结构信息)在定位和建图过程中将场景中的运动目标作为外点进行剔除。上述方法的主要目标都是将动态目标进行剔除的方式大幅度提高位姿的计算精度。显然，这些方法都丢失了场景中的动态目标的信息。With the development of various artificial intelligences, the application of various virtual reality technologies in various fields has been greatly promoted, such as autonomous navigation of mobile robots, human-machine collaboration, and industrial-level scene reconstruction. Simultaneous Localization and Mapping (SLAM) technology can help robots/sensors realize scene perception and accurate self-positioning at the same time. It has been greatly developed, but it has relatively few applications in reality. The main reason is that the current mainstream SLAM method relies on the rigid target in the scene to remain globally static. Obviously, this condition is very harsh. There are many moving objects in a real scene, making it a dynamic environment. This makes traditional SLAM methods no longer applicable. In recent years, some excellent methods have emerged focusing on solving the problems faced by SLAM in dynamic scenes. Most of these methods use prior information (such as semantic information or motion structure information in the scene) to remove moving objects in the scene as outliers in the process of localization and mapping. The main goal of the above method is to greatly improve the calculation accuracy of the pose by eliminating the dynamic target. Obviously, these methods lose the information of dynamic objects in the scene.

综上所述，针对以上研究的不足，提出了一种耦合多目标跟踪的视觉SLAM方法。To sum up, in view of the shortcomings of the above research, a visual SLAM method coupled with multi-target tracking is proposed.

发明内容Contents of the invention

鉴于上述问题，本发明的目的在于提供一种耦合多目标跟踪的视觉SLAM方法。In view of the above problems, the object of the present invention is to provide a visual SLAM method coupled with multi-target tracking.

一种耦合多目标跟踪的视觉SLAM方法，包括以下步骤：A visual SLAM method coupled with multi-target tracking, comprising the following steps:

步骤一：利用COCO数据集对实例分割网络Yolact训练；Step 1: Use the COCO dataset to train the instance segmentation network Yolact;

步骤二：设计二维目标跟踪器对场景中的运动目标进行跟踪；Step 2: Design a two-dimensional object tracker to track moving objects in the scene;

步骤三：对当前图像进行特征提取，利用跟踪到的目标信息对特征进行分类，将特征分为动态的目标特征以及静态的背景特征，静态特征被用于获取相机位姿；Step 3: Extract features from the current image, use the tracked target information to classify the features, divide the features into dynamic target features and static background features, and static features are used to obtain the camera pose;

步骤四：利用目标特征，基于最小重投影误差原理，获取目标的位姿信息；Step 4: Obtain the pose information of the target based on the minimum reprojection error principle by using the target feature;

步骤五：建立多元因子图模型，对相机位姿、目标位姿、静态3D点、目标上的动态3D点进行联合优化。Step 5: Establish a multi-factor graph model, and jointly optimize the camera pose, target pose, static 3D points, and dynamic 3D points on the target.

与现有技术相比，本发明具有以下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

1.相比于现有SLAM方法大幅度提高了动态环境下相机的位姿估计精度；1. Compared with the existing SLAM method, the pose estimation accuracy of the camera in a dynamic environment is greatly improved;

2.能够实时获取场景中的目标的运动信息。2. The movement information of the target in the scene can be obtained in real time.

附图说明Description of drawings

图1是COCO数据集示意图；Figure 1 is a schematic diagram of the COCO dataset;

图2是实例分割效果图及特征分类示意图；Figure 2 is a schematic diagram of instance segmentation effect diagram and feature classification;

图3是相机在连续帧的变换关系图；Fig. 3 is a transformation relationship diagram of the camera in consecutive frames;

图4是后端多元因子图模型结构图；Figure 4 is a structural diagram of the back-end multivariate factor graph model;

图5是获取的目标运动信息及相机运动轨迹示意图；FIG. 5 is a schematic diagram of acquired target motion information and camera motion trajectory;

具体实施方式Detailed ways

下面结合附图对本发明的技术方案进行详细说明。The technical solution of the present invention will be described in detail below in conjunction with the accompanying drawings.

一种耦合多目标跟踪的视觉SLAM方法，具体包括以下步骤：A visual SLAM method coupled with multi-target tracking, specifically comprising the following steps:

步骤一：利用COCO数据集对实例分割网络Yolact进行训练Step 1: Use the COCO dataset to train the instance segmentation network Yolact

步骤11：为了获取场景中的运动目标的实例掩码信息及检测框信息，在COCO实例分割数据集上对Yolact进行训练，数据集中人、机器人被认为是场景中的动态目标，数据集图片如图1所示；Step 11: In order to obtain the instance mask information and detection frame information of the moving objects in the scene, Yolact is trained on the COCO instance segmentation dataset. Humans and robots in the dataset are considered as dynamic objects in the scene. The images in the dataset are as follows: As shown in Figure 1;

步骤12：将Yolact输出的实例掩码信息及检测框信息通过ROS发送到SLAM端。Step 12: Send the instance mask information and detection frame information output by Yolact to the SLAM end through ROS.

步骤二：设计二维目标跟踪器对场景中的运动目标进行跟踪Step 2: Design a 2D target tracker to track moving targets in the scene

步骤21：利用Yolact输出的目标检测信息；Step 21: using the target detection information output by Yolact;

步骤22：对目标状态进行估计，定义为所需要估计的运动状态，其中/>表示目标mask中心的三维位置信息，/>表示目标在三个方向的速度信息，整个目标的状态估计过程可以表示为：Step 22: Estimate the target state, define is the motion state to be estimated, where /> Indicates the three-dimensional position information of the center of the target mask, /> Represents the speed information of the target in three directions, and the state estimation process of the entire target can be expressed as:

式中，表示需要估计的目标状态，F是状态转移矩阵，w_i-1是状态转移过程中的过程噪声，/>是在状态估计过程中的不确定性，Q是过程噪声的协方差矩阵，是过程噪声的分布矩阵，Δt表示观测时间，/>是目标速度模型的协方差矩阵；In the formula, Represents the target state that needs to be estimated, F is the state transition matrix, w _i-1 is the process noise in the state transition process, /> is the uncertainty in the state estimation process, Q is the covariance matrix of the process noise, is the distribution matrix of the process noise, Δt represents the observation time, /> is the covariance matrix of the target velocity model;

步骤23：对目标状态进行更新，更新过程主要利用卡尔曼增益系数K_i对目标的估计状态和/>测量值进行融合：Step 23: Update the target state. The update process mainly uses the Kalman gain coefficient K _i to estimate the target state and /> The measurements are fused:

式中，H_i表示观测函数的雅克比矩阵，观测函数h(·)＝[f_α,f_β,f_γ,1]能够实现目标从估计状态到测量状态的转变，其具体表达如下：In the formula, H _i represents the observation function The Jacobian matrix of the observation function h(·)=[f _α ,f _β ,f _γ ,1] can realize the transition of the target from the estimated state to the measured state, and its specific expression is as follows:

式中，[x_o,y_o,z_o]表示相机中心点的坐标；In the formula, [x _o , y _o , z _o ] represent the coordinates of the camera center point;

步骤24：更新卡尔曼增益系数和不确定性，主要过程如下式：Step 24: Update the Kalman gain coefficient and uncertainty, the main process is as follows:

式中，是观测噪声的协方差矩阵。In the formula, is the covariance matrix of the observation noise.

步骤三：利用FAST方法对当前图像进行特征提取，利用跟踪到的目标信息对特征进行分类，静态特征用于获取相机位姿Step 3: Use the FAST method to extract features from the current image, use the tracked target information to classify the features, and use the static features to obtain the camera pose

步骤31：对当前图片提取FAST特征，获取目标特征集；Step 31: extracting FAST features from the current picture to obtain the target feature set;

步骤32：利用步骤二获取的目标状态，可以获得目标的估计速度：Step 32: Using the target state obtained in step 2, the estimated speed of the target can be obtained:

是获取的目标第一估计速度，此外，根据该目标在上一帧的6-DOF位姿信息可以获得目标获得第二个估计速度： is the first estimated velocity of the acquired target, in addition, according to the 6-DOF pose information of the target in the previous frame You can get the target to get a second estimated velocity:

最后，利用下式获取目标的最终速度：Finally, the final velocity of the target is obtained using the following formula:

α,β在实验中被分别设置为0.4和0.6，最后若V_i ^j≥l，该目标上的特征被认为是动态特征；α and β are set to 0.4 and 0.6 respectively in the experiment, and finally if V _i ^j ≥ l, the feature on the target is considered as a dynamic feature;

步骤33：从特征集中剔除动态特征，剩下的静态特征利用PnP求解相机位姿。Step 33: Eliminate dynamic features from the feature set, and use PnP to solve the camera pose for the remaining static features.

步骤四：利用目标特征，基于最小重投影误差原理，获取目标的位姿信息Step 4: Using the target features, based on the principle of minimum reprojection error, obtain the pose information of the target

步骤41：建立如图3所示的相机及目标的运动模型；Step 41: Establish the motion model of the camera and the target as shown in Figure 3;

步骤42：建立3D点的非线性变换模型，初步表示出目标在图像帧间的运动变换信息：Step 42: Establish a nonlinear transformation model of 3D points, and initially express the motion transformation information of the target between image frames:

上式中，表示目标上的3D点，/>表示在第i-1帧和第i帧的目标位姿信息，/>是相邻两帧之间的位姿变换；In the above formula, represents a 3D point on the target, /> Represents the target pose information of the i-1th frame and the i-th frame, /> is the pose transformation between two adjacent frames;

步骤43：根据相机投影模型，建立位于目标的上的3D点的重投影误差：Step 43: According to the camera projection model, establish the reprojection error of the 3D point on the target:

上式中，表示匹配上的特征点对所对应的像素点，/>表示投影函数，/>代表第i帧的相机位姿，同时，对于目标上的所有3D点，最小重投影误差可以表示为：In the above formula, Indicates the pixel corresponding to the matching feature point pair, /> represents the projection function, /> represents the camera pose of the i-th frame, meanwhile, for all 3D points on the target, the minimum reprojection error can be expressed as:

上式中，n_d表示从目标上提取的特征总数，表示最终需要获取的位姿变换信息；In the above formula, n _d represents the total number of features extracted from the target, Indicates the final pose transformation information that needs to be obtained;

步骤44：将上述最小重投影误差模型中映射到李代数空间中：Step 44: Map the above minimum reprojection error model into the Lie algebraic space:

式中，是/>在李代数空间的指数映射，即ρ的是目标平移向量，φ代表的是目标的旋转向量，最后，利用高斯-牛顿法对上式进行求解。In the formula, yes /> The exponential map in the Lie algebraic space, that is ρ is the translation vector of the target, and φ represents the rotation vector of the target. Finally, the Gauss-Newton method is used to solve the above formula.

步骤五：建立图优化模型，对相机位姿、目标位姿、静态3D点、目标上的动态3D点进行联合优化Step 5: Establish a graph optimization model, and jointly optimize the camera pose, target pose, static 3D points, and dynamic 3D points on the target

步骤51：建立如图4所示的因子图模型，因子图中包含目标3D点、静态3D点、相机位姿信息、目标位姿信息多种待优化变量；Step 51: Establish a factor graph model as shown in Figure 4. The factor graph includes target 3D points, static 3D points, camera pose information, target pose information and various variables to be optimized;

步骤52：建立因子图节点所对应多种误差：Step 52: Establish various errors corresponding to the nodes of the factor graph:

式中，表示静态3D点的投影误差，/>代表动态3D点的投影误差，/>代表目标的运动误差，ω表示相机位姿对于目标位姿的影响参数，相机位姿的测量误差则采用了现有方法；In the formula, Indicates the projection error of a static 3D point, /> represents the projection error of a dynamic 3D point, /> Represents the motion error of the target, ω represents the influence parameter of the camera pose on the target pose, and the measurement error of the camera pose adopts the existing method;

步骤53：将上述多种误差模型被联合定以为：Step 53: The above multiple error models are jointly determined as:

表示待优化变量的集合，最后采用Levenberg-Marquardt法对上式进行求解。 Represents the set of variables to be optimized, and finally uses the Levenberg-Marquardt method to solve the above formula.

整个网络结构完整描述如下：A complete description of the entire network structure is as follows:

步骤1：利用COCO数据集对实例分割网络Yolact训练；Step 1: Use the COCO dataset to train the instance segmentation network Yolact;

步骤2：设计二维目标跟踪器对场景中的运动目标进行跟踪；Step 2: Design a two-dimensional object tracker to track moving objects in the scene;

步骤3：对当前图像进行特征提取，利用跟踪到的目标信息对特征进行分类，将特征分为动态的目标特征以及静态的背景特征，静态特征被用于获取相机位姿；Step 3: Extract the features of the current image, classify the features by using the tracked target information, divide the features into dynamic target features and static background features, and static features are used to obtain the camera pose;

步骤4：利用目标特征，基于最小重投影误差原理，获取目标的位姿信息；Step 4: Using the target features, based on the principle of minimum reprojection error, obtain the pose information of the target;

步骤5：建立图优化模型，对相机位姿、目标位姿、静态3D点、目标上的动态3D点进行联合优化；Step 5: Establish a graph optimization model, and jointly optimize the camera pose, target pose, static 3D points, and dynamic 3D points on the target;

步骤6：获取如图5所示的目标的运动轨迹及相机运动轨迹。Step 6: Obtain the movement trajectory of the target and the movement trajectory of the camera as shown in FIG. 5 .

Claims

1. a visual SLAM method for coupling multiple target tracking, is characterized in that, comprises the steps:

Step 1: Build a visual SLAM method framework coupled with multi-target tracking;

The entire framework includes a coupled odometry front-end and a graph-optimized back-end. The coupled odometer front-end includes a real two-dimensional target tracker, target pose estimation, and camera pose estimation. Information and camera pose information are optimized;

Step 2: Design a 2D object tracker;

Based on the detection information of the target in the scene, the extended Kalman filter (EKF) is used to estimate the motion state of the target; at the i-th moment, the estimated value of the motion state of the target j Expressed as:

In the formula, w _i-1 is the process noise in the state transition process, is the final motion state of the target j to be tracked, /> is the three-dimensional position information of the mask of the target j, /> is the speed information of the target j in the three directions of x, y, and z, /> is the state transition matrix, I _3×1 is an identity matrix, Δt is the estimated frequency for the target motion state; the estimated value of the uncertainty in the target state estimation process /> It can be expressed as:

In the formula, is the uncertainty of target j at time i-1, Q=A×∑× ^AT with/> represents the noise term in the uncertainty estimation process, /> is the distribution matrix of the process noise, /> is used to represent the speed model /> The covariance matrix of (moving speeds of different types of targets in the workshop), I _3×3 is an identity matrix; next, the moving state of target j /> is updated by the following formula:

In the formula, Indicates the Kalman gain coefficient at the i-th moment, /> Represents the covariance matrix of the observation noise, the value on the diagonal of R _i represents the variance of the three-dimensional coordinates of the mask center of the target j, /> Indicates the observed value of target j, that is, the three-dimensional coordinates corresponding to the center of its mask, /> is the observation function /> The Jacobian matrix of , the observation function h(·)=[f _α ,f _β ,f _γ ,1] is expressed by the following formula:

In the formula, α, β, γ are the Euler angles of the movement of the target j, and the observation function h( ) can realize the observed value of the target j Mapping to Motion State Estimates /> [x _o , y _o , z _o ] represents the world coordinates of the camera center point;

Step 3: Use the FAST method to perform feature extraction on the current image to obtain a feature set, which includes target features on the dynamic target and static features belonging to the static background;

Step 4: Target motion state estimation;

Use the tracked motion state of target j The medium velocity component V _i ^j can obtain the linear velocity of target j:

Set l as the speed threshold of the moving target in the scene. If the speed of the target V _i ^j >l, the target j is considered to be a dynamic target in the scene, and the features extracted from the target j belong to the target feature. If V _i ^j < l, the target j is considered static, and the features extracted from the target j are static features;

Step 5: camera pose estimation;

Use the target mask information output by Yolact to mark the position of all static features in the image at the i-th moment, and use the PnP method to obtain the initial pose of the camera for the static features

Step 6: target pose estimation;

For a 3D point on target j at time i According to the camera projection model, its pixel value P _i ^j,n and its coordinates in its own reference system can be obtained /> So /> The reprojection error ^ζi at consecutive frames is expressed as:

In the formula, is the camera pose information at frame i, /> is the projection function (camera internal parameters), /> is the target pose information of the target at time i-1, /> Represents the SE(3) pose transformation of the 3D point in two consecutive frames; then, the reprojection error e(V _i ^j ) for all 3D points on the target j is expressed as:

In the formula, The pose transformation of the target between arbitrary images, n _d is the total number of 3D points from the target j, and finally, the above formula is solved in the Lie algebraic space to obtain the pose of the target;

Step 7: Establish a multivariate factor graph model;

3D projections that will belong to static features 3D projections belonging to target features /> The pose information of the dynamic target and the pose information of the camera are used as nodes in the multivariate factor graph;

Step 8: Obtain the camera measurement error e _i ;

Use the camera measurement error in ORB_SLAM2 as the camera measurement error here

Step 9: Obtain the target feature projection error

According to the target j pose information described in step 6 Target feature projection error for target j /> Expressed as:

Step 10: Obtain the projection error of static features

According to the camera pose information described in step 5 Projection error of static features/> Expressed as:

In the formula, ψ( ) represents the projection function, is the camera pose information, /> yes /> Corresponding two-dimensional pixel coordinates;

Step 11: Obtain the motion constraint error of target j between consecutive image frames

According to the target motion state described in step 4 Express the motion error of the target between consecutive image frames as

In the formula, represents the target state as described in claim 1;

Step 12: Establish a joint error constraint model;

The above multiple constraint models constitute a factor graph, and each constraint model is used as a node of the factor graph. Finally, the factor graph optimization problem is expressed by the following formula:

In the formula, M _c is a set of image frames with a common view relationship, M _o is the total number of targets tracked in the current frame image, M _s is the static feature set extracted from the background, and M _d is the dynamic feature extracted from the target set, l _i is the Huber loss function, ∑Δt represents the covariance matrix of the observation frequency, is the parameter to be optimized in the entire factor graph, and finally the above formula is solved by Levenberg-Marquardt to obtain the optimal parameter.