CN109035329A

CN109035329A - Camera Attitude estimation optimization method based on depth characteristic

Info

Publication number: CN109035329A
Application number: CN201810878967.8A
Authority: CN
Inventors: 纪荣嵘; 郭锋; 陈晗
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2018-08-03
Filing date: 2018-08-03
Publication date: 2018-12-18

Abstract

基于深度特征的相机姿态估计优化方法，涉及基于有监督学习SLAM系统的优化方法。使用基于随机森林的匹配算法，快速计算2D‑3D点的相似性以映射2D‑3D点信息；使用约束函数和多特征融合的方法评估相机姿态；针对基于深度学习的SLAM算法中存在的不稳定性问题，提出一种多特征集束优化算法。使用三维重建数据作为参考，然后使用可见的3D点及来自随机森林的离线数据集的相关关键点映射，并使用多特征融合和约束函数来测量姿态评估分数。以上方法用于优化基于深度学习SLAM的性能。实验结果证明，算法鲁棒性。The camera pose estimation optimization method based on depth features relates to the optimization method based on supervised learning SLAM system. Use the random forest-based matching algorithm to quickly calculate the similarity of 2D-3D points to map 2D-3D point information; use the method of constraint function and multi-feature fusion to evaluate the camera pose; aim at the instability existing in the SLAM algorithm based on deep learning A multi-feature bundle optimization algorithm is proposed. Using the 3D reconstruction data as a reference, we then use the visible 3D points and associated keypoint maps from the random forest offline dataset, and use multi-feature fusion and constraint functions to measure the pose evaluation score. The above methods are used to optimize the performance of deep learning-based SLAM. Experimental results prove that the algorithm is robust.

Description

Optimal Method for Camera Pose Estimation Based on Depth Features

技术领域technical field

本发明涉及基于有监督学习SLAM系统的优化方法，尤其是涉及用于有监督学习SLAM算法的基于深度特征的相机姿态估计优化方法。The present invention relates to an optimization method based on a supervised learning SLAM system, in particular to a depth feature-based camera pose estimation optimization method for a supervised learning SLAM algorithm.

背景技术Background technique

SLAM技术在机器人、自动驾驶、虚拟与增强现实领域有着很好的应用前景，在众多的计算机视觉与人工智能技术中，SLAM的研究持续火热。近几年，越来越多的机器人出现在人们的视野当中，给人们生活带来很多便利，通过自身的摄像头、陀螺仪、激光传感器等来得到具体场景的环境，并对自身进行定位，在满足实时性的条件下完成特定任务。近年来，国内外多家公司投入大量人力、物力来进行无人驾驶车辆的研发。无人驾驶的核心技术也是SLAM技术，鲁棒而快速的环境识别和语义分割是该无人驾驶的关键所在。在增强现实领域，目前在市面上投入到商业场景中的AR应用大多是基于特定模板，从模板识别到模板跟踪匹配，结合三维注册和模型渲染来进行虚实交互。而真正的增强现实需要对应用场景所处的环境进行识别与语义理解，这时候仍需SLAM技术来作为核心技术。SLAM technology has good application prospects in the fields of robotics, autonomous driving, virtual and augmented reality. Among many computer vision and artificial intelligence technologies, SLAM research continues to be hot. In recent years, more and more robots have appeared in people's field of vision, bringing a lot of convenience to people's lives. Through their own cameras, gyroscopes, laser sensors, etc., they can obtain the environment of specific scenes and position themselves. Complete specific tasks under real-time conditions. In recent years, many companies at home and abroad have invested a lot of manpower and material resources in the research and development of unmanned vehicles. The core technology of unmanned driving is also SLAM technology. Robust and fast environment recognition and semantic segmentation are the key to this unmanned driving. In the field of augmented reality, most of the AR applications currently on the market in commercial scenarios are based on specific templates, from template recognition to template tracking and matching, combined with 3D registration and model rendering for virtual-real interaction. The real augmented reality requires recognition and semantic understanding of the environment where the application scene is located. At this time, SLAM technology is still needed as the core technology.

相机重新定位一直是SLAM中的一项重要任务。基于图像的重新定位方法在SLAM中是一个强大而有效的线程，基于成像的相机姿态估计的常用技术是图像检索和基于3D场景重建的方法，但是，图像检索错误大于GPS位置传感器。此外，估计精度也取决于数据集，ARZamir等人([1]Zamir,A.R.,Shah,M.:Image geo-localization based on multiplenearest neighbor feature matching using generalized graphs.IEEE Trans.PatternAnal.Mach.Intell.(2014))提出多个最近邻的地理定位特征匹配框架，但如果查询图像(Query)和数据集(Database)不匹配那么此方法受到严重限制。这种方法为了实现更好的定位结果，需要三维先验信息提供有价值的空间关系。但是，现有的基于三维先验的图像定位方法只关注局部的定位精度，而忽视后续的优化。对于局部优化，如FastSLAM([2]Parsons,S.:Probabilistic robotics by Sebastian Thrun,Wolfram Burgard andDieter Fox.Knowledge Eng.Review(2006).https://doi.org/10.1017/S0269888906210993)，通过传感器估计噪声来测量估计姿态误差，并对其优化。ORB-SLAM([3]Mur-Artal,R.,Montiel,J.M.M.,Tard′os,J.D.:ORB-SLAM:a versatile andaccurate monocular SLAM System.IEEE Trans.Robot.(2015))是一种优秀视觉SLAM系统，使用orb特征和局部及全局优化，但是它没有使用深度特征和先验知识。([4]Kendall,A.,Cipolla,R.:Modelling uncertainty in deep learning for camerarelocalization.In:IEEE International Conference on Robotics and Automation,ICRA 2016,Stockholm,Sweden,16–21May 2016.https://doi.org/10.1109/ICRA.2016.7487679)提出了一个使用贝叶斯的视觉重定位系统，并用卷积神经网络来回归相机姿态，重用了三维信息并接近实时性能。我们提出的方法不利用任何传感器，而是基于先验三维点的相似度定位模型，但是，目前基于有监督学习的方法都没有相当适用的优化方法。虽然一些基于深卷积神经网络的视觉定位算法被认为是容忍大基线的端到端定位方法，但是，估计姿态时往往出现平均误差较大的情况。这一切归结为没有后端优化(例如局部集束优化)，基于上述问题，展示了如何使用随机森林(Cutler,A.,Cutler,D.R.,Stevens,J.R.:Random forests.In:Machine Learning(2012))做2D-3D的点匹配，以及如何改进及利用约束函数优化基于深度学习的SLAM系统。Camera relocalization has always been an important task in SLAM. Image-based relocalization methods are a powerful and effective thread in SLAM. Common techniques for imaging-based camera pose estimation are image retrieval and methods based on 3D scene reconstruction. However, image retrieval errors are larger than GPS position sensors. In addition, the estimation accuracy also depends on the data set, ARZamir et al. 2014)) proposed a multiple nearest neighbor geolocation feature matching framework, but this method is severely limited if the query image (Query) and the dataset (Database) do not match. This approach requires 3D prior information to provide valuable spatial relationships in order to achieve better localization results. However, existing image localization methods based on 3D priors only focus on local localization accuracy and ignore subsequent optimization. For local optimization, such as FastSLAM ([2] Parsons, S.: Probabilistic robotics by Sebastian Thrun, Wolfram Burgard and Dieter Fox. Knowledge Eng. Review (2006). https://doi.org/10.1017/S0269888906210993), by sensor estimation noise to measure the estimated pose error and optimize it. ORB-SLAM ([3] Mur-Artal, R., Montiel, J.M.M., Tard′os, J.D.: ORB-SLAM: a versatile and accurate monocular SLAM System.IEEE Trans.Robot.(2015)) is an excellent visual SLAM system, using orb features and local and global optimization, but it does not use deep features and prior knowledge. ([4]Kendall, A., Cipolla, R.: Modeling uncertainty in deep learning for camera localization. In: IEEE International Conference on Robotics and Automation, ICRA 2016, Stockholm, Sweden, 16–21 May 2016. https://doi. org/10.1109/ICRA.2016.7487679) proposed a visual relocalization system using Bayesian and convolutional neural networks to regress camera poses, reusing 3D information and approaching real-time performance. Our proposed method does not utilize any sensors, but is based on a priori 3D point similarity localization model, however, none of the current methods based on supervised learning have quite applicable optimization methods. Although some visual localization algorithms based on deep convolutional neural networks are considered as end-to-end localization methods that tolerate large baselines, the average error often occurs when estimating poses. It all boils down to no back-end optimization (e.g. local bundle optimization), based on the above problem, shows how to use random forests (Cutler,A.,Cutler,D.R.,Stevens,J.R.:Random forests.In:Machine Learning(2012)) Do 2D-3D point matching, and how to improve and use constraint functions to optimize SLAM systems based on deep learning.

发明内容Contents of the invention

本发明的目的在于提供基于深度特征的相机姿态估计优化方法。The object of the present invention is to provide a camera pose estimation optimization method based on depth features.

本发明包括以下步骤：The present invention comprises the following steps:

1)使用基于随机森林的匹配算法，快速计算2D-3D点的相似性以映射2D-3D点信息；1) Use the random forest-based matching algorithm to quickly calculate the similarity of 2D-3D points to map 2D-3D point information;

2)使用约束函数和多特征融合的方法评估相机姿态；2) Evaluate the camera pose using constraint functions and multi-feature fusion;

3)针对基于深度学习的SLAM算法中存在的不稳定性问题，提出一种多特征集束优化算法。3) Aiming at the instability problem existing in the SLAM algorithm based on deep learning, a multi-feature bundle optimization algorithm is proposed.

在步骤1)中，所述使用基于随机森林的匹配算法，快速计算2D-3D点的相似性以映射2D-3D点信息的具体方法可为：In step 1), the specific method for quickly calculating the similarity of 2D-3D points to map 2D-3D point information using a random forest-based matching algorithm can be:

每个决策树由内部节点和叶节点组成，决策树的预测能够计算出2D像素点之间的相似度，再由2D像素点之间的相似度推算到三维空间的相似度，从根节点开始一直到叶子，通过反复修改决策函数而使训练收敛，决策函数表达如下：Each decision tree is composed of internal nodes and leaf nodes. The prediction of the decision tree can calculate the similarity between 2D pixel points, and then calculate the similarity between 2D pixel points to the similarity in three-dimensional space, starting from the root node Up to the leaves, the training converges by repeatedly modifying the decision function. The decision function is expressed as follows:

split(p；δ_n)＝[f_n(p)＞δ_n]split(p; δ _n ) = [f _n (p) > δ _n ]

其中，n表示决策树中节点的索引，p是代表2D关键点的非叶节点，[.]是0～1指标函数，δ是决策阈值，f()是决策函数：Among them, n represents the index of the node in the decision tree, p is the non-leaf node representing the 2D key point, [.] is the index function from 0 to 1, δ is the decision threshold, and f() is the decision function:

f(p)＝a₁D_shape(p₁,p₂)+a₂D_texture(p₁,p₂)+a₃D_color(p₁,p₂)f(p)＝a ₁ D _shape (p ₁ ,p ₂ )+a ₂ D _texture (p ₁ ,p ₂ )+a ₃ D _color (p ₁ ,p ₂ )

在特征融合中定义a和D()，若split(p；δn)求值为0，则为训练路径分支到左边的子树，否则分支到右边，p1和p2是关键点周围的像素对点，在训练的过程中，三维重建数据包含相关的相应3D点和2D点的映射；将先验划分为训练数据和验证数据；在算法1中展示训练框架；目标函数Q用于使训练数据和验证数据具有相同的学习趋势，其中，Θ是验证数据进入与相关训练数据不同路径的数量，P_verification是验证数据集，λ是折衷相同分支的相似度和不同分支的多样性的参数；目标函数是基于落入相同分支的验证数据和训练数据的比例去衡量点与点之间的相似性：Define a and D() in feature fusion. If split(p; δn) evaluates to 0, it is the training path branching to the left subtree, otherwise branching to the right, p1 and p2 are the pixel pairs around the key point , in the process of training, the 3D reconstruction data contains the mapping of the corresponding 3D points and 2D points; the prior is divided into training data and verification data; the training framework is shown in Algorithm 1; the objective function Q is used to make the training data and The verification data has the same learning tendency, where Θ is the number of verification data entering different paths from the related training data, P _verification is the verification data set, and λ is a parameter that compromises the similarity of the same branch and the diversity of different branches; the objective function is based on the proportion of validation data and training data that fall into the same branch to measure the similarity between points:

算法1基于随机森林的2D-3D点映射训练过程如表1所示：Algorithm 1 The training process of 2D-3D point mapping based on random forest is shown in Table 1:

表1Table 1

在步骤2)中，所述使用约束函数和多特征融合的方法评估相机姿态的具体方法可为：In step 2), the specific method for evaluating the camera pose using the constraint function and multi-feature fusion method can be:

假设姿态估计结果是相机外部参数由x，y，z(位置信息)和w，p，q，r(四元素)转化得到，假设内参矩阵K由EXIF标签获得(假设没有径向变形)；然后，结合内参：Suppose the pose estimation result is the camera extrinsic parameters It is converted from x, y, z (position information) and w, p, q, r (four elements), assuming that the internal reference matrix K is obtained from the EXIF tag (assuming no radial deformation); then, combined with the internal reference:

可以通过图像坐标和世界坐标得到变换矩阵：The transformation matrix can be obtained by image coordinates and world coordinates:

2D点和3D点云通过这个变换矩阵相关联；如提取一个已经评估过相机矩阵的评估图像的FALoG特征点作为查询数据集。对于查询特征点m(x，y)，可以通过上述映射算法找到查询特征的最近似特征，搜索已建立的最近的数据集特征，随机森林通过测试每个决策节点中的查询特征，最后查询到达叶节点；随机森林的最终映射特征m'在m点是节点概率最大的对应点，对应点m'有其相关的3D点。对于每个3D点，数据库至少有一个相应的图像，每个点可以投影到相关的图像，结合像素差异作为误差函数，并使用所述误差函数迭代优化姿态矩阵，认为不同映射点的误差可以通过基于颜色的，基于形状的和基于纹理的像素级别误差表示，对于每个特征利用评估基于上述特征的偏差率最为的姿态的评分；2D points and 3D point clouds are associated through this transformation matrix; for example, extracting the FALoG feature points of an evaluation image that has already evaluated the camera matrix as a query data set. For the query feature point m(x, y), the most approximate feature of the query feature can be found through the above mapping algorithm, and the nearest established data set feature can be searched. The random forest tests the query feature in each decision node, and finally the query reaches Leaf node; the final mapping feature m' of the random forest is the corresponding point with the highest probability of the node at point m, and the corresponding point m' has its associated 3D point. For each 3D point, the database has at least one corresponding image, each point can be projected to the relevant image, combined with the pixel difference as the error function, and using the error function to iteratively optimize the pose matrix, it is considered that the error of different mapping points can be obtained by Color-based, shape-based and texture-based pixel-level error representations, for each feature using a score that evaluates the pose with the highest deviation rate based on the above features;

有关特征如下：The relevant characteristics are as follows:

(1)基于颜色的像素特征。(1) Color-based pixel features.

颜色特征能够表示图像中物体的表面属性，采用融合单像素和图像区域的融合像素特征方法可以最大程度地表达单个像素的颜色特征；使用颜色距离函数来区分两个单个像素颜色变化：The color feature can represent the surface properties of the object in the image, and the fusion pixel feature method of fusing the single pixel and the image area can express the color feature of a single pixel to the greatest extent; use the color distance function to distinguish two single pixel color changes:

其中，p(x,y)和p'(x,y)是两个目标点，R()，G()和B()是各自的对应R，G，和B值；对于区域颜色特征，假设包含在一个区域中的目标点，假设有5×5个像素，对于每个RGB通道，区域的中心是目标点；提取目标点(x，y)的水平和垂直方向上的梯度值G(x，y)：Among them, p(x, y) and p'(x, y) are two target points, R(), G() and B() are the respective corresponding R, G, and B values; for the regional color feature, Suppose the target point contained in an area, assuming there are 5×5 pixels, for each RGB channel, the center of the area is the target point; extract the gradient value G( x,y):

其中，d_h是左侧两个像素之间的差异目标点的平均值和右侧两个像素的平均值；中心值48水平和垂直区域的其他值是d_v，通过d_h同样的方式计算得到的。对于其余的四个2×2像素块，平均下采样，最后，基于颜色的像素差异性是：where, d _h is the average value of the difference target point between the two pixels on the left and the average value of the two pixels on the right; the center value 48 and the other values in the horizontal and vertical areas are d _v , calculated in the same way by d _h owned. For the remaining four 2×2 pixel blocks, average downsampling, finally, the color-based pixel disparity is:

其中，δ∈{0,1}，若两个块的下采样值的差异低于d_point(p,p')，则δ设为0，否则为1；d_region(p,p')采用与d_point(p,p')相同的计算方法，d_G(p,p')是梯度值的差值，η是正则因子；Among them, δ∈{0,1}, if the difference between the downsampled values of two blocks is lower than d _point (p,p'), then δ is set to 0, otherwise it is 1; d _region (p,p') adopts The same calculation method as d _point (p,p'), d _G (p,p') is the difference of the gradient value, and η is the regular factor;

(2)基于形状和基于纹理的像素特征(2) Shape-based and texture-based pixel features

作为32×32块的中心，提取形状上下文特征，并计算25个采样点的形状上下文，分配块的空间对数坐标分为12×3＝36个部分，表达形状特征差别使用二进制化表示变换两点形状多样性，其中，n＝36，item∈{0,1}，item＝1意味着i块的列值大于平均值，对于纹理，一个半径为R的圆的中心点被分成八个相等的角度区域；计算每个区域的平均图像强度，若平均值大于中心像素值，则该区域的值设置为1，否则，其设置为0；八个二进制序列被转换以十进制数表示纹理特征；最后，D_texture由海明距离确定；As the center of the 32×32 block, the shape context features are extracted, and the shape context of 25 sampling points is calculated, and the spatial logarithmic coordinates of the distribution block are divided into 12×3=36 parts, and the difference between the shape features is expressed using binarization. point shape diversity, Among them, n=36, item ∈ {0,1}, item=1 means that the column value of block i is greater than the average value, for texture, the center point of a circle with radius R is divided into eight equal angle regions; calculate The average image intensity of each area, if the average value is greater than the central pixel value, the value of the area is set to 1, otherwise, it is set to 0; eight binary sequences are converted to decimal numbers to represent texture features; finally, D _texture is represented by The Hamming distance is determined;

最后，评估函数E是上述特征的组合：Finally, the evaluation function E is a combination of the above features:

E(p_i,p_j)＝a₁D_shape(p_i,p_j)+a₂D_texture(p_i,p_j)+a₃||D_color(p_i,p_j)||₂ E(p _i ,p _j )＝a ₁ D _shape (p _i ,p _j )+a ₂ D _texture (p _i ,p _j )+a ₃ ||D _color (p _i ,p _j )|| ₂

其中，a_i∈(0,1)表示不同项的权重，D_color被视为一个这个融合多样性函数中的正则项，D_shape和D_texture被连接起来，通过k次迭代训练使得这个函数中的参数能够评估的最好，评估函数描述了姿态评估错误可以用查询二维特征与先验数据集特征在特征点上的重投影点误差来衡量。Among them, a _i ∈ (0,1) represents the weight of different items, D _color is regarded as a regular item in this fused diversity function, D _shape and D _texture are connected, and training through k iterations makes this function The parameters of can be evaluated best, and the evaluation function describes that the pose evaluation error can be measured by the reprojection point error of the query 2D feature and the prior dataset feature on the feature point.

在步骤3)中，所述针对基于深度学习的SLAM算法中存在的不稳定性问题，提出一种多特征集束优化算法的具体方法如下：In step 3), the specific method of proposing a multi-feature cluster optimization algorithm for the instability problem existing in the SLAM algorithm based on deep learning is as follows:

基于先验知识的深度学习SLAM系统通过打开一个新线程来存储2D-3D映射和关键帧，选择那些满足的关键帧作为集束优化(bundle adjustment，简称BA)的关键帧集合；局部BA优化所有可以被关键帧看到的点；观测点有助于约束函数和最终姿势的优化，全局BA优化和局部BA优化类似，只不过系统将在30帧之间执行局部BA，300帧之间执行全局BA；使用优化算法的SLAM系统后与其他先进SLAM系统的比较如表2所示。A deep learning SLAM system based on prior knowledge stores 2D-3D maps and keyframes by opening a new thread, and selects those that satisfy The key frame of the key frame is used as the key frame set of bundle adjustment (BA for short); the local BA optimizes all points that can be seen by the key frame; the observation point helps to optimize the constraint function and the final pose, global BA optimization and local BA Optimization is similar, except that the system will perform local BA between 30 frames and global BA between 300 frames; the comparison between the SLAM system using the optimized algorithm and other advanced SLAM systems is shown in Table 2.

表2Table 2

本发明结合基于深度学习的SLAM技术、基于随机森林的二维图像点与三维点云的映射，设计了一个基于深度学习SLAM系统优化算法，很好地解决了深度学习SLAM系统优化算法的空白。该系统集低计算量的SLAM环境构建、图像2D点与3D点云匹配于一体，在PC端和移动端都可以对场景进行实时重建和优化，保持了比较高的重建精度，对机器人、无人驾驶、增强现实等领域都有着重要的实用价值和意义。The present invention combines the SLAM technology based on deep learning and the mapping between two-dimensional image points and three-dimensional point clouds based on random forests, and designs a SLAM system optimization algorithm based on deep learning, which well solves the gap in the deep learning SLAM system optimization algorithm. The system integrates low-calculation SLAM environment construction, image 2D point and 3D point cloud matching, and can reconstruct and optimize the scene in real time on both the PC and mobile terminals, maintaining a relatively high reconstruction accuracy. Human driving, augmented reality and other fields have important practical value and significance.

大多数现有的定位方法仅基于参考点距离近似姿态置信度。与其他的方法不同，本发明使用三维重建数据作为参考，然后使用可见的3D点及来自随机森林的离线数据集的相关关键点映射，并使用多特征融合和约束函数来测量姿态评估分数。以上方法用于优化基于深度学习SLAM的性能。实验结果证明，本发明的算法鲁棒性。Most existing localization methods only approximate pose confidence based on reference point distance. Different from other methods, the present invention uses 3D reconstruction data as reference, then uses visible 3D points and related keypoint mapping from random forest offline dataset, and uses multi-feature fusion and constraint function to measure pose evaluation score. The above methods are used to optimize the performance of deep learning-based SLAM. Experimental results prove that the algorithm of the present invention is robust.

附图说明Description of drawings

图1为2D-3D点匹配的系统概述图。Figure 1 is a system overview of 2D-3D point matching.

图2为基于颜色特征的5×5像素区域的图示。Figure 2 is an illustration of a 5x5 pixel region based on color features.

图3为基于形状的特征的图示。Figure 3 is an illustration of shape-based features.

具体实施方式Detailed ways

以下实施例将结合附图对本发明作进一步的说明。The following embodiments will further illustrate the present invention in conjunction with the accompanying drawings.

一、基本概念1. Basic concepts

1)基于随机森林的2D-3D点映射1) 2D-3D point mapping based on random forest

构造随机森林的目的在于找到查询图像和三维重建先验数据之间的关系。能够通过该映射架构评估不同的定位算法的姿态。由于随机森林算法有着较低复杂性和较高鲁棒的优点，使得映射算法满足实时有效的评估。实质上，决策树只是一个从2D到3D检索三角几何关系的映射工具，能够用其他方法替代。The purpose of constructing a random forest is to find the relationship between the query image and the 3D reconstruction prior data. The pose of different localization algorithms can be evaluated through this mapping architecture. Because the random forest algorithm has the advantages of lower complexity and higher robustness, the mapping algorithm meets real-time and effective evaluation. In essence, a decision tree is just a mapping tool for retrieving triangular geometric relations from 2D to 3D, which can be replaced by other methods.

决策树的训练是映射算法的关键。随机森林的性能是由各个不同的决策树的整合性能来决定的。每个决策树由内部节点和叶节点组成。决策树的预测能够计算出2D像素点之间的相似度，再由2D像素点之间的相似度推算到三维空间的相似度。从根节点开始一直到叶子，通过反复修改决策函数而使训练收敛，决策函数表达如下：The training of the decision tree is the key to the mapping algorithm. The performance of random forest is determined by the integration performance of various decision trees. Each decision tree consists of internal nodes and leaf nodes. The prediction of the decision tree can calculate the similarity between 2D pixel points, and then calculate the similarity between 2D pixel points to the similarity in three-dimensional space. From the root node to the leaves, the training converges by repeatedly modifying the decision function. The decision function is expressed as follows:

split(p；δ_n)＝[f_n(p)＞δ_n]split(p; δ _n ) = [f _n (p) > δ _n ]

其中，n表示决策树中节点的索引，p是代表2D关键点的非叶节点，[.]是0-1指标函数，δ是决策阈值，f()是决策函数：Among them, n represents the index of the node in the decision tree, p is the non-leaf node representing the 2D key point, [.] is the 0-1 index function, δ is the decision threshold, and f() is the decision function:

在特征融合中定义a和D()。若split(p；δn)求值为0，则为训练路径分支到左边的子树，否则分支到右边。p1和p2是关键点周围的像素对点。在训练的过程中，三维重建数据(先验数据)包含相关的相应3D点和2D点的映射。将先验划分为训练数据和验证数据。在算法1中展示了训练框架。要注意的是使用了快速的FALoG特征点(Wang,Z.,Fan,B.,Wu,F.:FRIF:fast robust invariant feature.In:British MachineVision Conference,BMVC2013,Bristol,UK,9–13September 2013.https://doi.org/10.5244/C.27.16)能够快速地检测出相应的二值特征点。目标函数Q用于使训练数据和验证数据尽可能具有相同的学习趋势。其中Θ是验证数据进入与相关训练数据不同路径的数量，P_verification是验证数据集。λ是折衷相同分支的相似度和不同分支的多样性的参数。目标函数是基于落入相同分支的验证数据和训练数据的比例去衡量点与点之间的相似性。Define a and D() in feature fusion. If split(p; δn) evaluates to 0, the training path branches to the left subtree, otherwise it branches to the right. p1 and p2 are pixel pairs around the keypoint. During training, the 3D reconstruction data (prior data) contains the corresponding 3D points and the mapping of 2D points. Divide the prior into training and validation data. The training framework is shown in Algorithm 1. It should be noted that the fast FALoG feature points are used (Wang, Z., Fan, B., Wu, F.: FRIF: fast robust invariant feature. In: British MachineVision Conference, BMVC2013, Bristol, UK, 9–13 September 2013 .https://doi.org/10.5244/C.27.16) can quickly detect the corresponding binary feature points. The objective function Q is used to make the training data and validation data have the same learning tendency as much as possible. where Θ is the number of validation data that enters a different path from the associated training data, and P _verification is the validation dataset. λ is a parameter that trades off the similarity of the same branch and the diversity of different branches. The objective function is to measure the similarity between points based on the ratio of validation data and training data falling into the same branch.

算法1基于随机森林的2D-3D点映射训练过程如表1所示。Algorithm 1 The training process of 2D-3D point mapping based on random forest is shown in Table 1.

2)特征融合算法2) Feature Fusion Algorithm

虽然2D与3D的关系很容易通过上述映射获得。由于没有图像姿态标签，要判断相机姿态的好坏是一个很难进行的度量。设计出一套相机姿态评估算法(评估相机姿态的分数)，这种算法不需要进行人工的标注，只需要通过迁移学习获得的三维重建数据。要预测的相机姿态可以使通过任何相机姿态预测算法来计算的。假设姿态估计结果是相机外部参数由x，y，z(位置信息)和w，p，q，r(四元素)转化得到。假设内参矩阵K由EXIF标签获得(假设没有径向变形)。然后，结合内参：Although the relationship of 2D and 3D is easily obtained by the above mapping. Since there is no image pose label, it is a difficult metric to judge whether the camera pose is good or bad. Design a set of camera pose evaluation algorithm (evaluate the score of camera pose), this algorithm does not require manual labeling, but only needs 3D reconstruction data obtained through transfer learning. The camera pose to be predicted can be computed by any camera pose prediction algorithm. Suppose the pose estimation result is the camera extrinsic parameters It is converted from x, y, z (position information) and w, p, q, r (four elements). Assume that the internal reference matrix K is obtained from EXIF tags (assuming no radial deformation). Then, combined with internal parameters:

2D点和3D点云通过这个变换矩阵相关联。如提取一个已经评估过相机矩阵的评估图像的FALoG特征点作为查询数据集。对于查询特征点m(x，y)，可以通过上述映射算法找到查询特征的最近似特征。如图1所示，搜索已建立的最近的数据集特征，随机森林通过测试每个决策节点中的查询特征，最后查询到达叶节点。随机森林的最终映射特征m'在m点是节点概率最大的对应点。对应点m'有其相关的3D点。对于每个3D点，数据库至少有一个相应的图像，每个点可以投影到相关的图像。结合像素差异来作为误差函数，并使用这个误差函数迭代优化姿态矩阵。认为不同映射点的误差可以通过基于颜色的，基于形状的和基于纹理的像素级别误差表示。对于每个特征，我们在下面进行介绍。最后，利用评估基于上述特征的偏差率最为的姿态的评分。2D points and 3D point clouds are related through this transformation matrix. For example, extracting the FALoG feature points of an evaluation image that has evaluated the camera matrix as a query data set. For the query feature point m(x, y), the most approximate feature of the query feature can be found through the above mapping algorithm. As shown in Figure 1, searching for the nearest dataset features that have been established, Random Forest passes through testing the query features in each decision node, and finally the query reaches the leaf nodes. The final mapping feature m' of the random forest is the corresponding point with the highest node probability at point m. The corresponding point m' has its associated 3D point. For each 3D point, the database has at least one corresponding image, and each point can be projected onto the associated image. Combine the pixel difference as an error function, and use this error function to iteratively optimize the pose matrix. It is considered that the errors of different mapped points can be represented by color-based, shape-based and texture-based pixel-level errors. For each feature, we describe it below. Finally, a score is used to evaluate the pose with the highest deviation rate based on the above features.

1.基于颜色的像素特征。1. Color-based pixel features.

颜色特征能够表示了图像中物体的表面属性。在本发明中，采用融合单像素和图像区域的融合像素特征方法可以最大程度地表达单个像素的颜色特征。使用颜色距离函数来区分两个单个像素颜色变化：Color features can represent the surface properties of objects in an image. In the present invention, the color feature of a single pixel can be expressed to the greatest extent by adopting a fusion pixel feature method that fuses a single pixel and an image region. Use a color distance function to distinguish between two single pixel color changes:

其中p(x,y)和p'(x,y)是两个目标点，R()，G()和B()是各自的对应R，G，和B值。对于区域颜色特征，假设包含在一个区域中的目标点，假设有5×5个像素，如图2所示，对于每个RGB通道，区域的中心是目标点。提取目标点(x，y)的水平和垂直方向上的梯度值G(x，y)where p(x,y) and p'(x,y) are two target points, and R(), G(), and B() are the respective corresponding R, G, and B values. For the region color feature, it is assumed that the target point is contained in a region, assuming that there are 5×5 pixels, as shown in Figure 2, for each RGB channel, the center of the region is the target point. Extract the gradient value G(x, y) in the horizontal and vertical directions of the target point (x, y)

其中d_h是左侧两个像素之间的差异目标点的平均值和右侧两个像素的平均值。在图2中心值48水平和垂直区域的其他值是d_v，是通过图2中d_h同样的方式计算得到的。对于其余的四个2×2像素块(图2中除中心点垂直和水平区域之外的点)，平均他们的下采样。最后，基于颜色的像素差异性是：where d _h is the mean of the difference target point between the two pixels on the left and the mean of the two pixels on the right. The other values in the horizontal and vertical regions of the central value 48 in FIG. 2 are d _v , calculated in the same way as d _h in FIG. 2 . For the remaining four 2×2 pixel blocks (points in Figure 2 except the center point vertical and horizontal regions), average their downsampling. Finally, the pixel disparity based on color is:

其中，δ∈{0,1}，若两个块的下采样值的差异低于d_point(p,p')，则δ设为0，否则为1。d_region(p,p')采用与d_point(p,p')相同的计算方法。d_G(p,p')是梯度值的差值，η是正则Among them, δ ∈ {0,1}, if the difference of the downsampled values of two blocks is lower than d _point (p,p'), then δ is set to 0, otherwise it is 1. d _region (p,p') uses the same calculation method as d _point (p,p'). d _G (p,p') is the difference of the gradient value, η is the regular

因子。factor.

2.基于形状和基于纹理的像素特征。2. Shape-based and texture-based pixel features.

为了更快，更高效地提取像素的形状特征点，作为32×32块的中心，提取形状上下文特征，并计算25个采样点的形状上下文。如图3所示，分配块的空间对数坐标分为12×3＝36个部分。直方图用于表示形状特征向量。为了表达形状特征差别，使用二进制化来表示变换两点形状直方图的多样性，其中n＝36，item∈{0,1}，item＝1意味着i块的列值大于直方图的平均值。对于纹理，一个半径为R的圆的中心点被分成八个相等的角度区域。计算每个区域的平均图像强度，若平均值大于中心像素值，则该区域的值设置为1，否则，其设置为0。八个二进制序列被转换以十进制数表示纹理特征。最后，D_texture由海明距离确定。In order to extract the shape feature point of the pixel faster and more efficiently, as the center of the 32×32 block, the shape context feature is extracted, and the shape context of 25 sampling points is calculated. As shown in FIG. 3, the spatial logarithmic coordinates of the allocation block are divided into 12×3=36 parts. Histograms are used to represent shape feature vectors. In order to express the difference in shape features, binarization is used to represent the diversity of the transformed two-point shape histogram, Where n = 36, item ∈ {0,1}, item = 1 means that the column value of block i is greater than the average value of the histogram. For textures, the center point of a circle of radius R is divided into eight equally angular regions. Calculate the average image intensity of each region, if the average value is greater than the center pixel value, the value of this region is set to 1, otherwise, it is set to 0. Eight binary sequences are converted to represent texture features as decimal numbers. Finally, the D _texture is determined by the Hamming distance.

其中，a_i∈(0,1)表示不同项的权重。D_color被视为一个这个融合多样性函数中的正则项。D_shape和D_texture被连接起来。通过k次迭代训练使得这个函数中的参数能够评估的最好。评估函数描述了姿态评估错误可以用查询二维特征与先验数据集特征在特征点上的重投影点误差来衡量。Among them, a _i ∈ (0,1) represents the weight of different items. D _color is regarded as a regularization term in this fused diversity function. D _shape and D _texture are connected. The parameters in this function can be best evaluated by training for k iterations. The evaluation function describes that the pose estimation error can be measured by the reprojection point error of the query 2D features and the prior dataset features on the feature points.

3)基于深度SLAM的后端优化3) Backend optimization based on deep SLAM

局部定位错误将反映在不同关键帧之间的姿态预测和三维重建中。关键帧姿态估计错误的积累导致了全局性错误增长，这样，整个系统的精度会受到限制。基于先验知识的深度学习SLAM系统通过打开一个新线程来存储2D-3D映射和关键帧。选择那些满足的关键帧作为集束优化(bundle adjustment，简称BA)的关键帧集合。局部BA优化所有可以被关键帧看到的点。观测点有助于约束函数和最终姿势的优化。全局BA优化和局部BA优化类似，只不过系统将在30帧之间执行局部BA，300帧之间执行全局BA。Local localization errors will be reflected in pose prediction and 3D reconstruction between different keyframes. Accumulation of keyframe pose estimation errors leads to global error growth, such that the accuracy of the entire system is limited. Deep learning SLAM systems based on prior knowledge store 2D-3D maps and keyframes by opening a new thread. choose those that meet The key frames of are used as the key frame set of bundle adjustment (BA for short). Local BA optimizes all points that can be seen by keyframes. The observation points help in the optimization of the constraint function and the final pose. Global BA optimization is similar to local BA optimization, except that the system will perform local BA between 30 frames and global BA between 300 frames.

使用本发明的优化算法的SLAM系统后与其他先进SLAM系统的比较如表2所示。The comparison between the SLAM system using the optimization algorithm of the present invention and other advanced SLAM systems is shown in Table 2.

Claims

1. The camera pose estimation optimization method based on depth feature is characterized in that comprising the following steps:

1) Use the random forest-based matching algorithm to quickly calculate the similarity of 2D-3D points to map 2D-3D point information;

2) Evaluate the camera pose using constraint functions and multi-feature fusion;

3) Aiming at the instability problem existing in the SLAM algorithm based on deep learning, a multi-feature bundle optimization algorithm is proposed.

2. the camera pose estimation optimization method based on depth feature as claimed in claim 1, is characterized in that in step 1), described use is based on the matching algorithm of random forest, the similarity of fast calculation 2D-3D point is to map 2D- The specific method of 3D point information is:

Each decision tree is composed of internal nodes and leaf nodes. The prediction of the decision tree can calculate the similarity between 2D pixel points, and then calculate the similarity between 2D pixel points to the similarity in three-dimensional space, starting from the root node Up to the leaves, the training converges by repeatedly modifying the decision function. The decision function is expressed as follows:

split(p; δ _n ) = [f _n (p) > δ _n ]

Among them, n represents the index of the node in the decision tree, p is the non-leaf node representing the 2D key point, [.] is the index function from 0 to 1, δ is the decision threshold, and f() is the decision function:

f(p)＝a ₁ D _shape (p ₁ ,p ₂ )+a ₂ D _texture (p ₁ ,p ₂ )+a ₃ D _color (p ₁ ,p ₂ )

Define a and D() in feature fusion. If split(p; δn) evaluates to 0, it is the training path branching to the left subtree, otherwise branching to the right, p1 and p2 are the pixel pairs around the key point , in the process of training, the 3D reconstruction data contains the mapping of the corresponding 3D points and 2D points; the prior is divided into training data and verification data; the training framework is shown in Algorithm 1; the objective function Q is used to make the training data and The verification data has the same learning tendency, where Θ is the number of verification data entering different paths from the related training data, P _verification is the verification data set, and λ is a parameter that compromises the similarity of the same branch and the diversity of different branches; the objective function is based on the proportion of validation data and training data that fall into the same branch to measure the similarity between points:

Algorithm 1 The training process of 2D-3D point mapping based on random forest is shown in Table 1:

Table 1

3. the camera pose estimation optimization method based on depth feature as claimed in claim 1, it is characterized in that in step 2) in, the concrete method of the method evaluation camera pose described using constraint function and multi-feature fusion is:

Suppose the pose estimation result is the camera extrinsic parameters It is obtained by converting x, y, z and w, p, q, r, assuming that the internal reference matrix K is obtained from the EXIF tag; then, combined with the internal reference:

Get the transformation matrix by image coordinates and world coordinates:

2D points and 3D point clouds are associated through this transformation matrix; extract a FALoG feature point of an evaluation image that has already evaluated the camera matrix as the query data set; for the query feature point m(x, y), find the query feature through the above mapping algorithm The most approximate feature, search for the most recent data set features that have been established, the random forest tests the query feature in each decision node, and finally the query reaches the leaf node; the final mapping feature m' of the random forest is the node with the highest probability at m points Corresponding points, the corresponding point m' has its associated 3D point; for each 3D point, the database has at least one corresponding image, each point is projected to the associated image, combine the pixel difference as an error function, and use the error function Iteratively optimize the pose matrix, considering that the errors of different mapping points are represented by color-based, shape-based and texture-based pixel-level errors, and for each feature, use the score based on the attitude with the highest deviation rate based on the above features;

The relevant characteristics are as follows:

(1) Color-based pixel features

The color feature can represent the surface properties of the object in the image, and the fusion pixel feature method of fusing the single pixel and the image area is used to express the color feature of a single pixel to the greatest extent; the color distance function is used to distinguish two single pixel color changes:

Among them, p(x, y) and p'(x, y) are two target points, R(), G() and B() are the respective corresponding R, G, and B values; for the regional color feature, Suppose the target point contained in an area, assuming there are 5×5 pixels, for each RGB channel, the center of the area is the target point; extract the gradient value G( x,y):

where, d _h is the average value of the difference target point between the two pixels on the left and the average value of the two pixels on the right; the center value 48 and the other values in the horizontal and vertical areas are d _v , calculated in the same way by d _h Obtained; for the remaining four 2×2 pixel blocks, average downsampling, and finally, the color-based pixel dissimilarity is:

Among them, δ∈{0,1}, if the difference between the downsampled values of two blocks is lower than d _point (p,p'), then δ is set to 0, otherwise it is 1; d _region (p,p') adopts The same calculation method as d _point (p,p'), d _G (p,p') is the difference of the gradient value, and η is the regular factor;

(2) Shape-based and texture-based pixel features

As the center of the 32×32 block, the shape context features are extracted, and the shape context of 25 sampling points is calculated, and the spatial logarithmic coordinates of the distribution block are divided into 12×3=36 parts, and the difference between the shape features is expressed using binarization. point shape diversity, Among them, n=36, item ∈ {0,1}, item=1 means that the column value of block i is greater than the average value, for texture, the center point of a circle with radius R is divided into eight equal angle regions; calculate The average image intensity of each area, if the average value is greater than the central pixel value, the value of the area is set to 1, otherwise, it is set to 0; eight binary sequences are converted to decimal numbers to represent texture features; finally, D _texture is represented by The Hamming distance is determined;

Finally, the evaluation function E is a combination of the above features:

E(p _i ,p _j )＝a ₁ D _shape (p _i ,p _j )+a ₂ D _texture (p _i ,p _j )+a ₃ ||D _color (p _i ,p _j )|| ₂

Among them, a _i ∈ (0,1) represents the weight of different items, D _color is regarded as a regular item in this fused diversity function, D _shape and D _texture are connected, and training through k iterations makes this function The parameters of can be evaluated best, and the evaluation function describes the pose evaluation error as measured by the reprojection point error of the query 2D feature and the prior dataset feature on the feature point.

4. as claimed in claim 1 based on the camera pose estimation optimization method of depth feature, it is characterized in that in step 3) in, the described instability problem existing in the SLAM algorithm based on deep learning, propose a kind of multi-feature bundle The specific method of the optimization algorithm is as follows:

A deep learning SLAM system based on prior knowledge stores 2D-3D maps and keyframes by opening a new thread, and selects those that satisfy The key frame of the key frame is used as the key frame set of bundle adjustment (BA for short); the local BA optimizes all points seen by the key frame; the observation point contributes to the optimization of the constraint function and the final pose, global BA optimization and local BA optimization Similar, except that the system will perform local BA between 30 frames and global BA between 300 frames;

Table 2

The comparison between the SLAM system using the optimized algorithm and other advanced SLAM systems is shown in Table 2.