CN118799398A

CN118799398A - Visual localization method based on alignment of 3D LoD map and neural wireframe

Info

Publication number: CN118799398A
Application number: CN202410953506.8A
Authority: CN
Inventors: 颜深; 朱珏霖; 张茂军; 张升岳; 肖华欣; 彭杨; 刘煜
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2024-07-16
Filing date: 2024-07-16
Publication date: 2024-10-18
Anticipated expiration: 2044-07-16
Also published as: CN118799398B

Abstract

The present application relates to a visual positioning method based on three-dimensional LoD map and neural wireframe alignment. The method comprises: uniformly sampling four degrees of freedom with the initial posture as the center, generating posture hypotheses in four directions respectively; calculating the straight line alignment cost of the posture hypothesis and the pre-constructed three-dimensional wireframe points according to the neural wireframe alignment method, combining the straight line alignment costs in four directions in a grid manner to calculate the posture cost volume and then calculating the probability distribution volume; determining the posture sampling range of the next layer according to the probability distribution volume to generate the posture hypothesis of the next layer, and taking the selected posture obtained at the last feature level as the candidate selected posture; mapping the multi-level features, using the mapped features and the three-dimensional wireframe points to design the optimization objective function of the candidate selected posture and then solving it to obtain the final posture. The use of this method can improve the visual positioning accuracy and low memory.

Description

Visual localization method based on alignment of 3D LoD map and neural wireframe

技术领域Technical Field

本申请涉及视觉定位技术领域，特别是涉及一种基于三维LoD地图与神经线框对齐的视觉定位方法。The present application relates to the field of visual positioning technology, and in particular to a visual positioning method based on alignment of a three-dimensional LoD map with a neural wireframe.

背景技术Background Art

视觉定位是确定相对于已知地图的给定图像的位置和方向，即相机姿势的过程。这是许多3D计算机视觉应用中的一个基本问题，从无人驾驶飞行器(uav)的自动驾驶和导航到增强现实。当前最先进的视觉定位方法通常涉及匹配查询图像中的像素和预先构建的高质量3D地图中的点，这些地图通常来自运动结构(SfM)或3D纹理模型。随后，通常使用透视n点RANSAC技术来计算相机姿态。Visual localization is the process of determining the position and orientation, i.e., the camera pose, of a given image relative to a known map. It is a fundamental problem in many 3D computer vision applications, ranging from autonomous driving and navigation of unmanned aerial vehicles (UAVs) to augmented reality. Current state-of-the-art visual localization methods typically involve matching pixels in a query image with points in a pre-built high-quality 3D map, which is usually derived from structure from motion (SfM) or a 3D textured model. Subsequently, the camera pose is typically computed using a perspective n-point RANSAC technique.

然而，使用摄影测量法构建高质量的3D地图在世界范围内是昂贵的，并且需要经常更新数据来捕捉视觉外观的时间变化。此外，这些3D地图的存储成本很高，这对手机和无人机等各种移动设备的终端部署构成了重大挑战。此外，高分辨率3D地图披露了有关定位区域的详细信息，提出了有关国土安全和隐私保护的关键问题。However, building high-quality 3D maps using photogrammetry is expensive worldwide and requires frequent data updates to capture temporal changes in visual appearance. In addition, the storage cost of these 3D maps is high, which poses a major challenge to terminal deployment on various mobile devices such as mobile phones and drones. In addition, high-resolution 3D maps disclose detailed information about the location area, raising key issues regarding homeland security and privacy protection.

发明内容Summary of the invention

基于此，有必要针对上述技术问题，提供一种能够提高视觉定位精度和实现低内存存储的基于三维LoD地图与神经线框对齐的视觉定位方法。Based on this, it is necessary to provide a visual positioning method based on alignment of three-dimensional LoD map and neural wireframe, which can improve the visual positioning accuracy and achieve low memory storage, in order to address the above technical problems.

一种基于三维LoD地图与神经线框对齐的视觉定位方法，所述方法包括：A visual positioning method based on alignment of a three-dimensional LoD map with a neural wireframe, the method comprising:

获取无人机在三维城市LoD地图上拍摄的查询图像和无人机传感器的先验姿态；构建视觉定位模型；视觉定位模型包括特征提取模块、姿态选择模块和姿态优化模块；Obtain the query image taken by the drone on the 3D city LoD map and the prior attitude of the drone sensor; build a visual positioning model; the visual positioning model includes a feature extraction module, an attitude selection module and an attitude optimization module;

在特征提取模块，根据卷积神经网络对查询图像在多个层次上进行特征提取，得到多级特征；In the feature extraction module, the query image is subjected to feature extraction at multiple levels according to the convolutional neural network to obtain multi-level features;

在姿态选择模块，定义当前特征层级上的初始姿态，以初始姿态为中心对四自由度进行均匀采样，分别在四个方向生成姿态假设；根据神经线框对齐方法计算姿态假设和预先构造的三维线框点的直线对齐代价，以网格的方式组合四个方向上的直线对齐代价得到姿态成本体积；根据softmax函数对姿态成本体积进行计算，得到概率分布体积；对概率分布体积进行argmax运算，得到选择位姿；利用当前特征层级的概率分布体积在当前层次处的方差确定下一层的姿态采样范围进行下一层姿态假设生成，将最后一个特征层级得到的选择姿态作为候选选择姿态；第一特征层级上的初始姿态为先验姿态；In the posture selection module, the initial posture on the current feature level is defined, and the four degrees of freedom are uniformly sampled with the initial posture as the center, and posture hypotheses are generated in four directions respectively; the straight line alignment cost of the posture hypothesis and the pre-constructed three-dimensional wireframe points is calculated according to the neural wireframe alignment method, and the straight line alignment costs in four directions are combined in a grid manner to obtain the posture cost volume; the posture cost volume is calculated according to the softmax function to obtain the probability distribution volume; the argmax operation is performed on the probability distribution volume to obtain the selected posture; the variance of the probability distribution volume of the current feature level at the current level is used to determine the posture sampling range of the next layer to generate the posture hypothesis of the next layer, and the selected posture obtained at the last feature level is used as the candidate selected posture; the initial posture on the first feature level is the prior posture;

在姿态优化模块，对多级特征进行映射，利用映射后的特征和三维线框点设计候选选择姿态的优化目标函数，根据高斯牛顿法对优化目标函数进行求解，得到最终姿态；In the posture optimization module, the multi-level features are mapped, and the optimization objective function of the candidate selection posture is designed using the mapped features and 3D wireframe points. The optimization objective function is solved according to the Gauss-Newton method to obtain the final posture;

根据预先设置的姿态选择损失函数对姿态选择模块进行训练，得到训练好的姿态选择模块；利用预先设置的姿态优化损失函数对姿态优化模块进行优化，得到训练好的姿态优化模块；The posture selection module is trained according to a preset posture selection loss function to obtain a trained posture selection module; the posture optimization module is optimized using a preset posture optimization loss function to obtain a trained posture optimization module;

利用训练好的视觉定位模型对输入图像进行视觉定位。Use the trained visual localization model to perform visual localization on the input image.

在其中一个实施例中，在当前特征层级上定义初始姿态，包括：In one embodiment, defining an initial pose at a current feature level includes:

在当前特征层级上定义初始姿态为其中(x_l,y_l,z_l)表示三维空间中的平移，分别表示偏航角、俯仰角和滚转角，l表示特征层次序号。The initial pose is defined at the current feature level as where (x _l ,y _l ,z _l ) represents the translation in three-dimensional space, They represent the yaw angle, pitch angle and roll angle respectively, and l represents the feature level number.

在其中一个实施例中，以初始姿态为中心对四自由度进行均匀采样，分别在四个方向生成姿态假设，包括：In one embodiment, the four degrees of freedom are uniformly sampled with the initial posture as the center, and posture hypotheses are generated in four directions, including:

以初始姿态为中心对四自由度进行均匀采样，分别在四个方向生成姿态假设为The four degrees of freedom are uniformly sampled with the initial posture as the center, and the postures generated in four directions are assumed to be

其中，d∈((x，y，z，θ)，(x_l,y_l,z_l)表示三维空间中的平移，θ_l表示偏航角，r_l表示采样范围，m_l表示采样个数，l表示特征层次序号。Among them, d∈((x, y, z, θ), (x _l , y _l , z _l ) represents the translation in three-dimensional space, θ _l represents the yaw angle, r _l represents the sampling range, m _l represents the number of samples, and l represents the feature level number.

在其中一个实施例中，根据神经线框对齐方法计算姿态假设和预先构造的三维线框点的直线对齐代价，包括：In one embodiment, calculating a straight line alignment cost of a pose hypothesis and pre-constructed three-dimensional wireframe points according to a neural wireframe alignment method includes:

根据神经线框对齐方法计算姿态假设和预先构造的三维线框点的直线对齐代价为According to the neural wireframe alignment method, the straight line alignment cost of the pose hypothesis and the pre-constructed 3D wireframe points is calculated as

其中，表示姿态假设，F_l表示l层的特征，P_i表示三维线框点。in, represents the pose hypothesis, F _l represents the features of layer l, and _Pi represents the 3D wireframe points.

在其中一个实施例中，当前特征层级的概率分布体积在当前层次处的方差的计算过程包括：In one embodiment, the calculation process of the variance of the probability distribution volume of the current feature level at the current level includes:

由于姿态假设姿态成本体积C_l和概率分布体积P_l具有相同的数据结构，将它们平面化并以t为索引，在第l层的方差v_l计算为Due to the posture assumption The pose cost volume C _l and the probability distribution volume P _l have the same data structure. They are flattened and indexed by t. The variance v _l at layer l is calculated as

其中，l表示当前的特征层次序号，P_l-1表示l-1层的概率分布体积，表示l-1层的选择位姿。Where l represents the current feature level number, P _l-1 represents the probability distribution volume of the l-1 layer, Represents the selected pose of layer l-1.

在其中一个实施例中，利用当前特征层级的概率分布体积在当前层次处的方差确定下一层的姿态采样范围，包括：In one embodiment, the variance of the probability distribution volume of the current feature level at the current level is used to determine the gesture sampling range of the next level, including:

根据方差计算标准差为利用标准差确定姿态采样范围为r_l＝2λ·σ_l，其中，λ是调节采样范围长度的超参数，l表示特征层次序号。The standard deviation is calculated based on the variance: The standard deviation is used to determine the gesture sampling range as r _l =2λ·σ _l , where λ is a hyperparameter for adjusting the length of the sampling range, and l represents the feature level number.

在其中一个实施例中，利用映射后的特征和三维线框点设计候选选择姿态的优化目标函数，包括：In one embodiment, the optimized objective function of the candidate selection pose is designed using the mapped features and the three-dimensional wireframe points, including:

利用映射后的特征和三维线框点设计候选选择姿态的优化目标函数为The optimization objective function of designing candidate selection poses using the mapped features and 3D wireframe points is:

其中，F_rf表示映射后的特征，P_i表示三维线框点，ξ^*＝(R^*,t^*)表示最终姿态，Π为投影运算。Among them, F _rf represents the mapped feature, _Pi represents the three-dimensional wireframe point, ξ ^* = (R ^* , t ^* ) represents the final posture, and Π is the projection operation.

在其中一个实施例中，预先设置的姿态选择损失函数为In one embodiment, the preset posture selection loss function is

其中，l∈{1,2,3}，p_l表示概率分布体积，表示地面真实姿态。Where l∈{1,2,3}, p _l represents the probability distribution volume, represents the ground truth pose.

在其中一个实施例中，预先设置的姿态优化损失函数为In one embodiment, the preset posture optimization loss function is

其中，地面真实姿态ρ表示Huber鲁棒核，ξ^*＝(R^*,t^*)表示最终姿态,Π为投影运算。Among them, the ground truth posture ρ represents the Huber robust kernel, ξ ^* = (R ^* , t ^* ) represents the final pose, and Π is the projection operation.

上述基于三维LoD地图与神经线框对齐的视觉定位方法，本申请通过在三维城市LoD地图上构建视觉定位模型，以无人机拍摄的查询图像及其传感器先验姿态作为输入实现视觉定位，相比于现有复杂的3D表示，本申请使用细节级别LoD地图来估计无人机的姿态，提供简单，可访问和隐私友好的场景表示，在构建的视觉定位模型中，首先对查询图像进行多层特征提取，并提出了分层位姿估计方案，利用多个小的位姿体积，而不是大的成本体积，以从粗到精的方式逐步计算出高质量的位姿。在分层过程中，采用自适应采样策略，其中前一阶段基于方差的不确定性影响下一阶段的采样范围以构建姿态成本体积，这种自适应过程诱导出合理的、细粒度的姿态空间划分，从而显著提高了最终的姿态输出，提高视觉定位精度和实现低内存存储。此外，在粗到精姿态估计阶段之后，对多级特征进行映射，利用映射后的特征和三维线框点设计候选选择姿态的优化目标函数，根据高斯牛顿法对优化目标函数进行求解，得到最终姿态，修正了初始重力先验内的微小误差，从而提高整体姿态精度。The above-mentioned visual positioning method based on alignment of three-dimensional LoD map with neural wireframe, this application realizes visual positioning by constructing a visual positioning model on the three-dimensional city LoD map, and taking the query image taken by the drone and its sensor prior pose as input. Compared with the existing complex 3D representation, this application uses the detail level LoD map to estimate the pose of the drone, providing a simple, accessible and privacy-friendly scene representation. In the constructed visual positioning model, multi-layer feature extraction is first performed on the query image, and a hierarchical pose estimation scheme is proposed, which uses multiple small pose volumes instead of large cost volumes to gradually calculate high-quality poses in a coarse-to-fine manner. In the hierarchical process, an adaptive sampling strategy is adopted, in which the uncertainty based on variance in the previous stage affects the sampling range of the next stage to construct the pose cost volume. This adaptive process induces a reasonable and fine-grained pose space division, which significantly improves the final pose output, improves the visual positioning accuracy and achieves low memory storage. In addition, after the coarse-to-fine attitude estimation stage, the multi-level features are mapped, and the optimized objective function of the candidate selection attitude is designed using the mapped features and 3D wireframe points. The optimized objective function is solved according to the Gauss-Newton method to obtain the final attitude, which corrects the slight error in the initial gravity prior and improves the overall attitude accuracy.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为一个实施例中一种基于三维LoD地图与神经线框对齐的视觉定位方法的流程示意图；FIG1 is a schematic diagram of a flow chart of a visual positioning method based on alignment of a three-dimensional LoD map with a neural wireframe in one embodiment;

图2为一个实施例中视觉定位方法的框架示意图；FIG2 is a schematic diagram of a framework of a visual positioning method in one embodiment;

图3为一个实施例中不确定性采样范围估计过程示意图。FIG. 3 is a schematic diagram of an uncertainty sampling range estimation process in one embodiment.

具体实施方式DETAILED DESCRIPTION

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本申请进行进一步详细说明。应当理解，此处描述的具体实施例仅仅用以解释本申请，并不用于限定本申请。In order to make the purpose, technical solution and advantages of the present application more clearly understood, the present application is further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application and are not used to limit the present application.

在一个实施例中，如图1和图2所示，提供了一种基于三维LoD地图与神经线框对齐的视觉定位方法，包括以下步骤：In one embodiment, as shown in FIG. 1 and FIG. 2 , a visual positioning method based on alignment of a three-dimensional LoD map with a neural wireframe is provided, comprising the following steps:

步骤102，获取无人机在三维城市LoD地图上拍摄的查询图像和无人机传感器的先验姿态；构建视觉定位模型；视觉定位模型包括特征提取模块、姿态选择模块和姿态优化模块。Step 102, obtaining a query image taken by the drone on the three-dimensional city LoD map and a priori posture of the drone sensor; constructing a visual positioning model; the visual positioning model includes a feature extraction module, a posture selection module and a posture optimization module.

步骤104，在特征提取模块，根据卷积神经网络对查询图像在多个层次上进行特征提取，得到多级特征。Step 104, in the feature extraction module, feature extraction is performed on the query image at multiple levels according to the convolutional neural network to obtain multi-level features.

利用带有UNet的卷积神经网络从查询图像中提取多层次特征，保持高维特征图来封装每一层丰富的视觉信息，将特征图的维数抽象并降维为1，其中该图中的每个像素表示成为线框的可能性。得到的多级特征用表示，其中l＝{1,2,3}是级别索引。A convolutional neural network with UNet is used to extract multi-level features from the query image, maintaining a high-dimensional feature map to encapsulate the rich visual information of each layer, abstracting the dimension of the feature map and reducing it to 1, where each pixel in the map represents the possibility of becoming a wireframe. The obtained multi-level features are used Represents, where l = {1, 2, 3} is the level index.

步骤106，在姿态选择模块，定义当前特征层级上的初始姿态，以初始姿态为中心对四自由度进行均匀采样，分别在四个方向生成姿态假设；根据神经线框对齐方法计算姿态假设和预先构造的三维线框点的直线对齐代价，以网格的方式组合四个方向上的直线对齐代价得到姿态成本体积；根据softmax函数对姿态成本体积进行计算，得到概率分布体积；对概率分布体积进行argmax运算，得到选择位姿；利用当前特征层级的概率分布体积在当前层次处的方差确定下一层的姿态采样范围进行下一层姿态假设生成，将最后一个特征层级得到的选择姿态作为候选选择姿态；第一特征层级上的初始姿态为先验姿态。Step 106, in the posture selection module, define the initial posture on the current feature level, uniformly sample the four degrees of freedom with the initial posture as the center, and generate posture hypotheses in four directions respectively; calculate the straight line alignment cost of the posture hypothesis and the pre-constructed three-dimensional wireframe points according to the neural wireframe alignment method, and combine the straight line alignment costs in four directions in a grid manner to obtain the posture cost volume; calculate the posture cost volume according to the softmax function to obtain the probability distribution volume; perform argmax operation on the probability distribution volume to obtain the selected posture; use the variance of the probability distribution volume of the current feature level at the current level to determine the posture sampling range of the next layer to generate the posture hypothesis of the next layer, and use the selected posture obtained at the last feature level as the candidate selected posture; the initial posture on the first feature level is the prior posture.

在特征提取之后，为姿态周围采样的各种姿态假设建立一个代价体积，然后在每个级别上选择概率最高的姿态。为了保证有效的采样，利用当前层姿态选择的不确定性来确定下一层的姿态采样范围。首先定义当前特征层级上的初始姿态，考虑到先前数据的重力方向俯仰和横滚具有较高的精度，故以初始姿态为中心只对四自由度进行均匀采样，分别在四个方向生成姿态假设，通过在先验周围采样四自由度(包括位置和偏航角)来生成姿态假设，假设惯性单元提供的重力方向具有较小的误差。根据生成的姿态假设，LoD建筑线框被投影到查询图像平面上。然后通过预测线框和预测线框之间的对齐对每个姿态假设进行评分，通过在四个不同的维度(x，y，Z，θ)上以网格方式组合这些对齐代价，得到了一个维度为[m_l(x)×m_l(y)×m_l(z)×m_l(θ)]的4D姿态成本体积C_l。对C_l执行softmax函数，得到概率分布体积Pl，推导出姿态的概率密度，该概率密度可用于通过分类(类似于分类分类)或回归(类似于位置期望)操作进行姿态估计，对P_l进行argmax运算，以最大概率选择位姿整个过程将姿态估计的输出解释为可学习网络参数化的概率分布，以预测查询图像的二维线框。在不确定采样范围估计过程中，在从粗到精的过程中，利用前一层的姿态选择不确定性来决定当前层的采样范围。当前电平的采样范围。该策略允许逐步细分姿态采样空间，从而提高姿态选择的精度，更具体地说，对于l＝1，通过仔细检查UAVD4L-LoD数据集中先验和GT姿态之间的最大分歧来定义姿态采样范围。建立(x，y，z，θ)的采样范围为[r₁(x)，r₁(y)，r₁(z)，r₁(z)](＝([r_p(x)，r_p(y)，r_p(z)，r_p(z)]。对于l＝{2,3}，使用概率分布体积P_l-1在l-1处的方差来确定姿态采样范围r_l，其中，由于位姿假设位姿代价体积C_l和概率分布体积P_l具有相同的数据结构，将它们平面化并以t为索引。在第l层的方差v_l计算为After feature extraction, a cost volume is built for various pose hypotheses sampled around the pose, and then the pose with the highest probability is selected at each level. To ensure effective sampling, the uncertainty of the pose selection at the current layer is used to determine the range of pose sampling at the next layer. First, the initial pose on the current feature level is defined. Considering that the pitch and roll of the gravity direction of the previous data have high accuracy, only four degrees of freedom are uniformly sampled around the initial pose, and pose hypotheses are generated in four directions respectively. The pose hypotheses are generated by sampling four degrees of freedom (including position and yaw angle) around the prior, assuming that the gravity direction provided by the inertial unit has a small error. Based on the generated pose hypotheses, the LoD building wireframe is projected onto the query image plane. Each pose hypothesis is then scored by the alignment between the predicted wireframe and the predicted wireframe. By combining these alignment costs in a grid manner on four different dimensions (x, y, Z, θ), a 4D pose cost volume C _l with dimensions [m _l (x)×m _l (y)×m _l (z)×m _l (θ)] is obtained. Perform the softmax function on C _l to obtain the probability distribution volume Pl, derive the probability density of the posture, which can be used for posture estimation through classification (similar to classification classification) or regression (similar to position expectation) operations, and perform argmax operation on P _l to select the posture with the maximum probability The whole process interprets the output of pose estimation as a probability distribution parameterized by a learnable network to predict a 2D wireframe of the query image. In the uncertain sampling range estimation process, the pose selection uncertainty of the previous layer is used to determine the sampling range of the current layer in the coarse-to-fine process. The sampling range of the current level. This strategy allows the pose sampling space to be gradually subdivided, thereby improving the accuracy of pose selection. More specifically, for l = 1, the pose sampling range is defined by carefully examining the maximum divergence between the prior and GT poses in the UAVD4L-LoD dataset. The sampling range of (x, y, z, θ) is established as [r ₁ (x), r ₁ (y), r ₁ (z), r ₁ (z)] (= ([r _p (x), r _p (y), r _p (z), r _p (z)]. For l = {2,3}, the variance of the probability distribution volume P _l-1 at l-1 is used to determine the pose sampling range r _l , where, due to the pose assumption The pose cost volume C _l and the probability distribution volume P _l have the same data structure, which are flattened and indexed by t. The variance v _l at the lth layer is calculated as

符号表示分别应用于(x，y，z，θ)方向的减法操作。相应的标准差计算为计算姿态采样范围为r_l＝2λ·σ_l，其中λ是调节采样范围长度的超参数。这种不确定性采样范围估计过程的可视化如图3所示。symbol represents the subtraction operation applied to the (x, y, z, θ) directions respectively. The corresponding standard deviation is calculated as The calculated pose sampling range is r _l =2λ·σ _l , where λ is a hyperparameter that adjusts the length of the sampling range. The visualization of this uncertain sampling range estimation process is shown in Figure 3.

通过本申请的分层位姿估计方案，利用多个小的位姿体积，而不是大的成本体积，以从粗到精的方式逐步计算出高质量的位姿。在分层过程中，采用自适应采样策略，其中前一阶段基于方差的不确定性影响下一阶段的采样范围以构建姿态成本体积。这种自适应过程诱导出合理的、细粒度的姿态空间划分，从而显著提高了最终的姿态输出，提高视觉定位精度和实现低内存存储。Through the hierarchical pose estimation scheme of this application, multiple small pose volumes are used instead of large cost volumes to gradually calculate high-quality poses in a coarse-to-fine manner. In the hierarchical process, an adaptive sampling strategy is adopted, in which the variance-based uncertainty of the previous stage affects the sampling range of the next stage to construct the pose cost volume. This adaptive process induces a reasonable and fine-grained partitioning of the pose space, which significantly improves the final pose output, improves visual positioning accuracy and achieves low memory storage.

步骤108，在姿态优化模块，对多级特征进行映射，利用映射后的特征和三维线框点设计候选选择姿态的优化目标函数，根据高斯牛顿法对优化目标函数进行求解，得到最终姿态。Step 108, in the posture optimization module, the multi-level features are mapped, and the optimization objective function of the candidate selection posture is designed using the mapped features and the three-dimensional wireframe points. The optimization objective function is solved according to the Gauss-Newton method to obtain the final posture.

基于前一阶段所选择的姿态使用经过后处理卷积网络从特征映射F_rf中进一步提取的精炼线框概率图F₃来优化姿态ξ^*＝(R^*，t^*)，以使3D线框与2D预测线框对齐。利用映射后的特征和三维线框点设计候选选择姿态的优化目标函数:Based on the posture selected in the previous stage The refined wireframe probability map _F3 further extracted from the feature map _Frf by the post-processing convolutional network is used to optimize the pose ξ ^* = (R ^* , t ^* ) to align the 3D wireframe with the 2D predicted wireframe. The optimization objective function of the candidate selected pose is designed using the mapped features and 3D wireframe points:

其中，P_i为三维线框点，Π为投影运算。最小化此函数使投影的3D线框点以更高的预测概率移动到2D位置。由高斯-牛顿法导出的ξ^*的位姿更新公式为Where _Pi is a 3D wireframe point and Π is a projection operation. Minimizing this function allows the projected 3D wireframe point to move to the 2D position with a higher prediction probability. The pose update formula of ξ ^* derived from the Gauss-Newton method is

R^*＝R^*·exp(△ξ_r)R ^* ＝R ^* ·exp(△ _ξr )

式中，为六维变换向量，为旋转分量，为平移分量。利用李代数的指数映射，将旋转分量△ξ_r转化为3×3旋转矩阵。其中，J_i表示残差函数f_i关于位姿参数的雅可比矩阵。In the formula, is the six-dimensional transformation vector, is the rotation component, is the translation component. Using Lie algebra The exponential mapping of , transforms the rotation component △ξ _r into a 3×3 rotation matrix. Where _Ji represents the Jacobian matrix of the residual function _fi with respect to the pose parameters.

步骤110，根据预先设置的姿态选择损失函数对姿态选择模块进行训练，得到训练好的姿态选择模块；利用预先设置的姿态优化损失函数对姿态优化模块进行优化，得到训练好的姿态优化模块；利用训练好的视觉定位模型对输入图像进行视觉定位。Step 110, training the posture selection module according to a preset posture selection loss function to obtain a trained posture selection module; optimizing the posture optimization module using a preset posture optimization loss function to obtain a trained posture optimization module; and performing visual positioning on the input image using the trained visual positioning model.

预先设置的姿态选择损失函数用于约束姿态选择模块的负对数似然损失，意味着所选择的姿态与真值越接近，那么姿态所对应的概率越大，能提高姿态选择的准确率；预先设置的姿态优化损失函数是姿态优化模块(牛顿迭代优化)的损失函数，利用重投影误差来约束姿态的调整。通过设置上述两个损失函数的结合能够从粗到细地优化最终的姿态结果。The pre-set posture selection loss function is used to constrain the negative log-likelihood loss of the posture selection module, which means that the closer the selected posture is to the true value, the greater the probability corresponding to the posture, which can improve the accuracy of posture selection; the pre-set posture optimization loss function is the loss function of the posture optimization module (Newton iterative optimization), which uses the reprojection error to constrain the adjustment of the posture. By setting the combination of the above two loss functions, the final posture result can be optimized from coarse to fine.

上述基于三维LoD地图与神经线框对齐的视觉定位方法中，本申请通过在三维城市LoD地图上构建视觉定位模型，以无人机拍摄的查询图像及其传感器先验姿态作为输入实现视觉定位，相比于现有复杂的3D表示，本申请使用细节级别LoD地图来估计无人机的姿态，提供简单，可访问和隐私友好的场景表示，在构建的视觉定位模型中，首先对查询图像进行多层特征提取，并提出了分层位姿估计方案，利用多个小的位姿体积，而不是大的成本体积，以从粗到精的方式逐步计算出高质量的位姿。在分层过程中，采用自适应采样策略，其中前一阶段基于方差的不确定性影响下一阶段的采样范围以构建姿态成本体积，这种自适应过程诱导出合理的、细粒度的姿态空间划分，从而显著提高了最终的姿态输出，提高视觉定位精度和实现低内存存储。此外，在粗到精姿态估计阶段之后，对多级特征进行映射，利用映射后的特征和三维线框点设计候选选择姿态的优化目标函数，根据高斯牛顿法对优化目标函数进行求解，得到最终姿态，修正了初始重力先验内的微小误差，从而提高整体姿态精度。In the above-mentioned visual positioning method based on alignment of three-dimensional LoD map with neural wireframe, this application realizes visual positioning by constructing a visual positioning model on the three-dimensional city LoD map, and taking the query image taken by the drone and its sensor prior pose as input. Compared with the existing complex 3D representation, this application uses the detail level LoD map to estimate the pose of the drone, providing a simple, accessible and privacy-friendly scene representation. In the constructed visual positioning model, multi-layer feature extraction is first performed on the query image, and a hierarchical pose estimation scheme is proposed, which uses multiple small pose volumes instead of large cost volumes to gradually calculate high-quality poses in a coarse-to-fine manner. In the hierarchical process, an adaptive sampling strategy is adopted, in which the uncertainty based on variance in the previous stage affects the sampling range of the next stage to construct the pose cost volume. This adaptive process induces a reasonable and fine-grained pose space division, which significantly improves the final pose output, improves the visual positioning accuracy and achieves low memory storage. In addition, after the coarse-to-fine attitude estimation stage, the multi-level features are mapped, and the optimized objective function of the candidate selection attitude is designed using the mapped features and 3D wireframe points. The optimized objective function is solved according to the Gauss-Newton method to obtain the final attitude, which corrects the slight error in the initial gravity prior and improves the overall attitude accuracy.

其中，表示姿态假设，F_l表示l层的特征，P_i表示三维线框点，[·]为亚像素插值查找。in, represents the pose hypothesis, F _l represents the features of layer l, _Pi represents the 3D wireframe points, and [·] is the sub-pixel interpolation search.

在具体实施例中，三维线框点的构造过程为现有技术，在本申请中不做过多的赘述。In a specific embodiment, the construction process of three-dimensional wireframe points is prior art and will not be described in detail in this application.

在具体实施例中，在模型训练过程中，设置了一个随机种子来保留3D线框点{P_i}到2000个点，并且由于CUDA内存相关的约束，对于level(l(＝1,2,3，姿态采样数m_l(x)，m_l(y)，m_l(z)，m_l(θ)被统一分配为[13,7,3]。UAVD4L数据集的图像大小为(512,480)，Swiss-EPFL数据集的图像大小为(720,480)。level(1的姿态采样范围设为[10,10,30,7.5]，指[r_p(x)，r_p(y)，r_p(z)，r_p(z)]。超参数λ固定为1.5。对于UAVD4L-(lod数据集，将UAVD4L[64]中的合成图像子集作为训练数据，其中包括建筑物。对于Swiss-EPFL，通过结合来自CrossLoc项目的合成图像LHS和真实查询图像来训练模型。在推理过程中，做了以下更改，以1米的间隔改变三维线框中的检索离散点。姿态采样个数增加到[m_l(x)，m_l(y)，m_l(z)，m_l(θ)](＝[10,10,30,8]。λ设为0.8。整个网络的训练和推理使用2个NVIDIA(RTX(4090(GPUs执行。In a specific embodiment, during the model training process, a random seed is set to retain the 3D wireframe points {P _i } to 2000 points, and due to CUDA memory-related constraints, for level(l(＝1,2,3, the pose sampling numbers m _l (x), m _l (y), m _l (z), m _l (θ) are uniformly assigned to [13,7,3]. The image size of the UAVD4L dataset is (512,480), and the image size of the Swiss-EPFL dataset is (720,480). The pose sampling range of level(1 is set to [10,10,30,7.5], which refers to [r _p (x), r _p (y), r _p (z), r _p (z)]. The hyperparameter λ is fixed to 1.5. For the UAVD4L-(lod dataset, a subset of synthetic images in UAVD4L[64] is used as training data, which includes buildings. For Swiss-EPFL, the model is trained by combining synthetic images LHS from the CrossLoc project and real query images. During inference, the following changes are made to change the retrieved discrete points in the 3D wireframe at intervals of 1 meter. The number of pose samples is increased to [m _l (x), m _l (y), m _l (z), m _l (θ)](=[10,10,30,8]. λ is set to 0.8. The training and inference of the entire network are performed using 2 NVIDIA(RTX(4090(GPUs).

应该理解的是，虽然图1的流程图中的各个步骤按照箭头的指示依次显示，但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明，这些步骤的执行并没有严格的顺序限制，这些步骤可以以其它的顺序执行。而且，图1中的至少一部分步骤可以包括多个子步骤或者多个阶段，这些子步骤或者阶段并不必然是在同一时刻执行完成，而是可以在不同的时刻执行，这些子步骤或者阶段的执行顺序也不必然是依次进行，而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that, although the various steps in the flowchart of FIG. 1 are shown in sequence according to the indication of the arrows, these steps are not necessarily executed in sequence according to the order indicated by the arrows. Unless there is a clear explanation in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least a part of the steps in FIG. 1 may include a plurality of sub-steps or a plurality of stages, and these sub-steps or stages are not necessarily executed at the same time, but can be executed at different times, and the execution order of these sub-steps or stages is not necessarily to be carried out in sequence, but can be executed in turn or alternately with other steps or at least a part of the sub-steps or stages of other steps.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，所述的计算机程序可存储于一非易失性计算机可读取存储介质中，该计算机程序在执行时，可包括如上述各方法的实施例的流程。其中，本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用，均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限，RAM以多种形式可得，诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)(DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。Those skilled in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be completed by instructing the relevant hardware through a computer program, and the computer program can be stored in a non-volatile computer-readable storage medium. When the computer program is executed, it can include the processes of the embodiments of the above-mentioned methods. Among them, any reference to memory, storage, database or other media used in the embodiments provided in the present application can include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM) or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. As an illustration and not limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (Synchlink) (DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

以上实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above embodiments may be combined arbitrarily. To make the description concise, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

以上所述实施例仅表达了本申请的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对发明专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本申请构思的前提下，还可以做出若干变形和改进，这些都属于本申请的保护范围。因此，本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation methods of the present application, and the descriptions thereof are relatively specific and detailed, but they cannot be understood as limiting the scope of the invention patent. It should be pointed out that, for a person of ordinary skill in the art, several variations and improvements can be made without departing from the concept of the present application, and these all belong to the protection scope of the present application. Therefore, the protection scope of the patent of the present application shall be subject to the attached claims.

Claims

1. A visual localization method based on alignment of a three-dimensional LoD map with a neural wireframe, the method comprising:

acquiring a query image shot by the unmanned aerial vehicle on a three-dimensional city LoD map and a priori posture of a sensor of the unmanned aerial vehicle; constructing a visual positioning model; the visual positioning model comprises a feature extraction module, a gesture selection module and a gesture optimization module;

The feature extraction module is used for extracting features of the query image on a plurality of layers according to a convolutional neural network to obtain multi-level features;

Defining an initial gesture on the current feature level at the gesture selection module, uniformly sampling four degrees of freedom by taking the initial gesture as a center, and respectively generating gesture assumptions in four directions; calculating the straight line alignment cost of the gesture hypothesis and the pre-constructed three-dimensional line frame point according to a nerve line frame alignment method, and combining the straight line alignment cost in four directions in a grid mode to obtain a gesture cost volume; calculating the gesture cost volume according to a softmax function to obtain a probability distribution volume; performing argmax operation on the probability distribution volume to obtain a selected pose; determining a gesture sampling range of a next layer by utilizing the variance of the probability distribution volume of the current feature level at the current level to generate a next layer of gesture hypothesis, and taking the selected gesture obtained by the last feature level as a candidate selected gesture; the initial pose on the first feature level is an a priori pose;

Mapping the multi-level features in the gesture optimization module, designing a candidate gesture optimization objective function by using the mapped features and the three-dimensional wire frame points, and solving the optimization objective function according to a Gauss Newton method to obtain a final gesture;

training the gesture selection module according to a preset gesture selection loss function to obtain a trained gesture selection module; optimizing the attitude optimization module by using a preset attitude optimization loss function to obtain a trained attitude optimization module;

And performing visual positioning on the input image by using the trained visual positioning model.

2. The method of claim 1, wherein defining an initial pose at a current feature level comprises:

defining an initial pose as on a current feature level Where (x _l,y_l,z_l) represents translation in three-dimensional space,Respectively representing yaw angle, pitch angle and roll angle, and l represents a characteristic hierarchy number.

3. The method of claim 1, wherein uniformly sampling four degrees of freedom centered on the initial pose, generating pose hypotheses in four directions, respectively, comprises:

uniformly sampling four degrees of freedom by taking the initial gesture as the center, and generating gesture assumptions in four directions respectively as

Where d ε (x, y, z, θ), (x _l,y_l,z_l) represents translation in three-dimensional space, θ _l represents yaw angle, r _l represents sampling range, m _l represents the number of samples, and l represents feature level number.

4. The method of claim 1, wherein calculating a straight line alignment cost of the pose hypothesis and a pre-constructed three-dimensional wireframe point according to a neural wireframe alignment method comprises:

calculating the straight line alignment cost of the posture assumption and the pre-constructed three-dimensional wire frame point according to the nerve wire frame alignment method

Wherein, Representing the pose hypothesis, F _l represents the features of the l-layer, and P _i represents the three-dimensional wireframe points.

5. The method of claim 1, wherein the calculating of the variance of the probability distribution volume of the current feature level at the current level comprises:

due to the pose assumption The pose cost volume C _l and the probability distribution volume P _l have the same data structure, they are flattened and indexed by t, and the variance v _l at the first layer is calculated as

Where l represents the current feature level number, P _l-1 represents the probability distribution volume of the l-1 layer,Representing the selected pose of the l-1 layer.

6. The method of claim 1, wherein determining the pose sampling range for the next layer using the variance of the probability distribution volume of the current feature level at the current level comprises:

calculating standard deviation from the variance as And determining the attitude sampling range as r _l＝2λ·σ_l by using the standard deviation, wherein lambda is a super parameter for adjusting the length of the sampling range, and l represents the characteristic hierarchy number.

7. The method of claim 1, wherein designing an optimization objective function for candidate selection poses using the mapped features and the three-dimensional wireframe points, comprises:

The optimization objective function of the three-dimensional wire frame point design candidate selection gesture is selected by using the mapped characteristics and the three-dimensional wire frame point design candidate selection gesture as follows

Wherein F _rf denotes the mapped feature, P _i denotes the three-dimensional wireframe point, ζ ^*＝(R^*,t^*) denotes the final pose, and pi is the projection operation.

8. The method according to claim 1, wherein the pre-set pose selection loss function is

Where l ε {1,2,3}, p _l represents the probability distribution volume,Representing the ground real attitude.

9. The method according to claim 1, wherein the pre-set pose optimization loss function is

Wherein, the ground is in a true postureΡ represents the Huber robust kernel, ζ ^*＝(R^*,t^*) represents the final pose, and ζ is the projection operation.