CN111340850A

CN111340850A - Ground target tracking method of unmanned aerial vehicle based on twin network and central logic loss

Info

Publication number: CN111340850A
Application number: CN202010198544.9A
Authority: CN
Inventors: 林白; 耿洋洋; 李冬冬; 蒯杨柳
Original assignee: System General Research Institute Academy Of Systems Engineering Academy Of Military Sciences; National University of Defense Technology
Current assignee: System General Research Institute Academy Of Systems Engineering Academy Of Military Sciences; National University of Defense Technology
Priority date: 2020-03-20
Filing date: 2020-03-20
Publication date: 2020-06-26

Abstract

The invention discloses an unmanned aerial vehicle ground target tracking method based on a twin network and central logic loss, which aims to solve the problems of more network parameters, large calculated amount, unbalanced positive and negative training samples and the like in the existing target tracking technology and belongs to the technical field of computer image processing. The method comprises the following steps: extracting a target feature map of a first frame image at a known target position by using a first feature extraction network, and extracting a search feature map of a second frame image by using a second feature extraction network; calculating the cross correlation between the search area of the second frame image and the target area of the first frame image according to the target characteristic diagram and the search characteristic diagram to obtain a score response diagram of the second frame image, and further obtaining the target position of the second frame image according to the score response diagram of the second frame image; the first feature extraction network and the second feature extraction network are two branches of a twin convolutional network and respectively consist of a lightweight convolutional neural network.

Description

A UAV-to-ground target tracking method based on twin network and central logic loss

技术领域technical field

本发明属于计算机图像处理技术领域，尤其是涉及一种基于孪生网络和中心逻辑损失的无人机对地目标跟踪方法。The invention belongs to the technical field of computer image processing, in particular to a method for tracking ground targets of unmanned aerial vehicles based on twin network and central logic loss.

背景技术Background technique

基于深度学习的视觉跟踪算法，例如图1所示的基于全卷积孪生网络的视觉跟踪算法，已被广大开发人员和用户接受和认可。Deep learning-based visual tracking algorithms, such as the fully convolutional Siamese network-based visual tracking algorithm shown in Figure 1, have been accepted and recognized by developers and users.

该方法将第一帧提取的模板图像和每一帧提取出的搜索图像分别输入两个子网络提取高层语义特征，然后对高层语义特征进行互相关从而得到模板图像在搜索图像中每个位置的相似度。通常，两个子网络中的参数是共享的，可以利用训练数据离线学习得到。高级语义特征具有和目标类别相关的丰富语义特性，所以对由遮挡、畸变等造成的目标外观变化具有很强的鲁棒性，并且网络在追踪的过程中不需要更新，大大减少了算法的计算量，保证了算法的实时性。网络有两个输入：目标模板图像和搜索区域图像，两个输入通过共享参数的孪生神经网络的孪生子网进行特征提取。In this method, the template image extracted from the first frame and the search image extracted from each frame are input into two sub-networks to extract high-level semantic features, and then the high-level semantic features are cross-correlated to obtain the similarity of each position of the template image in the search image. Spend. Usually, the parameters in the two sub-networks are shared and can be learned offline using training data. The high-level semantic features have rich semantic features related to the target category, so they have strong robustness to changes in target appearance caused by occlusion, distortion, etc., and the network does not need to be updated during the tracking process, which greatly reduces the calculation of the algorithm. It ensures the real-time performance of the algorithm. The network has two inputs: the target template image and the search area image, and the two inputs perform feature extraction through a siamese subnet of a siamese neural network that shares parameters.

然而，此类方法主要面向通用的视觉跟踪任务，而无法满足计算和存储资源受限的无人机硬件平台。首先，卷积神经网络中存在大量的权值参数，保存大量权值参数对设备的内存要求很高。其次，无人机等嵌入式硬件平台的计算资源有限，因而难以实现卷积神经网络中高效、实时的卷积计算。However, such methods are mainly oriented to general-purpose visual tracking tasks, and cannot satisfy the UAV hardware platforms with limited computing and storage resources. First, there are a large number of weight parameters in the convolutional neural network, and saving a large number of weight parameters requires a high memory of the device. Secondly, the computing resources of embedded hardware platforms such as drones are limited, so it is difficult to achieve efficient and real-time convolution computation in convolutional neural networks.

同时，在实际跟踪的过程中，此类实现方式需要在搜索区域的每个位置上对目标进行密集检测。对于无人机航拍图像，搜索区域中一般包含很多的简单背景中的负样本(如图2中的区域3)、少数的困难负样本(如图2中的区域1) 以及包含前景目标的正样本(如图2中的区域2)，而大量的简单背景负样本会导致训练样本不均衡，并主导网络的训练，从而导致模型退化。At the same time, in the process of actual tracking, such an implementation requires intensive detection of the target at each position of the search area. For UAV aerial images, the search area generally contains many negative samples in the simple background (region 3 in Figure 2), a few difficult negative samples (region 1 in Figure 2), and positive samples containing foreground targets. samples (area 2 in Figure 2), while a large number of simple background negative samples will lead to imbalanced training samples and dominate the training of the network, resulting in model degradation.

发明内容SUMMARY OF THE INVENTION

本发明旨在针对无人机硬件平台的特点，提供一种基于孪生网络和中心逻辑损失的无人机对地目标跟踪方法，以解决现有目标跟踪技术中网络参数多、计算量大以及正负训练样本不均衡等问题。The present invention aims to provide a UAV-to-ground target tracking method based on the twin network and central logic loss according to the characteristics of the UAV hardware platform, so as to solve the problems of many network parameters, large amount of calculation and positive problems in the existing target tracking technology. Negative training samples are not balanced.

根据本发明的第一方面，一种无人机对地目标视觉跟踪模型的训练方法，所述跟踪模型为孪生卷积网络，该孪生卷积网络的两个分支分别为第一特征提取网络、第二特征提取网络，该训练方法包括：According to the first aspect of the present invention, a method for training a visual tracking model of a UAV on a ground target, the tracking model is a twin convolution network, and the two branches of the twin convolution network are a first feature extraction network, a The second feature extraction network, the training method includes:

获取视频序列数据集，该数据集包括成对的模板图像和搜索图像；Obtain a video sequence dataset, which includes pairs of template images and search images;

利用第一特征提取网络提取模板图像的目标特征图，利用第二特征提取网络提取搜索图像的搜索特征图；Use the first feature extraction network to extract the target feature map of the template image, and use the second feature extraction network to extract the search feature map of the search image;

根据所述目标特征图、搜索特征图计算搜索图像的搜索区域与模板图像的目标区域之间的互相关性，得到搜索图像的得分响应图；Calculate the cross-correlation between the search area of the search image and the target area of the template image according to the target feature map and the search feature map, and obtain a score response map of the search image;

根据中心逻辑损失函数计算所述得分响应图与真实值的差异，得到差异结果；以及Calculate the difference between the score response map and the true value according to the central logistic loss function to obtain a difference result; and

对所述差异结果进行反向传播，以调整所述孪生神经网络中各层的权重；Back-propagating the difference result to adjust the weights of each layer in the Siamese neural network;

其中，所述第一特征提取网络、第二特征提取网络分别由轻量级卷积神经网络构成。Wherein, the first feature extraction network and the second feature extraction network are respectively composed of lightweight convolutional neural networks.

可选地，所述轻量级卷积神经网络为MobileNetV2模型。Optionally, the lightweight convolutional neural network is a MobileNetV2 model.

可选地，所述中心逻辑损失函数为：Optionally, the central logistic loss function is:

其中,v∈R^m×n是网络输出的得分图，y∈{+1,-1}是人工标注的真值。 a/(1+exp(b·yv))是逻辑损失的一个调制因子，该调制因子根据输入yv自适应调节每个训练样本对训练损失的贡献。where v∈R ^m×n is the score map output by the network, and y∈{+1,-1} is the ground-truth annotated manually. a/(1+exp(b·yv)) is a modulation factor of the logistic loss, which adaptively adjusts the contribution of each training sample to the training loss according to the input yv.

进一步地，当yv＞0时，调制因子为逻辑损失分配第一权重；当yv＜0时，调制因子为逻辑损失分配第二权重，所述第一权重小于所述第二权重。Further, when yv>0, the modulation factor assigns a first weight to the logic loss; when yv<0, the modulation factor assigns a second weight to the logic loss, and the first weight is smaller than the second weight.

根据本发明的第二方面，一种无人机对地目标视觉跟踪方法，包括：According to a second aspect of the present invention, a method for visual tracking of a ground target by an unmanned aerial vehicle, comprising:

利用第一特征提取网络提取已知目标位置的第一帧图像的目标特征图，利用第二特征提取网络提取第二帧图像的搜索特征图；Use the first feature extraction network to extract the target feature map of the first frame image of the known target position, and use the second feature extraction network to extract the search feature map of the second frame image;

根据所述目标特征图、搜索特征图计算所述第二帧图像的搜索区域与第一帧图像的目标区域之间的互相关性，得到第二帧图像的得分响应图，进而根据第二帧图像的得分响应图得到第二帧图像的目标位置；According to the target feature map and the search feature map, the cross-correlation between the search area of the second frame image and the target area of the first frame image is calculated to obtain the score response map of the second frame image, and then according to the second frame image The score response map of the image obtains the target position of the second frame image;

其中，所述第一特征提取网络、第二特征提取网络为孪生卷积网络的两个分支，且分别由轻量级卷积神经网络构成，所述轻量级卷积神经网络为MobileNet V2模型。Among them, the first feature extraction network and the second feature extraction network are two branches of the twin convolutional network, and are respectively composed of a lightweight convolutional neural network, and the lightweight convolutional neural network is the MobileNet V2 model .

根据本发明的第三方面，一种无人机对地目标视觉跟踪装置，包括：According to a third aspect of the present invention, a visual tracking device for a ground target of an unmanned aerial vehicle, comprising:

识别单元，用于利用第一特征提取网络提取已知目标位置的第一帧图像的目标特征图，利用第二特征提取网络提取第二帧图像的搜索特征图；an identification unit, used for extracting the target feature map of the first frame image of the known target position using the first feature extraction network, and using the second feature extraction network to extract the search feature map of the second frame image;

计算单元，用于根据所述目标特征图、搜索特征图计算第二帧图像的搜索区域与第一帧图像的目标区域之间的互相关性，得到第二帧图像的得分响应图；a calculation unit, configured to calculate the cross-correlation between the search area of the second frame image and the target area of the first frame image according to the target feature map and the search feature map, to obtain the score response map of the second frame image;

确定单元，用于根据第二帧图像的得分响应图得到第二帧图像的目标位置；a determining unit for obtaining the target position of the second frame image according to the score response map of the second frame image;

其中，所述识别单元中的第一特征提取网络、第二特征提取网络为孪生卷积网络的两个分支，且分别由轻量级卷积神经网络构成。Wherein, the first feature extraction network and the second feature extraction network in the identification unit are two branches of the twin convolutional network, and are respectively composed of lightweight convolutional neural networks.

根据本发明的第四方面，一种电子设备，包括：According to a fourth aspect of the present invention, an electronic device, comprising:

至少一个处理器；以及at least one processor; and

与所述处理器通信连接的存储器，其存储有可被所述处理器执行的指令；当所述指令被所述处理器执行时，所述处理器执行所述无人机对地目标视觉跟踪方法，或者所述训练方法。A memory communicatively connected to the processor stores instructions executable by the processor; when the instructions are executed by the processor, the processor performs visual tracking of the ground target by the UAV method, or the training method.

根据本发明的第五方面，一种可读存储介质，其上存储有计算机程序，其特征在于，所述计算机程序被处理器执行时，实现所述的无人机对地目标视觉跟踪方法，或者所述的训练方法。According to the fifth aspect of the present invention, a readable storage medium on which a computer program is stored, is characterized in that, when the computer program is executed by a processor, the described method for visual tracking of a ground target by an unmanned aerial vehicle is realized, or the training method described.

本发明采用轻量级网络中的MobileNet V2模型作为深度框架前端的特征提取子网络，从而降低了卷积神经网络中的计算复杂度和参数个数。同时，能够在处理速度和准确度之间保持较好地平衡，从而能够适应无人机硬件平台有限的存储和计算资源。The present invention adopts the MobileNet V2 model in the lightweight network as the feature extraction sub-network of the front end of the depth frame, thereby reducing the computational complexity and the number of parameters in the convolutional neural network. At the same time, it can maintain a good balance between processing speed and accuracy, so that it can adapt to the limited storage and computing resources of the UAV hardware platform.

此外，本发明采用中心逻辑损失函数对搜索区域中的不同训练样本施加不同的权重，解决了正负训练样本不均衡的问题，避免了离线训练中的网络退化问题，使学习到的卷积特征具有更强的判别力。In addition, the present invention uses the central logic loss function to apply different weights to different training samples in the search area, solves the problem of unbalanced positive and negative training samples, avoids the problem of network degradation in offline training, and makes the learned convolution features have stronger discrimination.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍。显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that are required to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are some embodiments of the present invention, and for those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative effort.

图1是常规孪生网络结构示意图；Figure 1 is a schematic diagram of a conventional twin network structure;

图2是一个实际的目标跟踪场景，该场景中包括简单负样本(区域3)、困难负样本(区域1)以及包含前景目标的正样本(区域2)；Figure 2 is an actual target tracking scene, which includes simple negative samples (region 3), difficult negative samples (region 1), and positive samples containing foreground targets (region 2);

图3是根据本发明实施例的视觉跟踪网络模型结构示意图；3 is a schematic structural diagram of a visual tracking network model according to an embodiment of the present invention;

图4是MobileNet V2网络结构示意图；Figure 4 is a schematic diagram of the MobileNet V2 network structure;

图5是根据本发明实施例的视觉跟踪网络训练及跟踪方法示意性流程；5 is a schematic flowchart of a visual tracking network training and tracking method according to an embodiment of the present invention;

图6是根据本发明实施例的无人机对地目标视觉跟踪方法示意性流程；6 is a schematic flow chart of a method for visual tracking of a ground target by a UAV according to an embodiment of the present invention;

图7是根据本发明实施例的无人机对地目标视觉跟踪装置结构示意图。FIG. 7 is a schematic structural diagram of a visual tracking device for a ground target of an unmanned aerial vehicle according to an embodiment of the present invention.

具体实施方式Detailed ways

以下结合附图对本发明的示范性实施例做出说明，其中包括本发明实施例的各种细节，以助于理解，应当将它们认为仅仅是示范性的。因此，本领域普通技术人员应当认识到，可以对这里描述的实施例做出各种改变和修改，而不会背离本发明的范围和精神。同样，为了清楚和简明，以下的描述中省略了对公知功能和结构的描述。Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, which include various details of the embodiments of the present invention to facilitate understanding and should be considered as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted from the following description for clarity and conciseness.

在以下描述中，UAV(UnmannedAerial Vehicle，无人机)主要指利用无线电遥控设备和自备的程序控制装置操纵的，或者由机载计算机完全地或间歇地自主地操作的不载人飞行器。In the following description, UAV (UnmannedAerial Vehicle, unmanned aerial vehicle) mainly refers to an unmanned aerial vehicle that is operated by radio remote control equipment and self-provided program control device, or is operated completely or intermittently autonomously by an onboard computer.

图3示出了根据本发明实施例构建的基于孪生网络的视觉跟踪网络。FIG. 3 shows a visual tracking network based on a Siamese network constructed according to an embodiment of the present invention.

如图3所示，视觉跟踪网络包括结构相同的第一分支和第二分支，两个分支分布包括特征提取网络，用于提取图像的特征图。本发明中各个分支的特征提取网络均由轻量级卷积神经网络构成。根据本发明的一个实施例，该轻量级卷积神经网络为MobileNet V2模型。As shown in Figure 3, the visual tracking network includes a first branch and a second branch with the same structure, and the distribution of the two branches includes a feature extraction network for extracting the feature map of the image. The feature extraction network of each branch in the present invention is composed of a lightweight convolutional neural network. According to an embodiment of the present invention, the lightweight convolutional neural network is a MobileNet V2 model.

图4示出了MobileNet V2模型的网络结构。MobileNet V2使用在深度上可分离的卷积作为高效的构建块。此外，MobileNet V2引入了两种新的架构特性：1)层之间的线性瓶颈层；2)瓶颈层之间的连接捷径。这种网络结构使MobileNet V2在安全、隐私和能耗上获得额外的优势。Figure 4 shows the network structure of the MobileNet V2 model. MobileNet V2 uses depthwise separable convolutions as efficient building blocks. Furthermore, MobileNet V2 introduces two new architectural features: 1) linear bottleneck layers between layers; 2) connection shortcuts between bottleneck layers. This network structure enables MobileNet V2 to gain additional advantages in security, privacy and energy consumption.

在MobileNet V2网络中，第二卷积层提取得到的特征受限于输入的通道数。为此，MobileNet V2网络在利用DW卷积层对图像进行特征提取前，设置一个 1×1的第一卷积层进行扩张，在提取特征后，再利用第三卷积层进行压缩。因此， MobileNet V2网络实现了扩张→卷积提特征→压缩过程，避免了产生通道损失问题。In the MobileNet V2 network, the features extracted by the second convolutional layer are limited by the number of input channels. To this end, the MobileNet V2 network sets a 1×1 first convolutional layer for expansion before using the DW convolutional layer to extract features from the image, and then uses the third convolutional layer for compression after extracting features. Therefore, the MobileNet V2 network implements the expansion→convolution feature extraction→compression process, avoiding the problem of channel loss.

同时，在第一卷积层、第二卷积层的输出端分别设置ReLU函数。由于采用上述扩张→卷积提特征→压缩过程，在压缩之后会碰到一个问题，即ReLU 函数会破坏特征，ReLU函数对于负的输入，输出全为零，因此为了避免进一步损失特征，第三卷积层的激活函数采用Linear函数。At the same time, ReLU functions are respectively set at the outputs of the first convolutional layer and the second convolutional layer. Due to the above-mentioned expansion→convolution feature extraction→compression process, there will be a problem after compression, that is, the ReLU function will destroy the features, and the output of the ReLU function for negative inputs is all zero. Therefore, in order to avoid further loss of features, the third The activation function of the convolutional layer adopts the Linear function.

本发明提出的基于孪生网络的视觉跟踪网络采用中心逻辑损失函数，该中心逻辑损失函数为：The visual tracking network based on the twin network proposed by the present invention adopts a central logical loss function, and the central logical loss function is:

其中：in:

v∈R^m×n是网络输出的得分图，m×n为得分图的尺寸大小；v∈R ^m×n is the score map output by the network, and m×n is the size of the score map;

y∈{+1,-1}是人工标注的真值；y∈{+1,-1} is the ground truth value annotated manually;

a/(1+exp(b·yv))为逻辑损失的一个调制因子，a和b是调制因子中的参数，例如根据本发明的一个实施例，a＝2，b＝1。a/(1+exp(b·yv)) is a modulation factor of the logic loss, a and b are parameters in the modulation factor, for example, according to an embodiment of the present invention, a=2, b=1.

该调制因子根据输入yv自适应调节每个训练样本对训练损失的贡献。当 yv＞0时，该样本是一个简单样本，调制因子为逻辑损失分配较小的权重；反之，当yv＜0时，该样本是一个困难样本，调制因子为逻辑损失分配较大的权重。This modulation factor adaptively adjusts the contribution of each training sample to the training loss according to the input yv. When yv > 0, the sample is a simple sample, and the modulation factor assigns a smaller weight to the logistic loss; conversely, when yv < 0, the sample is a difficult sample, and the modulation factor assigns a larger weight to the logistic loss.

大多数卷积神经网络都是以逻辑损失或者交叉熵损失作为深度模型训练过程的监督信号，这种损失函数训练出的模型虽然具有良好的可分性，但判别性能力较差。Most convolutional neural networks use logistic loss or cross entropy loss as the supervision signal in the training process of deep models. Although the models trained by this loss function have good separability, they have poor discriminative ability.

与物体分类识别等闭集问题不同，目标跟踪属于开集问题，其不仅要求深度模型输出的特征具有可分性，还必须要有较强的判别性。在处理长尾数据集 (其中大部分样本属于很少的类，而许多其他类的样本非常少)的时候，如何对不同类的损失进行加权可能比较棘手。对于目标跟踪问题来说，前景目标作为正样本比较容易采集，而背景中对训练有用的困难负样本则是少数的。Different from closed-set problems such as object classification and recognition, target tracking is an open-set problem, which not only requires the features output by the deep model to be separable, but also must have strong discrimination. When dealing with long-tailed datasets (where most of the samples belong to very few classes, and many other classes have very few samples), how to weight the losses for different classes can be tricky. For the target tracking problem, the foreground target is easier to collect as a positive sample, while the difficult negative samples in the background that are useful for training are few.

因此本发明利用上述中心逻辑损失函数可以利用端到端的形式自适应调整正负样本的比例，从而使学习到的网络免于遭受训练样本不均衡带来的影响。具体地，本发明通过采用中心逻辑损失函数对不同样本施加不同的权重，从而可以处理前景-背景训练样本不均衡的问题。Therefore, the present invention can adaptively adjust the ratio of positive and negative samples in an end-to-end manner by using the above-mentioned central logical loss function, so that the learned network is free from the influence of unbalanced training samples. Specifically, the present invention can deal with the problem of unbalanced foreground-background training samples by applying different weights to different samples by using the central logistic loss function.

图5示出了根据本发明实施例的视觉跟踪网络训练及采用训练好的网络模型进行无人机对地目标视觉跟踪的示意性流程。FIG. 5 shows a schematic flow of visual tracking network training and using a trained network model to visually track ground targets of unmanned aerial vehicles according to an embodiment of the present invention.

如图5所示，本发明中，视觉跟踪网络的训练过程包括预训练阶段以及微调阶段。As shown in FIG. 5 , in the present invention, the training process of the visual tracking network includes a pre-training stage and a fine-tuning stage.

在预训练阶段，例如采用ImageNet大规模视觉识别挑战赛ILSVRC中的视频数据库作为样本视频序列，利用标记后的样本视频序列训练视觉跟踪网络。在训练时，设置视觉跟踪网络的最大迭代次数和学习率，选择网络参数初始化方法和反向传播方法，优化网络参数。In the pre-training stage, for example, the video database in the ImageNet Large-Scale Visual Recognition Challenge ILSVRC is used as the sample video sequence, and the labeled sample video sequence is used to train the visual tracking network. During training, set the maximum number of iterations and learning rate of the visual tracking network, select the network parameter initialization method and backpropagation method, and optimize the network parameters.

根据本发明的一个实施例，设置最大迭代次数、学习率，选择反向传播方法具体如下：According to an embodiment of the present invention, the maximum number of iterations and the learning rate are set, and the back-propagation method is selected as follows:

最大迭代次数：50epoch；Maximum number of iterations: 50epoch;

初始学习率：0.001；initial learning rate: 0.001;

网络初始化方法：Xavier方法Network initialization method: Xavier method

反向传播方法：随机梯度下降方法。Backpropagation method: Stochastic gradient descent method.

在预训练阶段完成后，还通过微调阶段对网络参数进一步优化。After the pre-training stage is completed, the network parameters are further optimized through the fine-tuning stage.

在微调阶段，本发明利用无人机采集对地目标形成视频数据集，对该视频数据集各帧图像按照类别进行标注，并将已标注的视频数据集分成训练集、验证集、测试集，最后处理成视觉跟踪网络模型能够识别的数据类型。In the fine-tuning stage, the present invention uses drones to collect ground targets to form a video data set, label each frame of the video data set according to the category, and divide the labeled video data set into training set, verification set, and test set, Finally, it is processed into a data type that the visual tracking network model can recognize.

利用所述训练集和验证集对经过预训练的视觉跟踪网络进行训练，以对视觉跟踪网络进行微调，保留微调后的视觉跟踪网络模型的结构和参数。The pre-trained visual tracking network is trained by using the training set and the validation set, so as to fine-tune the visual tracking network and retain the structure and parameters of the fine-tuned visual tracking network model.

利用所述测试集对微调后的视觉跟踪网络进行测试，得到跟踪准确率。并对跟踪准确率进行判断，如果跟踪准确率能够满足实际工程需要，则所述视觉跟踪网络模型能够应用到实际的无人机对地特定目标识别的任务中。否则，则说明训练集不能满足实际工程需要，需要扩大训练集，重新开始预训练及微调步骤，直至满足实际工程需要为止。Use the test set to test the fine-tuned visual tracking network to obtain the tracking accuracy. And the tracking accuracy is judged. If the tracking accuracy can meet the actual engineering needs, the visual tracking network model can be applied to the actual task of identifying the specific target on the ground by the UAV. Otherwise, it means that the training set cannot meet the actual engineering needs, and it is necessary to expand the training set and restart the pre-training and fine-tuning steps until the actual engineering needs are met.

在训练过程中，本发明采用中心逻辑损失函数，该中心逻辑损失函数为：In the training process, the present invention adopts the central logical loss function, and the central logical loss function is:

其中：in:

v∈R^m×n是网络输出的得分图；v∈R ^m×n is the score map output by the network;

a/(1+exp(b·yv))为逻辑损失的一个调制因子，该调制因子根据输入yv自适应调节每个训练样本对训练损失的贡献。a/(1+exp(b·yv)) is a modulation factor of the logical loss, which adaptively adjusts the contribution of each training sample to the training loss according to the input yv.

当yv＞0时，该样本是一个简单样本，调制因子为逻辑损失分配较小的权重；反之，当yv＜0时，该样本是一个困难样本，调制因子为逻辑损失分配较大的权重。When yv > 0, the sample is a simple sample, and the modulation factor assigns a smaller weight to the logistic loss; conversely, when yv < 0, the sample is a difficult sample, and the modulation factor assigns a larger weight to the logistic loss.

因此，本发明通过采用中心逻辑损失函数对不同样本施加不同的权重，从而可以处理前景-背景训练样本不均衡的问题。Therefore, the present invention can deal with the problem of unbalanced foreground-background training samples by applying different weights to different samples by using the central logistic loss function.

在完成网络模型训练后，将视觉跟踪网络应用到无人机对地目标跟踪的实际场景中，对无人机采集的视频中的目标进行跟踪。After completing the training of the network model, the visual tracking network is applied to the actual scene of the UAV tracking the ground target, and the target in the video collected by the UAV is tracked.

图5、图6示出了根据本发明实施例的基于孪生网络的无人机对地目标跟踪方法的示意性流程。FIG. 5 and FIG. 6 show a schematic flow of a method for tracking a ground target of a UAV based on a twin network according to an embodiment of the present invention.

如图6所示，该方法包括如下步骤：As shown in Figure 6, the method includes the following steps:

利用第一特征提取网络提取已知目标位置的第一帧图像，即模板图像的目标特征图，利用第二特征提取网络提取第二帧图像，即搜索图像的搜索特征图；Use the first feature extraction network to extract the first frame image of the known target position, that is, the target feature map of the template image, and use the second feature extraction network to extract the second frame image, that is, the search feature map of the search image;

根据所述目标特征图、搜索特征图计算第二帧图像的搜索区域与第一帧图像的目标区域之间的互相关性，得到第二帧图像的得分响应图，进而根据第二帧图像的得分响应图得到第二帧图像的目标位置；According to the target feature map and the search feature map, the cross-correlation between the search area of the second frame image and the target area of the first frame image is calculated, and the score response map of the second frame image is obtained. Score the response map to obtain the target position of the second frame image;

其中，所述第一特征提取网络、第二特征提取网络为孪生卷积网络的两个分支，且分别由轻量级卷积神经网络构成。Wherein, the first feature extraction network and the second feature extraction network are two branches of the Siamese convolutional network, and are respectively composed of lightweight convolutional neural networks.

例如，对视频序列中的第1帧图像进行标定，利用该视觉跟踪网络可以得到第2帧图像的目标位置；然后以第1帧的标定结果或第2帧图像的跟踪结果，利用该视觉跟踪网络可以得到第3帧图像的目标位置。以此类推，进而可以得到视频序列中每一帧图像的目标位置，实现对视频序列的目标跟踪。For example, by calibrating the first frame image in the video sequence, the target position of the second frame image can be obtained by using the visual tracking network; The network can get the target position of the 3rd frame image. By analogy, the target position of each frame image in the video sequence can be obtained, and the target tracking of the video sequence can be realized.

其中，如图5所示，还包括根据互相关计算得到的得分响应图，对特征提取网络的参数进行进行更新，以进一步提高特征提取网络的准确性及可靠性。Among them, as shown in Figure 5, it also includes the score response graph obtained by the cross-correlation calculation, and the parameters of the feature extraction network are updated to further improve the accuracy and reliability of the feature extraction network.

图7为本发明实施例提供的无人机对地目标视觉跟踪装置，包括：FIG. 7 is a visual tracking device for an unmanned aerial vehicle on the ground provided by an embodiment of the present invention, including:

识别单元701，用于利用第一特征提取网络提取已知目标位置的第一帧图像的目标特征图，利用第二特征提取网络提取第二帧图像的搜索特征图；Recognition unit 701, for extracting the target feature map of the first frame image of the known target position by using the first feature extraction network, and using the second feature extraction network to extract the search feature map of the second frame image;

计算单元702，用于根据所述目标特征图、搜索特征图计算第二帧图像的搜索区域与第一帧图像的目标区域之间的互相关性，得到第二帧图像的得分响应图；A calculation unit 702, configured to calculate the cross-correlation between the search area of the second frame image and the target area of the first frame image according to the target feature map and the search feature map, to obtain the score response map of the second frame image;

确定单元703，用于根据第二帧图像的得分响应图得到第二帧图像的目标位置；Determining unit 703, for obtaining the target position of the second frame image according to the score response map of the second frame image;

本发明实施例还提供了一种电子设备，包括：The embodiment of the present invention also provides an electronic device, including:

至少一个处理器；以及at least one processor; and

可选地，存储器既可以是独立的，也可以跟处理器集成在一起。Optionally, the memory can be independent or integrated with the processor.

当存储器独立设置时，该电子设备还包括总线，用于连接所述存储器和处理器。When the memory is provided independently, the electronic device further includes a bus for connecting the memory and the processor.

进一步地，本发明实施例还提供了一种可读存储介质，其上存储有计算机程序，所述计算机程序被处理器执行时，实现所述的无人机对地目标视觉跟踪方法，或者所述的训练方法。Further, an embodiment of the present invention also provides a readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the described method for visual tracking of a ground target by an unmanned aerial vehicle is implemented, or the computer program is executed. the training method described.

在本发明所提供的几个实施例中，应该理解到，所揭露的设备和方法，可以通过其它的方式实现。例如，以上所描述的设备实施例仅仅是示意性的，例如，所述模块的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个模块可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或模块的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the device embodiments described above are only illustrative. For example, the division of the modules is only a logical function division. In actual implementation, there may be other division methods. For example, multiple modules may be combined or integrated. to another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or modules, and may be in electrical, mechanical or other forms.

所述作为分离部件说明的模块可以是或者也可以不是物理上分开的，作为模块显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The modules described as separate components may or may not be physically separated, and components shown as modules may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外，在本发明各个实施例中的各功能模块可以集成在一个处理单元中，也可以是各个模块单独物理存在，也可以两个或两个以上模块集成在一个单元中。上述模块成的单元既可以采用硬件的形式实现，也可以采用硬件加软件功能单元的形式实现。In addition, each functional module in each embodiment of the present invention may be integrated into one processing unit, or each module may exist physically alone, or two or more modules may be integrated into one unit. The units formed by the above modules can be implemented in the form of hardware, or can be implemented in the form of hardware plus software functional units.

上述以软件功能模块的形式实现的集成的模块，可以存储在一个计算机可读取存储介质中。上述软件功能模块存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器，或者网络设备等)或处理器执行本申请各个实施例所述方法的部分步骤。The above-mentioned integrated modules implemented in the form of software functional modules may be stored in a computer-readable storage medium. The above-mentioned software function modules are stored in a storage medium, and include several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to execute some steps of the methods described in the various embodiments of the present application.

应理解，上述处理器可以是中央处理单元(CPU)，还可以是其他通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合发明所公开的方法的步骤可以直接体现为硬件处理器执行完成，或者用处理器中的硬件及软件模块组合执行完成。It should be understood that the above-mentioned processor may be a central processing unit (CPU), and may also be other general-purpose processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in conjunction with the invention can be directly embodied as executed by a hardware processor, or executed by a combination of hardware and software modules in the processor.

存储器可能包含高速RAM存储器，也可能还包括非易失性存储NVM，例如至少一个磁盘存储器，还可以为U盘、移动硬盘、只读存储器、磁盘或光盘等。The memory may include high-speed RAM memory, and may also include non-volatile storage NVM, such as at least one magnetic disk memory, and may also be a U disk, a removable hard disk, a read-only memory, a magnetic disk or an optical disk, and the like.

总线可以是工业标准体系结构(ISA)总线、外部设备互连(PCI)总线或扩展工业标准体系结构(EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示，本申请附图中的总线并不限定仅有一根总线或一种类型的总线。The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus can be divided into address bus, data bus, control bus and so on. For convenience of representation, the buses in the drawings of the present application are not limited to only one bus or one type of bus.

上述存储介质可以是由任何类型的易失性或非易失性存储设备或者它们的组合实现，如静态随机存取存储器(SRAM)、电可擦除可编程只读存储器 (EEPROM)、可擦除可编程只读存储器(EPROM)、可编程只读存储器(PROM)、只读存储器(ROM)、磁存储器、快闪存储器、磁盘或光盘。存储介质可以是通用或专用计算机能够存取的任何可用介质。The above-mentioned storage medium may be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable Except programmable read only memory (EPROM), programmable read only memory (PROM), read only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage medium can be any available medium that can be accessed by a general purpose or special purpose computer.

一种示例性的存储介质耦合至处理器，从而使处理器能够从该存储介质读取信息，且可向该存储介质写入信息。当然，存储介质也可以是处理器的组成部分。处理器和存储介质可以位于专用集成电路(ASIC)中。当然，处理器和存储介质也可以作为分立组件存在于电子设备或主控设备中。An exemplary storage medium is coupled to the processor, such that the processor can read information from, and write information to, the storage medium. Of course, the storage medium can also be an integral part of the processor. The processor and storage medium may reside in an application specific integrated circuit (ASIC). Of course, the processor and the storage medium may also exist in the electronic device or the host device as discrete components.

本领域普通技术人员可以理解，实现上述各方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成。前述的程序可以存储于一计算机可读取存储介质中。该程序在执行时，执行包括上述各方法实施例的步骤；而前述的存储介质包括ROM、RAM、磁盘或者光盘等各种可以存储程序代码的介质。以上各实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述各实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分或者全部技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。Those skilled in the art can understand that all or part of the steps of implementing the above method embodiments may be completed by program instructions related to hardware. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, the steps including the above method embodiments are executed; and the foregoing storage medium includes various media that can store program codes, such as ROM, RAM, magnetic disk, or optical disk. The above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that the foregoing embodiments can still be used for The technical solutions described in the examples are modified, or some or all of the technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. a training method of an unmanned aerial vehicle to ground target visual tracking model, the tracking model is a twin convolution network, and two branches of this twin convolution network are respectively the first feature extraction network, the second feature extraction network, It is characterized in that, the training method includes:

Obtain a video sequence dataset, which includes pairs of template images and search images;

Use the first feature extraction network to extract the target feature map of the template image, and use the second feature extraction network to extract the search feature map of the search image;

Calculate the cross-correlation between the search area of the search image and the target area of the template image according to the target feature map and the search feature map, and obtain a score response map of the search image;

Calculate the difference between the score response map and the true value according to the central logistic loss function to obtain a difference result; and

Back-propagating the difference result to adjust the weights of each layer in the Siamese neural network;

Wherein, the first feature extraction network and the second feature extraction network are respectively composed of lightweight convolutional neural networks.

2. The training method according to claim 1, wherein the lightweight convolutional neural network is a MobileNetV2 model.

3. The training method according to claim 1 or 2, wherein the central logistic loss function is:

where v∈R ^m×n is the score map output by the network, and y∈{+1,-1} is the ground-truth annotated manually. a/(1+exp(b·yv)) is a modulation factor of the logic loss, which adaptively adjusts the contribution of each training sample to the training loss according to the input yv, a and b are the parameters in the modulation factor.

4. The training method according to claim 3, wherein when yv>0, the modulation factor assigns a first weight to the logic loss; when yv<0, the modulation factor assigns a second weight to the logic loss, and the The first weight is smaller than the second weight.

5 . The training method according to claim 3 , wherein the parameters a=2 and b=1 in the modulation factor of the central logistic loss function. 6 .

6. A method for visual tracking of an unmanned aerial vehicle on the ground, comprising:

Use the first feature extraction network to extract the target feature map of the first frame image of the known target position, and use the second feature extraction network to extract the search feature map of the second frame image;

According to the target feature map and the search feature map, the cross-correlation between the search area of the second frame image and the target area of the first frame image is calculated to obtain the score response map of the second frame image, and then according to the second frame image The score response map of the image obtains the target position of the second frame image;

Among them, the first feature extraction network and the second feature extraction network are two branches of the twin convolutional network, and are respectively composed of a lightweight convolutional neural network, and the lightweight convolutional neural network is the MobileNet V2 model .

7. An unmanned aerial vehicle to ground target visual tracking device, is characterized in that, comprises:

an identification unit, used for extracting the target feature map of the first frame image of the known target position using the first feature extraction network, and using the second feature extraction network to extract the search feature map of the second frame image;

a calculation unit, configured to calculate the cross-correlation between the search area of the second frame image and the target area of the first frame image according to the target feature map and the search feature map, to obtain the score response map of the second frame image;

a determining unit for obtaining the target position of the second frame image according to the score response map of the second frame image;

Wherein, the first feature extraction network and the second feature extraction network in the identification unit are two branches of the twin convolutional network, and are respectively composed of a lightweight convolutional neural network. The lightweight convolutional neural network is the MobileNet V2 model.

8. An electronic device, characterized in that, comprising:

at least one processor; and a memory communicatively connected to the processor that stores instructions executable by the processor; when the instructions are executed by the processor, the processor executes the instructions of claim 6 The described UAV to ground target visual tracking method, or the training method according to any one of claims 1-5.

9. A readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by the processor, the unmanned aerial vehicle (UAV) method for visual tracking of ground targets according to claim 6 is realized, or according to the The training method described in any one of requirements 1-5.