CN118038379A

CN118038379A - Vehicle small target detection method and device based on lightweight network design

Info

Publication number: CN118038379A
Application number: CN202410131971.3A
Authority: CN
Inventors: 赵小明; 尹旭坤; 林旭茂; 高苗
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2024-01-30
Filing date: 2024-01-30
Publication date: 2024-05-14

Abstract

The present invention discloses a vehicle small target detection method and device with a lightweight network design, which relates to the field of target detection technology, including: obtaining a vehicle image to be tested; using a trained network model to classify the vehicle image to be tested, and outputting a classification result; wherein the trained network model includes a backbone network module, a neck network module and a detection module, the backbone network module is used to extract hierarchical global features and local features, the neck network module is used to fuse hierarchical global features and local features, and the detection module is used to process the fused global features and local features to obtain a classification result. The present invention can meet the requirements of real-time and accuracy for vehicle target detection at the edge end where storage and calculation are limited.

Description

A vehicle small target detection method and device with lightweight network design

技术领域Technical Field

本发明属于目标检测技术领域，具体涉及一种轻量化网络设计的车辆小目标检测方法及其装置。The present invention belongs to the technical field of target detection, and in particular relates to a vehicle small target detection method and a device thereof with a lightweight network design.

背景技术Background Art

车辆检索技术是一种利用计算机视觉算法判断待测图像中是否存在特定车辆目标的技术。该技术在交通规划、环境检测等领域都具有重要的应用价值。另外，无人机由于其体积小并且机动性强，适合去采集环境复杂的区域的信息，适用于各种民事领域和军事领域。将无人机与车辆检索技术联合在一起，研究在不同地理环境和天气条件下，无人机机载相机拍摄下的无人机车辆检索问题，则具有重大的研究意义。复杂场景下的航拍车辆小目标检索研究内容包含：复杂场景中精准快速的检索，构建高效的目标检索数据库，目标候选框的确定三项任务。在实际的应用环境中，受无人机飞行高度，拍摄角度以及背景噪声的影响，使得复杂场景下的无人机车辆目标检索效果较差、漏检率高，且定位不够准确，无法满足实际需求，且现有的检测任务网络计算成本较高，对存储和计算都受限的边缘端并不友好。Vehicle retrieval technology is a technology that uses computer vision algorithms to determine whether there is a specific vehicle target in the image to be tested. This technology has important application value in the fields of traffic planning and environmental testing. In addition, drones are small in size and highly maneuverable, so they are suitable for collecting information in complex areas and are applicable to various civil and military fields. It is of great research significance to combine drones with vehicle retrieval technology to study the problem of drone vehicle retrieval under drone-mounted cameras in different geographical environments and weather conditions. The research content of aerial vehicle small target retrieval in complex scenes includes three tasks: accurate and fast retrieval in complex scenes, building an efficient target retrieval database, and determining the target candidate frame. In actual application environments, due to the influence of drone flight altitude, shooting angle and background noise, the drone vehicle target retrieval effect in complex scenes is poor, the missed detection rate is high, and the positioning is not accurate enough, which cannot meet actual needs. In addition, the existing detection task network computing cost is high, which is not friendly to the edge end where storage and computing are limited.

目前对于无人机车辆小目标检测的研究，都从以下几个角度进行改进。首先是数据集方面，由于目标检测数据集并不丰富，所以大部分的研究工作都会首先扩充数据集，以便更好地训练模型识别的准确率，比如基于自制的数据集的基础上采用泊松融合的方法进行扩展；其次是将模型结构进行改进，比如通过在主干末端增加特征映射关注层(featuremap attention，FMA)来提高网络特征提取能力来改YOLOV5算法，或者是从网络通用性和场景复杂度的角度出发，采用更加简易实用的ResNet-50网络结构替换原有的主干网络DarkNet-53，便于后续联合算法的实现。为了将神经网络模型更好地在边缘端部署，也可从轻量化角度进行改进，比如对YOLOv3中的主干网络进行改进，在不降低原有模型准确度的前提下，通过计算量更少的线性操作替代部分卷积操作，从而使模型的参数更少，消耗算力更小；或者用一种基于深度学习的水面目标检测模型压缩方法，采用带有深度可分离卷积和轻量级注意力模型的改进网络替代特征提取网络DarkNet，通过多尺度特征融合进行模型压缩。一般来说，为了加快最终预测的计算速度，卷积神经网络(Convolutional NeuralNetworks，CNN)中的图像几乎总是在主干网络经历类似的转换过程，空间维度的信息被逐步的向通道传送；而每次特征图的空间(宽度和高度)压缩和通道维度的扩展都会造成部分语义的丢失，单纯的使用DSC操作可提高检测器速度，但是这些模型在具体应用时，暴露出精度较低的现实问题。MobileNets使用大量的1*1密集卷积来融合独立计算的信道信息；ShuffleNets使用“channel shuffle”来实现通道信息的交互；GhostNet使用“halved”标准卷积(Standard Convolution，简称SC)操作来保留通道之间的交互信息；但是，1*1密集卷积反而占用了更多的计算资源，使用“通道洗牌”的效果仍然没有触及标准卷积的结果；许多轻量级模型仅使用从开始到结束的深度可分离卷积，但是无论这是用于图像分类还是目标检测，深度可分离卷积的缺陷在主干中被直接放大。At present, the research on small target detection of unmanned aerial vehicles has been improved from the following perspectives. First, in terms of data sets, since target detection data sets are not abundant, most research work will first expand the data sets in order to better train the accuracy of model recognition, such as using the Poisson fusion method based on the self-made data set for expansion; secondly, the model structure is improved, such as by adding a feature map attention layer (featuremap attention, FMA) at the end of the backbone to improve the network feature extraction capability to modify the YOLOV5 algorithm, or from the perspective of network versatility and scene complexity, using a simpler and more practical ResNet-50 network structure to replace the original backbone network DarkNet-53, which is convenient for the implementation of subsequent joint algorithms. In order to better deploy the neural network model on the edge, it can also be improved from a lightweight perspective, such as improving the backbone network in YOLOv3, replacing some convolution operations with linear operations with less computation without reducing the accuracy of the original model, so that the model has fewer parameters and consumes less computing power; or using a surface target detection model compression method based on deep learning, using an improved network with deep separable convolution and lightweight attention model to replace the feature extraction network DarkNet, and compressing the model through multi-scale feature fusion. Generally speaking, in order to speed up the calculation speed of the final prediction, the images in the Convolutional Neural Networks (CNN) almost always undergo a similar conversion process in the backbone network, and the information of the spatial dimension is gradually transmitted to the channel; and each time the spatial (width and height) compression of the feature map and the expansion of the channel dimension will cause the loss of some semantics, the simple use of DSC operation can increase the speed of the detector, but these models expose the real problem of low accuracy in specific applications. MobileNets uses a large number of 1*1 dense convolutions to fuse independently calculated channel information; ShuffleNets uses "channel shuffle" to achieve channel information interaction; GhostNet uses the "halved" standard convolution (SC) operation to retain the interaction information between channels; however, 1*1 dense convolution takes up more computing resources, and the effect of using "channel shuffle" still does not touch the result of standard convolution; many lightweight models only use depth-separable convolution from beginning to end, but whether it is used for image classification or target detection, the defects of depth-separable convolution are directly amplified in the backbone.

因此，亟需提供一种轻量化网络设计的车辆小目标检测方法，以降低成本。Therefore, there is an urgent need to provide a vehicle small target detection method with a lightweight network design to reduce costs.

发明内容Summary of the invention

为了解决现有技术中存在的上述问题，本发明提供了一种轻量化网络设计的车辆小目标检测方法及其装置。本发明要解决的技术问题通过以下技术方案实现：In order to solve the above problems existing in the prior art, the present invention provides a vehicle small target detection method and device with a lightweight network design. The technical problem to be solved by the present invention is achieved by the following technical solutions:

第一方面，本发明提供一种轻量化网络设计的车辆小目标检测方法，包括：In a first aspect, the present invention provides a vehicle small target detection method with a lightweight network design, comprising:

获取待测车辆图像；Acquire an image of a vehicle to be tested;

使用训练好的网络模型对待测车辆图像进行分类，输出分类结果；其中，训练好的网络模型包括主干网络模块、颈部网络模块和检测模块，主干网络模块用于提取层次化的全局特征和局部特征，颈部网络模块用于融合层次化的全局特征和局部特征，检测模块用于对融合后的全局特征和局部特征进行处理，得到分类结果。The trained network model is used to classify the vehicle image to be tested and the classification result is output; wherein the trained network model includes a backbone network module, a neck network module and a detection module, the backbone network module is used to extract hierarchical global features and local features, the neck network module is used to fuse hierarchical global features and local features, and the detection module is used to process the fused global features and local features to obtain the classification result.

第二方面，本发明还提供一种轻量化网络设计的车辆小目标检测装置，包括：In a second aspect, the present invention further provides a vehicle small target detection device with a lightweight network design, comprising:

图像获取模块，用于获取待测车辆图像；An image acquisition module, used to acquire an image of a vehicle to be tested;

图像处理模块，用于使用训练好的网络模型对待测车辆图像进行分类，输出分类结果；其中，训练好的网络模型包括主干网络模块、颈部网络模块和检测模块，主干网络模块用于提取层次化的全局特征和局部特征，颈部网络模块用于融合层次化的全局特征和局部特征，检测模块用于对融合后的全局特征和局部特征进行处理，得到分类结果。The image processing module is used to use the trained network model to classify the vehicle image to be tested and output the classification result; wherein the trained network model includes a backbone network module, a neck network module and a detection module, the backbone network module is used to extract hierarchical global features and local features, the neck network module is used to fuse hierarchical global features and local features, and the detection module is used to process the fused global features and local features to obtain the classification result.

本发明的有益效果：Beneficial effects of the present invention:

本发明提供的一种轻量化网络设计的车辆小目标检测方法及其装置，针对现有深层卷积神经网络参数量大，计算成本高，而浅层轻量化网络对于目标的特征提取能力弱，检测性能差的问题，提出了一种轻量化网络设计的车辆小目标检测方法，降低计算成本的同时，对目标的特征提取具有计算效率高、语义自适应和全局有效的优点。输出端构造基于注意力的边界框损失函数，使得定位更加准确，增加模型的泛化能力。整个网络在降低模型复杂度的同时，能够满足在存储和计算都受限的边缘端对车辆目标检测的实时性和准确性的要求。The present invention provides a method and device for detecting small vehicle targets with a lightweight network design. Aiming at the problems that the existing deep convolutional neural network has a large number of parameters and high computational cost, and the shallow lightweight network has a weak feature extraction capability for the target and poor detection performance, a method for detecting small vehicle targets with a lightweight network design is proposed. While reducing the computational cost, the feature extraction of the target has the advantages of high computational efficiency, semantic adaptation and global effectiveness. The output end constructs an attention-based bounding box loss function to make positioning more accurate and increase the generalization ability of the model. While reducing the complexity of the model, the entire network can meet the requirements of real-time and accuracy of vehicle target detection at the edge where storage and computing are limited.

以下将结合附图及实施例对本发明做进一步详细说明。The present invention will be further described in detail below with reference to the accompanying drawings and embodiments.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明实施例提供的轻量化网络设计的车辆小目标检测方法的一种流程图；FIG1 is a flow chart of a vehicle small target detection method with a lightweight network design provided by an embodiment of the present invention;

图2是本发明实施例提供的主干网络模块的一种示意图；FIG2 is a schematic diagram of a backbone network module provided by an embodiment of the present invention;

图3是本发明实施例提供的自适应频段滤波算子的一种示意图；FIG3 is a schematic diagram of an adaptive frequency band filtering operator provided by an embodiment of the present invention;

图4是本发明实施例提供的颈部网络模块的一种示意图；FIG4 is a schematic diagram of a neck network module provided by an embodiment of the present invention;

图5是本发明实施例提供的训练预设的网络模型的一种示意图；FIG5 is a schematic diagram of a preset network model for training provided in an embodiment of the present invention;

图6是本发明实施例提供的数据集示例图像的一种示意图；FIG6 is a schematic diagram of an example image of a data set provided by an embodiment of the present invention;

图7是本发明实施例提供的测试结果的一种示意图；FIG7 is a schematic diagram of a test result provided by an embodiment of the present invention;

图8是本发明实施例提供的轻量化网络设计的车辆小目标检测装置的一种示意图。FIG8 is a schematic diagram of a vehicle small target detection device with a lightweight network design provided by an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

下面结合具体实施例对本发明做进一步详细的描述，但本发明的实施方式不限于此。The present invention is further described in detail below with reference to specific embodiments, but the embodiments of the present invention are not limited thereto.

现有技术中，由于早期计算机资源有限，软硬件资源不匹配，想要实现目标检测只能通过设计人工特征并使用各种加速技术来实现，2012年随着卷积网络的提出，一些著名的卷积神经网络如AlexNet、VGGGoogLeNet、ResNet逐渐走进人们的视线。由于卷积神经网络能够提取到更加丰富的信息和特征，许多研究员将目光转向了使用卷积神经网络来提取目标物体的特征。在2014年，RGirshick等人提出了R-CNN算法，率先打破僵局，利用卷积神经网络特征的区域用于目标检测，此后目标检测算法急速发展，越来越多的算法涌现而出，目标检测算法得到了较好的发展。In the existing technology, due to the limited computer resources and the mismatch of software and hardware resources in the early days, target detection can only be achieved by designing artificial features and using various acceleration technologies. In 2012, with the introduction of convolutional networks, some famous convolutional neural networks such as AlexNet, VGGGoogLeNet, and ResNet gradually came into people's sight. Since convolutional neural networks can extract richer information and features, many researchers have turned their attention to using convolutional neural networks to extract the features of target objects. In 2014, RGirshick et al. proposed the R-CNN algorithm, which was the first to break the deadlock and use the area of convolutional neural network features for target detection. Since then, target detection algorithms have developed rapidly, and more and more algorithms have emerged, and target detection algorithms have been well developed.

目标检测算法一般可分为两类：以R-CNN为代表的二阶段(two-stage)检测方法和以YOLO系列为代表的一阶段(one-stage)检测方法。这两类算法的检测思想不同，因此具有不同的特点和优缺点。二阶段检测方法基于某种算法生成候选框，对这些候选框提取特征，最后产生位置框并预测对应的种类及回归位置信息；该类算法精度高，但速度较慢。2016年，Tao Kong等人提出HyperNet算法，HyperNet算法的核心想是通过多分支网络，将不同特征层次的信息进行融合，从而提高小目标检测的准确性。具体来说，HyperNet算法首先通过基础网络提取浅层特征和深层特征。再将不同层次的特征信息进行融合便可得到更加准确的融合网络特征表示。融合网络采用了多分支的设计，每个分支都对不同的特征层次进行融合。最后，通过分类网络和回归网络对目标进行检测和定位。2017年，Lin等人提出了特征金字塔网络结构，通过自顶向下的特征提取网络与自底向上的特征融合网络，可得到不同尺寸的特征金字塔。在这个过程中，高层次的特征图通过上采样的方式得到与低层次特征图相同的尺度，再与低层次特征图进行融合，从而得到一个更加具有语义信息的特征图；最后，通过检测网络对目标进行检测和定位。Law H等人于2018年提出了CornerNet算法，CornerNet算法基于Keypoint的思想，通过检测目标的角点来确定目标的位置和大小，从而实现小目标的检测。CornerNet算法的优点在于能够通过角点的方式来准确地表示目标的位置和大小，从而实现小目标的精确检测。2020年谷歌团队提出的Vision Transformer(ViT)模型是一个经典的用于视觉任务的纯transformer技术方案；但是ViT的计算复杂度高，计算开销大。2021年4月发表的EfficientNetV2(TanM，LeQ.Efficientnetv2:Smallermodels and fastertraining)是一个轻量化的网络模型，提出了改进的渐进学习方法，该方法会根据训练图像的尺寸动态调节正则方法；除此之外，验证了在浅层网络使用深度可分离卷积速度会很慢，因此提出了一个新的网络模块Fused-MBConv应用于浅层网络，以此来提升网络的训练速度，并提升网络的准确率；但因其网络架构是通过神经结构搜索得到的网络架构，并不完全适用于对地车辆小目标识别任务。Target detection algorithms can generally be divided into two categories: two-stage detection methods represented by R-CNN and one-stage detection methods represented by the YOLO series. These two types of algorithms have different detection ideas and therefore have different characteristics and advantages and disadvantages. The two-stage detection method generates candidate boxes based on a certain algorithm, extracts features from these candidate boxes, and finally generates position boxes and predicts the corresponding types and regresses position information; this type of algorithm has high accuracy but slow speed. In 2016, Tao Kong et al. proposed the HyperNet algorithm. The core idea of the HyperNet algorithm is to fuse information at different feature levels through a multi-branch network to improve the accuracy of small target detection. Specifically, the HyperNet algorithm first extracts shallow features and deep features through the basic network. Then, the feature information at different levels is fused to obtain a more accurate fusion network feature representation. The fusion network adopts a multi-branch design, and each branch fuses different feature levels. Finally, the target is detected and located through the classification network and regression network. In 2017, Lin et al. proposed a feature pyramid network structure. Through a top-down feature extraction network and a bottom-up feature fusion network, feature pyramids of different sizes can be obtained. In this process, the high-level feature map is upsampled to the same scale as the low-level feature map, and then fused with the low-level feature map to obtain a feature map with more semantic information; finally, the target is detected and located through the detection network. Law H et al. proposed the CornerNet algorithm in 2018. The CornerNet algorithm is based on the idea of Keypoint. It determines the position and size of the target by detecting the corner points of the target, thereby realizing the detection of small targets. The advantage of the CornerNet algorithm is that it can accurately represent the position and size of the target by means of corner points, thereby realizing the accurate detection of small targets. The Vision Transformer (ViT) model proposed by the Google team in 2020 is a classic pure transformer technology solution for visual tasks; however, ViT has high computational complexity and high computational overhead. EfficientNetV2 (TanM, LeQ. Efficientnetv2: Smallermodels and faster training), published in April 2021, is a lightweight network model that proposes an improved progressive learning method that dynamically adjusts the regularization method according to the size of the training image. In addition, it verifies that the use of deep separable convolution in shallow networks is very slow, so a new network module Fused-MBConv is proposed for shallow networks to increase the training speed of the network and improve the accuracy of the network. However, because its network architecture is obtained through neural structure search, it is not fully suitable for small target recognition tasks of ground vehicles.

综上所述，现有技术的缺点包括：In summary, the disadvantages of the prior art include:

1、一方面虽然卷积神经网络其需要训练的参数较少，但是其在空间感知上具有局部性，而全局空间感知对图像分类、语义分割等任务来说是至关重要的。另一方面Transformer结构计算复杂度高，计算开销大，耗费成本更高。1. On the one hand, although convolutional neural networks require fewer parameters to be trained, they have local spatial perception, and global spatial perception is crucial for tasks such as image classification and semantic segmentation. On the other hand, the Transformer structure has high computational complexity and high computational overhead, which is more costly.

2、深度网络带来的计算复杂度高，但是浅层网络又会带来对于目标特征提取能力较差等问题，导致检测性能差，且浅层网络使用深度可分离卷积速度慢。2. The computational complexity brought by deep networks is high, but shallow networks will bring problems such as poor ability to extract target features, resulting in poor detection performance, and shallow networks are slow when using deep separable convolution.

3、现有算法对于无人机车辆目标检索效果较差、漏检率高，且定位不够准确，无法满足实际需求。3. The existing algorithms have poor results in retrieval of drone vehicle targets, high missed detection rates, and inaccurate positioning, which cannot meet actual needs.

4、多数神经网络算法部署实现困难，且基于锚框的检测算法训练数据中难以避免地包含低质量示例，所以如距离、纵横比之类的几何度量都会加剧对低质量示例的惩罚从而使模型的泛化性能下降。4. Most neural network algorithms are difficult to deploy and implement, and the training data of anchor-based detection algorithms inevitably contains low-quality examples. Therefore, geometric metrics such as distance and aspect ratio will increase the penalty for low-quality examples, thereby reducing the generalization performance of the model.

有鉴于此，针对现有深层卷积神经网络参数量大，计算成本高，浅层轻量化网络对于目标的特征提取能力弱，检测性能差的问题；本发明提出的一种轻量化网络设计的车辆小目标检测方法，降低计算成本的同时，对目标的特征提取具有计算效率高、语义自适应和全局有效的优点。输出端构造基于注意力的边界框损失函数，使得定位更加准确，增加模型的泛化能力，在降低模型复杂度的同时，能够满足在存储和计算都受限的边缘端对车辆目标检测的实时性和准确性的要求。In view of this, the present invention proposes a method for detecting small vehicle targets using a lightweight network design, which reduces the computational cost while having the advantages of high computational efficiency, semantic adaptation, and global effectiveness in extracting target features. The output end constructs an attention-based bounding box loss function to make positioning more accurate and increase the generalization ability of the model. While reducing the complexity of the model, it can meet the requirements of real-time and accuracy of vehicle target detection at the edge where storage and computing are limited.

请参见图1，图1是本发明实施例提供的轻量化网络设计的车辆小目标检测方法的一种流程图，本发明提供一种轻量化网络设计的车辆小目标检测方法，包括：Please refer to FIG. 1 , which is a flow chart of a vehicle small target detection method with a lightweight network design provided by an embodiment of the present invention. The present invention provides a vehicle small target detection method with a lightweight network design, comprising:

S101、获取待测车辆图像。S101, obtaining an image of a vehicle to be tested.

具体而言，本实施例中，待测车辆图像为小目标图像。Specifically, in this embodiment, the vehicle image to be tested is a small target image.

S102、使用训练好的网络模型对待测车辆图像进行分类，输出分类结果；其中，训练好的网络模型包括主干网络模块、颈部网络模块和检测模块，主干网络模块用于提取层次化的全局特征和局部特征，颈部网络模块用于融合层次化的全局特征和局部特征，检测模块用于对融合后的全局特征和局部特征进行处理，得到分类结果。S102. Use the trained network model to classify the image of the vehicle to be tested and output the classification result; wherein the trained network model includes a backbone network module, a neck network module and a detection module, the backbone network module is used to extract hierarchical global features and local features, the neck network module is used to fuse hierarchical global features and local features, and the detection module is used to process the fused global features and local features to obtain the classification result.

在本发明的一种可选地实施例中，请参见图2，图2是本发明实施例提供的主干网络模块的一种示意图，主干网络模块包括第一模块、第二模块和第三模块，输入主干网络模块的待测车辆图像经过第一模块的处理，得到第一特征，将第一特征输入第二模块中处理，得到第二特征，将第二特征输入第三模块中处理，得到第三特征；其中，In an optional embodiment of the present invention, please refer to FIG. 2, which is a schematic diagram of a backbone network module provided in an embodiment of the present invention. The backbone network module includes a first module, a second module and a third module. The image of the vehicle to be tested input into the backbone network module is processed by the first module to obtain a first feature, the first feature is input into the second module for processing to obtain a second feature, and the second feature is input into the third module for processing to obtain a third feature; wherein,

第一模块、第二模块和第三模块均包括下降采样模块、AFF模块和融合模块，第一模块中设置的AFF模块的数量、第二模块中设置的AFF模块的数量和第三模块中设置的AFF模块的数量均不同，降采样模块包括MBConv层。The first module, the second module and the third module all include a downsampling module, an AFF module and a fusion module. The number of AFF modules set in the first module, the number of AFF modules set in the second module and the number of AFF modules set in the third module are all different. The downsampling module includes an MBConv layer.

具体而言，本实施例中，主干网络模块首先采用卷积系统进行标记化，即图2中的Conv Stem，该模块由一个3×3卷积层组成，其步长2，然后是四个MBConv层，降采样模块；可以理解的是，MBConv是移动翻转瓶颈卷积(Mobile invertedbottleneck convolution，MBConv)模块的简称，内核大小为3。将输入的特征标记化后，再经过三个阶段级联的轻量化网络的主体处理，其中每个阶段由一个步幅为2的MBConv层组成，用于空间下采样和N_i个AFF模块，即第一模块、第二模块和第三模块均由步幅为2的MBConv层组成；其中，第一模块的AFF模块的数量N₁＝2，第二模块的AFF模块的数量N₂＝4，第三模块的AFF模块的数量N₃＝3。本实施例中采用两组线性层(也称为1×1卷积层)和ReLU函数来学习所提出的自适应频率滤波的掩模，需要说明的是，使用线性层组至关重要，因为它可以防止非线性破坏太多的信息；该线性层组使用1x1的卷积核来进一步压缩通道数，减少计算量和参数数量；本实施例中使用层归一化(LN)进行通道混合，然后将其馈送到自适应频段滤波算子进行全局序列混合，以获得第l个AFF模块的输出；如此，本实施例中采用通道混合和序列混合的跳跃连接来加快模型的训练收敛。Specifically, in this embodiment, the backbone network module is first tokenized using a convolutional system, i.e., Conv Stem in FIG2 , which consists of a 3×3 convolutional layer with a stride of 2, followed by four MBConv layers, a downsampling module; it can be understood that MBConv is the abbreviation of a mobile inverted bottleneck convolution (MBConv) module with a kernel size of 3. After the input features are tokenized, they are processed by the main body of a three-stage cascaded lightweight network, where each stage consists of an MBConv layer with a stride of 2 for spatial downsampling and N _i AFF modules, i.e., the first module, the second module, and the third module are all composed of MBConv layers with a stride of 2; wherein, the number of AFF modules of the first module N ₁ =2, the number of AFF modules of the second module N ₂ =4, and the number of AFF modules of the third module N ₃ =3. In this embodiment, two groups of linear layers (also called 1×1 convolutional layers) and ReLU functions are used to learn the proposed adaptive frequency filtering mask. It should be noted that the use of a linear layer group is crucial because it can prevent nonlinearity from destroying too much information; the linear layer group uses a 1x1 convolution kernel to further compress the number of channels, reduce the amount of calculation and the number of parameters; in this embodiment, layer normalization (LN) is used for channel mixing, which is then fed to the adaptive frequency band filter operator for global sequence mixing to obtain the output of the lth AFF module; in this way, in this embodiment, jump connections of channel mixing and sequence mixing are used to speed up the training convergence of the model.

在本发明的一种可选地实施例中，AFF模块包括第一归一化层、第二归一化层和自适应频段滤波算子；其中，第一归一化层和第二归一化层用于通道混合，自适应频段滤波算子用于进行全局序列混合，AFF模块输出的特征表达式为：In an optional embodiment of the present invention, the AFF module includes a first normalization layer, a second normalization layer and an adaptive frequency band filter operator; wherein the first normalization layer and the second normalization layer are used for channel mixing, the adaptive frequency band filter operator is used for global sequence mixing, and the characteristic expression output by the AFF module is:

其中，表示第l次经过降采样层和第一归一化层后的输出特征，MBConv^l(·)表示经过第l个MBConv层处理后的输出特征，X^l-l表示输入至降采样层的特征，AFF^l(·)表示通过第l个AFF模块处理后的输出特征，LN(·)表示经过层第二归一化层处理后的输出特征，X^l表示经过第l个AFF模块处理后的最终输出特征。in, represents the output features after the lth downsampling layer and the first normalization layer, MBConv ^l (·) represents the output features after processing by the lth MBConv layer, X ^ll represents the features input to the downsampling layer, AFF ^l (·) represents the output features after processing by the lth AFF module, LN (·) represents the output features after processing by the second normalization layer, and X ^l represents the final output features after processing by the lth AFF module.

具体而言，本实施例中，通过MBConv和自适应频段滤波算子组成的主干网络能进行层次化的提取全局特征和局部特征，能有效降低计算复杂度，降低开销成本和计算成本；在达到深层网络同等效果的基础上，且模型计算成本低，模型参数量小。Specifically, in this embodiment, the backbone network composed of MBConv and adaptive frequency band filtering operators can perform hierarchical extraction of global features and local features, which can effectively reduce computational complexity, overhead costs and computational costs; on the basis of achieving the same effect as the deep network, the model calculation cost is low and the number of model parameters is small.

在本发明的一种可选地实施例中，请参见图3，图3是本发明实施例提供的自适应频段滤波算子的一种示意图，自适应频段滤波算子包括傅里叶变快模块、频域滤波器和傅里叶逆变换模块；自适应频段滤波算子对输入的特征的处理过程包括：In an optional embodiment of the present invention, please refer to FIG. 3, which is a schematic diagram of an adaptive frequency band filter operator provided by an embodiment of the present invention. The adaptive frequency band filter operator includes a Fourier fast transform module, a frequency domain filter and an inverse Fourier transform module; the processing process of the adaptive frequency band filter operator on the input feature includes:

使用傅里叶变换模块，对输入的特征进行变换，由空域转换到频域，得到频域特征，其表达式为：Use the Fourier transform module to transform the input features from the spatial domain to the frequency domain to obtain the frequency domain features, which are expressed as:

其中，X_F表示频域特征，X表示输入的特征，表示傅里叶变换；Among them, X _F represents the frequency domain features, X represents the input features, represents Fourier transform;

使用频域滤波器，对频域特征进行处理，得到对频域特征X_F学习的掩码张量 Use the frequency domain filter to process the frequency domain features and obtain the mask tensor learned by the frequency domain feature X _F

将掩码张量与频域特征进行点乘，得到滤波后的特征其表达式为：The mask tensor Frequency domain characteristics Perform point multiplication to obtain the filtered features Its expression is:

其中，⊙表示矩阵乘积；Among them, ⊙ represents matrix product;

使用傅里叶逆变换模块，对滤波后的特征进行变换，由频域转换到空域，得到空域特征其表达式为：Use the inverse Fourier transform module to transform the filtered features from the frequency domain to the spatial domain to obtain the spatial domain features Its expression is:

其中，表示傅里叶逆变换，空域特征即为自适应频段滤波算子处理后输出的特征。in, Represents inverse Fourier transform, spatial domain features That is the feature output after being processed by the adaptive frequency band filter operator.

具体而言，请继续参见图3，本实施例中自适应频段滤波算子对输入的自特征的处理过程包括：Specifically, please continue to refer to FIG. 3 , the process of processing the input self-features by the adaptive frequency band filter operator in this embodiment includes:

S111、对输入的特征做快速傅里叶变换。S111. Perform fast Fourier transform on the input features.

首先，把输入的特征做快速傅里叶变换(Fast Fourier Transform，FFT)转换到频域，对于给定特征X∈R^H×W×C，其中，H和W表示特征的高和宽，C表示特征的通道数，快速傅里叶转换后的频域特征的表达式为：First, the input features are transformed into the frequency domain by Fast Fourier Transform (FFT). For a given feature X∈R ^H×W×C , where H and W represent the height and width of the feature, and C represents the number of channels of the feature, the expression of the frequency domain feature after Fast Fourier Transform is:

其中，u和v分别表示特征的高和宽，不同空间位置的特征对应于X的不同频率分量，以复杂度为的变换将X的全局信息融合在一起。采用该种方式可以将序列混合的复杂性从降低到以减少计算复杂度。Among them, u and v represent the height and width of the feature respectively. The features at different spatial positions correspond to different frequency components of X. The complexity is The transformation of X integrates the global information of X. This method can reduce the complexity of sequence mixing from Reduce to To reduce the computational complexity.

S112、通过可学习的频域滤波器点乘由S111得到的输入的频域特征，可得：S112, through the learnable frequency domain filter By dot-multiplying the frequency domain features of the input obtained by S111, we can get:

其中，表示从X_F学习的掩码张量，具有与X_F相同的形状。如图3所示，为了使网络尽可能轻量化，由1组1×1卷积(线性)层有效地实现，随后是一组RELU函数和另一组线性层。可以理解的是，使用线性层组至关重要，因为它可以防止非线性破坏太多的信息，该线性层组使用1x1的卷积核来进一步压缩通道数，减少计算量和参数数量。in, represents the mask tensor learned from X _F , has the same shape as X _F. As shown in Figure 3, in order to make the network as lightweight as possible, It is effectively implemented by a group of 1×1 convolutional (linear) layers, followed by a group of RELU functions and another group of linear layers. It is understandable that the use of a linear layer group is crucial because it prevents nonlinearity from destroying too much information. The linear layer group uses a 1x1 convolution kernel to further compress the number of channels, reducing the amount of calculation and the number of parameters.

S113、通过快速傅里叶逆变换(Inverse Fast Fourier Transform,IFFT)转换回到空域。S113, convert back to the spatial domain through Inverse Fast Fourier Transform (IFFT).

应用S112卷积定理，通过使用可学习的实例自适应掩码对X的频率表示X_F进行滤波，从而实现对X的有效的全局序列混合；进一步对滤波后的X_F进行逆傅立叶变换，得到原始潜在空间中更新的特征表示该过程可以表述为：Applying the S112 convolution theorem, the frequency representation _XF of X is filtered using a learnable instance-adaptive mask, thereby achieving effective global sequence mixing of X; the filtered _XF is further inverse Fourier transformed to obtain the updated feature representation in the original latent space. The process can be described as:

其中，⊙表示矩阵乘积，表示傅里叶逆变换，可以看作X经过全局自适应序列混合的结果，使用线性层组至关重要，可以防止非线性破坏太多的信息。where ⊙ represents matrix product, represents the inverse Fourier transform, It can be seen as the result of a global adaptive sequence mixture of X. The use of linear layer groups is crucial to prevent nonlinearity from destroying too much information.

S114、完成自适应频域滤波算子结构的构建。S114, completing the construction of the adaptive frequency domain filtering operator structure.

此时经过S111、S112和S113得到的在数学上等价于采用大核动态卷积得到的输出结果，由于两个信号在傅里叶域中的相乘等于这两个信号在其原始域中的卷积的傅立叶变换。当将其应用于频域乘法时，存在：At this time, the result obtained through S111, S112 and S113 Mathematically equivalent to the output result obtained by using large kernel dynamic convolution, since the multiplication of two signals in the Fourier domain is equal to the Fourier transform of the convolution of the two signals in their original domain. When it is applied to frequency domain multiplication, there exists:

根据上述公式可得：According to the above formula, we can get:

其中，是与X形状相同的张量，可以被视为在空间上与X一样大的动态深度卷积核，该核自适应于X的信息。由于傅里叶变换的性质，这里对X采用圆形填充，傅里叶逆变换的操作使得模块具有语义自适应权重的全局范围的序列混合操作。in, It is a tensor of the same shape as X and can be regarded as a dynamic deep convolution kernel as large as X in space, which is adaptive to the information of X. Due to the properties of Fourier transform, circular padding is used for X here, and the operation of inverse Fourier transform enables the module to have a global range of sequence mixing operations with semantically adaptive weights.

本实施例中，自适应频段滤波算子等价于空间大卷积核，而空间大卷积核的计算复杂度更高，计算开销更大，使用自适应滤波算子代替从而减少网络的计算成本，降低计算开销，达到和空域动态大卷积核同等的效果，在有效降低计算成本，减少资源浪费的同时，达到对目标特征提取高效、全局、自适应的特点。In this embodiment, the adaptive frequency band filter operator is equivalent to the large spatial convolution kernel, which has higher computational complexity and greater computational overhead. The adaptive filter operator is used instead to reduce the computational cost of the network and the computational overhead, thereby achieving the same effect as the dynamic large spatial convolution kernel. While effectively reducing the computational cost and reducing resource waste, it achieves the characteristics of efficient, global and adaptive target feature extraction.

在本发明的一种可选地实施例中，请参见图4，图4是本发明实施例提供的颈部网络模块的一种示意图，颈部网络模块包括3个ASFF模块，分别为第一ASFF模块、第二ASFF模块和第三ASFF模块；其中，颈部网络模块对输入的特征的处理过程包括：In an optional embodiment of the present invention, please refer to FIG. 4, which is a schematic diagram of a neck network module provided in an embodiment of the present invention. The neck network module includes three ASFF modules, namely a first ASFF module, a second ASFF module and a third ASFF module; wherein the neck network module processes the input features including:

第一ASFF模块对第一特征、第二特征和第三特征进行融合处理，得到第一融合特征，第二ASFF模块对第一特征、第二特征和第三特征进行融合处理，得到第二融合特征，第三ASFF模块对第一特征、第二特征和第三特征进行融合处理，得到第三融合特征。The first ASFF module fuses the first feature, the second feature and the third feature to obtain a first fused feature, the second ASFF module fuses the first feature, the second feature and the third feature to obtain a second fused feature, and the third ASFF module fuses the first feature, the second feature and the third feature to obtain a third fused feature.

具体而言，请继续参见图4，本实施例中，特征的语义信息由低维向高维转变随着网络层次的深化。高级特征为对象分类提供了丰富的语义信息，而低级特征为对象定位提供了丰富的细粒度信息。特征融合也尤为重要，因此构建了一种高效的特征融合结构，引入自适应空间特征融合(Adaptively Spatial Feature Fusion，ASFF)模块，在主干网络各个阶段(Stage)的融合模块后，将特征导入ASFF模块，为每个融合的特征图设置自学习权值来进行加权融合，使得语义信息得到更深入的整合，以达到提升检测性能的目的。Specifically, please continue to refer to Figure 4. In this embodiment, the semantic information of the features changes from low-dimensional to high-dimensional as the network level deepens. High-level features provide rich semantic information for object classification, while low-level features provide rich fine-grained information for object positioning. Feature fusion is also particularly important, so an efficient feature fusion structure is constructed, and an adaptive spatial feature fusion (ASFF) module is introduced. After the fusion modules of each stage (Stage) of the backbone network, the features are imported into the ASFF module, and self-learning weights are set for each fused feature map to perform weighted fusion, so that the semantic information is more deeply integrated to achieve the purpose of improving detection performance.

请继续参见图4，ASFF1、ASFF2、ASFF3分别对应阶段1、阶段2、阶段3的输出X1、X2、X3，以ASFF3为例，要得到ASFF3的输出，首先通过1×1的卷积调整X1、X2的通道数，然后通过上采样调整特征图的宽高，使其与X3一致，之后将调整后特征图乘以相应的权重系数，最后将具有位置信息的低分辨率特征图与具有语义信息的高分辨率特征图相加，得到融合后的特征图；ASFF输出层计算公式如下：Please continue to refer to Figure 4. ASFF1, ASFF2, and ASFF3 correspond to the outputs X1, X2, and X3 of stage 1, stage 2, and stage 3, respectively. Taking ASFF3 as an example, to obtain the output of ASFF3, first adjust the number of channels of X1 and X2 through 1×1 convolution, then adjust the width and height of the feature map through upsampling to make it consistent with X3, then multiply the adjusted feature map by the corresponding weight coefficient, and finally add the low-resolution feature map with position information and the high-resolution feature map with semantic information to obtain the fused feature map; the calculation formula of the ASFF output layer is as follows:

其中，表示融合后得到的新的第l层特征图，表示第n层特征图经过大小、维度调整后与第l层特征图大小、维度相同的特征向量，和分别表示自适应学习得到的第l层的权重系数，权重系数的表达式为：in, Represents the new l-th layer feature map obtained after fusion, It represents the feature vector with the same size and dimension as the feature map of the lth layer after the size and dimension of the feature map of the nth layer are adjusted. and They represent the weight coefficients of the lth layer obtained by adaptive learning, The expression of weight coefficient is:

其中，表示第l层自学习权重控制参数，由调整后的特征向量经过1×1卷积得到，表示第l层定义的自学习权重控制参数，表示第l层定义的自学习权重控制参数。in, represents the self-learning weight control parameter of the lth layer, which is obtained by 1×1 convolution of the adjusted feature vector. Indicates the definition of level l The self-learning weight control parameters, Indicates the definition of level l The self-learning weight control parameters.

本实施例中，通过上述融合方式不但防止不同网络层学习重复的梯度信息，而且消除了计算瓶颈，减少了内存消耗，促进了低层次信息的融合，获得了更强的特征提取能力；在一定程度上缓解了目标位置信息和轮廓信息的特征丢失问题。In this embodiment, the above-mentioned fusion method not only prevents different network layers from learning repeated gradient information, but also eliminates the computational bottleneck, reduces memory consumption, promotes the fusion of low-level information, and obtains stronger feature extraction capabilities; it also alleviates the problem of feature loss of target position information and contour information to a certain extent.

在本发明的一种可选地实施例中，请参见图5，图5是本发明实施例提供的训练预设的网络模型的一种示意图，训练好的网络模型通过对预设的网络模型进行训练获取，对预设的网络模型进行训练包括：In an optional embodiment of the present invention, please refer to FIG. 5, which is a schematic diagram of training a preset network model provided in an embodiment of the present invention. The trained network model is obtained by training the preset network model. The training of the preset network model includes:

获取数据集，数据集包括训练样本集和测试样本集；Obtain a data set, which includes a training sample set and a test sample set;

使用训练样本集对预设的网络模型进行训练；Use the training sample set to train the preset network model;

在训练过程中，使用损失函数进行迭代优化，获取预设的网络模型的最优权重，以构建训练好的网络模型；During the training process, the loss function is used for iterative optimization to obtain the optimal weights of the preset network model to build a trained network model;

使用测试样本集，对训练好的网络模型的性能进行测试。Use the test sample set to test the performance of the trained network model.

在本发明的一种可选地实施例中，损失函数包括边界损失函数、置信度损失函数和分类损失函数，置信度损失和分类损失均由交叉熵损失函数来进行计算。损失函数的表达式为：In an optional embodiment of the present invention, the loss function includes a boundary loss function, a confidence loss function and a classification loss function, and the confidence loss and the classification loss are both calculated by a cross entropy loss function. The expression of the loss function is:

Loss＝L_WIoUv+Loss_cls+Loss_obj；Loss＝ _LWIoUv +Loss _cls +Loss _obj ;

数据中不可避免地包含低质量的样本，距离和纵横比等几何因素会加剧对低质量样本的惩罚，从而降低模型的泛化性能。当锚盒与目标盒重合较好时，一个好的损失函数应该会减弱几何因素的惩罚，训练过程中较少的干预会使模型获得更好的泛化能力。因此引入了距离注意R_WIoU，并获得具有两层关注机制的WIoUv，这将显著放大普通质量锚框的L_IoU。边界损失函数L_WIoUv的表达式为：The data inevitably contains low-quality samples, and geometric factors such as distance and aspect ratio will aggravate the penalty for low-quality samples, thereby reducing the generalization performance of the model. When the anchor box and the target box overlap well, a good loss function should weaken the penalty of geometric factors, and less intervention during training will enable the model to obtain better generalization ability. Therefore, distance attention R _WIoU is introduced, and WIoUv with a two-layer attention mechanism is obtained, which will significantly amplify the L _IoU of the normal quality anchor box. The expression of the boundary loss function L _WIoUv is:

L_WIoUv＝R_WIoUL_IoU L _WIoUv ＝R _WIoU L _IoU

其中，R_WIoU表示基于距离设置的系数，L_IoU表示初始交并比损失函数，x表示真实框中心的横坐标，x_gt表示预测中心点区域的横坐标，y表示真实框中心的纵坐标，y_gt表示预测中心点区域的纵坐标，W_g表示最小锚框的宽，H_g表示最小锚框的高，W_g表示最小锚框的宽，H_g表示最小锚框的高；为了防止R_WIoU产生阻碍收敛的梯度，W_g、H_g从计算图形中分离(上标*表示此操作)。因为它有效地消除了阻碍收敛的因素，所以此次操作并不引入新的指标，如纵横比。该损失函数使得预测矩形框回归变得更加稳定，有效避免在训练过程中出现发散等问题，加速了预测框的回归收敛过程；Among them, R _WIoU represents the coefficient based on distance setting, L _IoU represents the initial intersection-over-union loss function, x represents the horizontal coordinate of the center of the true box, x _gt represents the horizontal _coordinate of the predicted center point area, y represents the vertical coordinate of the center of the true box, y _gt represents the vertical coordinate of the predicted center point area, W _g represents the width of the minimum anchor box, H _g represents the height of the minimum anchor box, W _g represents the width of the minimum anchor box, and H g represents the height of the minimum anchor box; in order to prevent R _WIoU from generating gradients that hinder convergence, W _g and H _g are separated from the calculation graph (superscript * indicates this operation). Because it effectively eliminates factors that hinder convergence, this operation does not introduce new indicators, such as aspect ratio. This loss function makes the regression of the predicted rectangular box more stable, effectively avoids problems such as divergence during training, and accelerates the regression convergence process of the predicted box;

所述置信度损失函数Loss_obj的表达式为：The expression of the confidence loss function Loss _obj is:

其中，b表示mask矩阵中元素值为true时的置信度损失权重，loss_BCE(z,x,y)表示交叉熵损失函数，Lable_(z,x,y)表示置信度标签矩阵，P_(z,x,y)表示预测置信度矩阵；Where b represents the confidence loss weight when the element value in the mask matrix is true, loss _BCE (z,x,y) represents the cross entropy loss function, Lable _(z,x,y) represents the confidence label matrix, and P _(z,x,y) represents the prediction confidence matrix;

分类损失，以VisDrone2019数据集为例，有10个类别，将输入图像划分成80*80大小的网格，每个网格有预设的3个anchor框，则预测概率矩阵P维度为3*80*80*10；标签概率矩阵Label的维度与预测概率矩阵一致。在进行损失计算前，需要先对标签数值转化成独热编码(One-Hot Encoding)，在独热编码基础上再在对其进行平滑处理，label为独热编码数值，a表示平滑系数，设为0.1，n表示类别数，平滑处理公式如下：Classification loss, taking the VisDrone2019 dataset as an example, there are 10 categories. The input image is divided into 80*80 grids, each grid has 3 preset anchor boxes, then the dimension of the prediction probability matrix P is 3*80*80*10; the dimension of the label probability matrix Label is consistent with the prediction probability matrix. Before calculating the loss, the label value needs to be converted into one-hot encoding (One-Hot Encoding), and then smoothed on the basis of one-hot encoding. Label is the one-hot encoding value, a represents the smoothing coefficient, set to 0.1, n represents the number of categories, and the smoothing formula is as follows:

其中，Lable_smooth表示标签平滑处理后的标签概率矩阵，P表示预测概率矩阵，n表示类别数，a表示平滑系数，以V数据集为例有10个类别，x、y、z表示矩阵下标值，用以确定该元素所处的位置，z取值0到3，x、y取值0到80，n取值0-10，矩阵中每个位置的损失计算公式如下：Among them, Label _smooth represents the label probability matrix after label smoothing, P represents the prediction probability matrix, n represents the number of categories, a represents the smoothing coefficient, and the V data set has 10 categories as an example. x, y, and z represent the matrix subscript values, which are used to determine the position of the element. z takes values from 0 to 3, x and y take values from 0 to 80, and n takes values from 0 to 10. The loss calculation formula for each position in the matrix is as follows:

当网格大小为80×80，类别数为6时，分类损失函数公式如下：When the grid size is 80×80 and the number of categories is 6, the classification loss function formula is as follows:

其中，置信度的值越大表明该矩形框中存在目标物体的可能性越高。Among them, the larger the confidence value is, the higher the possibility that the target object exists in the rectangular frame is.

在本发明的一种可选地实施例中，检测模块包括卷积层、池化层和全连接层，检测模块对输入的特征的处理过程包括：In an optional embodiment of the present invention, the detection module includes a convolution layer, a pooling layer and a fully connected layer, and the detection module processes the input features including:

卷积层对输入的第一融合特征、第二融合特征和第三融合特征进行处理，池化层对卷积层处理后的特征进行处理，全连接层对池化层处理后的特征进行处理，得到分类结果。The convolution layer processes the first, second and third fusion features of the input, the pooling layer processes the features processed by the convolution layer, and the fully connected layer processes the features processed by the pooling layer to obtain the classification results.

具体而言，本实施例中，最终送入1*1卷积中，进行池化操作以及进行全连接操作，得到一个分类的结果，完成网络模型构建。Specifically, in this embodiment, it is finally sent to a 1*1 convolution, a pooling operation and a full connection operation are performed to obtain a classification result, and the network model construction is completed.

在本发明的一种可选地实施例中，具体而言，请继续参见图5，使用VisDrone2019数据集联合AU-AIR数据集筛选出10000副高质量数据集图像，进行随机旋转、叠加噪声等简单的数据增强，将其按6:2:2分为训练集、验证集和测试集。将数据集中的标签内容重新调整，使得自制数据集中包含不同场景下的六种不同类别的车辆目标。训练集由6000副图像组成，验证集和测试集均由2000副图像组成。数据集的示例图像如图6所示，图6是本发明实施例提供的数据集示例图像的一种示意图。In an optional embodiment of the present invention, specifically, please continue to refer to Figure 5, use the VisDrone2019 dataset and the AU-AIR dataset to screen out 10,000 high-quality dataset images, perform simple data enhancements such as random rotation and superimposed noise, and divide them into training set, verification set and test set at 6:2:2. The label content in the dataset is readjusted so that the self-made dataset contains six different categories of vehicle targets in different scenarios. The training set consists of 6,000 images, and the verification set and test set each consist of 2,000 images. An example image of the dataset is shown in Figure 6, which is a schematic diagram of an example image of a dataset provided in an embodiment of the present invention.

结合当下轻量化网络的优势，并兼顾网络检测性能，使用移动翻转瓶颈卷积(Mobile invertedbottleneck convolution，MBConv)模块和设计的自适应频段滤波算子作为网络的主体框架。主干网络如图2所示，具体来说，采用卷积系统进行标记化，该系统由一个3×3卷积层组成，其步长2，然后是四个MBConv层，内核大小为3。标记化后，将三个阶段级联作为轻量化网络的主体，其中每个阶段由一个步幅为2的MBConv层组成，用于空间下采样和N_i个AFF模块。并设置第一模块的AFF模块的数量N₁＝2，第二模块的AFF模块的数量N₂＝4，第三模块的AFF模块的数量N₃＝3。采用两组线性层(也称为1×1卷积层)和ReLU函数来学习所提出的自适应频率滤波的掩模，使用线性层组至关重要，因为它可以防止非线性破坏太多的信息。这个线性层组使用1x1的卷积核来进一步压缩通道数，减少计算量和参数数量。使用层归一化(LN)进行通道混合，然后将其馈送到所提出的自适应频段滤波算子进行全局序列混合，以获得第l个AFF模块的输出。采用通道混合和序列混合的跳跃连接来加快模型的训练收敛。主干网络负责特征提取，颈部网络负责特征融合。引入改进后的自适应空间特征融合Adaptively Spatial Feature Fusion(ASFF)作为颈部融合模块。最终送入1*1卷积中，进行池化操作以及进行全连接操作，得到分类的结果，完成网络模型构建。Combining the advantages of the current lightweight network and taking into account the network detection performance, the mobile inverted bottleneck convolution (MBConv) module and the designed adaptive frequency band filter operator are used as the main framework of the network. The backbone network is shown in Figure 2. Specifically, a convolution system is used for tokenization, which consists of a 3×3 convolution layer with a stride of 2, followed by four MBConv layers with a kernel size of 3. After tokenization, three stages are cascaded as the main body of the lightweight network, where each stage consists of an MBConv layer with a stride of 2 for spatial downsampling and N _i AFF modules. And the number of AFF modules of the first module is set to N ₁ = 2, the number of AFF modules of the second module is set to N ₂ = 4, and the number of AFF modules of the third module is set to N ₃ = 3. Two groups of linear layers (also called 1×1 convolution layers) and ReLU functions are used to learn the masks of the proposed adaptive frequency filtering. The use of linear layer groups is crucial because it prevents nonlinearity from destroying too much information. This linear layer group uses a 1x1 convolution kernel to further compress the number of channels, reduce the amount of calculation and the number of parameters. Layer normalization (LN) is used for channel mixing, and then it is fed to the proposed adaptive frequency band filter operator for global sequence mixing to obtain the output of the lth AFF module. The jump connection of channel mixing and sequence mixing is used to speed up the training convergence of the model. The backbone network is responsible for feature extraction, and the neck network is responsible for feature fusion. The improved adaptive spatial feature fusion Adaptively Spatial Feature Fusion (ASFF) is introduced as the neck fusion module. Finally, it is sent to the 1*1 convolution, pooling operation and full connection operation to obtain the classification result and complete the network model construction.

对网络进行迭代训练，保存最好的权重。Iteratively train the network and save the best weights.

在输出端利用设计的损失函数建立起端到端的训练，使网络收敛，保存最好的训练权重，完成模型的训练。At the output end, the designed loss function is used to establish end-to-end training to make the network converge, save the best training weights, and complete the model training.

对上述训练好的网络模型进行预测，使用平均精确率均值(mean AveragePrecision，mAP)作为评判网络性能的评价指标，评价网络的好坏，请参见图7，图7是本发明实施例提供的测试结果的一种示意图，(a)表示真实场景原图，(b)表示本发明的测试结果图。The trained network model is predicted, and mean Average Precision (mAP) is used as an evaluation index to judge the network performance. Please refer to FIG7 , which is a schematic diagram of the test results provided by an embodiment of the present invention, (a) represents the original image of the real scene, and (b) represents the test result diagram of the present invention.

在目标检测任务中，TP(True positives)指正样本被正确识别，TN(Truenegatives)指负样本被正确识别，FP(False positives)指负样本被错误识别，FN(Falsenegatives)指正样本被错误识别，精确率(Precision，P)是指在所有检测出的目标中检测正确的比例，计算公式如下：In the target detection task, TP (True positives) refers to the positive samples that are correctly identified, TN (Truenegatives) refers to the negative samples that are correctly identified, FP (False positives) refers to the negative samples that are incorrectly identified, and FN (False negatives) refers to the positive samples that are incorrectly identified. Precision (P) refers to the proportion of correct detections among all detected targets. The calculation formula is as follows:

平均精确率均值(mean Average Precision，mAP)是用来衡量所有类别的平均精确率，mAP值越高表明网络模型的性能越好。mAP是将多个类别的平均精确率(AveragePrecision，AP)进行加和求平均值，若一个目标检测任务中共有N个类别，表示第i个类别的AP，则mAP公式如下：The mean average precision (mAP) is used to measure the average precision of all categories. The higher the mAP value, the better the performance of the network model. mAP is the average value of the average precision (AP) of multiple categories. If there are N categories in a target detection task, the AP of the i-th category is represented by the mAP formula as follows:

使用深度学习框架随机生成一个的张量，将其送入到上述得到的训练好的网络模型中，计算参数量Params、FLOPs(用来衡量模型的复杂度)，为了更客观充分评价本发明的方法的优势，使用两个当下先进的轻量化网络模型MobileNetV3、ShuffleNetV2、采用空域卷积方法以及本发明的办法在上述步骤所得的数据集上进行了对比。对比结果如图表1所示。A tensor is randomly generated using a deep learning framework and sent to the trained network model obtained above to calculate the parameter quantities Params and FLOPs (used to measure the complexity of the model). In order to more objectively and fully evaluate the advantages of the method of the present invention, two currently advanced lightweight network models MobileNetV3 and ShuffleNetV2 are used to compare the spatial domain convolution method and the method of the present invention on the data set obtained in the above steps. The comparison results are shown in Figure 1.

表1自制数据集上的评估指标Table 1 Evaluation indicators on self-made datasets

ParamsParams FLOPsFLOPs mAPmAP ShuffleNetV2ShuffleNetV2 5.55.5 0.60.6 44.544.5 MobileNetv3MobileNetv3 5.45.4 0.20.2 45.245.2 采用空域卷积Using spatial convolution 10.710.7 2.72.7 48.648.6 本发明The present invention 5.55.5 1.51.5 49.849.8

相比于当下最先进的轻量化网络模型，本发明在和先进的轻量化模型ShuffleNetV2、MobileNetv3相比参数量相近的情况下，牺牲了少量的计算量，该网络对车辆小目标的精度得以提升。将本发明的自适应频段滤波算子代替等价的空域卷积的情况下，参数量会下降，计算成本也会减小，检测精度也会上升。能够满足在存储和计算都受限的边缘端对车辆小目标检测的实时性和准确性的要求。本发明的准确性和效率能达到更好的权衡。Compared with the most advanced lightweight network models, the present invention sacrifices a small amount of computation while having similar parameters to the advanced lightweight models ShuffleNetV2 and MobileNetv3, and the accuracy of the network for small vehicle targets is improved. When the adaptive frequency band filter operator of the present invention replaces the equivalent spatial convolution, the number of parameters will decrease, the computational cost will also decrease, and the detection accuracy will also increase. It can meet the requirements of real-time and accuracy of small vehicle target detection at the edge where storage and computing are limited. The accuracy and efficiency of the present invention can achieve a better trade-off.

综上所述，针对现有深层卷积神经网络参数量大，计算成本高，而浅层轻量化网络对于目标的特征提取能力弱，检测性能差的问题，提出了一种轻量化网络设计的车辆小目标检测方法，降低计算成本的同时，对目标的特征提取具有计算效率高、语义自适应和全局有效的优点。输出端构造基于注意力的边界框损失函数，使得定位更加准确，增加模型的泛化能力。整个网络在降低模型复杂度的同时，能够满足在存储和计算都受限的边缘端对车辆目标检测的实时性和准确性的要求。In summary, in order to solve the problems that the existing deep convolutional neural network has large parameters and high computational cost, and the shallow lightweight network has weak feature extraction ability and poor detection performance for the target, a vehicle small target detection method with lightweight network design is proposed. While reducing the computational cost, the feature extraction of the target has the advantages of high computational efficiency, semantic adaptation and global effectiveness. The output end constructs an attention-based bounding box loss function to make the positioning more accurate and increase the generalization ability of the model. While reducing the complexity of the model, the entire network can meet the requirements of real-time and accuracy of vehicle target detection at the edge where storage and computing are limited.

基于同一发明构思，图8是本发明实施例提供的轻量化网络设计的车辆小目标检测装置的一种示意图，本发明还提供一种轻量化网络设计的车辆小目标检测装置，应用于本发明上述实施例提供的轻量化网络设计的车辆小目标检测方法，方法的实施例请参考上述，本发明在此不再赘述；该装置包括：Based on the same inventive concept, FIG8 is a schematic diagram of a vehicle small target detection device with a lightweight network design provided by an embodiment of the present invention. The present invention also provides a vehicle small target detection device with a lightweight network design, which is applied to the vehicle small target detection method with a lightweight network design provided by the above embodiment of the present invention. The embodiment of the method is referred to above, and the present invention will not be repeated here; the device includes:

图像获取模块101，用于获取待测车辆图像；An image acquisition module 101 is used to acquire an image of a vehicle to be tested;

图像处理模块201，用于使用训练好的网络模型对待测车辆图像进行分类，输出分类结果；其中，训练好的网络模型包括主干网络模块、颈部网络模块和检测模块，主干网络模块用于提取层次化的全局特征和局部特征，颈部网络模块用于融合层次化的全局特征和局部特征，检测模块用于对融合后的全局特征和局部特征进行处理，得到分类结果。The image processing module 201 is used to classify the image of the vehicle to be tested using the trained network model and output the classification result; wherein the trained network model includes a backbone network module, a neck network module and a detection module, the backbone network module is used to extract hierarchical global features and local features, the neck network module is used to fuse hierarchical global features and local features, and the detection module is used to process the fused global features and local features to obtain the classification result.

应当说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的物品或者设备中还存在另外的相同要素。“连接”或者“相连”等类似的词语并非限定于物理的或者机械的连接，而是可以包括电性的连接，不管是直接的还是间接的。“上”、“下”、“左”、“右”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本发明的限制。It should be noted that, in this article, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that there is any such actual relationship or order between these entities or operations. Moreover, the terms "include", "comprise" or any other variants are intended to cover non-exclusive inclusion, so that the article or device including a series of elements includes not only those elements, but also other elements that are not explicitly listed. In the absence of more restrictions, the elements defined by the sentence "including one..." do not exclude the existence of other identical elements in the article or device including the elements. "Connect" or "connected" and similar words are not limited to physical or mechanical connections, but can include electrical connections, whether direct or indirect. The orientation or position relationship indicated by "up", "down", "left", "right", etc. is based on the orientation or position relationship shown in the drawings, which is only for the convenience of describing the present invention and simplifying the description, rather than indicating or implying that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and therefore cannot be understood as a limitation of the present invention.

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征或者特点包含于本发明的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且，描述的具体特征或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。此外，本领域的技术人员可以将本说明书中描述的不同实施例或示例进行接合和组合。In the description of this specification, the description with reference to the terms "one embodiment", "some embodiments", "example", "specific example", or "some examples" etc. means that the specific features or characteristics described in conjunction with the embodiment or example are included in at least one embodiment or example of the present invention. In this specification, the schematic representation of the above terms does not necessarily refer to the same embodiment or example. Moreover, the specific features or characteristics described may be combined in any one or more embodiments or examples in a suitable manner. In addition, those skilled in the art may combine and combine different embodiments or examples described in this specification.

以上内容是结合具体的优选实施方式对本发明所作的进一步详细说明，不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干简单推演或替换，都应当视为属于本发明的保护范围。The above contents are further detailed descriptions of the present invention in combination with specific preferred embodiments, and it cannot be determined that the specific implementation of the present invention is limited to these descriptions. For ordinary technicians in the technical field to which the present invention belongs, several simple deductions or substitutions can be made without departing from the concept of the present invention, which should be regarded as falling within the protection scope of the present invention.

Claims

1. A vehicle small target detection method with a lightweight network design, characterized by comprising:

Acquire an image of a vehicle to be tested;

Use the trained network model to classify the vehicle image to be tested and output the classification result; wherein the trained network model includes a backbone network module, a neck network module and a detection module, the backbone network module is used to extract hierarchical global features and local features, the neck network module is used to fuse hierarchical global features and local features, and the detection module is used to process the fused global features and local features to obtain the classification result.

2. The vehicle small target detection method with lightweight network design according to claim 1 is characterized in that the backbone network module includes a first module, a second module and a third module, the vehicle image to be detected input into the backbone network module is processed by the first module to obtain a first feature, the first feature is input into the second module for processing to obtain a second feature, the second feature is input into the third module for processing to obtain a third feature; wherein,

The first module, the second module and the third module all include a downsampling module, an AFF module and a fusion module. The number of AFF modules set in the first module, the number of AFF modules set in the second module and the number of AFF modules set in the third module are all different. The downsampling module includes an MBConv layer.

3. The vehicle small target detection method with lightweight network design according to claim 2 is characterized in that the AFF module includes a first normalization layer, a second normalization layer and an adaptive frequency band filter operator; wherein the first normalization layer and the second normalization layer are used for channel mixing, the adaptive frequency band filter operator is used for global sequence mixing, and the characteristic expression output by the AFF module is:

in, represents the output features after the lth downsampling layer and the first normalization layer, MBConv ^l (·) represents the output features after processing by the lth MBConv layer, X ^ll represents the features input to the downsampling layer, AFF ^l (·) represents the output features after processing by the lth AFF module, LN (·) represents the output features after processing by the second normalization layer, and X ^l represents the final output features after processing by the lth AFF module.

4. The vehicle small target detection method with lightweight network design according to claim 3 is characterized in that the adaptive frequency band filter operator includes a Fourier fast transform module, a frequency domain filter and an inverse Fourier transform module; the processing process of the adaptive frequency band filter operator on the input self-features includes:

The Fourier transform module is used to transform the input features from the spatial domain to the frequency domain to obtain the frequency domain features, which are expressed as follows:

Among them, X _F represents the frequency domain features, X represents the input features, represents Fourier transform;

The frequency domain feature is processed using the frequency domain filter to obtain a mask tensor for learning the frequency domain feature X _F

The mask tensor With the frequency domain characteristics Perform point multiplication to obtain the filtered features Its expression is:

Among them, ⊙ represents matrix product;

The inverse Fourier transform module is used to transform the filtered features from the frequency domain to the spatial domain to obtain the spatial domain features. Its expression is:

in, represents the inverse Fourier transform, the spatial domain features That is, the feature output after being processed by the adaptive frequency band filter operator.

5. The vehicle small target detection method with lightweight network design according to claim 2 is characterized in that the neck network module includes three ASFF modules, namely a first ASFF module, a second ASFF module and a third ASFF module; wherein the neck network module processes the input features including:

The first ASFF module fuses the first feature, the second feature and the third feature to obtain a first fused feature, the second ASFF module fuses the first feature, the second feature and the third feature to obtain a second fused feature, and the third ASFF module fuses the first feature, the second feature and the third feature to obtain a third fused feature.

6. The vehicle small target detection method with lightweight network design according to claim 1 is characterized in that the characteristic expression output by the ASFF module is:

in, Represents the new l-th layer feature map obtained after fusion, It represents the feature vector with the same size and dimension as the feature map of the lth layer after the size and dimension of the feature map of the nth layer are adjusted. and They represent the weight coefficients of the lth layer obtained by adaptive learning,

The expression of the weight coefficient is:

in, Indicates the definition of level l The self-learning weight control parameters, Indicates the definition of level l The self-learning weight control parameters, Indicates the definition of level l The self-learning weight control parameters.

7. The vehicle small target detection method with lightweight network design according to claim 1 is characterized in that the trained network model is obtained by training a preset network model, and the training of the preset network model comprises:

Acquire a data set, wherein the data set includes a training sample set and a test sample set;

Using the training sample set to train the preset network model;

During the training process, the loss function is used for iterative optimization to obtain the optimal weight of the preset network model to construct the trained network model;

The performance of the trained network model is tested using the test sample set.

8. The vehicle small target detection method with lightweight network design according to claim 7 is characterized in that the loss function includes a boundary loss function, a confidence loss function and a classification loss function, and the expression of the loss function is:

Loss＝ _LWIoUv +Loss _cls +Loss _obj ;

The expression of the boundary loss function L _WIoUv is:

Among them, R _WIoU represents the coefficient based on distance setting, L _IoU represents the initial intersection-over-union loss function, x represents the horizontal coordinate of the center of the true box, x _gt represents the horizontal coordinate of the predicted center point area, y represents the vertical coordinate of the center of the true box, y _gt represents the vertical coordinate of the predicted center point area, W _g represents the width of the minimum anchor box, and H _g represents the height of the minimum anchor box;

The expression of the confidence loss function Loss _obj is:

Where b represents the confidence loss weight when the element value in the mask matrix is true, loss _BCE (z,x,y) represents the cross entropy loss function, Lable _(z,x,y) represents the confidence label matrix, and P _(z,x,y) represents the prediction confidence matrix;

The expression of the classification loss function loss _cls is:

Among them, Label _smooth represents the label probability matrix after label smoothing, P represents the prediction probability matrix, n represents the number of categories, and a represents the smoothing coefficient.

9. The vehicle small target detection method with lightweight network design according to claim 5 is characterized in that the detection module includes a convolution layer, a pooling layer and a fully connected layer, and the processing process of the input features of the detection module includes:

The convolution layer processes the first fusion feature, the second fusion feature and the third fusion feature of the input, the pooling layer processes the features processed by the convolution layer, and the fully connected layer processes the features processed by the pooling layer to obtain a classification result.

10. A vehicle small target detection device with a lightweight network design, characterized by comprising:

An image acquisition module, used to acquire an image of a vehicle to be tested;

An image processing module is used to classify the vehicle image to be tested using a trained network model and output a classification result; wherein the trained network model includes a backbone network module, a neck network module and a detection module, the backbone network module is used to extract hierarchical global features and local features, the neck network module is used to fuse hierarchical global features and local features, and the detection module is used to process the fused global features and local features to obtain a classification result.