CN116310605A

CN116310605A - Image detection method, device, equipment and computer storage medium

Info

Publication number: CN116310605A
Application number: CN202211510080.6A
Authority: CN
Inventors: 李磊; 马超凡
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Priority date: 2022-11-29
Filing date: 2022-11-29
Publication date: 2023-06-23
Anticipated expiration: 2042-11-29
Also published as: CN116310605B

Abstract

The embodiment of the invention relates to the technical field of computer data processing, and discloses an image detection method, which comprises the following steps: inputting an image to be detected into a target detection model to obtain a target detection result corresponding to the image to be detected; the target detection model includes a modified YOLOv3 model; the improved YOLOv3 model comprises a high-dimensional convolution module; the high-dimensional convolution module is used for carrying out depth convolution on the image to be detected in a high-dimensional space to obtain image characteristics of the image to be detected. By the mode, the embodiment of the invention realizes the balance of the precision and the detection speed when the target detection is carried out on the high-resolution images such as the remote sensing image and the like.

Description

Image detection method, device, equipment and computer storage medium

技术领域technical field

本发明实施例涉及计算机数据处理技术领域，具体涉及一种图像检测方法、装置、设备及计算机存储介质。Embodiments of the present invention relate to the technical field of computer data processing, and in particular to an image detection method, device, equipment, and computer storage medium.

背景技术Background technique

遥感图像具有超远距离、全天运行和抗干扰能力强的优点，遥感图像图像检测与识别是计算机视觉中的研究热点。现有技术中对于高分辨率遥感图像的检测任务，主要以YOLO(You only look once)系列为主。YOLO算法首先将一副图像划分为SxS个网格，通过确定对象的中心在和某个网格重叠，那么这个对象的预测将由这个网络负责。而YOLOv3在v1的基础上通过K-means聚类的方式分析数据集标签，并且将特征提取网络由Darknet19升级为Darknet53。Remote sensing images have the advantages of ultra-long distance, all-day operation and strong anti-interference ability. Image detection and recognition of remote sensing images is a research hotspot in computer vision. In the prior art, the detection tasks of high-resolution remote sensing images are mainly based on the YOLO (You only look once) series. The YOLO algorithm first divides an image into SxS grids. By determining that the center of the object overlaps with a certain grid, the prediction of the object will be in charge of the network. On the basis of v1, YOLOv3 analyzes the data set labels through K-means clustering, and upgrades the feature extraction network from Darknet19 to Darknet53.

发明人在实施本发明实施例的过程中发现：YOLOv3模型主要通过RES和DBL结构提取特征。其中，一个RES操作主要由步长为2的3x3的卷积和n个ResUnit组成，ResUnit又由1x1卷积和3x3卷积操作组成。而大量的3x3卷积操作使得模型参数剧增，考虑到本身遥感图像的分辨率就很高，因此YOLOv3无法兼顾检测速度和精度。The inventor found in the process of implementing the embodiment of the present invention that the YOLOv3 model mainly extracts features through the RES and DBL structures. Among them, a RES operation is mainly composed of a 3x3 convolution with a step size of 2 and n ResUnits, and a ResUnit is composed of a 1x1 convolution and a 3x3 convolution operation. However, a large number of 3x3 convolution operations have caused a sharp increase in model parameters. Considering the high resolution of the remote sensing image itself, YOLOv3 cannot take into account both detection speed and accuracy.

发明内容Contents of the invention

鉴于上述问题，本发明实施例提供了一种图像检测方法，用于解决现有的YOLOv3在对高分辨率图像进行目标检测时无法兼顾检测速度和精度的问题。In view of the above problems, an embodiment of the present invention provides an image detection method, which is used to solve the problem that the existing YOLOv3 cannot balance detection speed and accuracy when performing target detection on high-resolution images.

根据本发明实施例的一个方面，提供了一种图像检测方法，所述方法包括：According to an aspect of an embodiment of the present invention, an image detection method is provided, the method comprising:

将待检测图像输入目标检测模型，得到所述待检测图像对应的目标检测结果；所述目标检测模型包括改进的YOLOv3模型；所述改进的YOLOv3模型中包括高维卷积模块；所述高维卷积模块用于对所述待检测图像在高维空间下进行深度卷积，得到所述待检测图像的图像特征。The image to be detected is input into the target detection model to obtain the target detection result corresponding to the image to be detected; the target detection model includes an improved YOLOv3 model; the improved YOLOv3 model includes a high-dimensional convolution module; the high-dimensional The convolution module is used to perform deep convolution on the image to be detected in a high-dimensional space to obtain image features of the image to be detected.

在一种可选的方式中，所述高维卷积模块包括依次连接的第一深度卷积层、维度调节卷积层以及第二深度卷积层；所述第一深度卷积层与所述第二深度卷积层用于对输入数据进行深度卷积；所述维度调节卷积层用于对输入数据进行维度调节，以使得输入所述第二深度卷积层的数据为高维数据。In an optional manner, the high-dimensional convolution module includes a first depth convolution layer, a dimension adjustment convolution layer, and a second depth convolution layer connected in sequence; the first depth convolution layer and the The second depth convolution layer is used to perform depth convolution on the input data; the dimension adjustment convolution layer is used to perform dimension adjustment on the input data, so that the data input to the second depth convolution layer is high-dimensional data .

在一种可选的方式中，所述改进的YOLOv3模型包括：多尺度卷积模块；所述多尺度卷积模块用于对输入数据在多尺度的卷积核下进行卷积，得到多尺度的所述图像特征；其中，所述卷积核包括所述高维卷积模块。In an optional manner, the improved YOLOv3 model includes: a multi-scale convolution module; the multi-scale convolution module is used to convolve the input data under a multi-scale convolution kernel to obtain a multi-scale The image features; wherein, the convolution kernel includes the high-dimensional convolution module.

在一种可选的方式中，所述多尺度卷积模块包括多层依次连接的卷积核；所述卷积核的尺寸依次减小，所述卷积核的深度依次增大；多个所述卷积核采用分组卷积，以使得各个所述卷积核输出同样通道数的所述图像特征。In an optional manner, the multi-scale convolution module includes multiple layers of sequentially connected convolution kernels; the size of the convolution kernels decreases sequentially, and the depth of the convolution kernels increases sequentially; multiple The convolution kernels use group convolution, so that each of the convolution kernels outputs the image features with the same number of channels.

在一种可选的方式中，所述改进的YOLOv3模型还包括特征增强模块：所述特征增强模块用于根据通道注意力机制对所述图像特征进行特征增强。In an optional manner, the improved YOLOv3 model further includes a feature enhancement module: the feature enhancement module is configured to perform feature enhancement on the image features according to a channel attention mechanism.

在一种可选的方式中，所述特征增强模块包括依次连接的池化层、多个全连接层以及加权计算层；所述池化层用于提取多个通道的图像特征；所述全连接层用于根据各个所述通道的图像特征确定通道间依赖性；所述加权计算层用于根据所述通道间依赖性对所述多个通道的图像特征进行加权计算，得到增强后图像特征。In an optional manner, the feature enhancement module includes sequentially connected pooling layers, multiple fully connected layers, and weighted calculation layers; the pooling layer is used to extract image features of multiple channels; the full The connection layer is used to determine inter-channel dependencies according to the image features of each of the channels; the weighted calculation layer is used to perform weighted calculations on the image features of the multiple channels according to the inter-channel dependencies to obtain enhanced image features .

在一种可选的方式中，所述待检测图像包括高分辨率图像；所述改进的YOLOv3模型基于原始YOLOv3模型改进得到；所述原始YOLOv3模型中包括原始DBL模块以及原始RES模块，所述原始DBL模块包括依次连接的原始卷积层、归一化层以及激活函数层，所述原始RES模块包括依次连接的零填充层、所述原始DBL模块以及原始残差结构；其中，在所述改进的YOLOv3模型中，将所述原始卷积层替换为高维卷积模块，在所述激活函数层后添加特征增强模块，在所述原始残差结构后添加多尺度卷积模块；所述高维卷积模块用于对输入数据在高维空间下进行深度卷积；所述特征增强模块用于根据通道注意力机制对所述图像特征进行特征增强；所述多尺度卷积模块用于用于对输入数据在多尺度的卷积核下进行卷积；所述卷积核包括所述高维卷积模块。In an optional manner, the image to be detected includes a high-resolution image; the improved YOLOv3 model is improved based on the original YOLOv3 model; the original YOLOv3 model includes an original DBL module and an original RES module, and the The original DBL module includes sequentially connected original convolutional layers, normalization layers, and activation function layers, and the original RES module includes sequentially connected zero-fill layers, the original DBL module, and the original residual structure; wherein, in the In the improved YOLOv3 model, the original convolution layer is replaced by a high-dimensional convolution module, a feature enhancement module is added after the activation function layer, and a multi-scale convolution module is added after the original residual structure; the The high-dimensional convolution module is used to perform deep convolution on the input data in a high-dimensional space; the feature enhancement module is used to perform feature enhancement on the image features according to the channel attention mechanism; the multi-scale convolution module is used for It is used to perform convolution on input data under a multi-scale convolution kernel; the convolution kernel includes the high-dimensional convolution module.

根据本发明实施例的另一方面，提供了一种图像检测装置，包括：According to another aspect of the embodiments of the present invention, an image detection device is provided, including:

检测模块，用于将待检测图像输入目标检测模型，得到所述待检测图像对应的目标检测结果；所述目标检测模型包括改进的YOLOv3模型；所述改进的YOLOv3模型中包括高维卷积模块；所述高维卷积模块用于对所述待检测图像在高维空间下进行深度卷积，得到所述待检测图像的图像特征。The detection module is used to input the image to be detected into the target detection model to obtain the target detection result corresponding to the image to be detected; the target detection model includes an improved YOLOv3 model; the improved YOLOv3 model includes a high-dimensional convolution module ; The high-dimensional convolution module is used to perform depth convolution on the image to be detected in a high-dimensional space to obtain image features of the image to be detected.

根据本发明实施例的另一方面，提供了一种图像检测设备，包括：According to another aspect of the embodiments of the present invention, an image detection device is provided, including:

处理器、存储器、通信接口和通信总线，所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信；A processor, a memory, a communication interface, and a communication bus, wherein the processor, the memory, and the communication interface complete mutual communication through the communication bus;

所述存储器用于存放至少一可执行指令，所述可执行指令使所述处理器执行如前述任意一项所述的图像检测方法实施例的操作。The memory is used to store at least one executable instruction, and the executable instruction causes the processor to execute the operation of the image detection method embodiment described in any one of the foregoing.

根据本发明实施例的又一方面，提供了一种计算机可读存储介质，所述存储介质中存储有至少一可执行指令，所述可执行指令使图像检测设备执行如前述任意一项所述的图像检测方法实施例的操作。According to still another aspect of the embodiments of the present invention, a computer-readable storage medium is provided, and at least one executable instruction is stored in the storage medium, and the executable instruction causes the image detection device to perform the above-mentioned any one of the above-mentioned The operation of the image detection method embodiment.

本发明实施例通过将待检测图像输入目标检测模型，得到所述待检测图像对应的目标检测结果；所述目标检测模型包括改进的YOLOv3模型；所述改进的YOLOv3模型中包括高维卷积模块；其中，高维卷积模块用于对所述待检测图像在高维空间下进行深度卷积，得到所述待检测图像的图像特征，从而区别于现有的YOLOv3模型中采取大量的3x3卷积操作进行特征提取，导致模型的参数量过大，检测速度不佳的问题，本发明实施例将深度卷积放在了整个卷积结构的开始和结束位置，使得深度卷积操作都作用在高维空间，由此可以提取出信息丰富的特征表示，在更高的维度上执行标识映射和空间变换，降低了传统移动网络倒残差结构带来的信息丢失和梯度混淆的风险，从而能够有效提升模型的处理速度，实现检测精度与检测速度之间的平衡。In the embodiment of the present invention, the target detection result corresponding to the image to be detected is obtained by inputting the image to be detected into the target detection model; the target detection model includes an improved YOLOv3 model; the improved YOLOv3 model includes a high-dimensional convolution module ; Wherein, the high-dimensional convolution module is used to perform deep convolution on the image to be detected in a high-dimensional space to obtain the image features of the image to be detected, so as to be different from the existing YOLOv3 model that adopts a large number of 3x3 volumes feature extraction by product operation, which leads to the problem that the parameter amount of the model is too large and the detection speed is not good. In the embodiment of the present invention, the depth convolution is placed at the beginning and end of the entire convolution structure, so that the depth convolution operation acts on the High-dimensional space, from which information-rich feature representations can be extracted, and identity mapping and spatial transformation can be performed in a higher dimension, which reduces the risk of information loss and gradient confusion caused by the traditional mobile network inverted residual structure, so that it can Effectively improve the processing speed of the model and achieve a balance between detection accuracy and detection speed.

上述说明仅是本发明实施例技术方案的概述，为了能够更清楚了解本发明实施例的技术手段，而可依照说明书的内容予以实施，并且为了让本发明实施例的上述和其它目的、特征和优点能够更明显易懂，以下特举本发明的具体实施方式。The above description is only an overview of the technical solutions of the embodiments of the present invention. In order to better understand the technical means of the embodiments of the present invention, it can be implemented according to the contents of the description, and in order to make the above and other purposes, features and The advantages can be more obvious and understandable, and the specific embodiments of the present invention are enumerated below.

附图说明Description of drawings

附图仅用于示出实施方式，而并不认为是对本发明的限制。而且在整个附图中，用相同的参考符号表示相同的部件。在附图中：The drawings are only for illustrating the embodiments and are not to be considered as limiting the invention. Also throughout the drawings, the same reference numerals are used to designate the same components. In the attached picture:

图1示出了现有的YOLOv3的模型结构示意图；Figure 1 shows a schematic diagram of the existing YOLOv3 model structure;

图2示出了本发明实施例提供的图像检测方法的流程示意图；FIG. 2 shows a schematic flowchart of an image detection method provided by an embodiment of the present invention;

图3示出了本发明实施例提供的图像检测方法中的高维卷积模型的结构示意图；FIG. 3 shows a schematic structural diagram of a high-dimensional convolution model in an image detection method provided by an embodiment of the present invention;

图4示出了本发明实施例提供的图像检测方法中的多尺度卷积模块的结构示意图；FIG. 4 shows a schematic structural diagram of a multi-scale convolution module in an image detection method provided by an embodiment of the present invention;

图5示出了本发明实施例提供的图像检测方法中的特征增强模块的结构示意图；FIG. 5 shows a schematic structural diagram of a feature enhancement module in an image detection method provided by an embodiment of the present invention;

图6示出了本发明实施例提供的图像检测方法中的改进后DBL模块的结构示意图；FIG. 6 shows a schematic structural diagram of an improved DBL module in an image detection method provided by an embodiment of the present invention;

图7示出了本发明实施例提供的图像检测方法中的改进后NRES模块的结构示意图；FIG. 7 shows a schematic structural diagram of an improved NRES module in an image detection method provided by an embodiment of the present invention;

图8示出了本发明实施例提供的图像检测方法中的改进的YOLOv3与现有技术的检测结果对比图；Fig. 8 shows a comparison diagram of the detection results of the improved YOLOv3 in the image detection method provided by the embodiment of the present invention and the prior art;

图9示出了本发明实施例提供的图像检测装置的结构示意图；FIG. 9 shows a schematic structural diagram of an image detection device provided by an embodiment of the present invention;

图10示出了本发明实施例提供的图像检测设备的结构示意图。FIG. 10 shows a schematic structural diagram of an image detection device provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面将参照附图更详细地描述本发明的示例性实施例。虽然附图中显示了本发明的示例性实施例，然而应当理解，可以以各种形式实现本发明而不应被这里阐述的实施例所限制。Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. Although exemplary embodiments of the present invention are shown in the drawings, it should be understood that the invention may be embodied in various forms and should not be limited to the embodiments set forth herein.

对相关名词进行说明：Explain related nouns:

IoU值：Intersection over Union，又称为交并比，通常被应用在目前目标检测算法的评价中，IOU值越高，说明算法对目标的预测精度越高。IoU的取值区间是[0，1]，IoU值越大，表明两个框重合越好。IoU value: Intersection over Union, also known as the intersection over union ratio, is usually used in the evaluation of current target detection algorithms. The higher the IoU value, the higher the prediction accuracy of the algorithm for the target. The value range of IoU is [0, 1]. The larger the IoU value, the better the coincidence of the two frames.

池化操作：利用一个矩阵窗口在输入张量上进行扫描，并且将每个矩阵窗口中的值通过取最大值，平均值等来减少元素个数(减少元素个数相当于提取元素重要特征)因为卷积的目的是提取特征，在所有的特征都提取出来后要进一步筛选出其关键特征，因此对特征机械能池化操作，可以非常有效地缩小矩阵尺寸，如减小矩阵的长和宽。Pooling operation: use a matrix window to scan the input tensor, and reduce the number of elements by taking the value in each matrix window by taking the maximum value, average value, etc. (reducing the number of elements is equivalent to extracting important features of elements) Because the purpose of convolution is to extract features, after all the features are extracted, the key features must be further screened out. Therefore, the feature mechanical energy pooling operation can effectively reduce the size of the matrix, such as reducing the length and width of the matrix.

分组卷积：对于普通卷积，如果输入的特征图尺寸为CxHxW，卷积核的数量为N，每个卷积核的尺寸为CxKxK，那么输出的特征图尺寸为CxHxN，总的参数量为：NxCxKxK。如果进行分组卷积，假定要分成G组，每组输入的特征图数量为CG，每组输出的特征图数量为NG，每个卷积核的尺寸为CGxKxK，每组的卷积核数量为NG，卷积核只与同组的输入进行卷积，则总的参数量为NxCGxKxK，则总的参数量减少为以前的1G。当分组数量等于输入图数量，输出图数量也等于输入图的数量，即G＝N＝C，每个卷积核尺寸为1xKxK时，就成了Depthwiseconvolution(深度可分离卷积)。Group convolution: For ordinary convolution, if the input feature map size is CxHxW, the number of convolution kernels is N, and the size of each convolution kernel is CxKxK, then the output feature map size is CxHxN, and the total parameter amount is : NxCxKxK. If group convolution is performed, it is assumed to be divided into G groups, the number of feature maps input by each group is CG, the number of feature maps output by each group is NG, the size of each convolution kernel is CGxKxK, and the number of convolution kernels in each group is NG, the convolution kernel is only convolved with the input of the same group, then the total parameter amount is NxCGxKxK, and the total parameter amount is reduced to the previous 1G. When the number of groups is equal to the number of input images, the number of output images is also equal to the number of input images, that is, G=N=C, and each convolution kernel size is 1xKxK, it becomes Depthwiseconvolution (depth separable convolution).

通道注意力机制：用于显式地建模特征通道之间的相互依赖关系，通过采用了一种全新的“特征重标定”策略–自适应地重新校准通道的特征响应，从而让网络更加关注待检测目标，提高检测效果。Channel attention mechanism: used to explicitly model the interdependence between feature channels, by adopting a new "feature recalibration" strategy - adaptively recalibrating the feature response of the channel, so that the network pays more attention The target to be detected improves the detection effect.

金字塔卷积(Pyramid Convolution，PyConv)：在多个滤波器尺度对输入进行处理。PyConv包含一个核金字塔，每一层包含不同类型的滤波器(滤波器的大小与深度可变，因此可以提取不同尺度的细节信息)。除了可以提取多尺度信息外，相比标准卷积，PyConv实现高效，即不会提升额外的计算量与参数量。更进一步，它更为灵活并具有可扩展性，为不同的应用提升了更大的架构设计空间。Pyramid Convolution (PyConv): Processes input at multiple filter scales. PyConv consists of a kernel pyramid, and each layer contains different types of filters (the size and depth of the filters are variable, so that details at different scales can be extracted). In addition to extracting multi-scale information, PyConv is efficient compared to standard convolution, that is, it does not increase the amount of additional calculations and parameters. Furthermore, it is more flexible and scalable, providing a larger architecture design space for different applications.

NWPU VHR-10数据集：用于研究、公开的10类地理空间物体检测数据集，这10类物体是飞机、轮船、储罐、棒球、网球场、篮球场、地面跑道、港口、桥梁和车辆。此数据集总共包含800幅超高分辨率(VHR)遥感图像，是从Google Earth和Vaihingen数据集裁剪而来，并由专家手动注释。NWPU VHR-10 Dataset: A dataset of 10 types of geospatial object detection for research and public use. These 10 types of objects are airplanes, ships, storage tanks, baseballs, tennis courts, basketball courts, ground runways, ports, bridges, and vehicles. . This dataset contains a total of 800 very high-resolution (VHR) remote sensing images, cropped from Google Earth and Vaihingen datasets, and manually annotated by experts.

在进行本发明实施例的说明之前，对现有技术以及其存在的问题进行进一步说明：Before carrying out the description of the embodiment of the present invention, the prior art and its existing problems are further described:

遥感图像具有超远距离、全天运行和抗干扰能力强的优点，遥感图像目标检测与识别是计算机视觉中的研究热点。目前光学遥感图像的目标检测备受关注，其在城市安防、土地规划等领域得到了广泛的应用。Remote sensing images have the advantages of ultra-long distance, all-day operation and strong anti-interference ability. Target detection and recognition of remote sensing images is a research hotspot in computer vision. At present, target detection in optical remote sensing images has attracted much attention, and it has been widely used in urban security, land planning and other fields.

对于高分辨率遥感图像的检测任务，主要以YOLO系列为主。YOLO算法首先将一副图像划分为SxS个网格，通过确定对象的中心在和某个网格重叠，那么这个对象的预测将由这个网络负责。YOLO中每个网格预测边界框的两个信息：位置和置信度(Confidence)。位置信息中包含四个参数，分别是：边界框的中心点坐标和宽高。置信度具有两重含义，一是所预测边界框中是否有对象，其二是预测边界框的准确度。其定义如下：For the detection tasks of high-resolution remote sensing images, the YOLO series is mainly used. The YOLO algorithm first divides an image into SxS grids. By determining that the center of the object overlaps with a certain grid, the prediction of the object will be in charge of the network. Each grid in YOLO predicts two pieces of information about the bounding box: position and confidence (Confidence). The location information contains four parameters, namely: the coordinates of the center point and the width and height of the bounding box. Confidence has two meanings, one is whether there is an object in the predicted bounding box, and the other is the accuracy of the predicted bounding box. It is defined as follows:

其中P_r(object)表示网格是否存在对象，存在取1，否则取0；

表示预测边界框和groundtruth(标准值)之间的IoU值。每个边界框要预测(x，y，w，h)和置信度共5个值。每个网格还要预测一个类别信息，记为C类。则SxS个网格，每个网格要预测B个边界框和C个类别。最终网络输出一个张量SxSx(5xB+C)。Among them, P _r (object) indicates whether there is an object in the grid, and it is 1 if it exists, otherwise it is 0;

Indicates the IoU value between the predicted bounding box and groundtruth (standard value). Each bounding box has 5 values to predict (x, y, w, h) and confidence. Each grid also predicts a category information, denoted as C category. Then there are SxS grids, and each grid needs to predict B bounding boxes and C categories. The final network outputs a tensor SxSx(5xB+C).

YOLOv3在v1的基础上通过K-means聚类的方式分析数据集标签，并且将特征提取网络由Darknet19升级为Darknet53。其中，darknet53网络是YOLOV3的主干网络，分别用于提取8、16、32倍降采样的特征。网络部分大量使用了1x1卷积和3x3卷积，其中1x1主要应用于通道的扩充与缩减。整体的卷积网络采用的是Conv+BN+LeakyReLU结构形式，其中的残差网络先使用1x1的卷积核对通道进行收缩，再使用3x3的卷积核对通道进行还原，其本质还是矩阵分解的思路，用于减少参数量。On the basis of v1, YOLOv3 analyzes the data set labels through K-means clustering, and upgrades the feature extraction network from Darknet19 to Darknet53. Among them, the darknet53 network is the backbone network of YOLOV3, which is used to extract the features of 8, 16, and 32 times downsampling respectively. The network part uses a lot of 1x1 convolution and 3x3 convolution, of which 1x1 is mainly used for channel expansion and reduction. The overall convolutional network adopts the Conv+BN+LeakyReLU structure. The residual network first uses a 1x1 convolution kernel to shrink the channel, and then uses a 3x3 convolution kernel to restore the channel. Its essence is the idea of matrix decomposition , used to reduce the number of parameters.

具体地，现有的YOLOv3模型的结构如图1所示，首先对YOLOv3的三个基本组件进行说明：Specifically, the structure of the existing YOLOv3 model is shown in Figure 1. First, the three basic components of YOLOv3 are explained:

DBL：YOLOv3网络结构中的最小组件，由Conv+BN+Leaky_relu三者组成。其中，Conv为卷积层，BN为Batch Normalization，归一化层，Leaky_relu为激活函数层；其中，BN会在batch size这个维度，对不同样本的同一个通道直接做归一化，得到C个均值和方差，以及C个γ，β(γ，β为BN层的学习参数)。BN层的作用包括加快网络的训练和收敛的速度，控制梯度爆炸防止梯度消失，以及防止过拟合。DBL: The smallest component in the YOLOv3 network structure, consisting of Conv+BN+Leaky_relu. Among them, Conv is the convolution layer, BN is Batch Normalization, normalization layer, Leaky_relu is the activation function layer; among them, BN will directly normalize the same channel of different samples in the dimension of batch size, and get C Mean and variance, and C γ, β (γ, β are the learning parameters of the BN layer). The role of the BN layer includes speeding up the training and convergence of the network, controlling the gradient explosion to prevent the gradient from disappearing, and preventing overfitting.

Res unit：借鉴Resnet网络中的残差结构，让网络可以构建的更深。Res unit: Learn from the residual structure in the Resnet network, so that the network can be built deeper.

ResX：由一个DBL和X个残差组件构成，是YOLOv3中的大组件。每个Res模块前面的DBL都起到下采样的作用，因此经过5次Res模块后，得到的特征图是608->304->152->76->38->19大小。ResX: Consisting of a DBL and X residual components, it is a large component in YOLOv3. The DBL in front of each Res module plays the role of downsampling, so after 5 Res modules, the obtained feature map is 608->304->152->76->38->19 in size.

YOLOv3中的其他基础操作：Other basic operations in YOLOv3:

Concat：张量拼接，将darknet中间层和后面的某一层的上采样进行拼接。拼接的操作和残差层add的操作是不一样的，拼接会扩充张量的维度，而add只是直接相加不会导致张量维度的改变。Concat: Tensor splicing, splicing the upsampling of the middle layer of darknet and a certain layer behind. The operation of splicing is different from that of add in the residual layer. Splicing will expand the dimension of the tensor, while add will not change the dimension of the tensor.

add：张量相加，张量直接相加，不会扩充维度，例如104x104x128和104x104x128相加，结果还是104x104x128。add和cfg文件中的shortcut功能一样。add: Tensor addition, tensors are added directly, and the dimension will not be expanded. For example, if 104x104x128 and 104x104x128 are added, the result is still 104x104x128. add has the same function as the shortcut in the cfg file.

Backbone中卷积层的数量：每个ResX中包含1+2xX个卷积层，因此整个主干网络Backbone中一共包含1+(1+2x1)+(1+2x2)+(1+2x8)+(1+2x8)+(1+2x4)＝52，再加上一个FC全连接层，即可以组成一个Darknet53分类网络。不过在图像检测YOLOv3中，去掉FC层，但出于称呼上的便利仍把YOLOv3的主干网络叫做Darknet53结构。The number of convolutional layers in Backbone: each ResX contains 1+2xX convolutional layers, so the entire backbone network Backbone contains a total of 1+(1+2x1)+(1+2x2)+(1+2x8)+( 1+2x8)+(1+2x4)=52, plus a FC fully connected layer, can form a Darknet53 classification network. However, in the image detection YOLOv3, the FC layer is removed, but the backbone network of YOLOv3 is still called the Darknet53 structure for convenience.

如图1所示，YOLO_v3使用了darknet-53的前面的52层(没有全连接层)，YOLO_v3这个网络是一个全卷积网络，大量使用残差的跳层连接，并且为了降低池化带来的梯度负面效果，直接摒弃了POOLing(池化)，用conv(卷积)的stride(即在进行一次卷积后，特征图滑动若干格，默认是1，即滑动一格)来实现降采样。在这个网络结构中，使用的是步长为2的卷积来进行降采样。As shown in Figure 1, YOLO_v3 uses the first 52 layers of darknet-53 (without a fully connected layer). The network of YOLO_v3 is a fully convolutional network, which uses a large number of residual layer skip connections, and in order to reduce the pooling. The negative effect of the gradient directly abandons POOLing (pooling), and uses conv (convolution) stride (that is, after a convolution, the feature map slides several grids, the default is 1, that is, slides one grid) to achieve downsampling . In this network structure, a convolution with a step size of 2 is used for downsampling.

为了加强算法对小图像检测的精确度，YOLO v3中采用类似FPN的upsample(上采样)和融合做法，最后融合了3个scale(尺度)，其他两个scale的大小分别是26x26和52x52)，在多个scale的特征图(特征图)上做检测。In order to enhance the accuracy of the algorithm for small image detection, YOLO v3 adopts FPN-like upsample (upsampling) and fusion methods, and finally integrates 3 scales (scales), and the sizes of the other two scales are 26x26 and 52x52), respectively. Perform detection on feature maps (feature maps) of multiple scales.

在3条预测支路采用的也是全卷积的结构，其中最后一个卷积层的卷积核个数是255，是针对COCO数据集的80类：3x(80+4+1)＝255，3表示一个grid cell(网格)包含3个bounding box(检测框)，4表示检测框的4个坐标信息，1表示objectness score(目标存在可能性得分)。多尺度的图像特征检测结果即来自这3条预测之路，y1，y2和y3的深度都是255(即3x(5+80))，边长的规律是13:26:52。YOLO v3设定的是每个网格单元预测3个bounding box，所以每个box需要有(x，y，w，h，confidence(置信度))五个基本参数，然后还要有80个类别的概率。The three prediction branches also adopt a full convolution structure, in which the number of convolution kernels in the last convolution layer is 255, which is for the 80 categories of the COCO dataset: 3x(80+4+1)=255, 3 indicates that a grid cell (grid) contains 3 bounding boxes (detection boxes), 4 indicates the 4 coordinate information of the detection box, and 1 indicates the objectness score (target existence possibility score). The multi-scale image feature detection results come from these three prediction paths. The depths of y1, y2, and y3 are all 255 (that is, 3x(5+80)), and the rule of side length is 13:26:52. YOLO v3 sets that each grid unit predicts 3 bounding boxes, so each box needs to have five basic parameters (x, y, w, h, confidence (confidence)), and then there are 80 categories The probability.

网络中进行了三次检测，分别是在32倍降采样，16倍降采样，8倍降采样时进行检测，这样在多尺度的特征图上检测跟SSD有点像。在网络中使用up-sample(上采样)的原因在于，网络越深的特征表达效果越好，比如在进行16倍降采样检测，如果直接使用第四次下采样的特征来检测，这样就使用了浅层特征，这样效果一般并不好。如果想使用32倍降采样后的特征，但深层特征的大小太小，因此YOLO_v3使用了步长为2的up-sample(上采样)，把32倍降采样得到的特征图的大小提升一倍，也就成了16倍降采样后的维度。同理8倍采样也是对16倍降采样的特征进行步长为2的上采样，这样就可以使用深层特征进行检测。Three detections were performed in the network, which were detected at 32 times downsampling, 16 times downsampling, and 8 times downsampling, so that the detection on the multi-scale feature map is a bit like SSD. The reason for using up-sample (up-sampling) in the network is that the deeper the network, the better the feature expression effect. For example, when performing 16-fold down-sampling detection, if you directly use the fourth down-sampled feature to detect, then use If shallow features are removed, the effect is generally not good. If you want to use the features after 32 times downsampling, but the size of the deep features is too small, so YOLO_v3 uses up-sample (upsampling) with a step size of 2 to double the size of the feature map obtained by 32 times downsampling , which becomes the dimension after 16 times downsampling. Similarly, 8-fold sampling is also an upsampling with a step size of 2 on the features of 16-fold downsampling, so that deep features can be used for detection.

最后，YOLOv3通过上采样将深层特征提取，其维度是与将要融合的特征层维度相同的(但通道不同)。如图1所示，85层将13x13x256的特征上采样得到26x26x256，再将其与61层的特征拼接起来得到26x26x768。为了得到channel255，还需要进行一系列的3x3，1x1卷积操作，这样既可以提高非线性程度增加泛化性能提高网络精度，又能减少参数提高实时性。52x52x255的特征也是类似的过程。Finally, YOLOv3 extracts deep features by upsampling, and its dimension is the same as the dimension of the feature layer to be fused (but the channels are different). As shown in Figure 1, the 85th layer upsamples the features of 13x13x256 to get 26x26x256, and then stitches it with the features of the 61st layer to get 26x26x768. In order to obtain channel255, a series of 3x3 and 1x1 convolution operations are required, which can not only improve the degree of nonlinearity, increase generalization performance, improve network accuracy, but also reduce parameters and improve real-time performance. The characterization of 52x52x255 is also a similar process.

综上，对于一个输入图像，YOLOv3将其映射到3个尺度的输出张量，代表图像各个位置存在各种对象的概率。在Darknet-53得到的特征图的基础上，经过六个DBL结构和最后一层卷积层得到第一个特征图，在这个特征图上做第一次预测。Y1支路上，从后向前的倒数第3个卷积层的输出，经过一个DBL结构和一次上采样，将上采样特征与第2个Res8结构输出的卷积特征张量连接，经过六个DBL结构和最后一层卷积层得到第二个特征图，在这个特征图上做第二次预测。Y2支路上，从后向前倒数第3个卷积层的输出，经过一个DBL结构和一次上采样，将上采样特征与第1个Res8结构输出的卷积特征张量连接，经过六个DBL结构和最后一层卷积层得到第三个特征图，在这第三个特征图上再进行第三次预测，由此得到三个尺度上的特征图作为输入图像的图像特征检测结果。In summary, for an input image, YOLOv3 maps it to an output tensor of 3 scales, representing the probability of various objects in each position of the image. Based on the feature map obtained by Darknet-53, the first feature map is obtained through six DBL structures and the last convolutional layer, and the first prediction is made on this feature map. On the Y1 branch, the output of the third convolutional layer from the back to the front, after a DBL structure and an upsampling, connects the upsampling feature with the convolutional feature tensor output by the second Res8 structure, and passes through six The DBL structure and the last convolutional layer get the second feature map, and make a second prediction on this feature map. On the Y2 branch, the output of the third convolutional layer from the back to the front, after a DBL structure and an upsampling, connect the upsampling feature with the convolutional feature tensor output by the first Res8 structure, and go through six DBLs The structure and the last convolutional layer obtain the third feature map, and then perform the third prediction on the third feature map, thus obtaining the feature maps on three scales as the image feature detection results of the input image.

发明人发现：现有的YOLOv3模型主要通过图1中的RES和DBL结构提取特征。一个RES操作主要由步长为2的3x3的卷积和n个ResUnit组成，ResUnit又由1x1卷积和3x3卷积操作组成。而大量的3x3卷积操作使得模型参数剧增，而考虑到遥感图像自身的分辨率就很高，因此现有YOLOv3模型无法兼顾检测速度和精度。并且当不同尺度的检测目标同时出现在检测任务中时，网络很难对较大目标的尺寸变化做出准确的判断。The inventors found that: the existing YOLOv3 model mainly extracts features through the RES and DBL structures in Figure 1. A RES operation is mainly composed of 3x3 convolution with a step size of 2 and n ResUnits, and ResUnit is composed of 1x1 convolution and 3x3 convolution operations. A large number of 3x3 convolution operations make the model parameters increase dramatically, and considering the high resolution of the remote sensing image itself, the existing YOLOv3 model cannot take into account the detection speed and accuracy. And when detection targets of different scales appear in the detection task at the same time, it is difficult for the network to make accurate judgments on the size changes of larger targets.

因此，需要一种能够兼顾检测速度和精度的针对遥感影像等高分辨率图像进行多尺度目标检测的方法。Therefore, there is a need for a multi-scale target detection method for high-resolution images such as remote sensing images that can take into account both detection speed and accuracy.

图2示出了本发明实施例提供的图像检测方法的流程图，该方法由计算机处理设备执行。该计算机处理设备可以包括手机、笔记本电脑等。如图2所示，该方法包括以下步骤：Fig. 2 shows a flowchart of an image detection method provided by an embodiment of the present invention, and the method is executed by a computer processing device. The computer processing device may include a cell phone, a laptop, and the like. As shown in Figure 2, the method includes the following steps:

步骤10：将待检测图像输入目标检测模型，得到所述待检测图像对应的目标检测结果；所述目标检测模型包括改进的YOLOv3模型；所述改进的YOLOv3模型中包括高维卷积模块；所述高维卷积模块用于对所述待检测图像在高维空间下进行深度卷积，得到所述待检测图像的图像特征。Step 10: Input the image to be detected into the target detection model to obtain the target detection result corresponding to the image to be detected; the target detection model includes an improved YOLOv3 model; the improved YOLOv3 model includes a high-dimensional convolution module; The high-dimensional convolution module is used to perform depth convolution on the image to be detected in a high-dimensional space to obtain image features of the image to be detected.

在本发明的一个实施例中，考虑到卷积的目的是提取特征，输出特征图的深度＝卷积核数＝通道数。而四个3x3的卷积核累加卷积才能达到一个9x9卷积核所对应的感知野，但是所需要的参数量更少，且产生了更多的特征。面对一张固定大小的图片，如果采用9x9的，那么在这9x9范围内进行提取特征，一张图片假设以9的步长正合适的话，那么提取的特征数是确定数据。因此，卷积越小提取的特征也越多越精细。而如前所述，现有的YOLOv3模型中采用大量的3x3卷积来进行图片的特征提取，由此造成模型的参数量过大，在针对遥感图像等高分辨率图像进行目标检测时的模型检测速度下降，因此，本发明实施例将YOLOv3模型中3x3卷积替换为高维卷积模块，通过高维卷积模块在高维空间下进行深度卷积，从而在提取出信息丰富的特征表示的同时，能够减小模型的参数量，通过高维卷积模块在更高的维度上执行标识映射和空间变换，降低了传统的目标检测网络倒残差结构带来的信息丢失和梯度混淆的风险。In one embodiment of the present invention, considering that the purpose of convolution is to extract features, the depth of the output feature map=the number of convolution kernels=the number of channels. The cumulative convolution of four 3x3 convolution kernels can reach the perceptual field corresponding to a 9x9 convolution kernel, but requires fewer parameters and produces more features. In the face of a fixed-size picture, if a 9x9 image is used, features are extracted within the 9x9 range. If a picture is just right with a step size of 9, then the number of features extracted is definite data. Therefore, the smaller the convolution, the more and finer the features extracted. As mentioned earlier, the existing YOLOv3 model uses a large number of 3x3 convolutions for feature extraction of pictures, resulting in too many parameters of the model. The model when performing target detection on high-resolution images such as remote sensing images The detection speed drops. Therefore, the embodiment of the present invention replaces the 3x3 convolution in the YOLOv3 model with a high-dimensional convolution module, and performs deep convolution in a high-dimensional space through the high-dimensional convolution module, thereby extracting information-rich feature representations At the same time, the parameter amount of the model can be reduced, and the high-dimensional convolution module can perform identity mapping and spatial transformation in a higher dimension, which reduces the information loss and gradient confusion caused by the traditional target detection network inverted residual structure. risk.

具体地，高维卷积模块可以是沙漏型的，即在高维卷积模块的开始位置和结束位置上分别设置深度卷积结构，从而实现对于输入以及输出的高维数据进行卷积，实现高维空间下的图片特征的提取，对应地，在开始位置以及结束位置之间可以设置一或多个数据维度调整结构，从而将对输入结束位置上的深度卷积结构的数据的维度调整到高维空间。其中，深度卷积结构可以是预设尺寸的卷积核，如3x3卷积核。Specifically, the high-dimensional convolution module can be hourglass-shaped, that is, the depth convolution structure is set at the start position and the end position of the high-dimensional convolution module, so as to realize the convolution of the input and output high-dimensional data, and realize The extraction of picture features in high-dimensional space, correspondingly, one or more data dimension adjustment structures can be set between the start position and the end position, so as to adjust the dimension of the data of the depth convolution structure at the input end position to high-dimensional space. Wherein, the deep convolution structure may be a convolution kernel of a preset size, such as a 3x3 convolution kernel.

因此，在本发明的再一个实施例中，高维卷积模块的结构可以参考图3。Therefore, in another embodiment of the present invention, the structure of the high-dimensional convolution module can refer to FIG. 3 .

如图3所示，所述高维卷积模块包括依次连接的第一深度卷积层、维度调节卷积层以及第二深度卷积层；所述第一深度卷积层与所述第二深度卷积层用于对输入数据进行深度卷积；所述维度调节卷积层用于对输入数据进行维度调节，以使得输入所述第二深度卷积层的数据为高维数据。As shown in Figure 3, the high-dimensional convolution module includes a first depth convolution layer, a dimension adjustment convolution layer, and a second depth convolution layer connected in sequence; the first depth convolution layer and the second depth convolution layer The depth convolution layer is used to perform depth convolution on the input data; the dimension adjustment convolution layer is used to perform dimension adjustment on the input data, so that the data input to the second depth convolution layer is high-dimensional data.

其中，第一深度卷积层可以包括预设尺寸的卷积核，如3x3卷积核，即图3中的Dwise 3x3(深度可分离的3x3卷积核)。第二深度卷积层的结构可以与第一深度卷积层相同。维度调节卷积层可以包括两个依次连接的1x1卷积层，即图3中的Conv 1x1，1x1卷积层用于调节通道数，对不同的通道上的像素点进行线性组合，然后进行非线性化操作，从而可以完成升维和降维的功能。其中，与第一深度卷积层连接的1x1卷积层用于对数据进行降维(reduction)操作，与第二深度卷积层连接的1x1卷积层用于对数据进行升维(expansion)操作。可选得，高维卷积模模块的输入可以与现有的YOLOv3模型中的激活函数连接。Wherein, the first depth convolution layer may include a convolution kernel of a preset size, such as a 3x3 convolution kernel, that is, Dwise 3x3 (depth separable 3x3 convolution kernel) in FIG. 3 . The structure of the second depthwise convolutional layer may be the same as that of the first depthwise convolutional layer. The dimension adjustment convolutional layer can include two sequentially connected 1x1 convolutional layers, that is, Conv 1x1 in Figure 3. The 1x1 convolutional layer is used to adjust the number of channels, linearly combine the pixels on different channels, and then perform nonlinear Linearization operation, so that the functions of dimensionality enhancement and dimensionality reduction can be completed. Among them, the 1x1 convolutional layer connected to the first deep convolutional layer is used to reduce the dimension of the data, and the 1x1 convolutional layer connected to the second deep convolutional layer is used to expand the data. operate. Optionally, the input of the high-dimensional convolution module can be connected with the activation function in the existing YOLOv3 model.

进一步地，考虑到YOLOv3模型用于提取输入图像的多尺度的目标检测结果，而现有的YOLOv3模型中在进行多尺度的特征提取时通过前述现有的RES模块，而RES模块中又包括DBL模块，其中DBL模块中采用多个3x3卷积核进行图片的特征提取，其存在参数过多，模型处理速度不佳的问题，因此，在本发明的再一个实施例中，可以将现有的YOLOv3模型中的RES模块中的特征提取通过前述实施例中的高维卷积模块来实现，从而在更高的维度上执行标识映射和空间变换，降低了传统的目标检测网络倒残差结构带来的信息丢失和梯度混淆的风险，实现模型的处理速度以及处理精度之间的平衡。Further, considering that the YOLOv3 model is used to extract the multi-scale target detection results of the input image, and the existing YOLOv3 model uses the aforementioned existing RES module when performing multi-scale feature extraction, and the RES module includes DBL Module, wherein in the DBL module, multiple 3x3 convolution kernels are used to extract the features of the picture, which has the problems of too many parameters and poor model processing speed. Therefore, in another embodiment of the present invention, the existing The feature extraction in the RES module in the YOLOv3 model is realized by the high-dimensional convolution module in the aforementioned embodiments, thereby performing identity mapping and spatial transformation in a higher dimension, reducing the traditional target detection network inverted residual structure. The risks of information loss and gradient confusion are achieved, and the balance between the processing speed and processing accuracy of the model is achieved.

因此，在本发明的再一个实施例中，改进的YOLOv3模型包括：多尺度卷积模块；所述多尺度卷积模块用于对输入数据在多尺度的卷积核下进行卷积，得到多尺度的所述图像特征；其中，所述卷积核包括所述高维卷积模块。Therefore, in another embodiment of the present invention, the improved YOLOv3 model includes: a multi-scale convolution module; the multi-scale convolution module is used to convolve the input data under a multi-scale convolution kernel to obtain multiple The image features of the scale; wherein, the convolution kernel includes the high-dimensional convolution module.

具体地，多尺度卷积模块中可以包括依次连接的多个尺度以及深度的卷积核，其中，多尺度卷积模块可以呈金字塔结构，从金字塔结构的上往下卷积核的大小依次减小，同时，在通道维度上，从金字塔结构的上往下卷积核的通道的数目依次增加。多尺度卷积模块中的各个卷积核最后将得到的特征图拼接起来。在对不同的尺寸的卷积核输出的特征图进行拼接时，为了实现相同的输出通道数，可以采用分组卷积的方法。进一步地，为了在获取到多尺度的特征图的基础上，提高网络的处理检测速度，在多尺度卷积模块中每一层卷积以及分组卷积中，可以使用前述方法实施例中的高维卷积模块来提取特征。Specifically, the multi-scale convolution module may include sequentially connected convolution kernels of multiple scales and depths, wherein the multi-scale convolution module may have a pyramid structure, and the size of the convolution kernel from the top to the bottom of the pyramid structure decreases sequentially. At the same time, in the channel dimension, the number of channels of the convolution kernel from the top to the bottom of the pyramid structure increases sequentially. Each convolution kernel in the multi-scale convolution module finally stitches the obtained feature maps together. When splicing the feature maps output by convolution kernels of different sizes, in order to achieve the same number of output channels, the method of group convolution can be used. Furthermore, in order to improve the processing and detection speed of the network on the basis of obtaining the multi-scale feature map, in the convolution of each layer and group convolution in the multi-scale convolution module, the high dimensional convolution module to extract features.

因此，在本发明的再一个实施例中，参考图4，所述多尺度卷积模块包括多层依次连接的卷积核；所述卷积核的尺寸依次减小，所述卷积核的深度依次增大；多个所述卷积核采用分组卷积，以使得各个所述卷积核输出同样通道数的所述图像特征。Therefore, in another embodiment of the present invention, with reference to FIG. 4, the multi-scale convolution module includes multiple layers of convolution kernels connected in sequence; the size of the convolution kernels decreases successively, and the convolution kernels The depth increases sequentially; multiple convolution kernels use group convolution, so that each convolution kernel outputs the image features with the same number of channels.

如图4所示，输入多尺度卷积模块的特征图(图4中的Input Feature Maps)会经过不同尺寸的卷积核(图4中的Pyramidal Convolution Kernels，包括Level1-n PyConv)，然后将每一个特征图按通道连接起来得到输出(图4中的Onput Feature Maps)。卷积核的尺寸不断变大的同时卷积核的深度在不断减少。为了能够使用不同深度的卷积核，采用了分组卷积(Grouped Convolution)的方式，由此得到同样通道数的特征图。其中，为了进一步提高特征提取的性能，在金字塔结构中进行卷积处理时，可以采用前述的高维卷积模块(Sandglass Block，沙漏块型卷积层)对于特征进行提取。As shown in Figure 4, the feature map of the input multi-scale convolution module (Input Feature Maps in Figure 4) will pass through convolution kernels of different sizes (Pyramidal Convolution Kernels in Figure 4, including Level1-n PyConv), and then the Each feature map is connected by channel to get output (Input Feature Maps in Figure 4). As the size of the convolution kernel continues to increase, the depth of the convolution kernel continues to decrease. In order to be able to use convolution kernels of different depths, Grouped Convolution is used to obtain feature maps with the same number of channels. Among them, in order to further improve the performance of feature extraction, when performing convolution processing in the pyramid structure, the aforementioned high-dimensional convolution module (Sandglass Block, hourglass block convolution layer) can be used to extract features.

进一步地，考虑到遥感图像等高分辨率图像的图像精度较高，能够提取到的特征相对较多，为了对提取出的大量特征进行筛选，增强更能表征输入图像的目标特征，从而提高现有的YOLOv3模型中特征提取的准确率，在本发明的再一个实施例中，所述改进的YOLOv3模型还包括特征增强模块：所述特征增强模块用于根据通道注意力机制对所述图像特征进行特征增强。Further, considering that the image accuracy of high-resolution images such as remote sensing images is high, and there are relatively many features that can be extracted, in order to screen a large number of extracted features, enhance the target features that can better characterize the input image, thereby improving the current performance. The accuracy rate of feature extraction in some YOLOv3 models, in another embodiment of the present invention, the improved YOLOv3 model also includes a feature enhancement module: the feature enhancement module is used to process the image features according to the channel attention mechanism Perform feature enhancements.

具体地，特征增强模块连接在前述的高维卷积模块之后，通过特征增强模块对前述高维卷积模块提取到的图像特征进行特征增强，抑制不重要的冗余特征，从而增强图像检测中的目标特征。其中，特征增强模块可以采用通道注意力机制对高维卷积模块提取到的图像特征进行特征增强。具体地，特征增强模块可以首先利用平均池化得到通道方向的特征向量，随后通过全连接层捕捉不同通道之间的依赖性。最后将得到的不同通道之间的依赖性对应的权重进行映射，从而将权重与原特征相乘便得到最终增强后的特征。Specifically, the feature enhancement module is connected after the aforementioned high-dimensional convolution module, and the feature enhancement module performs feature enhancement on the image features extracted by the aforementioned high-dimensional convolution module, suppressing unimportant redundant features, thereby enhancing image detection. target features. Among them, the feature enhancement module can use the channel attention mechanism to perform feature enhancement on the image features extracted by the high-dimensional convolution module. Specifically, the feature enhancement module can first use average pooling to obtain the feature vector of the channel direction, and then capture the dependencies between different channels through the fully connected layer. Finally, the weights corresponding to the obtained dependencies between different channels are mapped, so that the weights are multiplied by the original features to obtain the final enhanced features.

在本发明的一个实施例中，特征增强模块的结构示意图如图5所示，参考图5，所述特征增强模块包括依次连接的池化层(Global Pooling)、多个全连接(FC，FullConection)层以及加权计算层；所述池化层用于提取多个通道的图像特征；所述全连接层用于根据各个所述通道的图像特征确定通道间依赖性；所述加权计算层用于根据所述通道间依赖性对所述多个通道的图像特征进行加权计算，得到增强后图像特征。In one embodiment of the present invention, a schematic structural diagram of the feature enhancement module is shown in Figure 5. Referring to Figure 5, the feature enhancement module includes sequentially connected pooling layers (Global Pooling), multiple full connections (FC, FullConection ) layer and a weighted calculation layer; the pooling layer is used to extract the image features of multiple channels; the fully connected layer is used to determine the inter-channel dependency according to the image features of each of the channels; the weighted calculation layer is used for Perform weighted calculation on the image features of the multiple channels according to the inter-channel dependencies to obtain enhanced image features.

其中，全连接层的数量可以为两个，通过两个连接的全连接层根据各个所述通道的图像特征确定通道间依赖性。加权计算层可以是通过Sigmoid函数将通道间依赖性对应的权重映射到[0，1]，并将该权重与原始输入的图像特征相乘，得到增强后的特征。Wherein, the number of fully connected layers may be two, and the inter-channel dependence is determined according to the image features of each of the channels through the two connected fully connected layers. The weighted calculation layer can map the weight corresponding to the inter-channel dependence to [0, 1] through the Sigmoid function, and multiply the weight with the original input image feature to obtain the enhanced feature.

在本发明的再一个实施例中，针对遥感图像等高分辨率图像的目标检测的场景，在应用YOLOv3模型时，为了进一步提高模型的处理性能以及多尺度特征提取的准确性，可以将前述实施例中的高维卷积模块、特征增强模块以及多尺度特征提取模块集成到现有的YOLOv3模型中，得到改进的YOLOv3模型。In yet another embodiment of the present invention, for the scene of target detection of high-resolution images such as remote sensing images, when applying the YOLOv3 model, in order to further improve the processing performance of the model and the accuracy of multi-scale feature extraction, the aforementioned implementation can be The high-dimensional convolution module, feature enhancement module and multi-scale feature extraction module in the example are integrated into the existing YOLOv3 model to obtain an improved YOLOv3 model.

具体地，所述待检测图像包括高分辨率图像；所述改进的YOLOv3模型基于原始YOLOv3模型改进得到；所述原始YOLOv3模型中包括原始DBL模块以及原始RES模块，所述原始DBL模块包括依次连接的原始卷积层、归一化层以及激活函数层，所述原始RES模块包括依次连接的零填充层、所述原始DBL模块以及原始残差结构；其中，在所述改进的YOLOv3模型中，将所述原始卷积层替换为高维卷积模块，在所述激活函数层后添加特征增强模块，在所述原始残差结构后添加多尺度卷积模块；所述高维卷积模块用于对输入数据在高维空间下进行深度卷积；所述特征增强模块用于根据通道注意力机制对所述图像特征进行特征增强；所述多尺度卷积模块用于用于对输入数据在多尺度的卷积核下进行卷积；所述卷积核包括所述高维卷积模块。Specifically, the image to be detected includes a high-resolution image; the improved YOLOv3 model is improved based on the original YOLOv3 model; the original YOLOv3 model includes an original DBL module and an original RES module, and the original DBL module includes sequentially connected The original convolution layer, normalization layer and activation function layer, the original RES module includes a zero-fill layer connected in sequence, the original DBL module and the original residual structure; wherein, in the improved YOLOv3 model, The original convolution layer is replaced by a high-dimensional convolution module, a feature enhancement module is added after the activation function layer, and a multi-scale convolution module is added after the original residual structure; the high-dimensional convolution module is used It is used to perform deep convolution on the input data in a high-dimensional space; the feature enhancement module is used to perform feature enhancement on the image features according to the channel attention mechanism; the multi-scale convolution module is used to perform feature enhancement on the input data in Convolution is performed under a multi-scale convolution kernel; the convolution kernel includes the high-dimensional convolution module.

如图1所示，在原始YOLOv3模型中，原始卷积层包括conv层，归一化层包括BN层，激活函数层包括Leaky Relu层，零填充层包括zero paddding层；原始残差结构包括ResUnit*n。As shown in Figure 1, in the original YOLOv3 model, the original convolution layer includes the conv layer, the normalization layer includes the BN layer, the activation function layer includes the Leaky Relu layer, and the zero padding layer includes the zero padding layer; the original residual structure includes ResUnit *n.

针对图1中示出的原始YOLOv3模型中，在本发明的一个实施例中，如图6所示，首先是针对原始DBL模块的改进，将DBL模块中的conv层替换为高维卷积模块，即图中的sandglass block，与此同时，在激活函数层后添加特征增强模块，即图中的SE(Squeezeand excitation)模块，得到改进后的DBL模块(记为NDBL)。For the original YOLOv3 model shown in Figure 1, in one embodiment of the present invention, as shown in Figure 6, the first is the improvement of the original DBL module, replacing the conv layer in the DBL module with a high-dimensional convolution module , that is, the sandglass block in the figure. At the same time, the feature enhancement module is added after the activation function layer, that is, the SE (Squeeze and excitation) module in the figure, and the improved DBL module (denoted as NDBL) is obtained.

再者，针对原始RES模块的改进，如图7所示，首先将原始RES模块中的原始DBL模块替换为改进后的DBL模块，再在所述原始残差结构后添加多尺度卷积模块，即图中的hourglass-pyconv，得到改进后的RES模块(记为NRES)。Furthermore, for the improvement of the original RES module, as shown in Figure 7, first replace the original DBL module in the original RES module with the improved DBL module, and then add a multi-scale convolution module after the original residual structure, That is, the hourglass-pyconv in the figure, the improved RES module (denoted as NRES) is obtained.

经过上述对现有的YOLOv3的原始DBL模块以及原始RES模块的改进，得到改进的YOLOv3模型，其整体网络结构和YOLOv3一样，提取特征使用NDBL和NRES。After the above-mentioned improvements to the existing YOLOv3 original DBL module and the original RES module, the improved YOLOv3 model is obtained. Its overall network structure is the same as that of YOLOv3, and NDBL and NRES are used to extract features.

本发明实施例在NWPU VHR-10数据集上对改进的YOLOv3模型进行了测试。本发明实施例的改进的YOLOv3的模型与现有的YOLOv3模型的检测结果的对比结果可以参考图8。The embodiment of the present invention tests the improved YOLOv3 model on the NWPU VHR-10 dataset. The comparison results of the improved YOLOv3 model of the embodiment of the present invention and the detection results of the existing YOLOv3 model can refer to FIG. 8 .

在本发明的一个实施例中，改进的YOLOv3模型的训练过程可如下：In one embodiment of the present invention, the training process of the improved YOLOv3 model can be as follows:

步骤1：对于数据集中的遥感图像进行数据增强。Step 1: Perform data augmentation on the remote sensing images in the dataset.

其中，数据增强的过程包括：按照步幅512从数据集中的遥感图像中截取1024×1024大小系列的块。Wherein, the process of data enhancement includes: intercepting a series of blocks with a size of 1024×1024 from the remote sensing image in the data set according to the stride 512 .

以及，将目标类别数量较少的，按照1：0.5的比例对图像大小进行调整，同时采用随机翻转和旋转的方式，处理数据集类别不平衡的问题。And, if the number of target categories is small, the image size is adjusted according to the ratio of 1:0.5, and random flip and rotation are used to deal with the problem of unbalanced dataset categories.

步骤2：对增强后的数据进行标注，得到训练样本。Step 2: Label the enhanced data to obtain training samples.

具体包括通过K-means聚类算法确定先验框，设置9个集群并均匀的分布在3个尺度之上，并在NWPU VHR-10获取先验盒。Specifically, it includes determining the prior box through the K-means clustering algorithm, setting up 9 clusters and evenly distributing them on 3 scales, and obtaining the prior box in NWPU VHR-10.

步骤3：根据训练样本对改进的YOLOv3模型进行训练。Step 3: Train the improved YOLOv3 model based on the training samples.

将处理的图片输入改进的YOLOv3模型，并在改进的YOLOv3模型中的第三、四个NRES后独立出两个分支，提取特征输出另外两个特征，最终得到三个目标特征图。Input the processed picture into the improved YOLOv3 model, and after the third and fourth NRES in the improved YOLOv3 model, two branches are independently extracted, and the features are extracted to output the other two features, and finally three target feature maps are obtained.

本发明实施例提供的图像检测装置通过将待检测图像输入目标检测模型，得到所述待检测图像对应的目标检测结果；所述目标检测模型包括改进的YOLOv3模型；所述改进的YOLOv3模型中包括高维卷积模块；其中，高维卷积模块用于对所述待检测图像在高维空间下进行深度卷积，得到所述待检测图像的图像特征，从而区别于现有的YOLOv3模型中采取大量的3x3卷积操作进行特征提取，导致模型的参数量过大，检测速度不佳的问题，本发明实施例将深度卷积放在了整个卷积结构的开始和结束位置，使得深度卷积操作都作用在高维空间，由此可以提取出信息丰富的特征表示，在更高的维度上执行标识映射和空间变换，降低了传统移动网络倒残差结构带来的信息丢失和梯度混淆的风险，从而能够有效提升模型的处理速度，实现检测精度与检测速度之间的平衡。The image detection device provided by the embodiment of the present invention obtains the target detection result corresponding to the image to be detected by inputting the image to be detected into the target detection model; the target detection model includes an improved YOLOv3 model; the improved YOLOv3 model includes A high-dimensional convolution module; wherein, the high-dimensional convolution module is used to perform deep convolution on the image to be detected in a high-dimensional space to obtain image features of the image to be detected, thereby distinguishing it from the existing YOLOv3 model A large number of 3x3 convolution operations are used for feature extraction, which leads to the problem that the parameter amount of the model is too large and the detection speed is not good. The embodiment of the present invention places the depth convolution at the beginning and end of the entire convolution structure, so that the depth convolution Product operations all act on high-dimensional space, so that feature representations with rich information can be extracted, and identity mapping and space transformation are performed in higher dimensions, which reduces the information loss and gradient confusion caused by the traditional mobile network inverted residual structure. Therefore, the processing speed of the model can be effectively improved, and the balance between detection accuracy and detection speed can be achieved.

图9示出了本发明实施例提供的图像检测装置的结构示意图。如图9所示，该装置20包括：检测模块201。其中，检测模块201，用于将待检测图像输入目标检测模型，得到所述待检测图像对应的目标检测结果；所述目标检测模型包括改进的YOLOv3模型；所述改进的YOLOv3模型中包括高维卷积模块；所述高维卷积模块用于对所述待检测图像在高维空间下进行深度卷积，得到所述待检测图像的图像特征。FIG. 9 shows a schematic structural diagram of an image detection device provided by an embodiment of the present invention. As shown in FIG. 9 , the device 20 includes: a detection module 201 . Wherein, the detection module 201 is used to input the image to be detected into the target detection model to obtain the target detection result corresponding to the image to be detected; the target detection model includes an improved YOLOv3 model; the improved YOLOv3 model includes a high-dimensional Convolution module; the high-dimensional convolution module is used to perform deep convolution on the image to be detected in a high-dimensional space to obtain image features of the image to be detected.

本发明实施例提供的图像检测装置的操作过程与前述方法实施例大致相同，不再赘述。The operation process of the image detection device provided by the embodiment of the present invention is substantially the same as that of the foregoing method embodiment, and will not be repeated here.

图10示出了本发明实施例提供的图像检测设备的结构示意图，本发明具体实施例并不对图像检测设备的具体实现做限定。FIG. 10 shows a schematic structural diagram of an image detection device provided by an embodiment of the present invention. The specific embodiment of the present invention does not limit the specific implementation of the image detection device.

如图10所示，该图像检测设备可以包括：处理器(processor)302、通信接口(Communications Interface)304、存储器(memory)306、以及通信总线308。As shown in FIG. 10 , the image detection device may include: a processor (processor) 302 , a communication interface (Communications Interface) 304 , a memory (memory) 306 , and a communication bus 308 .

其中：处理器302、通信接口304、以及存储器306通过通信总线308完成相互间的通信。通信接口304，用于与其它设备比如客户端或其它服务器等的网元通信。处理器302，用于执行程序310，具体可以执行上述用于图像检测方法实施例中的相关步骤。Wherein: the processor 302 , the communication interface 304 , and the memory 306 communicate with each other through the communication bus 308 . The communication interface 304 is used to communicate with network elements of other devices such as clients or other servers. The processor 302 is configured to execute the program 310, specifically, may execute the above-mentioned relevant steps in the embodiment of the image detection method.

具体地，程序310可以包括程序代码，该程序代码包括计算机可执行指令。Specifically, the program 310 may include program codes including computer-executable instructions.

处理器302可能是中央处理器CPU，或者是特定集成电路ASIC(ApplicationSpecific Integrated Circuit)，或者是被配置成实施本发明实施例的一个或多个集成电路。图像检测设备包括的一个或多个处理器，可以是同一类型的处理器，如一个或多个CPU；也可以是不同类型的处理器，如一个或多个CPU以及一个或多个ASIC。The processor 302 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present invention. The one or more processors included in the image detection device may be of the same type, such as one or more CPUs, or may be different types of processors, such as one or more CPUs and one or more ASICs.

存储器306，用于存放程序310。存储器306可能包含高速RAM存储器，也可能还包括非易失性存储器(non-volatile memory)，例如至少一个磁盘存储器。The memory 306 is used to store the program 310 . The memory 306 may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.

程序310具体可以被处理器302调用使图像检测设备执行以下操作：Specifically, the program 310 can be called by the processor 302 to make the image detection device perform the following operations:

本发明实施例提供的图像检测设备的操作过程与前述方法实施例大致相同，不再赘述。The operation process of the image detection device provided by the embodiment of the present invention is substantially the same as that of the foregoing method embodiment, and will not be repeated here.

本发明实施例提供的图像检测设备通过将待检测图像输入目标检测模型，得到所述待检测图像对应的目标检测结果；所述目标检测模型包括改进的YOLOv3模型；所述改进的YOLOv3模型中包括高维卷积模块；其中，高维卷积模块用于对所述待检测图像在高维空间下进行深度卷积，得到所述待检测图像的图像特征，从而区别于现有的YOLOv3模型中采取大量的3x3卷积操作进行特征提取，导致模型的参数量过大，检测速度不佳的问题，本发明实施例将深度卷积放在了整个卷积结构的开始和结束位置，使得深度卷积操作都作用在高维空间，由此可以提取出信息丰富的特征表示，在更高的维度上执行标识映射和空间变换，降低了传统移动网络倒残差结构带来的信息丢失和梯度混淆的风险，从而能够有效提升模型的处理速度，实现检测精度与检测速度之间的平衡。The image detection device provided by the embodiment of the present invention obtains the target detection result corresponding to the image to be detected by inputting the image to be detected into the target detection model; the target detection model includes an improved YOLOv3 model; the improved YOLOv3 model includes A high-dimensional convolution module; wherein, the high-dimensional convolution module is used to perform deep convolution on the image to be detected in a high-dimensional space to obtain image features of the image to be detected, thereby distinguishing it from the existing YOLOv3 model A large number of 3x3 convolution operations are used for feature extraction, which leads to the problem that the parameter amount of the model is too large and the detection speed is not good. The embodiment of the present invention places the depth convolution at the beginning and end of the entire convolution structure, so that the depth convolution Product operations all act on high-dimensional space, so that feature representations with rich information can be extracted, and identity mapping and space transformation are performed in higher dimensions, which reduces the information loss and gradient confusion caused by the traditional mobile network inverted residual structure. Therefore, the processing speed of the model can be effectively improved, and the balance between detection accuracy and detection speed can be achieved.

本发明实施例提供了一种计算机可读存储介质，所述存储介质存储有至少一可执行指令，该可执行指令在图像检测设备上运行时，使得所述图像检测设备执行上述任意方法实施例中的图像检测方法。An embodiment of the present invention provides a computer-readable storage medium, the storage medium stores at least one executable instruction, and when the executable instruction is run on the image detection device, the image detection device executes any of the above method embodiments Image detection methods in .

可执行指令具体可以用于使得图像检测设备执行以下操作：The executable instructions can specifically be used to make the image detection device perform the following operations:

本发明实施例提供的计算机可读存储介质存储的可执行指令的操作过程与前述方法实施例大致相同，不再赘述。The operation process of the executable instruction stored in the computer-readable storage medium provided by the embodiment of the present invention is substantially the same as that of the foregoing method embodiment, and will not be repeated here.

本发明实施例提供的计算机可读存储介质存储的可执行指令通过将待检测图像输入目标检测模型，得到所述待检测图像对应的目标检测结果；所述目标检测模型包括改进的YOLOv3模型；所述改进的YOLOv3模型中包括高维卷积模块；其中，高维卷积模块用于对所述待检测图像在高维空间下进行深度卷积，得到所述待检测图像的图像特征，从而区别于现有的YOLOv3模型中采取大量的3x3卷积操作进行特征提取，导致模型的参数量过大，检测速度不佳的问题，本发明实施例将深度卷积放在了整个卷积结构的开始和结束位置，使得深度卷积操作都作用在高维空间，由此可以提取出信息丰富的特征表示，在更高的维度上执行标识映射和空间变换，降低了传统移动网络倒残差结构带来的信息丢失和梯度混淆的风险，从而能够有效提升模型的处理速度，实现检测精度与检测速度之间的平衡。The executable instruction stored in the computer-readable storage medium provided by the embodiment of the present invention obtains the target detection result corresponding to the image to be detected by inputting the image to be detected into the target detection model; the target detection model includes an improved YOLOv3 model; The improved YOLOv3 model includes a high-dimensional convolution module; wherein, the high-dimensional convolution module is used to perform deep convolution on the image to be detected in a high-dimensional space to obtain image features of the image to be detected, thereby distinguishing In the existing YOLOv3 model, a large number of 3x3 convolution operations are used for feature extraction, which leads to the problem that the parameter amount of the model is too large and the detection speed is not good. The embodiment of the present invention puts the depth convolution at the beginning of the entire convolution structure and the end position, so that the deep convolution operation acts on the high-dimensional space, so that the feature representation with rich information can be extracted, and the identity mapping and space transformation are performed in a higher dimension, which reduces the traditional mobile network inverted residual structure. The risks of information loss and gradient confusion can effectively improve the processing speed of the model and achieve a balance between detection accuracy and detection speed.

本发明实施例提供一种图像检测装置，用于执行上述图像检测方法。An embodiment of the present invention provides an image detection device configured to execute the above image detection method.

本发明实施例提供了一种计算机程序，所述计算机程序可被处理器调用使图像检测设备执行上述任意方法实施例中的图像检测方法。An embodiment of the present invention provides a computer program, and the computer program can be invoked by a processor to cause an image detection device to execute the image detection method in any method embodiment above.

本发明实施例提供了一种计算机程序产品，计算机程序产品包括存储在计算机可读存储介质上的计算机程序，计算机程序包括程序指令，当程序指令在计算机上运行时，使得所述计算机执行上述任意方法实施例中的图像检测方法。An embodiment of the present invention provides a computer program product. The computer program product includes a computer program stored on a computer-readable storage medium. The computer program includes program instructions. The image detection method in the method embodiment.

在此提供的算法或显示不与任何特定计算机、虚拟系统或者其它设备固有相关。各种通用系统也可以与基于在此的示教一起使用。根据上面的描述，构造这类系统所要求的结构是显而易见的。此外，本发明实施例也不针对任何特定编程语言。应当明白，可以利用各种编程语言实现在此描述的本发明的内容，并且上面对特定语言所做的描述是为了披露本发明的最佳实施方式。The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other device. Various generic systems can also be used with the teachings based on this. The structure required to construct such a system is apparent from the above description. Furthermore, embodiments of the present invention are not directed to any particular programming language. It should be understood that various programming languages can be used to implement the content of the present invention described herein, and the above description of specific languages is for disclosing the best mode of the present invention.

在此处所提供的说明书中，说明了大量具体细节。然而，能够理解，本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中，并未详细示出公知的方法、结构和技术，以便不模糊对本说明书的理解。In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure the understanding of this description.

类似地，应当理解，为了精简本发明并帮助理解各个发明方面中的一个或多个，在上面对本发明的示例性实施例的描述中，本发明实施例的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而，并不应将该公开的方法解释成反映如下意图：即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, in order to streamline the present disclosure and to facilitate an understanding of one or more of the various inventive aspects, various features of the embodiments of the invention are sometimes grouped together into a single implementation examples, figures, or descriptions thereof. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim.

本领域技术人员可以理解，可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件，以及可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外，可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述，本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。Those skilled in the art can understand that the modules in the device in the embodiment can be adaptively changed and arranged in one or more devices different from the embodiment. The modules or units or components in the embodiments can be combined into one module or unit or component, and they can be divided into a plurality of sub-modules or sub-units or sub-components. All features disclosed in this specification (including accompanying claims, abstract and drawings) and any method or method so disclosed may be used in any combination, except that at least some of such features and/or processes or units are mutually exclusive. All processes or units of equipment are combined. Each feature disclosed in this specification (including accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制，并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中，不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中，这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。上述实施例中的步骤，除有特殊说明外，不应理解为对执行顺序的限定。It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In a unit claim enumerating several means, several of these means can be embodied by one and the same item of hardware. The use of the words first, second, and third, etc. does not indicate any order. These words can be interpreted as names. The steps in the above embodiments, unless otherwise specified, should not be construed as limiting the execution order.

Claims

1. An image detection method, the method comprising:

inputting an image to be detected into a target detection model to obtain a target detection result corresponding to the image to be detected; the target detection model includes a modified YOLOv3 model; the improved YOLOv3 model comprises a high-dimensional convolution module; the high-dimensional convolution module is used for carrying out depth convolution on the image to be detected in a high-dimensional space to obtain image characteristics of the image to be detected.

2. The method of claim 1, wherein the high-dimensional convolution module comprises a first depth convolution layer, a dimension adjustment convolution layer, and a second depth convolution layer connected in sequence; the first depth convolution layer and the second depth convolution layer are used for performing depth convolution on input data; the dimension adjustment convolution layer is used for performing dimension adjustment on input data so that the data input into the second depth convolution layer are high-dimensional data.

3. The method of claim 1, wherein the modified YOLOv3 model comprises: a multi-scale convolution module; the multi-scale convolution module is used for convolving the input data under a multi-scale convolution kernel to obtain the multi-scale image characteristics; wherein the convolution kernel comprises the high-dimensional convolution module.

4. A method according to claim 3, wherein the multi-scale convolution module comprises a plurality of layers of sequentially connected convolution kernels; the size of the convolution kernel is sequentially reduced, and the depth of the convolution kernel is sequentially increased; a plurality of the convolution kernels are convolved with a grouping such that each of the convolution kernels outputs the image feature for the same number of channels.

5. The method of claim 1, wherein the improved YOLOv3 model further comprises a feature enhancement module: the feature enhancement module is used for carrying out feature enhancement on the image features according to a channel attention mechanism.

6. The method of claim 5, wherein the feature enhancement module comprises a pooling layer, a plurality of fully connected layers, and a weight calculation layer connected in sequence; the pooling layer is used for extracting image features of a plurality of channels; the full connection layer is used for determining inter-channel dependence according to the image characteristics of each channel; the weighting calculation layer is used for carrying out weighting calculation on the image characteristics of the channels according to the inter-channel dependence to obtain enhanced image characteristics.

7. The method of claim 1, wherein the image to be detected comprises a high resolution image; the improved YOLOv3 model is improved based on the original YOLOv3 model; the original YOLOv3 model comprises an original DBL module and an original RES module, wherein the original DBL module comprises an original convolution layer, a normalization layer and an activation function layer which are sequentially connected, and the original RES module comprises a zero filling layer, the original DBL module and an original residual error structure which are sequentially connected; in the improved YOLOv3 model, the original convolution layer is replaced by a high-dimensional convolution module, a feature enhancement module is added after the function layer is activated, and a multi-scale convolution module is added after the original residual structure; the high-dimensional convolution module is used for carrying out depth convolution on input data in a high-dimensional space; the characteristic enhancement module is used for carrying out characteristic enhancement on the image characteristics according to a channel attention mechanism; the multi-scale convolution module is used for convolving the input data under a multi-scale convolution kernel; the convolution kernel includes the high-dimensional convolution module.

8. An image detection apparatus, the apparatus comprising:

the detection module is used for inputting the image to be detected into the target detection model to obtain a target detection result corresponding to the image to be detected; the target detection model includes a modified YOLOv3 model; the improved YOLOv3 model comprises a high-dimensional convolution module; the high-dimensional convolution module is used for carrying out depth convolution on the image to be detected in a high-dimensional space to obtain image characteristics of the image to be detected.

9. An image detection apparatus, characterized by comprising: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;

the memory is configured to store at least one executable instruction that causes the processor to perform the operations of the image detection method according to any one of claims 1-7.

10. A computer readable storage medium having stored therein at least one executable instruction that, when executed on an image detection device, causes the image detection device to perform the operations of the image detection method of any one of claims 1-7.