WO2025050369A1

WO2025050369A1 - Method for detecting object in image and related device

Info

Publication number: WO2025050369A1
Application number: PCT/CN2023/117608
Authority: WO
Inventors: 王杰
Original assignee: Chengdu Boe Smart Technology Co Ltd; BOE Technology Group Co Ltd; Beijing BOE Technology Development Co Ltd
Current assignee: Chengdu Boe Smart Technology Co Ltd; BOE Technology Group Co Ltd; Beijing BOE Technology Development Co Ltd
Priority date: 2023-09-08
Filing date: 2023-09-08
Publication date: 2025-03-13
Anticipated expiration: 2026-03-08
Also published as: CN120129923A

Abstract

A method for detecting an object in an image, a computer-readable storage medium, and an electronic device. The method comprises: acquiring an image to be detected (102); on the basis of said image, generating a plurality of feature maps for detecting objects having different sizes (104); fusing the plurality of feature maps to obtain a fused feature map (106); and on the basis of the fused feature map, detecting the objects (108). Objects of difference sizes can be detected, thereby improving the detection speed and saving computing resources.

Description

Method and related device for detecting objects in an image

Technical Field

本公开涉及图像处理技术领域，更特别地，涉及用于检测图像中的对象的方法、计算机可读存储介质和电子设备。The present disclosure relates to the technical field of image processing, and more particularly, to a method for detecting an object in an image, a computer-readable storage medium, and an electronic device.

Background Art

近年来，目标检测技术作为图像处理技术领域的研究任务之一，已经广泛应用在工业生产和智能制造等领域。图像处理技术通过相机获取待检测的工业产品的图像，并将图像传送给图像处理系统，由图像处理系统根据图像中的像素的分布和亮度、颜色等信息，检测目标的特征，例如检测工业产品的缺陷，诸如划痕、孔洞和凹陷等。In recent years, target detection technology, as one of the research tasks in the field of image processing technology, has been widely used in industrial production and intelligent manufacturing. Image processing technology uses a camera to obtain images of industrial products to be detected and transmits the images to an image processing system, which detects the characteristics of the target based on the distribution, brightness, color and other information of the pixels in the image, such as defects in industrial products such as scratches, holes and dents.

发明内容Summary of the invention

本公开的实施例提供了用于检测图像中的对象的方法、计算机可读存储介质和电子设备。Embodiments of the present disclosure provide a method, a computer-readable storage medium, and an electronic device for detecting an object in an image.

在本公开的第一方面中，提供了一种用于检测图像中的对象的方法。所述方法包括：获取待检测的图像；基于所述图像，生成用于检测具有不同尺寸的对象的多个特征图；融合所述多个特征图，以获得融合特征图；以及基于所述融合特征图，检测所述对象。In a first aspect of the present disclosure, a method for detecting an object in an image is provided. The method comprises: acquiring an image to be detected; based on the image, generating a plurality of feature maps for detecting objects of different sizes; fusing the plurality of feature maps to obtain a fused feature map; and detecting the object based on the fused feature map.

在本公开的实施例中，所述多个特征图包括：用于检测具有第一尺寸的对象的第一特征图，用于检测具有第二尺寸的对象的第二特征图，以及用于检测具有第三尺寸的对象的第三特征图，其中，所述第一尺寸大于所述第二尺寸，以及所述第二尺寸大于所述第三尺寸。In an embodiment of the present disclosure, the multiple feature maps include: a first feature map for detecting an object having a first size, a second feature map for detecting an object having a second size, and a third feature map for detecting an object having a third size, wherein the first size is larger than the second size, and the second size is larger than the third size.

在本公开的实施例中，基于所述图像生成所述第一特征图包括：对所述图像进行下采样，以获得第一图像数据；以及对所述第一图像数据执行第一卷积处理，以生成所述第一特征图。 In an embodiment of the present disclosure, generating the first feature map based on the image includes: downsampling the image to obtain first image data; and performing a first convolution process on the first image data to generate the first feature map.

在本公开的实施例中，所述下采样的倍率是6倍，并且在所述第一卷积处理中，卷积核的大小是3，步长是2。In an embodiment of the present disclosure, the downsampling ratio is 6 times, and in the first convolution process, the size of the convolution kernel is 3 and the step size is 2.

在本公开的实施例中，基于所述图像生成所述第二特征图包括：对所述图像执行第二卷积处理，以获得第二图像数据；以及对所述第二图像数据执行第三卷积处理，以生成所述第二特征图。In an embodiment of the present disclosure, generating the second feature map based on the image includes: performing a second convolution process on the image to obtain second image data; and performing a third convolution process on the second image data to generate the second feature map.

在本公开的实施例中，在所述第二卷积处理中，卷积核的大小是3，步长是3，并且其中，在所述第三卷积处理中，卷积核的大小是4，步长是4。In an embodiment of the present disclosure, in the second convolution processing, the size of the convolution kernel is 3 and the step size is 3, and wherein, in the third convolution processing, the size of the convolution kernel is 4 and the step size is 4.

在本公开的实施例中，基于所述图像生成所述第三特征图包括：对所述图像执行第四卷积处理，以获得第三图像数据；对所述第三图像数据执行第五卷积处理，以获得第四图像数据；对所述第四图像数据分别执行第一池化处理和第二池化处理，以获得第五图像数据和第六图像数据；对所述第五图像数据和所述第六图像数据进行拼接，以获得经拼接的图像数据；以及对所述经拼接的图像数据执行第六卷积处理，以生成所述第三特征图。In an embodiment of the present disclosure, generating the third feature map based on the image includes: performing a fourth convolution process on the image to obtain third image data; performing a fifth convolution process on the third image data to obtain fourth image data; performing a first pooling process and a second pooling process on the fourth image data, respectively, to obtain fifth image data and sixth image data; splicing the fifth image data and the sixth image data to obtain spliced image data; and performing a sixth convolution process on the spliced image data to generate the third feature map.

在本公开的实施例中，所述第一池化处理包括第一最大池化处理，并且其中，所述第二池化处理包括第二最大池化处理。In an embodiment of the present disclosure, the first pooling process includes a first maximum pooling process, and wherein the second pooling process includes a second maximum pooling process.

在本公开的实施例中，在所述第四卷积处理中，卷积核的大小是3，步长是3，在所述第五卷积处理中，卷积核的大小是3，步长是1，并且其中，在所述第一池化处理中，池化核的大小是3，步长是2，在所述第二池化处理中，池化核的大小是5，步长是2，并且其中，在所述第六卷积处理中，卷积核的大小是3，步长是3。In an embodiment of the present disclosure, in the fourth convolution processing, the size of the convolution kernel is 3 and the step size is 3, in the fifth convolution processing, the size of the convolution kernel is 3 and the step size is 1, and wherein, in the first pooling processing, the size of the pooling kernel is 3 and the step size is 2, in the second pooling processing, the size of the pooling kernel is 5 and the step size is 2, and wherein, in the sixth convolution processing, the size of the convolution kernel is 3 and the step size is 3.

在本公开的实施例中，基于所述融合特征图，使用经训练的神经网络来检测所述对象。In an embodiment of the present disclosure, a trained neural network is used to detect the object based on the fused feature map.

在本公开的实施例中，所述经训练的神经网络是Faster-RCNN网络，其中，所述Faster RCNN网络包括主干网络、区域建议网络、以及回归分类网络。In an embodiment of the present disclosure, the trained neural network is a Faster-RCNN network, wherein the Faster RCNN network includes a backbone network, a region proposal network, and a regression classification network.

在本公开的实施例中，使用SimOTA算法来训练所述区域建议网络。In an embodiment of the present disclosure, the SimOTA algorithm is used to train the region proposal network.

在本公开的实施例中，使用SimOTA算法来训练所述区域建议网络括：将落在真实框或者真实框附近的特征点作为候选正样本；计算所述候选正样本的预测框与所述真实框的代价矩阵；对于每个真实框，将所述真实框与所述预测框的交并比IoU的值从大到小进行排名，选取前m个预测框；对与所述前m个预测框对应的IoU值进行求和，求和值为n；以及将所述代价矩阵从小到大进行排名，选取前n个预测框作为正样本，其他作为负样本。In an embodiment of the present disclosure, the SimOTA algorithm is used to train the region proposal network, including: taking feature points falling in the real box or near the real box as candidate positive samples; calculating the cost matrix of the predicted box of the candidate positive sample and the real box; for each real box, ranking the intersection and union (IoU) values of the real box and the predicted box from large to small, and selecting the first m predicted boxes; summing the IoU values corresponding to the first m predicted boxes, and the sum is n; and ranking the cost matrix from small to large, selecting the first n predicted boxes as positive samples, and the others as negative samples.

在本公开的实施例中，在使用所述SimOTA算法时，如果所述真实框的尺寸小于预定阈值，则将所述真实框的尺寸调整为所述预定阈值。In an embodiment of the present disclosure, when using the SimOTA algorithm, if the size of the real frame is smaller than a predetermined threshold, the size of the real frame is adjusted to the predetermined threshold.

在本公开的实施例中，如果存在M个真实框以及N个预测框，则所述代价矩阵的大小是M×N，所述代价矩阵中的每个元素是真实框与预测框的损失函数的值。In an embodiment of the present disclosure, if there are M real boxes and N predicted boxes, the size of the cost matrix is M×N, and each element in the cost matrix is the value of the loss function of the real box and the predicted box.

在本公开的实施例中，所述损失函数包括关于分类的交叉熵损失和关于回归的IOU损失。In an embodiment of the present disclosure, the loss function includes a cross entropy loss for classification and an IOU loss for regression.

在本公开的实施例中，所述关于分类的交叉熵损失被表示为：
In an embodiment of the present disclosure, the cross entropy loss for classification is expressed as:

其中，N_cls是选择的预测框的数量，p_i是预测框为真实框的概率，p_i ^*＝0为正样本，p_i ^*＝0为负样本，L_cls(p_i，p_i ^*)被表示为：
Where N _cls is the number of selected prediction boxes, _pi is the probability that the prediction box is the true box, _pi ^* = 0 is a positive sample, _pi ^* = 0 is a negative sample, and L _cls ( _pi , _pi ^* ) is expressed as:

在本公开的实施例中，所述关于回归的IOU损失被表示为：
In an embodiment of the present disclosure, the IOU loss for regression is expressed as:

其中，t_i是预测框所预测的偏移量，t_i ^*是预测框相对于真实框的偏移量，其中，L_reg(t_i，t_i ^*)被表示为：
Where _ti is the offset predicted by the predicted box, _ti ^* is the offset of the predicted box relative to the true box, and _Lreg ( _ti , _ti ^* ) is expressed as:

其中，R是smooth_L1函数，其被表示为：
Where R is the smooth _L1 function, which is expressed as:

根据本公开的第二方面，提供了一种用于检测图像中的对象的设备。所述设备包括：获取模块，被配置为获取待检测的图像；特征图生成模块，被配置为基于所述图像生成用于检测具有不同尺寸的对象的多个特征图；融合模块，被配置为融合所述多个特征图以获得融合特征图；以及对象检测模块，被配置为基于所述融合特征图检测所述对象。According to a second aspect of the present disclosure, a device for detecting an object in an image is provided. The device comprises: an acquisition module configured to acquire an image to be detected; a feature map generation module configured to generate multiple feature maps for detecting objects of different sizes based on the image; a fusion module configured to fuse the multiple feature maps to obtain a fused feature map; and an object detection module configured to detect the object based on the fused feature map.

根据本公开的第三方面，提供了一种计算机可读存储介质。在所述计算机可读存储介质上存储有计算机程序指令，其中，所述计算机程序指令在被处理器执行时使得所述处理器执行根据本公开的第一方面所述的方法。According to a third aspect of the present disclosure, a computer-readable storage medium is provided. Computer program instructions are stored on the computer-readable storage medium, wherein when the computer program instructions are executed by a processor, the processor is caused to perform the method according to the first aspect of the present disclosure.

根据本公开的第四方面，提供了一种电子设备。所述电子设备包括：处理器；以及存储器，其存储可由所述处理器执行的指令；其中，所述处理器被配置为执行所述指令以实施根据本公开的第一方面所述的方法。According to a fourth aspect of the present disclosure, an electronic device is provided, comprising: a processor; and a memory storing instructions executable by the processor; wherein the processor is configured to execute the instructions to implement the method according to the first aspect of the present disclosure.

适应性的进一步的方面和范围从本文中提供的描述变得明显。应当理解，本申请的各个方面可以单独或者与一个或多个其它方面组合实施。还应当理解，本文中的描述和特定实施例旨在说明的目的，并不旨在限制本申请的范围。Further aspects and scopes of adaptability become apparent from the description provided herein. It should be understood that various aspects of the present application can be implemented individually or in combination with one or more other aspects. It should also be understood that the description and specific embodiments herein are intended for illustrative purposes and are not intended to limit the scope of the present application.

BRIEF DESCRIPTION OF THE DRAWINGS

本文中描述的附图用于仅对所选择的实施例的说明的目的，并不是所有可能的实施方式，并且不旨在限制本申请的范围，其中：The drawings described herein are for illustrative purposes only of selected embodiments, not all possible implementations, and are not intended to limit the scope of the present application, wherein:

图1示出了根据本公开的实施例的用于检测图像中的对象的方法的示意性流程图；FIG1 shows a schematic flow chart of a method for detecting an object in an image according to an embodiment of the present disclosure;

图2示出了根据本公开的实施例的示例性stem网络；FIG2 shows an exemplary STEM network according to an embodiment of the present disclosure;

图3示出了根据本公开的实施例的在其中实现如图1所示的方法的网络架构，网络架构包括图2所示的stem网络和Faster-RCNN网络；FIG3 shows a network architecture in which the method shown in FIG1 is implemented according to an embodiment of the present disclosure, the network architecture including the stem network and the Faster-RCNN network shown in FIG2 ;

图4示出了采用根据本公开的实施例的方法所检测的结果的示例；FIG4 shows an example of a result detected by using a method according to an embodiment of the present disclosure;

图5示出了采用根据本公开的实施例的方法所检测的结果的示例； FIG5 shows an example of a result detected by using a method according to an embodiment of the present disclosure;

图6示出了根据本公开的实施例的电子设备的示意性结构的框图；以及FIG6 is a block diagram showing a schematic structure of an electronic device according to an embodiment of the present disclosure; and

图7示出了根据本公开的实施例的用于检测图像中的对象的设备的示例性结构框图。FIG. 7 shows an exemplary structural block diagram of a device for detecting an object in an image according to an embodiment of the present disclosure.

DETAILED DESCRIPTION

为使本公开的实施例的目的、技术方案和优点更加清楚，下面将结合本公开的实施例的附图，对本公开的实施例的技术方案进行清楚、完整地描述。显然，所描述的实施例仅是本公开的一部分实施例，而不是全部的实施例。基于所描述的本公开的实施例，本领域普通技术人员在无需创造性劳动的前提下所获得的所有其他实施例，都属于本公开保护的范围。下文中将参考附图并结合实施例来详细说明本公开的实施例。需要说明的是，在不冲突的情况下，本公开中的实施例中的特征可以相互组合。In order to make the purpose, technical solution and advantages of the embodiments of the present disclosure clearer, the technical solution of the embodiments of the present disclosure will be clearly and completely described below in conjunction with the drawings of the embodiments of the present disclosure. Obviously, the described embodiments are only part of the embodiments of the present disclosure, not all of the embodiments. Based on the described embodiments of the present disclosure, all other embodiments obtained by ordinary technicians in the field without creative work are within the scope of protection of the present disclosure. The embodiments of the present disclosure will be described in detail below with reference to the drawings and in conjunction with the embodiments. It should be noted that the features in the embodiments of the present disclosure can be combined with each other without conflict.

Faster-RCNN可以实现目标(在本文中也成为对象)检测，基于Faster-RCNN的方法能够得到限定框bbox(Bounding-box)，该限定框可以限定出目标或者对象。在对工业产品执行目标检测的过程中，待检测的产品的图像可能包含极小的对象(例如，2到3个像素的对象)以及较大的对象(例如，300到8000个像素的对象)。如果直接使用Faster-RCNN进行目标检测，则由于图像的分辨率过高而需要较大的显存，因此，在现有的硬件条件下可能无法完成目标检测。如果在对图像进行下采样之后，使用Faster-RCNN进行检测，则可能由于丢失小对象的信息而无法检测到小对象。通常，如果需要同时检测到这些对象，则使用图像金字塔结构来提取多尺度特征，并使用滑动窗的方法处理图像。然而，这种方法的复杂度较高，并且耗时较长。Faster-RCNN can realize target (also called object in this article) detection. The method based on Faster-RCNN can obtain a bounding box bbox (Bounding-box), which can define the target or object. In the process of performing target detection on industrial products, the image of the product to be detected may contain extremely small objects (for example, objects of 2 to 3 pixels) and larger objects (for example, objects of 300 to 8000 pixels). If Faster-RCNN is used directly for target detection, a large video memory is required due to the high resolution of the image. Therefore, target detection may not be completed under existing hardware conditions. If Faster-RCNN is used for detection after downsampling the image, small objects may not be detected due to the loss of information about small objects. Usually, if these objects need to be detected at the same time, an image pyramid structure is used to extract multi-scale features, and the image is processed using a sliding window method. However, this method is more complex and time-consuming.

传统的Faster-RCNN主要处理自然场景，为了处理复杂的自然场景，Faster-RCNN的主干网络的每一层都相对较为重量级。由于自然场景复杂，并且待检测的对象不会太小，因此，自然场景的图像的分辨率通常在1024×1024以内。然而，待检测的工业产品的图像具有较高的分辨率，其分辨率通常在8000×5000左右，因此，如果直接使用重量级的主干网络处理图像，存在着极大的资源浪费并且耗时较长。Traditional Faster-RCNN mainly processes natural scenes. In order to process complex natural scenes, each layer of the Faster-RCNN backbone network is relatively heavyweight. Since natural scenes are complex and the objects to be detected are not too small, the resolution of natural scene images is usually within 1024×1024. However, the images of industrial products to be detected have a higher resolution, usually around 8000×5000. Therefore, if the heavyweight backbone network is used directly to process the image, there will be a huge waste of resources and it will take a long time.

图1示出了根据本公开的实施例的用于检测图像中的对象的方法的示意性流程图。如图1所示，在框102中，获取待检测的图像。在本公开的实施例中，该图像可以是待检测的工业产品的图像，并且对象可以是指工业产品中存在的缺陷，例如灼伤、刺伤、折痕等。图像可以通过图像采集设备实时采集，也可以预先存储在存储设备中，对此不进行限定。FIG1 shows a schematic flow chart of a method for detecting an object in an image according to an embodiment of the present disclosure. As shown in FIG1 , in box 102, an image to be detected is acquired. In an embodiment of the present disclosure, the image may be an image of an industrial product to be detected, and the object may refer to defects present in the industrial product, such as burns, punctures, creases, etc. The image may be acquired in real time by an image acquisition device, or may be pre-stored in a storage device, which is not limited.

继续参考图1，在框104中，基于所获取的图像，生成用于检测具有不同尺寸的对象的多个特征图。在本公开的实施例中，多个特征图可以包括第一特征图、第二特征图、以及第三特征图。第一特征图可用于检测具有第一尺寸的对象，第二特征图可用于检测具有第二尺寸的对象，以及第三特征图可用于检测具有第三尺寸的对象，其中，第一尺寸大于第二尺寸，以及第二尺寸大于第三尺寸。在该实施例中，具有第一尺寸的对象可以是工业产品中的较大的缺陷对象，具有第二尺寸的对象可以是中等的缺陷对象，具有第三尺寸的对象可以是较小的缺陷对象。在一个实施例中，对于图像分辨率为8000的图像，第一尺寸可以在1个到32个像素范围内，第二尺寸可以在32个到256个像素范围内，以及第三尺寸可以在256个像素以上范围内。在另一个实施例中，对于图像分辨率为15000的图像，第一尺寸可以在1个到64个像素范围内，第二尺寸可以在64到512像素范围内，以及第三尺寸可以在512个像素以上范围内。Continuing to refer to FIG. 1, in block 104, based on the acquired image, a plurality of feature maps for detecting objects of different sizes are generated. In an embodiment of the present disclosure, the plurality of feature maps may include a first feature map, a second feature map, and a third feature map. The first feature map may be used to detect an object of a first size, the second feature map may be used to detect an object of a second size, and the third feature map may be used to detect an object of a third size, wherein the first size is greater than the second size, and the second size is greater than the third size. In this embodiment, an object of the first size may be a larger defective object in an industrial product, an object of the second size may be a medium defective object, and an object of the third size may be a smaller defective object. In one embodiment, for an image with an image resolution of 8000, the first size may be in the range of 1 to 32 pixels, the second size may be in the range of 32 to 256 pixels, and the third size may be in the range of more than 256 pixels. In another embodiment, for an image with an image resolution of 15000, the first size may be in the range of 1 to 64 pixels, the second size may be in the range of 64 to 512 pixels, and the third size may be in the range of more than 512 pixels.

需要说明的是，本公开的实施例以三个尺寸来划分图像中的对象，本领域的技术人员可以根据实际需要，以更多或更少的尺寸来划分对象。此外，以下示例以所获取的图像是灰度图像为例进行说明，应当理解，图像还可以是彩色图像。It should be noted that the embodiment of the present disclosure divides the objects in the image into three sizes, and those skilled in the art can divide the objects into more or fewer sizes according to actual needs. In addition, the following examples are described by taking the acquired image as a grayscale image as an example, and it should be understood that the image can also be a color image.

图2示出了根据本公开的实施例的示例性stem网络。下面将结合图2的stem网络的三个分支来详细说明如何生成第一、第二和第三特征图。需要说明，下面所提及的图像数据和特征图以张量形状[B,C,H,W]的形式来表示，其中，B表示批次，C表示通道(例如，灰度图像的通道为1，RGB图像的通道为3)，H表示图像的高度，以及W表示图像的宽度。FIG2 shows an exemplary stem network according to an embodiment of the present disclosure. The following will describe in detail how to generate the first, second, and third feature maps in conjunction with the three branches of the stem network of FIG2. It should be noted that the image data and feature maps mentioned below are represented in the form of tensor shapes [B, C, H, W], where B represents batch, C represents channel (for example, the channel of a grayscale image is 1, the channel of an RGB image is 3), H represents the height of the image, and W represents the width of the image.

如图2所示，第一分支(被标记为“分支1”)用于生成第一特征图。在该分支中，首先，对所获取的图像进行下采样，以获得第一图像数据。在一个实施例中，下采样处理可以使用双线性插值函数，下采样的倍率可以是6倍，所获得的第一图像数据的张量形状为[B,1,H/6,W/6]。接着，对第一图像数据[B,1,H/6,W/6]执行第一卷积处理，以生成第一特征图。在一个实施例中，第一卷积处理采用的卷积核的大小是3，步长是2。所生成的第一特征图的张量形状为[B,16,H/12,W/12]。在第一分支的处理中，通过下采样，能够节约大量的计算资源。所生成的第一特征图保留了大尺寸对象的图像数据，但丢失了小尺寸对象的图像数据，主要用于检测大尺寸对象。As shown in Figure 2, the first branch (labeled as "branch 1") is used to generate a first feature map. In this branch, first, the acquired image is downsampled to obtain first image data. In one embodiment, the downsampling process can use a bilinear interpolation function, the downsampling ratio can be 6 times, and the tensor shape of the obtained first image data is [B, 1, H/6, W/6]. Then, the first convolution process is performed on the first image data [B, 1, H/6, W/6] to generate a first feature map. In one embodiment, the size of the convolution kernel used in the first convolution process is 3 and the step size is 2. The tensor shape of the generated first feature map is [B, 16, H/12, W/12]. In the processing of the first branch, a large amount of computing resources can be saved by downsampling. The generated first feature map retains the image data of large-sized objects, but loses the image data of small-sized objects, and is mainly used to detect large-sized objects.

如图2所示，第二分支(被标记为“分支2”)用于生成第二特征图。在该分支中，首先，对所获取的图像执行第二卷积处理，以获得第二图像数据。在一个实施例中，第二卷积处理采用的卷积核的大小是3，步长是3。所获得的第二图像数据的张量形状为[B,8,H/3,W/3]。接着，对第二图像数据[B,8,H/3,W/3]执行第三卷积处理，以生成第二特征图。在一个实施例中，第三卷积处理采用的卷积核的大小是4，步长是4。所生成的第二特征图的张量形状为[B,16,H/12,W/12]。通过第二分支的处理所生成的第二特征图保留了中等尺寸对象的图像数据，主要用于检测中等尺寸对象。As shown in Figure 2, the second branch (labeled as "branch 2") is used to generate a second feature map. In this branch, first, a second convolution process is performed on the acquired image to obtain second image data. In one embodiment, the size of the convolution kernel used in the second convolution process is 3, and the step size is 3. The tensor shape of the obtained second image data is [B, 8, H/3, W/3]. Then, a third convolution process is performed on the second image data [B, 8, H/3, W/3] to generate a second feature map. In one embodiment, the size of the convolution kernel used in the third convolution process is 4, and the step size is 4. The tensor shape of the generated second feature map is [B, 16, H/12, W/12]. The second feature map generated by the processing of the second branch retains the image data of medium-sized objects and is mainly used to detect medium-sized objects.

继续参考图2，第三分支(被标记为“分支3”)用于生成第三特征图。在该分支中，对所获取的图像执行第四卷积处理，以获得第三图像数据。在一个实施例中，第四卷积处理采用的卷积核的大小是3，步长是3。所获得的第三图像数据的张量形状为[B,8,H/2,W/2]。接着，对第三图像数据[B,8,H/2,W/2]执行第五卷积处理，以获得第四图像数据。在一个实施例中，第五卷积处理采用的卷积核的大小是3，步长是1。所获得的第四图像数据的张量形状为[B,16,H/2,W/2]。然后，对第四图像数据[B,16,H/2,W/2]执行第一池化处理，例如，第一最大池化处理，从而获得第五图像数据。在一个实施例中，执行第一池化处理采用的池化核的大小是3，步长是2。所获得的第五图像数据的张量形状为[B,8,H/4,W/4]。对第四图像数据 [B,16,H/2,W/2]执行第二池化处理，例如，第二最大池化处理，从而获得第六图像数据。在一个实施例中，执行第二池化处理采用的池化核的大小是5，步长是2。所获得的第六图像数据的张量形状为[B,8,H/4,W/4]。接着，对第五图像数据[B,8,H/4,W/4]和第六图像数据[B,8,H/4,W/4]进行拼接，以获得经拼接的图像数据[B,32,H/4,W/4]。在一个实施例中，可以采用全连接层来拼接第五图像数据和第六图像数据。最后，对经拼接的图像数据[B,32,H/4,W/4]执行第六卷积处理，以生成第三特征图。在一个实施例中，第六卷积处理采用的卷积核的大小是3，步长是3。所生成的第三特征图的张量形状为[B,32,H/12,W/12]。通过第三分支的处理所生成的第三特征图保留了大尺寸对象的图像数据，主要用于检测大尺寸对象。Continuing to refer to FIG. 2 , the third branch (labeled as “branch 3”) is used to generate a third feature map. In this branch, a fourth convolution process is performed on the acquired image to obtain third image data. In one embodiment, the size of the convolution kernel used in the fourth convolution process is 3, and the step size is 3. The tensor shape of the obtained third image data is [B, 8, H/2, W/2]. Next, a fifth convolution process is performed on the third image data [B, 8, H/2, W/2] to obtain fourth image data. In one embodiment, the size of the convolution kernel used in the fifth convolution process is 3, and the step size is 1. The tensor shape of the obtained fourth image data is [B, 16, H/2, W/2]. Then, a first pooling process is performed on the fourth image data [B, 16, H/2, W/2], for example, a first maximum pooling process, to obtain fifth image data. In one embodiment, the size of the pooling kernel used in the first pooling process is 3, and the step size is 2. The tensor shape of the obtained fifth image data is [B, 8, H/4, W/4]. A second pooling process, for example, a second maximum pooling process, is performed on the fourth image data [B, 16, H/2, W/2] to obtain the sixth image data. In one embodiment, the size of the pooling kernel used in the second pooling process is 5, and the step size is 2. The tensor shape of the obtained sixth image data is [B, 8, H/4, W/4]. Next, the fifth image data [B, 8, H/4, W/4] and the sixth image data [B, 8, H/4, W/4] are spliced to obtain the spliced image data [B, 32, H/4, W/4]. In one embodiment, a fully connected layer can be used to splice the fifth image data and the sixth image data. Finally, a sixth convolution process is performed on the spliced image data [B, 32, H/4, W/4] to generate a third feature map. In one embodiment, the size of the convolution kernel used in the sixth convolution process is 3, and the step size is 3. The tensor shape of the generated third feature map is [B, 32, H/12, W/12]. The third feature map generated by the processing of the third branch retains image data of large-sized objects and is mainly used for detecting large-sized objects.

需要说明的是，在本公开的实施例中，可以并行地执行上述三个分支，以及并行地执行第一池化处理和第二池化处理。It should be noted that, in the embodiments of the present disclosure, the above three branches may be executed in parallel, and the first pooling process and the second pooling process may be executed in parallel.

在本公开的实施例中，上述的各个卷积处理可以依次使用一个二维卷积层、一个归一化层(例如，批归一化层或组归一化层)，以及一个激活函数。In an embodiment of the present disclosure, each of the above-mentioned convolution processes may sequentially use a two-dimensional convolution layer, a normalization layer (for example, a batch normalization layer or a group normalization layer), and an activation function.

继续参考图1，在框106中，融合第一、第二和第三特征图。在本公开的实施例中，可以采用全连接层来融合第一、第二和第三特征图，以获得张量形状为[B,64,H/12,W/12]的融合特征图。在一个实施例中，融合处理可以在通道上对图像数据进行融合。Continuing to refer to FIG. 1 , in block 106 , the first, second, and third feature maps are fused. In an embodiment of the present disclosure, a fully connected layer may be used to fuse the first, second, and third feature maps to obtain a fused feature map having a tensor shape of [B, 64, H/12, W/12]. In one embodiment, the fusion process may fuse the image data on a channel basis.

在框108中，基于融合特征图来检测对象。在本公开的实施例中，可以使用经训练的神经网络来检测对象。在一个实施例中，经训练的神经网络可以是Faster-RCNN网络，将在下面详细描述。In block 108, objects are detected based on the fused feature map. In an embodiment of the present disclosure, a trained neural network may be used to detect objects. In one embodiment, the trained neural network may be a Faster-RCNN network, which will be described in detail below.

需要说明的是，图1所示的流程图仅仅用于示例，本领域的技术人员知道，可以对所示的流程图或其中描述的步骤进行各种变形。It should be noted that the flowchart shown in FIG. 1 is only for example, and those skilled in the art will appreciate that various modifications may be made to the flowchart shown or the steps described therein.

图3示出了根据本公开的实施例的在其中实现如图1所示的方法的网络架构。如图3所示，该网络架构包括图2所示的stem网络和Faster-RCNN网络。在本公开的实施例中，Faster-RCNN网络可以包括主干网络、区域建议网络、以及回归分类网络。需要说明的是，虽然图3将stem网络和 Faster-RCNN网络分开示出，但是可以理解，也可以将stem网络嵌入在Faster-RCNN网络的主干网络的合适的层之间。FIG3 shows a network architecture in which the method shown in FIG1 is implemented according to an embodiment of the present disclosure. As shown in FIG3, the network architecture includes the stem network and the Faster-RCNN network shown in FIG2. In an embodiment of the present disclosure, the Faster-RCNN network may include a backbone network, a region proposal network, and a regression classification network. It should be noted that although FIG3 combines the stem network and the The Faster-RCNN network is shown separately, but it can be understood that the stem network can also be embedded between appropriate layers of the backbone network of the Faster-RCNN network.

在本公开的实施例中，主干网络可以是残差网络ResNet50。ResNet50接收由stem网络输出的融合特征图。并经由ResNet50的五个层对融合特征图进行处理。在本公开的实施例中，第0层输出的图像数据的张量形状为[B,256,H/24,W/24],第1层输出的图像数据的张量形状为[B,512,H/48,W/48],第2层输出的图像数据的张量形状为[B,1024,H/96,W/96],第3层输出的图像数据的张量形状为[B,2048,H/192,W/192],第4层输出的图像数据的张量形状为[B,2048,H/384,W/384]。然后，经由五个层所处理的图像数据被提供给特征金字塔网络FPN以生成统一图像数据。In an embodiment of the present disclosure, the backbone network may be a residual network ResNet50. ResNet50 receives the fused feature map output by the stem network. The fused feature map is processed via the five layers of ResNet50. In an embodiment of the present disclosure, the tensor shape of the image data output by the 0th layer is [B, 256, H/24, W/24], the tensor shape of the image data output by the 1st layer is [B, 512, H/48, W/48], the tensor shape of the image data output by the 2nd layer is [B, 1024, H/96, W/96], the tensor shape of the image data output by the 3rd layer is [B, 2048, H/192, W/192], and the tensor shape of the image data output by the 4th layer is [B, 2048, H/384, W/384]. Then, the image data processed by the five layers is provided to the feature pyramid network FPN to generate unified image data.

在本公开的实施例中，FPN处理过程如下：第4层特征经一层卷积网络进行处理，卷积网络核大小为1，输出通道数为256，生成的张量形状为[B,256,H/384,W/384]；第3层特征经一层卷积网络进行处理，卷积网络核大小为1，输出通道数为256，生成的张量形状为[B,256,H/192,W/192]；第4层特征经过两倍上采样后与第3层特征相加后作为第三层备选特征，第三层备选特征经过一层3x3卷积网络处理后成为新的第三层特征；第2层特征经一层卷积网络进行处理，卷积网络核大小为1，输出通道数为256，生成的张量形状为[B,256,H/96,W/96]；第3层备选特征经过两倍上采样后与第2层特征相加作为第2层备选特征，第2层备选特征经过一层3x3卷积网络处理后成为新的第2层特征；第1层特征经一层卷积网络进行处理，卷积网络核大小为1，输出通道数为256，生成的张量形状为[B,256,H/48,W/48]；第2层备选特征经过两倍上采样后与第1层特征相加后作为第1层备选特征，第1层备选特征经过一层3x3卷积网络处理后成为新的第1层特征；第0层特征经一层卷积网络进行处理，卷积网络核大小为1，输出通道数为256，生成的张量形状为[B,256,H/24,W/24]；第1层备选特征经过两倍上采样后与第0层特征相加后作为第0层备选特征，第0层备选特征经过一层3x3卷积网络处理后成为新的第0层特征；最后，新的多层特征张量数据被提供给区域建议网络(RPN)和回归分类网络(RCNN)以用于后续处理。In an embodiment of the present disclosure, the FPN processing process is as follows: the 4th layer features are processed by a layer of convolutional network, the convolutional network kernel size is 1, the number of output channels is 256, and the generated tensor shape is [B, 256, H/384, W/384]; the 3rd layer features are processed by a layer of convolutional network, the convolutional network kernel size is 1, the number of output channels is 256, and the generated tensor shape is [B, 256, H/192, W/192]; the 4th layer features are upsampled twice and added to the 3rd layer features as the third layer candidate features, and the third layer candidate features are processed by a layer of 3x3 convolutional network to become new third layer features; the 2nd layer features are processed by a layer of convolutional network, the convolutional network kernel size is 1, the number of output channels is 256, and the generated tensor shape is [B, 256, H/96, W/96]; the 3rd layer candidate features are upsampled twice and added to the 2nd layer features As the second-layer candidate features, the second-layer candidate features are processed by a 3x3 convolutional network to become the new second-layer features; the first-layer features are processed by a convolutional network with a kernel size of 1 and an output channel number of 256, and the generated tensor shape is [B, 256, H/48, W/48]; the second-layer candidate features are upsampled twice and added to the first-layer features as the first-layer candidate features, and the first-layer candidate features are processed by a 3x3 convolutional network to become the new first-layer features; the 0th-layer features are processed by a convolutional network with a kernel size of 1 and an output channel number of 256, and the generated tensor shape is [B, 256, H/24, W/24]; the first-layer candidate features are upsampled twice and added to the 0th-layer features as the 0th-layer candidate features, and the 0th-layer candidate features are processed by a 3x3 convolutional network to become the new 0th-layer features; finally, The new multi-layer feature tensor data is provided to the region proposal network (RPN) and regression classification network (RCNN) for subsequent processing.

在本公开的实施例中，多层特征张量数据被输入到区域建议网络。在区域建议网络中，首先，基于各层特征张量数据，分别使用不同的多层卷积网络处理，生成建议特征张量，建议特征张量接着使用两个独立的分支处理。在本公开的实施例中，第一个分支使用卷积生成建议框；另一个分支使用卷积网络进行处理并使用sigmoid来激活，将与之相对应的建议框分类为前景或背景。与之相对应的前景建议框(在本申请中，也称为初步边界框)会作为下一阶段(也就是RCNN阶段)的输入。In an embodiment of the present disclosure, multiple layers of feature tensor data are input into a region proposal network. In the region proposal network, first, based on each layer of feature tensor data, different multi-layer convolutional networks are used to process and generate a proposed feature tensor, which is then processed using two independent branches. In an embodiment of the present disclosure, the first branch uses convolution to generate a proposal box; the other branch uses a convolutional network for processing and uses sigmoid for activation to classify the corresponding proposal box as foreground or background. The corresponding foreground proposal box (also referred to as a preliminary bounding box in this application) will be used as the input of the next stage (that is, the RCNN stage).

接着，初步边界框和多层特征张量数据被提供给回归分类网络。在回归分类网络中，基于多层特征张量数据和初步边界框，由感兴趣区域池化层基于初步边界框对多层特征张量数据进行剪切，剪切后的特征即为相对应的候选目标特征。候选目标特征接下来使用两个独立的分支进行处理。在本公开的实施例中，第一个分支使用4层卷积网络进行处理，最终生成相对应的初步边界框到最终目标的回归参数；另一个分支使用两层全连接网络进行处理，生成最终的分类结果；通过分类分支与回归分支的配合，获得最终边界框，也就是说，获得每个对象的类别和对象的最终边界框的位置。Next, the preliminary bounding box and the multi-layer feature tensor data are provided to the regression classification network. In the regression classification network, based on the multi-layer feature tensor data and the preliminary bounding box, the region of interest pooling layer cuts the multi-layer feature tensor data based on the preliminary bounding box, and the cut features are the corresponding candidate target features. The candidate target features are then processed using two independent branches. In the embodiment of the present disclosure, the first branch is processed using a 4-layer convolutional network to ultimately generate the corresponding regression parameters from the preliminary bounding box to the final target; the other branch is processed using a two-layer fully connected network to generate the final classification result; the final bounding box is obtained through the cooperation of the classification branch and the regression branch, that is, the category of each object and the position of the final bounding box of the object are obtained.

在本公开的实施例中，还可以使用非极大值抑制NMS(Non-Maximum Suppression)来处理区域建议网络输出的初步边界框，选取置信度靠前的512个初步边界框作为候选边界框；接着使用ROIAlignPool基于候选框对多层特征张量进行剪切，从而得到剪切数据，使用4层的3x3卷积网络以及一层全连接来处理该剪切数据，进而得到最终回归的参数；使用两层全连接来得到最终分类的结果。In the embodiments of the present disclosure, non-maximum suppression (NMS) can also be used to process the preliminary bounding boxes output by the region proposal network, and the top 512 preliminary bounding boxes with the highest confidence levels are selected as candidate bounding boxes; then ROIAlignPool is used to clip the multi-layer feature tensors based on the candidate boxes to obtain clipped data, and a 4-layer 3x3 convolutional network and a layer of full connection are used to process the clipped data to obtain the final regression parameters; and two layers of full connection are used to obtain the final classification result.

需要说明的是，在对象识别技术领域中，对象的位置是通过在一个或多个对象的周围绘制边界框来确定。分类是指为该对象分配标签(即，判断对象属于背景还是缺陷目标，如灼伤，刺伤，拆痕等)的过程。It should be noted that in the field of object recognition technology, the location of an object is determined by drawing a bounding box around one or more objects. Classification refers to the process of assigning a label to the object (i.e., determining whether the object belongs to the background or a defect target such as a burn, puncture, or tear).

图4和图5示出了采用根据本公开的实施例的方法所检测的结果的示例。在图4的圆形框中示出了所检测到的小尺寸对象，例如，工业产品中存在的小缺陷。在图5的方形框中示出了所检测到的中等尺寸对象，例如，工业产品中存在的中等尺寸缺陷。由此可见，采用本公开的方法可以准确检测到图像中的各种尺寸的对象，尤其是小尺寸的对象。FIG4 and FIG5 show examples of the results detected by the method according to an embodiment of the present disclosure. The circular frame of FIG4 shows the detected small-sized objects, such as small defects in industrial products. The square frame of FIG5 shows the detected medium-sized objects, such as medium-sized defects in industrial products. It can be seen that the method of the present disclosure can accurately detect objects of various sizes in an image, especially small-sized objects.

此外，在使用图3所示的网络架构来检测对象之前，需要对Faster-RCNN中的区域建议网络进行训练。传统地，在训练区域建议网络的过程中，将二进制分类标签(例如，0或1)分配给每个预测框，其中，0表示负样本，1表示正样本。IoU的全称为交并比(Intersection over Union)，其是目标检测中使用的一个概念，在传统的区域建议网络中，需要提前预定义大小及比例各不相同的预测框，通过计算预测框与“真实框”的交叠率，即它们的交集和并集的比值来确定训练时相应的预测框为正样本还是负样本。在本公开的实施例中，如果预测框与真实框的交并比(IoU)大于0.7，则称之为正样本。如果预测框与真实框的交并比IoU小于0.3，则称之为负样本。其它预测框既不是正样本也不是负样本，不用于最终的训练。在采用该方法训练好区域建议网络之后，通过Faster-RCNN来检测工业产品中的对象。由于对象的尺寸范围变化非常大，例如，从2个像素变化到5000个像素，大小差别达到2000倍以上，远超过在自然场景下的对象尺寸的变化范围。在实施过程中难以定义合适的预测框从而使得各个目标都有足够多的预测框与之相对应，最终IoU对不同尺寸的对象的敏感性差异很大，特别是IoU对小对象的敏感度低，使得大多数预测框都因为较小的位置偏差变成了负样本，从而导致标签分配不准确。因此，采用传统的Faster-RCNN神经网络，无法满足对工业产品的缺陷的检测。In addition, before using the network architecture shown in FIG3 to detect objects, the region proposal network in Faster-RCNN needs to be trained. Traditionally, in the process of training the region proposal network, a binary classification label (e.g., 0 or 1) is assigned to each prediction box, where 0 represents a negative sample and 1 represents a positive sample. The full name of IoU is Intersection over Union, which is a concept used in object detection. In the traditional region proposal network, it is necessary to predefine prediction boxes of different sizes and proportions in advance, and to determine whether the corresponding prediction box is a positive sample or a negative sample during training by calculating the overlap rate of the prediction box and the "real box", that is, the ratio of their intersection and union. In an embodiment of the present disclosure, if the intersection of the prediction box and the real box (IoU) is greater than 0.7, it is called a positive sample. If the intersection of the prediction box and the real box IoU is less than 0.3, it is called a negative sample. Other prediction boxes are neither positive samples nor negative samples and are not used for final training. After the region proposal network is trained using this method, objects in industrial products are detected by Faster-RCNN. Since the size range of objects varies greatly, for example, from 2 pixels to 5000 pixels, the size difference is more than 2000 times, which is far greater than the size range of objects in natural scenes. In the implementation process, it is difficult to define appropriate prediction boxes so that each target has enough prediction boxes corresponding to it. In the end, the sensitivity of IoU to objects of different sizes varies greatly, especially the low sensitivity of IoU to small objects, which makes most prediction boxes become negative samples due to small position deviations, resulting in inaccurate label assignment. Therefore, the traditional Faster-RCNN neural network cannot meet the requirements for defect detection of industrial products.

为此，在本公开的实施例中，可以使用SimOTA算法来训练Faster-RCNN中的区域建议网络。首先，将落在真实框或者真实框附近的特征点作为候选正样本。接着，计算候选正样本的预测框与真实框的代价矩阵，如果存在M个真实框以及N个预测框，则代价矩阵的大小是M×N，代价矩阵中的每个元素是该真实框与预测框的损失函数的值，其中，损失函数包括关于分类的交叉熵损失和关于回归的IOU损失。然后，对于每个真实框，将IoU的值从大到小进行排名，选取前m个预测框，并对该m个预测框对应的IoU值进行求和(求和值为n)。最终，将代价矩阵中每个真实框对应的代价从小到大进行排名，选取前n个预测框作为正样本，其他作为负样本。在此过程中，为了更好的处理小目标，设置一个小目标阀值τ(典型值为64)，如果真实框尺度小于τ，将其大小扩展为τ进行上述处理。To this end, in an embodiment of the present disclosure, the SimOTA algorithm can be used to train the region proposal network in Faster-RCNN. First, the feature points that fall in the true box or near the true box are taken as candidate positive samples. Next, the cost matrix of the predicted box and the true box of the candidate positive sample is calculated. If there are M true boxes and N predicted boxes, the size of the cost matrix is M×N, and each element in the cost matrix is the value of the loss function of the true box and the predicted box, where the loss function includes the cross entropy loss for classification and the IOU loss for regression. Then, for each true box, the IoU value is ranked from large to small, the first m predicted boxes are selected, and the IoU values corresponding to the m predicted boxes are summed (the sum value is n). Finally, the cost corresponding to each true box in the cost matrix is ranked from small to large, and the first n predicted boxes are selected as positive samples, and the others are used as negative samples. In this process, in order to better handle small targets, a small target threshold τ (typical value is 64) is set. If the scale of the true box is less than τ, its size is expanded to τ for the above processing.

在本公开的实施例中，关于分类的交叉熵损失可以表示为：
In an embodiment of the present disclosure, the cross entropy loss for classification can be expressed as:

其中，N_cls是选择的预测框的数量，p_i是预测框为真实框的概率，p_i ^*＝0为正样本，p_i ^*＝0为负样本，L_cls(p_i，p_i ^*)是两个类别的对数损失：
Where N _cls is the number of selected prediction boxes, _pi is the probability that the predicted box is the true box, _pi ^* = 0 is a positive sample, _pi ^* = 0 is a negative sample, and L _cls ( _pi , _pi ^* ) is the logarithmic loss of the two categories:

在本公开的实施例中，关于回归的IOU损失可以表示为：
In the embodiments of the present disclosure, the IOU loss for regression can be expressed as:

其中，t_i是预测框所预测的偏移量，t_i ^*是预测框相对于真实框的偏移量，其中，L_reg(t_i，t_i ^*)表示为：
Where _ti is the offset predicted by the predicted box, _ti ^* is the offset of the predicted box relative to the real box, and _Lreg ( _ti , _ti ^* ) is expressed as:

其中，R是smooth_L1函数，其表示为：
Among them, R is the smooth _L1 function, which is expressed as:

在本公开的实施例中，在使用SimOTA算法训练区域建议网络时，如果真实框的尺寸小于预定阈值，则将真实框的尺寸调整为该预定阈值。In an embodiment of the present disclosure, when the SimOTA algorithm is used to train a region proposal network, if the size of the true box is smaller than a predetermined threshold, the size of the true box is adjusted to the predetermined threshold.

表1示出了分别采用根据本公开的实施例的方法和采用基于Faster-RCNN的滑动窗方法来检测图像中的对象的对比。参见表1，所检测的对象包括产品中的灼伤、刺伤及折痕缺陷，其对应的值越大，表示所采用的方法越好。Table 1 shows the comparison of detecting objects in an image using the method according to an embodiment of the present disclosure and using the sliding window method based on Faster-RCNN. Referring to Table 1, the detected objects include burns, punctures and creases in the product, and the larger the corresponding value, the better the method used.

表1
Table 1

从表1中可以看出，本公开的方法所用时间仅为常规滑动窗方法的1/100左右。此外，本公开的方法还表现出良好的性能。It can be seen from Table 1 that the time used by the method of the present invention is only about 1/100 of that of the conventional sliding window method. In addition, the method of the present invention also shows good performance.

通过以上描述可以看出，采用根据本公开的实施例的方法，通过生成用于检测具有不同尺寸的对象的多个特征图，能够检测图像中各种尺寸的对象，尤其是小尺寸的对象，提高了检测速度并节省了计算资源。It can be seen from the above description that by adopting the method according to an embodiment of the present disclosure, by generating multiple feature maps for detecting objects of different sizes, objects of various sizes in an image, especially small-sized objects, can be detected, thereby improving the detection speed and saving computing resources.

图6示出了根据本公开的实施例的用于检测图像中的对象的电子设备60的示意性框图。电子设备60包括一个或多个处理器602以及存储器604。存储器604与处理器602通过总线与I/O接口606耦接，并存储可由处理器602执行的指令。当指令由处理器602执行时，电子设备60可执行上述任意一个实施例中的用于检测图像中的对象的方法的步骤。FIG6 shows a schematic block diagram of an electronic device 60 for detecting an object in an image according to an embodiment of the present disclosure. The electronic device 60 includes one or more processors 602 and a memory 604. The memory 604 is coupled to the processor 602 via a bus and an I/O interface 606, and stores instructions that can be executed by the processor 602. When the instructions are executed by the processor 602, the electronic device 60 can perform the steps of the method for detecting an object in an image in any of the above embodiments.

存储器604可以包括易失性存储器形式的可读介质，例如随机存取存储器(RAM)和/或高速缓存存储器，还可以进一步包括只读存储器(ROM)。The memory 604 may include readable media in the form of volatile memory, such as random access memory (RAM) and/or cache memory, and may further include read-only memory (ROM).

存储器604还可以包括具有一组(至少一个)程序模块的程序/实用工具，这样的程序模块包括但不限于：操作系统、一个或者多个应用程序、其它程序模块以及程序数据，这些示例中的每一个或某种组合中可能包括网络环境的实现。Memory 604 may also include a program/utility having a set (at least one) of program modules, such program modules including but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which or some combination may include an implementation of a network environment.

总线可以为表示几类总线结构中的一种或多种，包括存储器总线或者存储器控制器、外围总线、图形加速端口、处理单元或者使用多种总线结构中的任意总线结构的局域总线。The bus may represent one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

电子设备60也可以与一个或多个外部设备(例如键盘、指向设备、蓝牙设备等)通信，还可与一个或者多个使得用户能与该电子设备60交互的设备通信，和/或与使得该电子设备60能与一个或多个其它计算设备进行通信的任何设备(例如路由器、调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口606进行。并且，电子设备60还可以通过网络适配器与一个或者多个网络(例如局域网(LAN)，广域网(WAN)和/或公共网络，例如因特网)通信。网络适配器可以通过总线与电子设备60的其它模块通信。应当明白，尽管图中未示出，可以结合电子设备60使用其它硬件和/或软件模块，包括但不限于：微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。The electronic device 60 may also communicate with one or more external devices (e.g., keyboards, pointing devices, Bluetooth devices, etc.), may communicate with one or more devices that enable a user to interact with the electronic device 60, and/or may communicate with any device that enables the electronic device 60 to communicate with one or more other computing devices (e.g., routers, modems, etc.). Such communication may be performed via an input/output (I/O) interface 606. Furthermore, the electronic device 60 may also communicate with one or more networks (e.g., a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) via a network adapter. The network adapter may communicate with other modules of the electronic device 60 via a bus. It should be understood that, although not shown in the figure, other hardware and/or software modules may be used in conjunction with the electronic device 60, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems.

图7示出了根据本公开的实施例的用于检测图像中的对象的设备700的示例性结构框图。设备700可以包括获取模块710，特征图生成模块720，融合模块730，以及对象检测模块740。7 shows an exemplary structural block diagram of a device 700 for detecting an object in an image according to an embodiment of the present disclosure. The device 700 may include an acquisition module 710, a feature map generation module 720, a fusion module 730, and an object detection module 740.

在本公开的实施例中，获取模块710被配置为获取待检测的图像，该图像可以是待检测的工业产品的图像，并且对象可以是指工业产品中存在的缺陷，例如灼伤、刺伤、折痕等。In an embodiment of the present disclosure, the acquisition module 710 is configured to acquire an image to be inspected, which may be an image of an industrial product to be inspected, and the object may refer to defects existing in the industrial product, such as burns, punctures, creases, etc.

在本公开的实施例中，特征图生成模块720被配置为基于所获取的图像生成用于检测具有不同尺寸的对象的多个特征图。多个特征图可以包括第一特征图、第二特征图、以及第三特征图。第一特征图可用于检测具有第一尺寸的对象，第二特征图可用于检测具有第二尺寸的对象，以及第三特征图可用于检测具有第三尺寸的对象，其中，第一尺寸大于第二尺寸，以及第二尺寸大于第三尺寸。In an embodiment of the present disclosure, the feature map generation module 720 is configured to generate a plurality of feature maps for detecting objects of different sizes based on the acquired image. The plurality of feature maps may include a first feature map, a second feature map, and a third feature map. The first feature map may be used to detect an object of a first size, the second feature map may be used to detect an object of a second size, and the third feature map may be used to detect an object of a third size, wherein the first size is larger than the second size, and the second size is larger than the third size.

在本公开的实施例中，融合模块730被配置为融合多个特征图以获得融合特征图。融合模块730将进一步结合图2中所介绍的stem网络的三个分支来详细说明如何生成第一、第二和第三特征图。在图2所示的第一分支的处理中，对图像进行下采样，以获得第一图像数据；以及对第一图像数据执行第一卷积处理，以生成第一特征图。在图2所示的第二分支的处理中，对图像执行第二卷积处理，以获得第二图像数据；以及对第二图像数据执行第三卷积处理，以生成第二特征图。在图2所示的第三分支的处理中，对图像执行第四卷积处理，以获得第三图像数据；对第三图像数据执行第五卷积处理，以获得第四图像数据；对第四图像数据分别执行第一池化处理和第二池化处理，以获得第五图像数据和第六图像数据；对第五图像数据和第六图像数据进行拼接，以获得经拼接的图像数据；以及对经拼接的图像数据执行第六卷积处理，以生成第三特征图。In an embodiment of the present disclosure, the fusion module 730 is configured to fuse multiple feature maps to obtain a fused feature map. The fusion module 730 will further explain in detail how to generate the first, second and third feature maps in conjunction with the three branches of the stem network introduced in FIG. 2. In the processing of the first branch shown in FIG. 2, the image is downsampled to obtain first image data; and the first convolution processing is performed on the first image data to generate the first feature map. In the processing of the second branch shown in FIG. 2, the image is subjected to a second convolution processing to obtain second image data; and the second image data is subjected to a third convolution processing to generate the second feature map. In the processing of the third branch shown in FIG. 2, the image is subjected to a fourth convolution processing to obtain third image data; the third image data is subjected to a fifth convolution processing to obtain fourth image data; the fourth image data is subjected to a first pooling processing and a second pooling processing respectively to obtain fifth image data and sixth image data; the fifth image data and the sixth image data are spliced to obtain spliced image data; and the spliced image data is subjected to a sixth convolution processing to generate the third feature map.

在本公开的实施例中，对象检测模块740被配置为基于融合特征图检测对象。在本公开的实施例中，可以使用经训练的神经网络来检测对象。In an embodiment of the present disclosure, the object detection module 740 is configured to detect an object based on the fused feature map. In an embodiment of the present disclosure, a trained neural network may be used to detect an object.

设备700的各个单元还可以进一步完成图1-3中所介绍的功能和方法，在此不再重复介绍。Each unit of the device 700 can further implement the functions and methods described in FIGS. 1-3 , which will not be repeated here.

在本公开的实施例中，还提供了一种计算机可读存储介质，在其上存储有计算机程序指令，该计算机程序指令在被例如处理器执行时可以实现上述任意一个实施例中所述用于检测图像中的对象的方法的步骤。在一些可能的实施方式中，本申请的各个方面还可以实现为一种程序产品的形式，其包括程序代码，当所述程序产品在终端设备上运行时，所述程序代码用于使所述终端设备执行本说明书用于检测图像中的对象的方法中描述的根据本申请各种示例性实施例的步骤。In an embodiment of the present disclosure, a computer-readable storage medium is also provided, on which computer program instructions are stored, and when the computer program instructions are executed by, for example, a processor, the steps of the method for detecting an object in an image described in any of the above embodiments can be implemented. In some possible implementations, various aspects of the present application can also be implemented in the form of a program product, which includes a program code, and when the program product is run on a terminal device, the program code is used to cause the terminal device to execute the steps of various exemplary embodiments of the present application described in the method for detecting an object in an image in this specification.

根据本申请的实施例的用于实现上述方法的程序产品可以采用便携式紧凑盘只读存储器(CD-ROM)并包括程序代码，并可以在终端设备，例如个人电脑上运行。然而，本申请的程序产品不限于此，在本文件中，可读存储介质可以是任何包含或存储程序的有形介质，该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。The program product for implementing the above method according to the embodiment of the present application can adopt a portable compact disk read-only memory (CD-ROM) and include program code, and can be run on a terminal device, such as a personal computer. However, the program product of the present application is not limited thereto. In this document, a readable storage medium can be any tangible medium containing or storing a program, which can be used by or in combination with an instruction execution system, an apparatus or a device.

所述程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以为但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括：具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。The program product may use any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination of the above. More specific examples (non-exhaustive list) of readable storage media include: an electrical connection with one or more wires, a portable disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above.

所述计算机可读存储介质可以包括在基带中或者作为载波一部分传播的数据信号，其中承载了可读程序代码。这种传播的数据信号可以采用多种形式，包括但不限于电磁信号、光信号或上述的任意合适的组合。可读存储介质还可以是可读存储介质以外的任何可读介质，该可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。可读存储介质上包含的程序代码可以用任何适当的介质传输，包括但不限于无线、有线、光缆、RF等等，或者上述的任意合适的组合。The computer readable storage medium may include a data signal propagated in a baseband or as part of a carrier wave, wherein a readable program code is carried. This propagated data signal may take a variety of forms, including but not limited to an electromagnetic signal, an optical signal, or any suitable combination of the above. The readable storage medium may also be any readable medium other than a readable storage medium, which may send, propagate, or transmit a program for use by an instruction execution system, an apparatus, or a device or used in combination with it. The program code contained on the readable storage medium may be transmitted with any appropriate medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the above.

可以以一种或多种程序设计语言的任意组合来编写用于执行本申请操作的程序代码，所述程序设计语言包括面向对象的程序设计语言—诸如Java、C++等，还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程计算设备的情形中，远程计算设备可以通过任意种类的网络，包括局域网(LAN)或广域网(WAN)，连接到用户计算设备，或者，可以连接到外部计算设备(例如利用因特网服务提供商来通过因特网连接)。Program code for performing the operations of the present application may be written in any combination of one or more programming languages, including object-oriented programming languages such as Java, C++, etc., and conventional procedural programming languages such as "C" or similar programming languages. The program code may be executed entirely on the user computing device, partially on the user device, as a separate software package, partially on the user computing device and partially on a remote computing device, or entirely on a remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any type of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computing device (e.g., via the Internet using an Internet service provider).

所属技术领域的技术人员能够理解，本申请的各个方面可以实现为系统、方法或程序产品。因此，本申请的各个方面可以具体实现为以下形式，即：完全的硬件实施方式、完全的软件实施方式(包括固件、微代码等)，或硬件和软件方面结合的实施方式，这里可以统称为“电路”、“模块”或“系统”。Those skilled in the art will appreciate that various aspects of the present application may be implemented as a system, method or program product. Therefore, various aspects of the present application may be specifically implemented in the following forms, namely: a complete hardware implementation, a complete software implementation (including firmware, microcode, etc.), or a combination of hardware and software, which may be collectively referred to as "circuit", "module" or "system" herein.

以上所述仅为本公开的优选实施例而已，并不用于限制本公开，对于本领域的技术人员来说，本公开可以有各种更改和变化。凡在本公开的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本公开的保护范围之内。 The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure. For those skilled in the art, the present disclosure may have various modifications and variations. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure shall be included in the protection scope of the present disclosure.

Claims

A method for detecting an object in an image, comprising:

Acquire an image to be detected;

Based on the image, generating a plurality of feature maps for detecting objects having different sizes;

fusing the multiple feature maps to obtain a fused feature map; and

Based on the fused feature map, the object is detected.

The method of claim 1, wherein the plurality of feature maps comprises: a first feature map for detecting an object having a first size, a second feature map for detecting an object having a second size, and a third feature map for detecting an object having a third size, wherein the first size is larger than the second size, and the second size is larger than the third size.

The method according to claim 2, wherein generating the first feature map based on the image comprises:

downsampling the image to obtain first image data; and

A first convolution process is performed on the first image data to generate the first feature map.

The method according to claim 3, wherein the downsampling ratio is 6 times, and in the first convolution process, the size of the convolution kernel is 3 and the step size is 2.

The method according to claim 2, wherein generating the second feature map based on the image comprises:

performing a second convolution process on the image to obtain second image data; and

A third convolution process is performed on the second image data to generate the second feature map.

The method according to claim 5, wherein, in the second convolution process, the size of the convolution kernel is 3 and the step size is 3, and wherein, in the third convolution process, the size of the convolution kernel is 4 and the step size is 4.

The method according to claim 2, wherein generating the third feature map based on the image comprises:

performing a fourth convolution process on the image to obtain third image data;

performing a fifth convolution process on the third image data to obtain fourth image data;

Performing a first pooling process and a second pooling process on the fourth image data respectively to obtain fifth image data and sixth image data;

splicing the fifth image data and the sixth image data to obtain spliced image data; and

A sixth convolution process is performed on the spliced image data to generate the third feature map.

The method of claim 7, wherein the first pooling process comprises a first maximum pooling process, and wherein the second pooling process comprises a second maximum pooling process.

According to the method according to claim 7 or 8, wherein, in the fourth convolution processing, the size of the convolution kernel is 3 and the step size is 3, in the fifth convolution processing, the size of the convolution kernel is 3 and the step size is 1, and wherein, in the first pooling processing, the size of the pooling kernel is 3 and the step size is 2, in the second pooling processing, the size of the pooling kernel is 5 and the step size is 2, and wherein, in the sixth convolution processing, the size of the convolution kernel is 3 and the step size is 3.

The method according to any one of claims 1 to 9, wherein the object is detected using a trained neural network based on the fused feature map.

The method according to claim 10, wherein the trained neural network is a Faster-RCNN network, wherein the Faster RCNN network includes a backbone network, a region proposal network, and a regression classification network.

The method of claim 11, wherein the region proposal network is trained using a SimOTA algorithm.

The method of claim 11, wherein using the SimOTA algorithm to train the region proposal network comprises:

The feature points that fall within or near the true frame are taken as candidate positive samples;

Calculate the cost matrix of the predicted box and the true box of the candidate positive sample;

For each true box, rank the intersection over union (IoU) values of the true box and the predicted box from large to small, and select the top m predicted boxes;

Sum the IoU values corresponding to the first m prediction boxes, where the sum is n; and

The cost matrix is ranked from small to large, and the first n prediction boxes are selected as positive samples, and the others are selected as negative samples.

The method according to claim 13, wherein, when using the SimOTA algorithm, if the size of the true frame is smaller than a predetermined threshold, the size of the true frame is adjusted to the predetermined threshold.

The method according to claim 13 or 14, wherein, if there are M true boxes and N predicted boxes, the size of the cost matrix is M×N, and each element in the cost matrix is a value of a loss function between a true box and a predicted box.

The method according to claim 15, wherein the loss function includes a cross entropy loss for classification and an IOU loss for regression.

The method according to claim 16, wherein the cross entropy loss for classification is expressed as:

Where N _cls is the number of selected prediction boxes, _pi is the probability that the prediction box is the true box, _pi ^* = 0 is a positive sample, _pi ^* = 0 is a negative sample, and L _cls ( _pi , _pi ^* ) is expressed as:

The method of claim 16, wherein the IOU loss for regression is expressed as:

Where _ti is the offset predicted by the predicted box, _ti ^* is the offset of the predicted box relative to the true box, and _Lreg ( _ti , _ti ^* ) is expressed as:

Where R is the smooth _L1 function, which is expressed as:

A device for detecting an object in an image, comprising:

An acquisition module is configured to acquire an image to be detected;

a feature map generation module, configured to generate a plurality of feature maps for detecting objects with different sizes based on the image;

a fusion module configured to fuse the multiple feature maps to obtain a fused feature map; and

The object detection module is configured to detect the object based on the fused feature map.

A computer-readable storage medium having computer program instructions stored thereon, wherein the computer program instructions, when executed by a processor, cause the processor to perform the method according to any one of claims 1 to 18.

An electronic device, comprising:

Processor; and

a memory storing instructions executable by the processor;

The processor is configured to execute the instructions to implement the method according to any one of claims 1 to 18.