CN117830703A

CN117830703A - Image recognition method based on multi-scale feature fusion, computer device and computer-readable storage medium

Info

Publication number: CN117830703A
Application number: CN202311750768.6A
Authority: CN
Inventors: 余正泓; 叶健雄; 黎红源
Original assignee: Guangdong Institute of Science and Technology
Current assignee: Guangdong Institute of Science and Technology
Priority date: 2023-12-18
Filing date: 2023-12-18
Publication date: 2024-04-05

Abstract

The invention provides an image identification method based on multi-scale feature fusion, a computer device and a computer readable storage medium, wherein the method comprises the following steps: manufacturing a training sample set; constructing a bidirectional cascade neural network model, wherein the bidirectional cascade neural network model comprises a first identification neural network, the first identification neural network comprises a first encoder and a first decoder, the first decoder is used for acquiring a first current feature map and a second current feature map which are output by the first encoder, respectively convolving the first current feature map and the second current feature map to obtain a first current convolution feature map and a second current convolution feature map, cascading the first current convolution feature map, the second current convolution feature map and the up-sampled first feature map to form a first cascade map, and performing feature remapping on the first cascade map to obtain a first feature map; training the bidirectional cascade neural network model by using a training sample set; the image to be identified is input into the bidirectional cascade neural network model to obtain an identification result, and the identification performance is improved.

Description

Image recognition method based on multi-scale feature fusion, computer device and computer-readable storage medium

技术领域Technical Field

本发明涉及图像处理的技术领域，具体是涉及一种基于多尺度特征融合的图像的识别方法、计算机装置和计算机可读存储介质。The present invention relates to the technical field of image processing, and in particular to an image recognition method based on multi-scale feature fusion, a computer device and a computer-readable storage medium.

背景技术Background technique

现如今，深度学习方法用于图像处理时，可识别多种物品，具有广泛适用性和准确性，成为解决农业视觉应用中遇到的挑战的新手段。但由于受作物自然生长规律的约束，农业图像数据的采集代价十分昂贵且耗时。一片试验田往往一年只能获取到1-2个序列的图像，所以使得可用于神经网络模型的训练数据十分有限，使得神经网络模型的性能受到严重影响。Nowadays, deep learning methods can identify a variety of objects when used for image processing. They have wide applicability and accuracy, and have become a new means to solve the challenges encountered in agricultural vision applications. However, due to the constraints of the natural growth laws of crops, the collection of agricultural image data is very expensive and time-consuming. An experimental field can often only obtain 1-2 sequences of images a year, so the training data available for the neural network model is very limited, which seriously affects the performance of the neural network model.

现有一种Faster R-CNN神经网络，该神经网络输出单个且较小的特征层，容易丢失小目标的像素信息，不能对图片的小型目标物体进行识别，从而影响神经网络的性能。There is a Faster R-CNN neural network that outputs a single and small feature layer, which is prone to losing pixel information of small targets and cannot identify small target objects in the image, thus affecting the performance of the neural network.

发明内容Summary of the invention

本发明的第一目的是提供一种增强神经网络的性能的基于多尺度特征融合的图像的识别方法。The first object of the present invention is to provide an image recognition method based on multi-scale feature fusion to enhance the performance of a neural network.

本发明的第二目的是提供一种实现上述基于多尺度特征融合的图像的识别方法的计算机装置。The second objective of the present invention is to provide a computer device for implementing the above-mentioned image recognition method based on multi-scale feature fusion.

本发明的第三目的是提供一种应用上述基于多尺度特征融合的图像的识别方法的计算机可读存储介质。A third object of the present invention is to provide a computer-readable storage medium to which the above-mentioned image recognition method based on multi-scale feature fusion is applied.

为了实现上述的第一目的，本发明提供的基于多尺度特征融合的图像的识别方法，包括制作训练样本集；构建双向级联神经网络模型，其中，双向级联神经网络模型包括第一识别神经网络，第一识别神经网络包括第一编码器和第一解码器，第一编码器用于对图像进行下采样提取多尺度特征图；第一解码器用于获取对第一编码器输出的第一当前特征图与相对于第一当前特征图的上一级的第二当前特征图分别进行卷积，得到第一当前卷积特征图和第二当前卷积特征图，对解码器的输出的上一特征图进行上采样，将第一当前卷积特征图、第二当前卷积特征图和上采样后的第一特征图进行级联，形成第一级联图，对第一级联图进行特征重映，得到第一特征图；利用训练样本集对双向级联神经网络模型进行训练；将待识别图像输入训练完成的双向级联神经网络模型，得到识别结果。In order to achieve the above-mentioned first purpose, the present invention provides an image recognition method based on multi-scale feature fusion, which includes preparing a training sample set; constructing a bidirectional cascade neural network model, wherein the bidirectional cascade neural network model includes a first recognition neural network, the first recognition neural network includes a first encoder and a first decoder, the first encoder is used to downsample the image to extract a multi-scale feature map; the first decoder is used to obtain a first current feature map output by the first encoder and a second current feature map of the previous level relative to the first current feature map, respectively convolving to obtain a first current convolution feature map and a second current convolution feature map, upsampling the previous feature map output by the decoder, cascading the first current convolution feature map, the second current convolution feature map and the upsampled first feature map to form a first cascade map, and remapping the first cascade map to obtain a first feature map; using the training sample set to train the bidirectional cascade neural network model; inputting the image to be recognized into the trained bidirectional cascade neural network model to obtain a recognition result.

由上述方案可见，在第一解码器提取特征图时，利用第一编码器输出第一当前特征图融合上一级的第二当前特征图进行融合，保留更准确定位信号，使得神经网络对于目标区域更具有判别特征的识别能力。第一解码器并且融合上一特征图，增加级联间的空间维度，提升小目标的识别性能。第一识别神经网络中，对第一当前特征图与第二当前特征图分别进行卷积使用SAConv进行卷积，SAConv可以自适应捕捉输入特征图中不同位置之间的关系，根据输入特征图自适应的调整卷积核的权重，对输入特征图进行上下文信息融合和增强，使得神经网络学习更多的特征信息，在少量的训练集进行训练时，也可保持良好的性能。It can be seen from the above scheme that when the first decoder extracts the feature map, the first encoder outputs the first current feature map and fuses it with the second current feature map of the previous level to retain a more accurate positioning signal, so that the neural network has a better ability to discriminate features for the target area. The first decoder also fuses the previous feature map to increase the spatial dimension between cascades and improve the recognition performance of small targets. In the first recognition neural network, the first current feature map and the second current feature map are convolved using SAConv respectively. SAConv can adaptively capture the relationship between different positions in the input feature map, adaptively adjust the weight of the convolution kernel according to the input feature map, and fuse and enhance the contextual information of the input feature map, so that the neural network learns more feature information and can maintain good performance when trained with a small number of training sets.

进一步的方案中，第一解码器对第一级联图进行特征重映中包括：对第一级联图进行卷积，将卷积后的第一级联图拆分为第一拆分级联图与第二拆分级联图，对第一拆分级联图进行可变形卷积，输出第一拆分卷积级联图，将第一拆分卷积级联图与第二拆分级联图进行级联并卷积，得到第一特征图。In a further scheme, the first decoder performs feature remapping on the first cascade graph, including: convolving the first cascade graph, splitting the convolved first cascade graph into a first split cascade graph and a second split cascade graph, performing deformable convolution on the first split cascade graph, outputting a first split convolution cascade graph, cascading and convolving the first split convolution cascade graph with the second split cascade graph to obtain a first feature graph.

由此可见，特征重映使用特征重映模块实现，第一解码器中的特征重映模块结合可变形卷积模块，可变形卷积可引入偏移量动态调整采样位置，进而更好地适应植物形态特征的形变。It can be seen that feature remapping is implemented using a feature remapping module. The feature remapping module in the first decoder is combined with a deformable convolution module. The deformable convolution can introduce an offset to dynamically adjust the sampling position, thereby better adapting to the deformation of plant morphological characteristics.

进一步的方案中，双向级联神经网络模型还包括第二识别神经网络，第二识别神经网络包括第二编码器和第二解码器，第二编码器用于对图像进行下采样提取多尺度特征图；第二解码器用于对第二编码器输出的第四目标特征图输入注意力机制模块，进行上采样操作后与第一目标特征图进行加权融合，得到第一融合特征图，对第一融合特征图进行上采样，与第二目标特征图进行加权融合，形成第二融合特征图，对第二融合特征图进行特征重映，得到第二特征图，获取第二特征图、第一目标特征图与第一融合特征图进行加权融合，得到第三融合特征图，对第三融合特征图通过第二特征重映模块进行特征重映，得到第三特征图。In a further scheme, the bidirectional cascade neural network model also includes a second recognition neural network, and the second recognition neural network includes a second encoder and a second decoder. The second encoder is used to downsample the image and extract a multi-scale feature map; the second decoder is used to input the fourth target feature map output by the second encoder into the attention mechanism module, perform an upsampling operation and weighted fusion with the first target feature map to obtain a first fused feature map, upsample the first fused feature map, and weighted fusion with the second target feature map to form a second fused feature map, perform feature remapping on the second fused feature map to obtain a second feature map, obtain the second feature map, the first target feature map and the first fused feature map for weighted fusion to obtain a third fused feature map, and perform feature remapping on the third fused feature map through the second feature remapping module to obtain a third feature map.

由此可见，第二识别神经网络中，对第四目标特征图进行上采样与上一级的第一目标特征图进行加权融合，直至第二编码器的所有特征图加权融合完毕，再进行特征重映，得到第二特征图。第二特征图则利用第二编码器输出的第一目标特征图、第一融合特征图和第二特征图作为输入，进行特征重映，得到第三特征图。该第二识别神经网络解码过程尊重梯度的多样性，将网络中各阶段的特征图整合起来，解码模块采用了权重融合技术，通过不断迭代的自底向上和自顶向下处理，第二识别神经网络能够保持较高的特征信息质量。It can be seen that in the second recognition neural network, the fourth target feature map is upsampled and weightedly fused with the first target feature map of the previous level until all feature maps of the second encoder are weightedly fused, and then feature remapping is performed to obtain the second feature map. The second feature map uses the first target feature map, the first fused feature map and the second feature map output by the second encoder as input, performs feature remapping, and obtains the third feature map. The decoding process of the second recognition neural network respects the diversity of gradients, integrates the feature maps of each stage in the network, and the decoding module adopts weight fusion technology. Through continuous iterative bottom-up and top-down processing, the second recognition neural network can maintain a high quality of feature information.

进一步的方案中，将所述待识别图像输入训练完成的双向级联神经网络模型中包括：判断当前单位时间内处理器的处理量是否大于预设值；若所述处理器的处理量大于所述预设值，则将所述待识别图像输入所述第二识别神经网络进行处理；若所述处理器的处理量小于所述预设值，则将所述第二识别神经网络输入所述第一识别神经网络进行处理。In a further solution, the image to be recognized is input into the trained bidirectional cascade neural network model, including: judging whether the processing capacity of the processor in the current unit time is greater than a preset value; if the processing capacity of the processor is greater than the preset value, the image to be recognized is input into the second recognition neural network for processing; if the processing capacity of the processor is less than the preset value, the second recognition neural network is input into the first recognition neural network for processing.

由此可见，根据当前处理量确定使用第一识别神经网络还是第二识别神经网络，减少处理器的计算处理量。It can be seen that whether to use the first recognition neural network or the second recognition neural network is determined according to the current processing volume, thereby reducing the computational processing volume of the processor.

进一步的方案中，所述解码器的输出的上一特征图进行上采样前，还包括：对第一编码器输出的第三当前特征图输入注意力机制模块，得到第三当前注意力特征图，对第三当前注意力特征图、第一当前特征图和第三当前特征图分别进行卷积，将卷积后的第三当前注意力特征图、采样后的第一当前特征图和采样后的第三当前特征图进行级联，得到第二级联图，对所述第二级联图进行特征重映，形成上一特征图。In a further scheme, before upsampling the previous feature map output by the decoder, it also includes: inputting the third current feature map output by the first encoder into the attention mechanism module to obtain the third current attention feature map, convolving the third current attention feature map, the first current feature map and the third current feature map respectively, cascading the convolved third current attention feature map, the sampled first current feature map and the sampled third current feature map to obtain a second cascade map, and performing feature remapping on the second cascade map to form the previous feature map.

由此可见，上一特征图是通过两个特征图和经过注意力机制模块输出的当前注意力特征图进行级联和特征重映获得的，而第三当前特征图比第一当前特征图高一级。融合第一编码器中一个额外低级特征，保留更准确的定位信号，这种连接方式更有效的增强我们网络对于目标区域最具判别性特征的学习能力，同时对于小目标的定位也具有重要意义。It can be seen that the previous feature map is obtained by cascading and remapping the two feature maps and the current attention feature map output by the attention mechanism module, and the third current feature map is one level higher than the first current feature map. Fusion of an additional low-level feature in the first encoder retains a more accurate positioning signal. This connection method more effectively enhances our network's ability to learn the most discriminative features of the target area, and is also of great significance for the positioning of small targets.

进一步的方案中，第一识别神经网络还包括第一损失模块，第一损失模块对第一解码器输出的特征图进行处理。In a further embodiment, the first recognition neural network also includes a first loss module, which processes the feature map output by the first decoder.

由此可见，第一损失模块包括分类损失和回归损失，分类损失用于惩罚预测类别与真实类别之间的差异，回归损失用于约束模型对于预测框位置和形态的学习。合并分类损失和回归损失得到第一识别神经网络的总损失，通过迭代不断最小化这个总损失，驱动模型不断提高其分类和定位的准确性，最终实现对农业生产活动中作物的精确检测与定位。It can be seen that the first loss module includes classification loss and regression loss. Classification loss is used to penalize the difference between the predicted category and the true category, and regression loss is used to constrain the model's learning of the predicted box position and shape. The total loss of the first recognition neural network is obtained by combining the classification loss and regression loss. This total loss is continuously minimized through iteration, driving the model to continuously improve its classification and positioning accuracy, and ultimately achieving accurate detection and positioning of crops in agricultural production activities.

进一步的方案中，所述第二识别神经网络还包括第二损失模块，第二损失模块对所述第二解码器输出的特征图进行处理。In a further embodiment, the second recognition neural network also includes a second loss module, which processes the feature map output by the second decoder.

由此可见，第二损失模块包括分类损失和回归损失，分类损失用于惩罚预测类别与真实类别之间的差异，回归损失用于约束模型对于预测框位置和形态的学习。合并分类损失和回归损失得到第二识别神经网络的总损失，通过迭代不断最小化这个总损失，驱动模型不断提高其分类和定位的准确性，最终实现对农业生产活动中作物的精确检测与定位。It can be seen that the second loss module includes classification loss and regression loss. The classification loss is used to penalize the difference between the predicted category and the true category, and the regression loss is used to constrain the model's learning of the predicted box position and shape. The total loss of the second recognition neural network is obtained by combining the classification loss and regression loss. This total loss is continuously minimized through iteration, driving the model to continuously improve its classification and positioning accuracy, and ultimately achieving accurate detection and positioning of crops in agricultural production activities.

为了实现上述的第二目的，本发明提供的计算机装置，其特征在于，所述计算机装置包括处理器和存储器，存储器存储有计算机程序，所述计算机程序被所述处理器执行时实现如上述的基于多尺度特征融合的图像的识别方法。In order to achieve the above-mentioned second purpose, the computer device provided by the present invention is characterized in that the computer device includes a processor and a memory, the memory stores a computer program, and when the computer program is executed by the processor, it implements the image recognition method based on multi-scale feature fusion as mentioned above.

为了实现上述的第三目的，本发明提供的计算机可读存储介质，其上存储有计算机程序，计算机程序被执行时实现上述的基于多尺度特征融合的图像的识别方法。In order to achieve the third objective mentioned above, the present invention provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed, the above-mentioned image recognition method based on multi-scale feature fusion is implemented.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明基于多尺度特征融合的图像的识别方法实施例的流程图。FIG1 is a flow chart of an embodiment of an image recognition method based on multi-scale feature fusion according to the present invention.

图2是本发明基于多尺度特征融合的图像的识别方法实施例的双向级联神经网络模型的结构系统框图。FIG2 is a structural system block diagram of a bidirectional cascade neural network model of an embodiment of an image recognition method based on multi-scale feature fusion according to the present invention.

图3是本发明基于多尺度特征融合的图像的识别方法实施例的第一解码器中的特征重映模块的示意图。FIG3 is a schematic diagram of a feature remapping module in a first decoder of an embodiment of an image recognition method based on multi-scale feature fusion according to the present invention.

以下结合附图及实施例对本发明作进一步说明。The present invention is further described below in conjunction with the accompanying drawings and embodiments.

具体实施方式Detailed ways

基于多尺度特征融合的图像的识别方法实施例：Embodiment of the image recognition method based on multi-scale feature fusion:

本发明的基于多尺度特征融合的图像的识别方法，通过利用编码器的多层特征图，在解码过程中利用多层特征图的信息，增强定位信号。The image recognition method based on multi-scale feature fusion of the present invention utilizes the multi-layer feature map of the encoder and utilizes the information of the multi-layer feature map in the decoding process to enhance the positioning signal.

参见图1，图1是本发明基于多尺度特征融合的图像的识别方法实施例的流程图。首先执行步骤S1，制作训练样本集。其中，训练样本集通过拍摄农田中的农作物得到，将所述训练图片中的农作物作标记。为了减少处理器的计算量，输入图片的最长边规定为608个像素点，并按比例缩放另一边长。在保证计算效率的同时，尽可能保留图像的原始信息。Refer to Figure 1, which is a flow chart of an embodiment of an image recognition method based on multi-scale feature fusion of the present invention. First, step S1 is executed to prepare a training sample set. The training sample set is obtained by photographing crops in a farmland, and the crops in the training picture are marked. In order to reduce the amount of calculation of the processor, the longest side of the input picture is set to 608 pixels, and the other side length is scaled proportionally. While ensuring the calculation efficiency, the original information of the image is retained as much as possible.

制作训练样本集后，执行步骤S2，构建双向级联神经网络模型PlantBiCNet。参见图2，双向级联神经网络模型包括第一识别神经网络模型与第二识别神经网络模型。其中，第一识别神经网络模型包括第一编码器和第一解码器，第一编码器采用的结构是CSPDarknet，通过步长为2的卷积层对图像连续下采样产生多尺度特征图。本实施例中，通过第一编码器产生三个特征图，其中，第二当前特征图为原图的1/8的特征图，第一当前特征图为原图的1/16的特征图，第三当前特征图为原图的1/32的特征图。After making the training sample set, execute step S2 to construct a bidirectional cascade neural network model PlantBiCNet. Referring to Figure 2, the bidirectional cascade neural network model includes a first recognition neural network model and a second recognition neural network model. Among them, the first recognition neural network model includes a first encoder and a first decoder. The structure adopted by the first encoder is CSPDarknet, and the image is continuously downsampled by a convolutional layer with a step size of 2 to generate a multi-scale feature map. In this embodiment, three feature maps are generated by the first encoder, among which the second current feature map is a feature map of 1/8 of the original image, the first current feature map is a feature map of 1/16 of the original image, and the third current feature map is a feature map of 1/32 of the original image.

第一解码器中，对第一编码器输出的第三当前特征图输入注意力机制模块，其中，注意力机制模块为Mlt-ECA(Multi-Efficient Channel Attention)模块，该注意力机制模块用于响应特征的加权调整。第三当前特征图输入注意力机制模块后，得到第三当前注意力特征图，对第三当前注意力特征图、第一当前特征图和第三当前特征图分别进行卷积。其中，第一解码器中，使用SAConv(Switchable Atrous Convolution)模块进行卷积。SAConv模块主要承担维度变换和连接操作，与普通卷积不同，SAConv模块能够自适应捕捉输入特征图中不同位置之间的关系。具体来说，它引入了开关机制，根据输入特征图自适应的调整卷积核的权重，同时存在上下文模块对输入特征图进行上下文信息融合和增强。In the first decoder, the third current feature map output by the first encoder is input into the attention mechanism module, wherein the attention mechanism module is an Mlt-ECA (Multi-Efficient Channel Attention) module, which is used to respond to the weighted adjustment of the feature. After the third current feature map is input into the attention mechanism module, the third current attention feature map is obtained, and the third current attention feature map, the first current feature map and the third current feature map are convolved respectively. Among them, in the first decoder, the SAConv (Switchable Atrous Convolution) module is used for convolution. The SAConv module is mainly responsible for dimensional transformation and connection operations. Unlike ordinary convolution, the SAConv module can adaptively capture the relationship between different positions in the input feature map. Specifically, it introduces a switch mechanism to adaptively adjust the weight of the convolution kernel according to the input feature map, and there is a context module to fuse and enhance the context information of the input feature map.

将卷积后的第三当前注意力特征图、采样后的第一当前特征图和采样后的第三当前特征图进行级联，得到第二级联图，对第二级联图进行特征重映，形成上一特征图。参见图3，第一解码器中，通过第一特征重映模块(CSPDCN)进行特征重映。第一特征重映模块中，引用瓶颈卷积层(CSPBottleneck)和可变形卷积(DCN(Deformable ConvolutionalNetworks))相结合，将CSPLayer中的两层卷积层改为两个3×3的可变形卷积层。与传统卷积中的固定几何结构和有限几何变换相比，可变形卷积引入偏移量来动态调整采样位置，进而更好地适应植物形态特征的形变。The convolved third current attention feature map, the sampled first current feature map, and the sampled third current feature map are cascaded to obtain a second cascade map, and the second cascade map is feature remapped to form the previous feature map. Referring to Figure 3, in the first decoder, feature remap is performed through the first feature remap module (CSPDCN). In the first feature remap module, the bottleneck convolution layer (CSPBottleneck) and the deformable convolution (DCN (Deformable Convolutional Networks)) are combined to change the two convolution layers in CSPLayer into two 3×3 deformable convolution layers. Compared with the fixed geometric structure and finite geometric transformation in traditional convolution, deformable convolution introduces an offset to dynamically adjust the sampling position, thereby better adapting to the deformation of plant morphological characteristics.

形成上一特征图后，对第一编码器输出的第一当前特征图与第一当前特征图的上一级的第二当前特征图分别使用SAConv模块进行卷积，得到第一当前卷积特征图和第二当前卷积特征图。对解码器的输出的上一特征图进行上采样，将第一当前卷积特征图、第二当前卷积特征图和上采样后的第一特征图级联，形成第一级联图，对第一级联图通过第一特征重映模块进行特征重映，得到第一特征图。其中，对第一级联图进行特征重映中，使用第一特征重映模块进行特征重映，其中，包括：对第一级联图进行卷积，将卷积后的第一级联图拆分为第一拆分级联图与第二拆分级联图，对第一拆分级联图进行双重可变形卷积，输出第一拆分卷积级联图，将第一拆分卷积级联图与第二拆分级联图进行级联并卷积，得到第一特征图。将第一级联图分为两部分，通过跨阶段层次结构将它们合并，实现更丰富的梯度组合。这种设计允许梯度流在不同的网络路径上传播，有助于提高网络的学习能力，并减少计算量。After forming the previous feature map, the first current feature map output by the first encoder and the second current feature map of the previous level of the first current feature map are convolved using the SAConv module respectively to obtain the first current convolution feature map and the second current convolution feature map. The previous feature map output by the decoder is upsampled, and the first current convolution feature map, the second current convolution feature map and the upsampled first feature map are cascaded to form a first cascade map, and the first cascade map is feature remapped by the first feature remapping module to obtain the first feature map. In the feature remapping of the first cascade map, the first feature remapping module is used to perform feature remapping, which includes: convolving the first cascade map, splitting the first cascade map after convolution into a first split cascade map and a second split cascade map, performing double deformable convolution on the first split cascade map, outputting the first split convolution cascade map, cascading and convolving the first split convolution cascade map with the second split cascade map, and obtaining the first feature map. The first cascade map is divided into two parts, and they are merged through a cross-stage hierarchy to achieve richer gradient combinations. This design allows the gradient flow to propagate on different network paths, which helps improve the network's learning ability and reduce the amount of computation.

第一解码器输出特征图后，将特征图使用Dropout策略。将特征图使用Dropout策略意味着在特征图的某些位置随机地应用Dropout。After the first decoder outputs the feature map, the feature map is subjected to the Dropout strategy. Using the Dropout strategy on the feature map means randomly applying Dropout at certain locations of the feature map.

无论是上一特征图还是第一特征图，都使用了两个当前特征图进行融合，保留更准确的定位信号，使得神经网络对于目标区域更具有判别特征的识别能力。而第一特征图还融合上一特征图，增加级联间的空间维度，提升小目标物体的识别性能。Whether it is the previous feature map or the first feature map, the two current feature maps are used for fusion to retain more accurate positioning signals, so that the neural network has better recognition ability of discriminative features for the target area. The first feature map also fuses the previous feature map to increase the spatial dimension between cascades and improve the recognition performance of small target objects.

第一识别神经网络模型还包括第一损失模块，第一损失模块对第一解码器输出的特征图进行处理。第一损失模块包括分类损失和回归损失，分类损失的计算基于BCE(Binary Cross-Entropy)函数，用于惩罚预测类别与真实类别之间的差异，回归损失用于约束模型对于预测框位置和形态的学习，由SCYLLA-IoU(SIoU)和Distribution FocalLoss(DFL)函数引导。可以看作是SIoU计算得到的框匹配程度损失和DFL计算得到的距离匹配损失之和。SIoU通过考虑边界框的尺度和角度信息来指导模型学习边界框的匹配程度，DFL则通过smooth L1损失对边界框位置进行优化。合并分类损失和回归损失得到第二识别神经网络的总损失，通过迭代不断最小化这个总损失，驱动模型不断提高其分类和定位的准确性，最终实现对农业生产活动中作物的精确检测与定位。The first recognition neural network model also includes a first loss module, which processes the feature map output by the first decoder. The first loss module includes classification loss and regression loss. The classification loss is calculated based on the BCE (Binary Cross-Entropy) function to penalize the difference between the predicted category and the true category. The regression loss is used to constrain the model's learning of the predicted box position and morphology, guided by the SCYLLA-IoU (SIoU) and Distribution FocalLoss (DFL) functions. It can be regarded as the sum of the box matching degree loss calculated by SIoU and the distance matching loss calculated by DFL. SIoU guides the model to learn the matching degree of the bounding box by considering the scale and angle information of the bounding box, and DFL optimizes the bounding box position through smooth L1 loss. The total loss of the second recognition neural network is obtained by combining the classification loss and the regression loss. This total loss is continuously minimized through iteration, driving the model to continuously improve its classification and positioning accuracy, and ultimately achieving accurate detection and positioning of crops in agricultural production activities.

第一神经网络模型还包括Anchor Free算法，在第一解码器输出的特征图进行预测。Anchor Free算法根据特征图上的信息生成预测框。The first neural network model also includes an Anchor Free algorithm, which performs prediction on the feature map output by the first decoder. The Anchor Free algorithm generates a prediction box based on the information on the feature map.

第二识别神经网络包括第二编码器和第二解码器，第二编码器用于对图像进行下采样提取多尺度特征图。第二编码器采用的结构是CSPDarknet，通过步长为2的卷积层对图像连续下采样产生多尺度特征图。本实施例中，通过第二编码器产生三个特征图，其中，第二目标特征图为原图的1/8的特征图，第一目标特征图为原图的1/16的特征图，第四目标特征图为原图的1/32的特征图。第二解码器不同于第一解码器中，使用DWConv(DepthwiseConvolution)进行卷积。The second recognition neural network includes a second encoder and a second decoder, and the second encoder is used to downsample the image to extract a multi-scale feature map. The structure adopted by the second encoder is CSPDarknet, which continuously downsamples the image through a convolution layer with a step size of 2 to generate a multi-scale feature map. In this embodiment, three feature maps are generated by the second encoder, wherein the second target feature map is a feature map of 1/8 of the original image, the first target feature map is a feature map of 1/16 of the original image, and the fourth target feature map is a feature map of 1/32 of the original image. The second decoder is different from the first decoder, and DWConv (DepthwiseConvolution) is used for convolution.

第二解码器首先对第四目标特征图输入注意力机制模块，进行特征加权调整后，使用DWConv进行卷积实现通道数和维度的调整，然后通过双线性插值进行上采样操作，形成相当于原图的1/16的特征图，再使用DWConv进行卷积实现通道数和维度的调整。然后与第一目标特征图使用权重融合模块进行加权融合，得到第一融合特征图，再对第一融合特征图进行上采样，形成相当于原图的1/8的特征图，与第二目标特征图进行加权融合，形成第二融合特征图，对第二融合特征图使用第二特征重映模块(CSPLayer)进行特征重映，得到第二特征图。The second decoder first inputs the fourth target feature map into the attention mechanism module, performs feature weighting adjustment, and then uses DWConv for convolution to adjust the number of channels and dimensions. Then, it performs upsampling operation through bilinear interpolation to form a feature map equivalent to 1/16 of the original image, and then uses DWConv for convolution to adjust the number of channels and dimensions. Then, it uses the weight fusion module to perform weighted fusion with the first target feature map to obtain the first fused feature map, and then upsamples the first fused feature map to form a feature map equivalent to 1/8 of the original image, and performs weighted fusion with the second target feature map to form a second fused feature map. The second feature remapping module (CSPLayer) is used to perform feature remapping on the second fused feature map to obtain the second feature map.

第二解码器获取第二特征图、第一目标特征图与第一融合特征图进行加权融合，得到第三融合特征图，对第三融合特征图通过第二特征重映模块使用第二特征重映模块(CSPLayer)进行特征重映，得到第三特征图。其中，第二特征图使用DWConv进行卷积后再进行加权融合。第二解码器解码过程尊重梯度的多样性，将网络中各阶段的特征图整合起来。同时，这些解码模块采用了权重融合技术，通过不断迭代的自底向上和自顶向下处理，第二识别神经网络能够保持较高的特征信息质量。确切的说，权重融合模块使用可学习的加权系数结构，这些系数主要用于控制不同层级之间的特征图对目标物体检测的影响程度。The second decoder obtains the second feature map, the first target feature map and the first fused feature map for weighted fusion to obtain the third fused feature map, and uses the second feature remapping module (CSPLayer) to remap the third fused feature map to obtain the third feature map. Among them, the second feature map is convolved using DWConv and then weighted fused. The decoding process of the second decoder respects the diversity of gradients and integrates the feature maps of each stage in the network. At the same time, these decoding modules use weight fusion technology. Through continuous iterative bottom-up and top-down processing, the second recognition neural network can maintain a high quality of feature information. To be precise, the weight fusion module uses a learnable weight coefficient structure, which is mainly used to control the degree of influence of feature maps between different levels on target object detection.

第三特征图融合第二特征图，使得特征表示的精细度和特征融合的多样性方面展现出明显的优势，从而增强了模型对深层信息的感知和捕获能力。The third feature map is fused with the second feature map, which shows obvious advantages in the refinement of feature representation and the diversity of feature fusion, thereby enhancing the model's ability to perceive and capture deep information.

第二解码器输出特征图后，将特征图使用Dropout策略。将特征图使用Dropout策略意味着在特征图的某些位置随机地应用Dropout。具体来说，可以在特征图的每个通道上独立地应用Dropout，或者以一定的概率将整个特征图的某些位置设置为0。这种策略可以帮助模型更好地学习特征图中的空间信息，并减少过拟合的风险。通过随机地丢弃一些特征，模型被迫学习更加鲁棒的特征表示，而不是依赖于某些特定的特征。After the second decoder outputs the feature map, the feature map uses the Dropout strategy. Using the Dropout strategy on the feature map means randomly applying Dropout to certain positions of the feature map. Specifically, Dropout can be applied independently on each channel of the feature map, or certain positions of the entire feature map can be set to 0 with a certain probability. This strategy can help the model better learn the spatial information in the feature map and reduce the risk of overfitting. By randomly discarding some features, the model is forced to learn a more robust feature representation instead of relying on certain specific features.

第二识别神经网络还包括第二损失模块，第二损失模块对第二解码器输出的特征图进行处理。第二损失模块包括分类损失和回归损失，分类损失的计算基于BCE(BinaryCross-Entropy)函数，用于惩罚预测类别与真实类别之间的差异，回归损失用于约束模型对于预测框位置和形态的学习，由SCYLLA-IoU(SIoU)和Distribution Focal Loss(DFL)函数引导。可以看作是SIoU计算得到的框匹配程度损失和DFL计算得到的距离场匹配损失之和。SIoU通过考虑边界框的尺度和角度信息来指导模型学习边界框的匹配程度，DFL则通过smooth L1损失对边界框位置进行优化。合并分类损失和回归损失得到第二识别神经网络的总损失，通过迭代不断最小化这个总损失，驱动模型不断提高其分类和定位的准确性，最终实现对农业生产活动中作物的精确检测与定位。The second recognition neural network also includes a second loss module, which processes the feature map output by the second decoder. The second loss module includes classification loss and regression loss. The classification loss is calculated based on the BCE (Binary Cross-Entropy) function to penalize the difference between the predicted category and the true category. The regression loss is used to constrain the model's learning of the predicted box position and morphology, guided by the SCYLLA-IoU (SIoU) and Distribution Focal Loss (DFL) functions. It can be regarded as the sum of the box matching degree loss calculated by SIoU and the distance field matching loss calculated by DFL. SIoU guides the model to learn the matching degree of the bounding box by considering the scale and angle information of the bounding box, and DFL optimizes the bounding box position through smooth L1 loss. The total loss of the second recognition neural network is obtained by combining the classification loss and the regression loss. This total loss is continuously minimized through iteration, driving the model to continuously improve its classification and positioning accuracy, and ultimately achieving accurate detection and positioning of crops in agricultural production activities.

第二神经网络模型还包括Anchor Free算法，在第二解码器输出的特征图进行预测。Anchor Free算法根据特征图上的信息生成预测框。The second neural network model also includes an Anchor Free algorithm, which performs prediction on the feature map output by the second decoder. The Anchor Free algorithm generates a prediction box based on the information on the feature map.

由于第二识别神经网络中，使用DWConv来降低计算复杂度，且第一特征重映模块中，融合了可变形卷积比第二特征重映模块的计算处理量更大。Since DWConv is used in the second recognition neural network to reduce computational complexity, and the deformable convolution is integrated in the first feature remapping module, the computational processing volume is greater than that of the second feature remapping module.

构建双向级联神经网络模型后，执行步骤S3，利用训练样本集对双向级联神经网络模型进行训练。在训练时，采用了SGD(Stochastic Gradient Descent)作为优化器，动量因子设置为0.937，初始化学习率为0.01。此外，还设置了早停策略，如果损失值连续50次迭代没有下降，则停止训练，避免过度拟合。After building the bidirectional cascade neural network model, execute step S3 to train the bidirectional cascade neural network model using the training sample set. During training, SGD (Stochastic Gradient Descent) was used as the optimizer, the momentum factor was set to 0.937, and the initial learning rate was 0.01. In addition, an early stopping strategy was set. If the loss value does not decrease for 50 consecutive iterations, the training is stopped to avoid overfitting.

利用训练样本集对双向级联神经网络模型进行训练后，执行步骤S4，将待识别图像输入训练完成的双向级联神经网络模型，得到识别结果。其中，在将待识别图像输入训练完成的双向级联神经网络模型中，还包括：判断当前单位时间内处理器的处理量是否大于预设值；若处理器的处理量大于预设值，则将待识别图像输入第二识别神经网络进行处理；若处理器的处理量小于预设值，则将第二识别神经网络输入第一识别神经网络进行处理。根据当前处理量确定使用第一识别神经网络还是第二识别神经网络，减少处理器的计算处理量。After the bidirectional cascade neural network model is trained using the training sample set, step S4 is executed to input the image to be identified into the trained bidirectional cascade neural network model to obtain the recognition result. Wherein, when the image to be identified is input into the trained bidirectional cascade neural network model, it also includes: judging whether the processing capacity of the processor in the current unit time is greater than the preset value; if the processing capacity of the processor is greater than the preset value, the image to be identified is input into the second recognition neural network for processing; if the processing capacity of the processor is less than the preset value, the second recognition neural network is input into the first recognition neural network for processing. Determine whether to use the first recognition neural network or the second recognition neural network according to the current processing capacity to reduce the computational processing capacity of the processor.

通过充分利用编码器输出的特征图和解码器输出的特征图，增强定位信号和空间维度，本发明通过更好地利用空间信息，提升对小目标物体的识别性能，并且具有更高的效率。在训练图片较少的情况下，还能对小目标物体进行识别。By making full use of the feature map output by the encoder and the feature map output by the decoder, the positioning signal and the spatial dimension are enhanced, and the present invention improves the recognition performance of small target objects by better utilizing spatial information, and has higher efficiency. In the case of fewer training pictures, small target objects can also be recognized.

计算机装置实施例：Computer device embodiment:

本实施例的计算机装置包括处理器与存储器，存储器存储有计算机程序，处理器执行计算机程序时实现上述的图像变换方法。The computer device of this embodiment includes a processor and a memory. The memory stores a computer program. When the processor executes the computer program, the above-mentioned image transformation method is implemented.

计算机装置可包括但不限于处理器与存储器。本领域技术人员可以理解，计算机装置可以包括更多或更少的部件，或者组合某些部件，或者不同的部件，例如计算机装置还可以包括输入输出设备、网络接入设备、总线等。The computer device may include but is not limited to a processor and a memory. Those skilled in the art will appreciate that the computer device may include more or fewer components, or a combination of certain components, or different components, for example, the computer device may also include input and output devices, network access devices, buses, etc.

计算机可读存储介质实施例：Computer readable storage medium embodiment:

上述实施例所描述的计算机装置中基于多尺度特征融合的图像的识别方法能以计算机程序方式存储在计算机可读存储介质中，该计算机程序被处理器执行时，可完成上述的计算机装置中基于多尺度特征融合的图像的识别方法实施例的步骤。计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质。计算机可读存储介质例如可以是但不限于：电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括：具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。The image recognition method based on multi-scale feature fusion in the computer device described in the above embodiment can be stored in a computer-readable storage medium in the form of a computer program. When the computer program is executed by the processor, the steps of the above-mentioned image recognition method based on multi-scale feature fusion in the computer device can be completed. The computer-readable medium can be a computer-readable signal medium or a computer-readable storage medium. The computer-readable storage medium can be, for example, but not limited to: an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination of the above. More specific examples of computer-readable storage media (a non-exhaustive list) include: an electrical connection with one or more wires, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above.

上述仅为本发明的较佳实施例，但发明的设计构思并不局限于此，在不脱离本发明构思的情况下，还可以包括更多其他等效实施例，对本领域技术人员来说能够进行各种明显的变化、重新调整和替代而不会脱离本发明的保护范围。The above are only preferred embodiments of the present invention, but the design concept of the invention is not limited thereto. Without departing from the concept of the present invention, more other equivalent embodiments may be included. It is possible for those skilled in the art to make various obvious changes, readjustments and substitutions without departing from the protection scope of the present invention.

Claims

1. The image identification method based on multi-scale feature fusion is characterized by comprising the following steps of:

manufacturing a training sample set;

constructing a bidirectional cascade neural network model, wherein the bidirectional cascade neural network model comprises a first identification neural network, the first identification neural network comprises a first encoder and a first decoder, and the first encoder is used for downsampling an image to extract a multi-scale feature map;

the first decoder is configured to obtain a first current feature map output by the first encoder and a second current feature map corresponding to a previous stage of the first current feature map, respectively convolve the first current feature map and the second current feature map to obtain a first current convolution feature map and a second current convolution feature map, upsample the previous feature map output by the decoder, cascade the first current convolution feature map, the second current convolution feature map and the upsampled first feature map to form a first cascade map, and remap the features of the first cascade map to obtain a first feature map;

training the bidirectional cascade neural network model by using the training sample set;

and inputting the image to be identified into the trained bidirectional cascade neural network model to obtain an identification result.

2. The method for identifying images based on multi-scale feature fusion according to claim 1, wherein:

the first decoder performs feature remapping on the first cascade graph to obtain a first feature graph, where the first feature graph includes: the first cascade diagram is convolved, the convolved first cascade diagram is divided into a first split cascade diagram and a second split cascade diagram, deformable convolution is carried out on the first split cascade diagram, the first split convolution cascade diagram is output, and the first split convolution cascade diagram and the second split cascade diagram are cascaded and convolved to obtain the first feature diagram.

3. The method for identifying images based on multi-scale feature fusion according to claim 1 or 2, characterized in that:

the bidirectional cascade neural network model further comprises a second identification neural network, wherein the second identification neural network comprises a second encoder and a second decoder, and the second encoder is used for downsampling an image to extract a multi-scale feature map;

the second decoder is configured to input a fourth target feature map output by the second encoder into the attention mechanism module, perform weighted fusion with the first target feature map after performing an upsampling operation to obtain a first fused feature map, perform weighted fusion with the first fused feature map and a second target feature map to form a second fused feature map, perform feature remapping on the second fused feature map to obtain a second feature map, obtain the second feature map, perform weighted fusion between the first target feature map and the first fused feature map, obtain a third fused feature map, and perform feature remapping on the third fused feature map through the second feature remapping module to obtain the third feature map.

4. A method of identifying images based on multi-scale feature fusion according to claim 3, characterized in that:

inputting the image to be identified into a trained bidirectional cascade neural network model comprises the following steps:

judging whether the processing capacity of the processor in the current unit time is larger than a preset value or not;

if the processing amount of the processor is larger than the preset value, inputting the image to be identified into the second identification neural network for processing;

and if the processing amount of the processor is smaller than the preset value, inputting the second identification neural network into the first identification neural network for processing.

5. The method for identifying images based on multi-scale feature fusion according to claim 1 or 2, characterized in that:

before upsampling the last feature map of the output of the decoder, the method further comprises: and inputting a third current feature map output by the first encoder into an attention mechanism module to obtain a third current attention feature map, respectively convolving the third current attention feature map, the first current feature map and the third current feature map, cascading the convolved third current attention feature map, the sampled first current feature map and the sampled third current feature map to obtain a second cascading map, and performing feature remapping on the second cascading map to form the previous feature map.

6. The method for identifying images based on multi-scale feature fusion of claim 5, wherein:

the first recognition neural network further comprises a first loss module, and the first loss module processes the characteristic diagram output by the first decoder.

7. The method for identifying images based on multi-scale feature fusion of claim 4, wherein:

the second recognition neural network further comprises a second loss module, and the second loss module processes the characteristic diagram output by the second decoder.

8. Computer device, characterized in that it comprises a processor and a memory, the memory storing a computer program which, when executed by the processor, implements a method for recognition of images based on multi-scale feature fusion according to any of claims 1 to 7.

9. A computer readable storage medium having stored thereon a computer program characterized by:

the computer program when executed implements a method of identifying images based on multi-scale feature fusion as claimed in any one of claims 1 to 7.