WO2025060272A1

WO2025060272A1 - Weakly supervised semantic segmentation method and apparatus based on attention mask

Info

Publication number: WO2025060272A1
Application number: PCT/CN2023/137612
Authority: WO
Inventors: 吴方闻; 叶玥; 王瑾
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-09-18
Filing date: 2023-12-08
Publication date: 2025-03-27
Anticipated expiration: 2026-03-18
Also published as: CN116935055A; CN116935055B

Abstract

The present disclosure relates to a weakly supervised semantic segmentation method and apparatus based on an attention mask. The method comprises: inputting a sample image into an attention encoder, so as to obtain a global class token feature, an image classification result and two semantic segmentation results; then inputting the sample image into the attention encoder again, so as to generate an attention matrix and to randomly generate a target mask matrix; compensating for the attention matrix by means of the target mask matrix, so as to obtain a compensated attention matrix; on the basis of the compensated attention matrix, generating each local class token feature; and distinguishing positive and negative characteristics of the local class token features. Model losses not only include losses in terms of image classification and image semantic segmentation, but also comprises losses caused by distinguishing positive and negative characteristics of local class token features and performing comparative learning together with global class token features, and semantic segmentation of a model is supervised by means of introducing a plurality of losses, thereby improving the accuracy of semantic segmentation.

Description

A weakly supervised semantic segmentation method and device based on attention mask

Technical Field

本公开涉及语义分割技术领域，尤其涉及一种基于注意力掩码的弱监督语义分割方法及装置。The present disclosure relates to the technical field of semantic segmentation, and in particular to a weakly supervised semantic segmentation method and device based on attention mask.

Background Art

目前图像语义分割方法采用的是像素级注释的全监督训练方法，其需要密集标注图像中所有像素的标签信息，因而极大的增加人力和时间成本。而弱监督语义分割方法仅仅只需要标注图像中出现物体类别标签的训练数据，因此人力和时间成本上的优势让弱监督语义分割获得了广泛的关注。The current image semantic segmentation method uses a fully supervised training method with pixel-level annotations, which requires densely annotating the label information of all pixels in the image, thus greatly increasing the manpower and time costs. The weakly supervised semantic segmentation method only requires training data with the object category labels that appear in the image. Therefore, the advantages in manpower and time costs have made weakly supervised semantic segmentation gain widespread attention.

弱监督语义分割(Weakly Supervised Semantic Segmentation，WSSS)是指利用边界框标注、潦草(scribble)标注、点标注或图像级类别标注来预测图像中每个像素的类别标签。本公开主要研究基于图像级别的标签(如图像中出现物体的类别)对神经网络进行分类训练，实现模型对图像进行语义分割的一类方法。Weakly supervised semantic segmentation (WSSS) refers to predicting the category label of each pixel in an image using bounding box annotation, scribble annotation, point annotation or image-level category annotation. This paper mainly studies a method of training a neural network based on image-level labels (such as the category of objects appearing in the image) to implement a model for semantic segmentation of images.

基于图像级分类标注弱监督语义分割的方法大多是基于类激活图(Class Activation Map，CAM)的方法展开。类激活图是一种基于深度分类网络的技术，用来生成通道数与总类别数相同的特征图，显示每个类别物体的近似位置。Most of the weakly supervised semantic segmentation methods based on image-level classification annotation are based on the Class Activation Map (CAM) method. Class Activation Map is a technique based on deep classification networks that is used to generate feature maps with the same number of channels as the total number of categories, showing the approximate location of objects in each category.

当前，如何提高弱监督语义分割的准确性，则是一个亟待解决的问题。Currently, how to improve the accuracy of weakly supervised semantic segmentation is an urgent problem to be solved.

发明内容Summary of the invention

本公开提供一种基于注意力掩码的弱监督语义分割方法及装置。The present invention provides a method and device for weakly supervised semantic segmentation based on attention mask.

根据本公开实施例的第一方面，提供了一种基于注意力掩码的弱监督语义分割方法，包括：According to a first aspect of an embodiment of the present disclosure, a weakly supervised semantic segmentation method based on attention mask is provided, comprising:

获取样本图像以及所述样本图像对应的图像分类标签；Obtaining a sample image and an image classification label corresponding to the sample image;

将所述样本图像输入到注意力编码器中；Input the sample image into the attention encoder;

通过所述注意力编码器得到补丁令牌特征与全局类别令牌特征，并通过所述补丁令牌特征生成对所述样本图像的图像分类结果以及类激活图，通过所述类激活图得到对所述样本图像的第一语义分割结果，以及对所述补丁令牌特征进行解码，得到第二语义分割结果；Obtaining patch token features and global category token features through the attention encoder, generating an image classification result and a class activation map for the sample image through the patch token features, obtaining a first semantic segmentation result for the sample image through the class activation map, and decoding the patch token features to obtain a second semantic segmentation result;

通过所述注意力编码器生成注意力矩阵，随机生成目标掩码矩阵，并通过所述目标掩码矩阵对所述注意力矩阵进行补偿，得到补偿后的注意力矩阵，根据所述补偿后的注意力矩阵，生成局部类别令牌特征；The attention encoder generates an attention matrix, randomly generates a target mask matrix, and compensates the attention matrix with the target mask matrix to obtain a compensated attention matrix. Attention matrix, generating local category token features;

根据辅助类激活图以及所述目标掩码矩阵，确定未被所述目标掩码矩阵中的掩码影响的类激活图中部分激活值的平均激活值，若所述平均激活值大于预设阈值，将所述局部类别令牌特征作为正局部类别令牌特征，若所述平均激活值小于等于预设阈值，将所述局部类别令牌特征作为负局部类别令牌特征；According to the auxiliary class activation map and the target mask matrix, determine the average activation value of some activation values in the class activation map that are not affected by the mask in the target mask matrix, if the average activation value is greater than a preset threshold, use the local class token feature as a positive local class token feature, if the average activation value is less than or equal to the preset threshold, use the local class token feature as a negative local class token feature;

以最小化总体损失为优化目标，对所述注意力编码器进行训练，所述总体损失包括：所述图像分类结果与所述图像分类标签之间的差异、所述第一语义分割结果与所述第二语义分割结果之间的差异、所述正局部类别令牌特征与所述全局类别令牌特征之间的差异，以及所述负局部类别令牌特征与所述全局类别令牌特征之间的差异。The attention encoder is trained with minimizing the overall loss as the optimization goal, and the overall loss includes: the difference between the image classification result and the image classification label, the difference between the first semantic segmentation result and the second semantic segmentation result, the difference between the positive local category token feature and the global category token feature, and the difference between the negative local category token feature and the global category token feature.

可选地，所述将所述样本图像输入到注意力编码器中，具体包括：Optionally, inputting the sample image into an attention encoder specifically includes:

将所述样本图像分割为若干子图像，以确定出所述若干子图像对应的多个初始补丁令牌特征，以及确定所述样本图像对应的初始类别令牌特征；Segmenting the sample image into a plurality of sub-images to determine a plurality of initial patch token features corresponding to the plurality of sub-images, and determining an initial category token feature corresponding to the sample image;

将所述多个初始补丁令牌特征与所述初始类别令牌特征拼接，输入到所述注意力编码器中。The multiple initial patch token features are concatenated with the initial category token features and input into the attention encoder.

可选地，所述随机生成目标掩码矩阵，具体包括：Optionally, the randomly generating a target mask matrix specifically includes:

确定所述注意力矩阵的尺寸；determining a size of the attention matrix;

按照所述尺寸进行预设倍数的下采样，得到采样尺寸，并随机生成大小为所述采样尺寸的初始掩码矩阵；Downsampling the size by a preset multiple to obtain a sampling size, and randomly generating an initial mask matrix having a size equal to the sampling size;

将所述初始掩码矩阵进行所述预设倍数的上采样，得到所述目标掩码矩阵。The initial mask matrix is upsampled by the preset multiple to obtain the target mask matrix.

可选地，所述注意力矩阵包括：查询矩阵、键矩阵以及值矩阵；Optionally, the attention matrix includes: a query matrix, a key matrix and a value matrix;

所述通过所述目标掩码矩阵对所述注意力矩阵进行补偿，得到补偿后的注意力矩阵，具体包括：The compensating the attention matrix by the target mask matrix to obtain a compensated attention matrix specifically includes:

通过以下公式确定所述补偿后的注意力矩阵：

Z^(l)＝A′^(l)V^(l) The compensated attention matrix is determined by the following formula:

Z ^(l) = A′ ^(l) V ^(l)

其中，A'为所述补偿后的注意力矩阵，M为所述目标掩码矩阵，Q为所述查询矩阵，K为所述键矩阵，V为所述值矩阵，Z为输出矩阵，l为所述注意力编码器的第l自注意力编码层。 Among them, A' is the compensated attention matrix, M is the target mask matrix, Q is the query matrix, K is the key matrix, V is the value matrix, Z is the output matrix, and l is the lth self-attention encoding layer of the attention encoder.

可选地，所述注意力编码器包括多个自注意力编码层；所述方法还包括：Optionally, the attention encoder includes a plurality of self-attention encoding layers; and the method further includes:

获取所述多个自注意力编码层中的目标自注意力编码层的输出结果。Obtain an output result of a target self-attention coding layer among the multiple self-attention coding layers.

可选地，所述目标自注意力编码层的输出结果包括中间补丁令牌特征，所述方法还包括：Optionally, the output result of the target self-attention encoding layer includes intermediate patch token features, and the method further includes:

通过所述中间补丁令牌特征，确定辅助分类结果；Determining auxiliary classification results through the intermediate patch token features;

以最小化所述总体损失为优化目标，对所述注意力编码器进行训练，其中，所述总体损失还包括：所述辅助分类结果与所述图像分类标签之间的差异。The attention encoder is trained with minimizing the overall loss as an optimization goal, wherein the overall loss also includes: the difference between the auxiliary classification result and the image classification label.

可选地，所述目标自注意力编码层的输出结果包括所述辅助类激活图，所述方法还包括：Optionally, the output result of the target self-attention encoding layer includes the auxiliary class activation map, and the method further includes:

通过所述辅助类激活图确定出的前景背景分割结果，确定第一亲和力矩阵；Determine a first affinity matrix based on the foreground-background segmentation result determined by the auxiliary class activation map;

根据所述补丁令牌特征，确定第二亲和力矩阵；Determining a second affinity matrix according to the patch token characteristics;

以最小化所述总体损失为优化目标，对所述注意力编码器进行训练，其中，所述总体损失还包括：所述第一亲和力矩阵与所述第二亲和力矩阵之间的差异。The attention encoder is trained with minimizing the overall loss as an optimization goal, wherein the overall loss also includes: a difference between the first affinity matrix and the second affinity matrix.

可选地，所述通过所述注意力编码器生成注意力矩阵，随机生成目标掩码矩阵，并通过所述目标掩码矩阵对所述注意力矩阵进行补偿，得到补偿后的注意力矩阵，根据所述补偿后的注意力矩阵，生成局部类别令牌特征，具体包括：Optionally, generating an attention matrix by the attention encoder, randomly generating a target mask matrix, and compensating the attention matrix by the target mask matrix to obtain a compensated attention matrix, and generating a local category token feature according to the compensated attention matrix, specifically includes:

随机生成多个目标掩码矩阵，针对每个目标掩码矩阵，将所述样本图像输入到所述注意力编码器中，生成所述样本图像对应的注意力矩阵，通过该目标掩码矩阵对本次生成的注意力矩阵进行补偿，得到对应的补偿后的注意力矩阵，根据所述对应的补偿后的注意力矩阵，生成该目标掩码矩阵对应的局部类别令牌特征；将多个局部类别令牌特征作为所述局部类别令牌特征。A plurality of target mask matrices are randomly generated. For each target mask matrix, the sample image is input into the attention encoder to generate an attention matrix corresponding to the sample image. The attention matrix generated this time is compensated by the target mask matrix to obtain a corresponding compensated attention matrix. According to the corresponding compensated attention matrix, a local category token feature corresponding to the target mask matrix is generated; and a plurality of local category token features are used as the local category token features.

可选地，所述方法还包括：Optionally, the method further comprises:

使用训练好的注意力编码器对待识别图像进行图像语义分割。Use the trained attention encoder to perform image semantic segmentation on the image to be recognized.

根据本公开实施例的第二方面，提供了一种基于注意力掩码的弱监督语义分割装置，包括：According to a second aspect of an embodiment of the present disclosure, a weakly supervised semantic segmentation device based on attention mask is provided, comprising:

获取模块，用于获取样本图像以及所述样本图像对应的图像分类标签；An acquisition module, used to acquire a sample image and an image classification label corresponding to the sample image;

输入模块，用于将所述样本图像输入到注意力编码器中，通过所述注意力编码器得到补丁令牌特征与全局类别令牌特征，并通过所述补丁令牌特征，生成对所述样本图像的图像分类结果以及类激活图，通过所述类激活图得到对所述样本图像的第一语义分割结果，以及对所述补丁令牌特征进行解码，得到第二语义分割结果；An input module is used to input the sample image into an attention encoder, obtain patch token features and global category token features through the attention encoder, generate an image classification result and a class activation map for the sample image through the patch token features, obtain a first semantic segmentation result for the sample image through the class activation map, and decode the patch token features to obtain a second semantic segmentation result;

掩码补偿模块，用于通过所述注意力编码器生成注意力矩阵，随机生成目标掩码矩阵，并通过所述目标掩码矩阵对所述注意力矩阵进行补偿，得到补偿后的注意力矩阵，根据所述补偿后的注意力矩阵，生成局部类别令牌特征；The mask compensation module is used to generate an attention matrix through the attention encoder and randomly generate a target mask matrix. matrix, and compensating the attention matrix by the target mask matrix to obtain a compensated attention matrix, and generating local category token features according to the compensated attention matrix;

判断模块，用于根据辅助类激活图以及所述目标掩码矩阵，确定未被所述目标掩码矩阵中的掩码影响的类激活图中部分激活值的平均激活值，若所述平均激活值大于预设阈值，将所述局部类别令牌特征作为正局部类别令牌特征，若所述平均激活值小于或等于预设阈值，将所述局部类别令牌特征作为负局部类别令牌特征；A judgment module, used for determining, according to the auxiliary class activation map and the target mask matrix, an average activation value of some activation values in the class activation map that are not affected by the mask in the target mask matrix, if the average activation value is greater than a preset threshold, using the local class token feature as a positive local class token feature, if the average activation value is less than or equal to the preset threshold, using the local class token feature as a negative local class token feature;

训练模块，用于以最小化总体损失为优化目标，对所述注意力编码器进行训练，所述总体损失包括：所述图像分类结果与所述图像分类标签之间的差异、所述第一语义分割结果与所述第二语义分割结果之间的差异、所述正局部类别令牌特征与所述全局类别令牌特征之间的差异，以及所述负局部类别令牌特征与所述全局类别令牌特征之间的差异。A training module is used to train the attention encoder with minimizing the overall loss as the optimization goal, wherein the overall loss includes: the difference between the image classification result and the image classification label, the difference between the first semantic segmentation result and the second semantic segmentation result, the difference between the positive local category token feature and the global category token feature, and the difference between the negative local category token feature and the global category token feature.

根据本公开实施例的第三方面，提供了一种计算机可读存储介质，所述计算机可读存储介质存储有计算机程序，所述计算机程序被处理器执行时实现上述基于注意力掩码的弱监督语义分割方法。According to a third aspect of an embodiment of the present disclosure, a computer-readable storage medium is provided, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the above-mentioned weakly supervised semantic segmentation method based on attention mask is implemented.

根据本公开实施例的第四方面，提供了一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现上述基于注意力掩码的弱监督语义分割方法。According to a fourth aspect of an embodiment of the present disclosure, an electronic device is provided, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the above-mentioned weakly supervised semantic segmentation method based on attention mask when executing the computer program.

本公开采用的上述至少一个技术方案能够达到以下有益效果：At least one of the above technical solutions adopted in the present disclosure can achieve the following beneficial effects:

从上述基于注意力掩码的弱监督语义分割方法中可以看出，该方法可以获取样本图像以及该样本图像对应的图像分类标签，将该样本图像输入到注意力编码器中，通过该注意力编码器得到补丁令牌特征与全局类别令牌特征，并通过补丁令牌特征，生成对该样本图像的图像分类结果以及类激活图，通过该类激活图得到对该样本图像的第一语义分割结果，以及对该补丁令牌特征进行解码，得到第二语义分割结果。通过该注意力编码器生成注意力矩阵，随机生成目标掩码矩阵，并通过目标掩码矩阵对注意力矩阵进行补偿，得到补偿后的注意力矩阵，根据补偿后的注意力矩阵，生成局部类别令牌特征。根据辅助类激活图以及目标掩码矩阵，确定未被目标掩码矩阵中的掩码影响的类激活图中部分激活值的平均激活值，若平均激活值大于预设阈值，将局部类别令牌特征作为正局部类别令牌特征，若平均激活值小于等于预设阈值，将该局部类别令牌特征作为负局部类别令牌特征。以最小化总体损失为优化目标，对该注意力编码器进行训练，总体损失包括：图像分类结果与图像分类标签之间的差异、第一语义分割结果与第二语义分割结果之间的差异、正局部类别令牌特征与全局类别令牌特征之间的差异，以及负局部类别令牌特征与全局类别令牌特征之间的差异，以通过训练后的注意力编码器对待识别图像进行图像语义分割。It can be seen from the above-mentioned weakly supervised semantic segmentation method based on attention mask that the method can obtain a sample image and an image classification label corresponding to the sample image, input the sample image into the attention encoder, obtain the patch token feature and the global category token feature through the attention encoder, and generate the image classification result and the class activation map of the sample image through the patch token feature, obtain the first semantic segmentation result of the sample image through the class activation map, and decode the patch token feature to obtain the second semantic segmentation result. Generate an attention matrix through the attention encoder, randomly generate a target mask matrix, and compensate the attention matrix through the target mask matrix to obtain a compensated attention matrix, and generate a local category token feature based on the compensated attention matrix. According to the auxiliary class activation map and the target mask matrix, determine the average activation value of the part of the activation value in the class activation map that is not affected by the mask in the target mask matrix, if the average activation value is greater than the preset threshold, the local category token feature is used as a positive local category token feature, if the average activation value is less than or equal to the preset threshold, the local category token feature is used as a negative local category token feature. The attention encoder is trained with the optimization goal of minimizing the overall loss, which includes the difference between the image classification result and the image classification label, the difference between the first semantic segmentation result and the second semantic segmentation result, the difference between the positive local category token feature and the global category token feature. The difference between , and the difference between the negative local category token features and the global category token features, are used to perform image semantic segmentation on the image to be recognized through the trained attention encoder.

从上述内容中可以看出，本方法在通过注意力编码器进行特征提取时，可以分为两种提取方式，第一种不采用掩码机制，直接得到图像分类结果，以及通过类激活图和补丁令牌特征分别得到的两种语义分割结果，第二种采用掩码机制，通过掩码矩阵对注意力机制进行补充，得到局部类别令牌特征，而后，可以结合对比学习的思想，将局部类别令牌特征区分正负性，从而参与模型训练，因此，模型训练不仅包含有分类损失、分割损失，还包含有与局部类别令牌特征相关的对比学习的损失。通过引入多种损失，能够更加准确地对图像语义分割的损失进行监督，从而提高语义分割的准确性。From the above content, it can be seen that this method can be divided into two extraction methods when extracting features through the attention encoder. The first method does not use a mask mechanism and directly obtains the image classification result, as well as two semantic segmentation results obtained through the class activation map and patch token features respectively. The second method uses a mask mechanism to supplement the attention mechanism through a mask matrix to obtain local category token features. Then, the idea of contrastive learning can be combined to distinguish the positive and negative local category token features, so as to participate in model training. Therefore, model training not only includes classification loss and segmentation loss, but also includes contrastive learning loss related to local category token features. By introducing multiple losses, the loss of image semantic segmentation can be supervised more accurately, thereby improving the accuracy of semantic segmentation.

BRIEF DESCRIPTION OF THE DRAWINGS

此处所说明的附图用来提供对本公开的进一步理解，构成本公开的一部分，本公开的示意性实施例及其说明用于解释本公开，并不构成对本公开的不当限定。在附图中：The drawings described herein are used to provide a further understanding of the present disclosure and constitute a part of the present disclosure. The illustrative embodiments of the present disclosure and their descriptions are used to explain the present disclosure and do not constitute an improper limitation on the present disclosure. In the drawings:

图1为本公开实施例提供的一种基于注意力掩码的弱监督语义分割方法的流程示意图；FIG1 is a flow chart of a weakly supervised semantic segmentation method based on attention mask provided by an embodiment of the present disclosure;

图2为本公开实施例提供的一种掩码策略的示意图；FIG2 is a schematic diagram of a masking strategy provided by an embodiment of the present disclosure;

图3为本公开实施例提供的一种生成目标掩码矩阵的示意图；FIG3 is a schematic diagram of generating a target mask matrix provided by an embodiment of the present disclosure;

图4为本公开实施例提供的一种模型训练的完整流程示意图；FIG4 is a schematic diagram of a complete process of model training provided by an embodiment of the present disclosure;

图5为本公开实施例提供的一种基于注意力掩码的弱监督语义分割装置示意图；FIG5 is a schematic diagram of a weakly supervised semantic segmentation device based on attention mask provided by an embodiment of the present disclosure;

图6为本公开实施例提供的对应于图1的电子设备示意图。FIG. 6 is a schematic diagram of an electronic device corresponding to FIG. 1 provided in an embodiment of the present disclosure.

DETAILED DESCRIPTION

为使本公开的目的、技术方案和优点更加清楚，下面将结合本公开具体实施例及相应的附图对本公开技术方案进行清楚、完整地描述。显然，所描述的实施例仅是本公开一部分实施例，而不是全部的实施例。基于本公开中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本公开保护的范围。In order to make the purpose, technical solutions and advantages of the present disclosure clearer, the technical solutions of the present disclosure will be clearly and completely described below in combination with the specific embodiments of the present disclosure and the corresponding drawings. Obviously, the described embodiments are only part of the embodiments of the present disclosure, not all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by ordinary technicians in this field without making creative work are within the scope of protection of the present disclosure.

以下结合附图，详细说明本公开各实施例提供的技术方案。The technical solutions provided by various embodiments of the present disclosure are described in detail below in conjunction with the accompanying drawings.

本公开提供一种基于注意力掩码的弱监督语义分割模型的训练方法，该模型的训练方法的执行过程可由服务器等电子设备执行。本公开以服务器执行该模型的训练方法为例进行说明。The present disclosure provides a training method for a weakly supervised semantic segmentation model based on an attention mask, and the execution process of the training method of the model can be executed by an electronic device such as a server. The present disclosure takes the server executing the training method of the model as an example for explanation.

图1为本公开实施例提供的一种基于注意力掩码的弱监督语义分割方法的流程示意图，具体包括以下步骤。FIG1 is a flowchart of a weakly supervised semantic segmentation method based on attention mask provided by an embodiment of the present disclosure. The figure specifically includes the following steps.

S100：获取样本图像以及所述样本图像对应的图像分类标签。S100: Obtain a sample image and an image classification label corresponding to the sample image.

S102：将所述样本图像输入到注意力编码器中。S102: Input the sample image into the attention encoder.

S104：通过所述注意力编码器得到补丁令牌特征与全局类别令牌特征，并通过所述补丁令牌特征生成对所述样本图像的图像分类结果以及类激活图，通过所述类激活图得到对所述样本图像的第一语义分割结果，以及对所述补丁令牌特征进行解码，得到第二语义分割结果。S104: obtaining patch token features and global category token features through the attention encoder, and generating an image classification result and a class activation map for the sample image through the patch token features, obtaining a first semantic segmentation result for the sample image through the class activation map, and decoding the patch token features to obtain a second semantic segmentation result.

S106：通过所述注意力编码器生成注意力矩阵，随机生成目标掩码矩阵，并通过所述目标掩码矩阵对所述注意力矩阵进行补偿，得到补偿后的注意力矩阵，根据所述补偿后的注意力矩阵，生成各局部类别令牌特征。S106: Generate an attention matrix through the attention encoder, randomly generate a target mask matrix, and compensate the attention matrix through the target mask matrix to obtain a compensated attention matrix, and generate each local category token feature according to the compensated attention matrix.

在弱监督语义分割中，训练样本对应的标签是指图像中所包含的目标物的类别，而不是像常规的语义分割中将图像中每个像素所属的目标物类别均标注出。In weakly supervised semantic segmentation, the label corresponding to the training sample refers to the category of the target object contained in the image, rather than marking the target object category of each pixel in the image as in conventional semantic segmentation.

基于此，服务器可以获取样本图像以及所述样本图像对应的图像分类标签，该图像分类标签用于表示图像中目标物对应的类别。Based on this, the server can obtain a sample image and an image classification label corresponding to the sample image, where the image classification label is used to indicate a category corresponding to a target object in the image.

可以将样本图像输入到注意力编码器中，通过该注意力编码器得到补丁令牌(Patch Token)特征与全局类别令牌(Class Token)特征，并通过补丁令牌特征，生成对样本图像的图像分类结果以及类激活图，通过类激活图得到对样本图像的第一语义分割结果，以及对补丁令牌特征进行解码，得到第二语义分割结果。该注意力编码器可以是Transformer模型中的注意力编码器，Transformer模型中的注意力编码器使用自注意力机制，包括多个自注意力层。The sample image can be input into the attention encoder, through which the patch token feature and the global class token feature are obtained, and the image classification result and the class activation map of the sample image are generated through the patch token feature, and the first semantic segmentation result of the sample image is obtained through the class activation map, and the patch token feature is decoded to obtain the second semantic segmentation result. The attention encoder can be an attention encoder in a Transformer model, and the attention encoder in the Transformer model uses a self-attention mechanism, including multiple self-attention layers.

并且，可以将样本图像输入到注意力编码器中，生成注意力矩阵以及随机生成目标掩码矩阵，并通过目标掩码矩阵对注意力矩阵进行补偿，得到补偿后的注意力矩阵，根据补偿后的注意力矩阵，生成局部类别令牌特征。In addition, the sample image can be input into the attention encoder to generate an attention matrix and a randomly generated target mask matrix, and the attention matrix is compensated by the target mask matrix to obtain a compensated attention matrix, and local category token features are generated according to the compensated attention matrix.

上述两个过程的区别在于，一个是将样本图像输入到注意力编码器中，没有使用到目标掩码矩阵(可以被称为第一过程)，一个是将样本图像输入到注意力编码器中，使用到了目标掩码矩阵(可以被称为第二过程)。目标掩码矩阵的作用在于在注意力机制中使用掩码，以重点提取图像中的局部信息，这是由于Transformer模型中的注意力机制具有稀疏的特性。而注意力编码器会提取出各个特征之间的关联，但这样提取可能会提取出不必要的关联，因此，本方法通过掩码(目标掩码矩阵)，可以覆盖住部分关联，不考虑这些关联来得到最终的类别令牌特征。The difference between the above two processes is that one is to input the sample image into the attention encoder without using the target mask matrix (which can be called the first process), and the other is to input the sample image into the attention encoder and use the target mask matrix (which can be called the second process). The role of the target mask matrix is to use the mask in the attention mechanism to focus on extracting local information in the image. This is because the attention mechanism in the Transformer model has the sparse characteristics. The attention encoder will extract the associations between each feature, but this extraction may extract unnecessary associations. Therefore, this method can cover some of the associations through the mask (target mask matrix) and obtain the final category token features without considering these associations.

上述将样本图像输入到注意力编码器中，可以有不同的方法。例如，可以是分次输入，如先将样本图像输入到没有使用到目标掩码矩阵的注意力编码器中，然后再将样本图像输入到使用到了目标掩码矩阵的注意力编码器中。又例如，可以设置注意力编码器不同分支，一个分支是不包括目标掩码矩阵的第一子编码器，一个分支是包括目标掩码矩阵的第二子编码器，可以将样本图像同时输入到第一子编码器和第二子编码器中。There are different ways to input the sample image into the attention encoder. For example, it can be input in batches. For example, the sample image is first input into the attention encoder that does not use the target mask matrix, and then the sample image is input into the attention encoder that uses the target mask matrix. For another example, different branches of the attention encoder can be set, one branch is the first sub-encoder that does not include the target mask matrix, and the other branch is the second sub-encoder that includes the target mask matrix, and the sample image can be input into the first sub-encoder and the second sub-encoder at the same time.

需要说明的是，在一些例子中，可以生成一次目标掩码矩阵，并将样本图像输入到注意力编码器中，通过此次生成的目标掩码矩阵对注意力矩阵进行补偿，从而得到一种局部类别令牌特征。It should be noted that in some examples, a target mask matrix can be generated once, and the sample image can be input into the attention encoder. The attention matrix is compensated by the target mask matrix generated this time, thereby obtaining a local category token feature.

在一些例子中，可以多次随机生成目标掩码矩阵，每次生成的目标掩码矩阵不同，可以多次将样本图像输入到注意力编码器中，每一次都通过一种目标掩码矩阵对注意力矩阵进行补偿，可以得到多种局部类别令牌特征。例如，可以随机生成多个目标掩码矩阵，将该样本图像输入到注意力编码器中，得到注意力矩阵，并针对每个目标掩码矩阵，通过该目标掩码矩阵对注意力矩阵进行补偿，得到对应的补偿后的注意力矩阵，根据补偿后的注意力矩阵，生成该目标掩码矩阵对应的局部类别令牌特征。最终提取出的局部类别令牌特征存在有多种局部类别令牌。In some examples, the target mask matrix can be randomly generated multiple times, each time the generated target mask matrix is different, the sample image can be input into the attention encoder multiple times, each time the attention matrix is compensated by a target mask matrix, and multiple local category token features can be obtained. For example, multiple target mask matrices can be randomly generated, the sample image can be input into the attention encoder to obtain the attention matrix, and for each target mask matrix, the attention matrix is compensated by the target mask matrix to obtain the corresponding compensated attention matrix, and the local category token feature corresponding to the target mask matrix is generated according to the compensated attention matrix. The local category token features finally extracted have multiple local category tokens.

下面对上述两个过程分别进行详细的说明。The above two processes are described in detail below.

需要说明的是，可以将样本图像分割为若干子图像，以确定出若干子图像对应的多个初始补丁令牌特征，以及确定出该样本图像对应的初始类别令牌特征(初始类别令牌特征可以是随机初始化的)，并将多个初始补丁令牌特征与初始类别令牌特征输入到注意力编码器中(上述两个过程在将样本图像输入到注意力编码器中时是相同的)。在一些例子中，将多个初始补丁令牌特征与初始类别令牌特征输入到注意力编码器中，可以是将维度为N*D的多个初始补丁令牌特征与维度为1*D的初始类别令牌特征，沿第一个维度拼接，从而得到(N+1)*D维度的令牌，并输入到注意力编码器中,其中D表示令牌嵌入的维度。It should be noted that the sample image can be divided into several sub-images to determine multiple initial patch token features corresponding to the several sub-images, and to determine the initial category token features corresponding to the sample image (the initial category token features can be randomly initialized), and the multiple initial patch token features and the initial category token features are input into the attention encoder (the above two processes are the same when the sample image is input into the attention encoder). In some examples, the multiple initial patch token features and the initial category token features are input into the attention encoder, which can be a concatenation of multiple initial patch token features of dimension N*D and initial category token features of dimension 1*D along the first dimension, thereby obtaining a token of dimension (N+1)*D, and inputting it into the attention encoder, where D represents the dimension of token embedding.

注意力编码器可以针对初始补丁令牌特征和初始类别令牌特征，通过注意力机制进行注意力加权，从而得到加权后的补丁令牌特征和类别令牌特征，在上述两个过程中的第一个过程得到的补丁令牌特征和全局类别令牌特征即是加权后的，在第二个过程中不需要最终获取到补丁令牌特征(中间过程存在有补丁令牌特征)，因此，只获取得到了各局部类别令牌特征，各局部类别令牌特征是加权后的。The attention encoder can perform attention weighting on the initial patch token features and the initial category token features through the attention mechanism, so as to obtain weighted patch token features and category token features. The patch token features and global category token features obtained in the first process of the above two processes are weighted. In the second process, there is no need to finally obtain the patch token features (patch token features exist in the intermediate process). Therefore, only the features of each local category token are obtained, and the features of each local category token are weighted.

示例性的，为输入图像，视觉Transformer编码器可以将I分割为N＝W′×H′个不重叠的补丁，补丁可表示为(即，子图像)，其中， P是补丁大小，H是输入图像的高度，W是输入图像的宽度。然后将块(补丁)I'平铺并投影得到N个补丁令牌(即，Patch Token)，补丁令牌可表示为其中D表示令牌嵌入的维度。令牌嵌入与相关的可学习类别令牌(即，Class Token)连接并输入标准Transformer编码器。在一些例子中，多个补丁令牌与一个相关的可学习类别令牌进行连接。此外，需要在补丁令牌顶部添加位置嵌入(即，Positional Embedding)。这里提到的补丁令牌为初始补丁令牌特征，可学习类别令牌为初始类别令牌特征。For example, For the input image, the visual Transformer encoder can split I into N = W′×H′ non-overlapping patches, which can be represented as (i.e., sub-image), where P is the patch size, H is the height of the input image, and W is the width of the input image. Then the block (patch) I' is tiled and projected to obtain N patch tokens (i.e., Patch Token), which can be expressed as Where D represents the dimension of token embedding. Token embeddings are concatenated with the associated learnable class token (i.e., Class Token) and input into the standard Transformer encoder. In some examples, multiple patch tokens are concatenated with one associated learnable class token. In addition, positional embeddings (i.e., Positional Embedding) need to be added on top of the patch token. The patch tokens mentioned here are the initial patch token features, and the learnable class tokens are the initial class token features.

注意力编码器中可以包含有多个自注意力编码层，即，Transformer层。The attention encoder can contain multiple self-attention encoding layers, i.e., Transformer layers.

具体来说，在Transformer层l(在Transformer模型中，层l可以是自注意力编码层)中，首先应用线性可学习转换将令牌序列映射到查询矩阵键矩阵和值矩阵这里，令牌序列是指上文提及的在补丁令牌顶部添加位置嵌入并将补丁令牌与类别令牌拼接后的序列，在一些例子中，可以通过一个简单的线性层对此令牌序列进行映射操作。Specifically, in the Transformer layer l (in the Transformer model, layer l can be a self-attention encoding layer), a linear learnable transformation is first applied to map the token sequence to the query matrix Key Matrix Sum Matrix Here, the token sequence refers to the sequence mentioned above after adding the position embedding on top of the patch token and concatenating the patch token with the category token. In some examples, this token sequence can be mapped through a simple linear layer.

这里，D_k表示Q和K的维度，而D_v是V的维度。通过计算查询矩阵与键矩阵之间的乘积并随后除以来实现自注意力机制。这个过程得到了连续的注意力矩阵包含每一层中成对的全局关系。输出矩阵Z^(l)作为V^(l)的加权混合产生，权重来自经过softmax操作(归一化操作)的注意力矩阵上述关联可以通过如下公式(1)和(2)表达：

Z^(l)＝A′^(l)V^(l) (2)Here, _Dk represents the dimensions of Q and K, and _Dv is the dimension of V. This is done by computing the product of the query matrix and the key matrix and then dividing by To implement the self-attention mechanism. This process obtains a continuous attention matrix Contains the global relationship between pairs in each layer. The output matrix Z ^(l) is generated as a weighted mixture of V ^(l) , with the weights coming from the attention matrix after the softmax operation (normalization operation) The above relationship can be expressed by the following formulas (1) and (2):

Z ^(l) = A′ ^(l) V ^(l) (2)

假设每个Transformer层的补丁令牌输出序列被重塑为特征图其中D是特征维度，H′×W′是空间维度。然后应用池化和卷积层可以生成图像分类结果(分类logits)。同时，通过特征图和相应分类器的参数之间的矩阵乘法可以生成第l层的类激活图可以通过如下公式(3)表达：
F^(l)＝W^(l)`Z^(l) (3)Assume that the patch token output sequence of each Transformer layer is reshaped into a feature map Where D is the feature dimension and H′×W′ is the spatial dimension. Then applying pooling and convolution layers can generate image classification results (classification logits). At the same time, through the feature map and the parameters of the corresponding classifier The matrix multiplication between them can generate the class activation map of the lth layer It can be expressed by the following formula (3):
F ^(l) = W ^(l)` Z ^(l) (3)

需要说明的是，上述过程是注意力编码器中一层的注意力加权过程。It should be noted that the above process is the attention weighting process of one layer in the attention encoder.

将Transformer编码器中最后一层得到的类激活图通过像素精炼模块可以生成伪分割标签(第一语义分割结果)。在一些例子中，如果这个Transformer编码器总共l层，那么这里得到的类激活图就是Transformer编码器的第l层输出的类激活图。像素精炼模块可以根据输入图片中的像素信息以及邻域的空间信息对类激活图进行改进，从而确保相似图像外观的相邻像素具有相同语义。将Transformer编码器中最后一层输出的补丁令牌重塑而成特征图，传入解码器中可以生成预测分割标签(第二语义分割结果)。该解码器可以由两个3×3卷积层(扩张率为5)和一个1×1线性层组成。The class activation map obtained from the last layer of the Transformer encoder can be passed through the pixel refinement module to generate a pseudo segmentation label (the first semantic segmentation result). In some examples, if the Transformer encoder has a total of l layers, then the class activation map obtained here is the class activation map output by the lth layer of the Transformer encoder. Pixel Refinement The module can improve the class activation map based on the pixel information in the input image and the spatial information of the neighborhood, thereby ensuring that adjacent pixels of similar image appearance have the same semantics. The patch tokens output by the last layer in the Transformer encoder are reshaped into feature maps, which can be passed into the decoder to generate predicted segmentation labels (second semantic segmentation results). The decoder can be composed of two 3×3 convolutional layers (with an expansion rate of 5) and a 1×1 linear layer.

上述过程是在注意力机制中不添加掩码的过程，在下面说明在注意力机制中添加掩码的过程，如图2、3所示。图2为本公开实施例提供的一种掩码策略的示意图。图3为本公开实施例提供的一种生成目标掩码矩阵的示意图。The above process is a process in which a mask is not added in the attention mechanism. The process of adding a mask in the attention mechanism is described below, as shown in Figures 2 and 3. Figure 2 is a schematic diagram of a mask strategy provided by an embodiment of the present disclosure. Figure 3 is a schematic diagram of generating a target mask matrix provided by an embodiment of the present disclosure.

在Transformer层l中，引入了一个注意力掩码M^(l)∈{0,1}^N×N(针对第l层的目标掩码矩阵)作为注意力矩阵的开关，从而得到一个正则化的注意力矩阵输出矩阵Z^(l)作为V^(l)的加权混合产生，权重来自经过softmax操作(归一化操作)的正则化的注意力矩阵。此时的输出矩阵Z^(l)包括了局部类别令牌。上述关联可以通过如下公式(4)和(5)表达：

In Transformer layer l, an attention mask M ^(l) ∈ {0,1} ^N×N (target mask matrix for layer l) is introduced as the switch of the attention matrix, thus obtaining a regularized attention matrix The output matrix Z ^(l) is generated as a weighted mixture of V ^(l) , with the weights coming from the regularized attention matrix after the softmax operation (normalization operation). The output matrix Z ^(l) now includes the local category tokens. The above association can be expressed by the following formulas (4) and (5):

掩码策略的细节如图2和图3所示。具体来说，掩码是通过在注意力矩阵中随机丢弃“列”来实现的。对注意力矩阵中的“列”进行丢弃，也可以说是对键进行丢弃，此处的键是指键矩阵注意，要屏蔽的键是从伯努利分布中独立抽取的，掩码比率p∈(0,1)，也就是说对键中元素的抽样服从伯努利分布。此处的伯努利分布即为数学中的二项分布。可以通过在不同下采样分辨率上独立抽取要屏蔽的键，再进行同样倍数的上采样来获得掩码。下采样也称抽取，对于一个样值序列间隔几个样值取样一次，这样得到新序列就是原序列的下采样。上采样是下采样的逆过程，也称增取样或内插。在本实施例中，可以先确定注意力矩阵的尺寸，按照尺寸进行预设倍数的下采样，得到采样尺寸，并随机生成大小为采样尺寸的初始掩码矩阵，再将初始掩码矩阵进行预设倍数的上采样，得到目标掩码矩阵。该策略可以在不同分辨率下使未被掩码覆盖的键形成连续区域。此时生成的掩码是在令牌维度上，随后扩展到注意力矩阵的形状以形成M∈{0,1}^N×N。The details of the masking strategy are shown in Figures 2 and 3. Specifically, masking is achieved by randomly dropping columns in the attention matrix. Dropping columns in the attention matrix can also be said to be dropping keys, where the key refers to the key matrix Note that the keys to be masked are independently extracted from the Bernoulli distribution, and the mask ratio p∈(0,1), that is, the sampling of the elements in the key follows the Bernoulli distribution. The Bernoulli distribution here is the binomial distribution in mathematics. The mask can be obtained by independently extracting the keys to be masked at different downsampling resolutions and then upsampling by the same multiple. Downsampling is also called decimation. For a sample value sequence, a few sample values are sampled once, so that the new sequence obtained is the downsampling of the original sequence. Upsampling is the inverse process of downsampling, also known as upsampling or interpolation. In this embodiment, the size of the attention matrix can be determined first, and the preset multiples of downsampling are performed according to the size to obtain the sampling size, and the initial mask matrix of the sampling size is randomly generated, and the initial mask matrix is upsampled by a preset multiple to obtain the target mask matrix. This strategy can form a continuous area for the keys not covered by the mask at different resolutions. At this time, the mask generated is in the token dimension, and then expanded to the shape of the attention matrix to form M∈{0,1} ^N×N .

S108：根据辅助类激活图以及所述目标掩码矩阵，确定未被所述目标掩码矩阵中的掩码影响的类激活图中部分激活值的平均激活值，若所述平均激活值大于预设阈值，将所述局部类别令牌特征作为正局部类别令牌特征，若所述平均激活值小于或等于预设阈值，将该局部类别令牌特征作为负局部类别令牌特征。S108: According to the auxiliary class activation map and the target mask matrix, determine the average activation value of some activation values in the class activation map that are not affected by the mask in the target mask matrix, if the average activation value is greater than a preset threshold, use the local class token feature as a positive local class token feature, if the average activation value is less than or equal to the preset threshold Set a threshold and use the local category token feature as a negative local category token feature.

S110：以最小化总体损失为优化目标，对所述注意力编码器进行训练，所述总体损失包括：所述图像分类结果与所述图像分类标签之间的差异、所述第一语义分割结果与所述第二语义分割结果之间的差异、所述正局部类别令牌特征与所述全局类别令牌特征之间的差异，以及所述负局部类别令牌特征与所述全局类别令牌特征之间的差异。S110: The attention encoder is trained with minimizing the overall loss as the optimization goal, and the overall loss includes: the difference between the image classification result and the image classification label, the difference between the first semantic segmentation result and the second semantic segmentation result, the difference between the positive local category token feature and the global category token feature, and the difference between the negative local category token feature and the global category token feature.

在一些例子中，总体损失为所述图像分类结果与所述图像分类标签之间的差异、所述第一语义分割结果与所述第二语义分割结果之间的差异、所述正局部类别令牌特征与所述全局类别令牌特征之间的差异，以及所述负局部类别令牌特征与所述全局类别令牌特征之间的差异之和。In some examples, the overall loss is the sum of the difference between the image classification result and the image classification label, the difference between the first semantic segmentation result and the second semantic segmentation result, the difference between the positive local class token feature and the global class token feature, and the difference between the negative local class token feature and the global class token feature.

在一些例子中，进一步的，训练好的注意力编码器可以对待识别图像进行图像语义分割。In some examples, the trained attention encoder can further perform image semantic segmentation on the image to be recognized.

确定出局部类别令牌特征后，可以按照类似对比学习的方式，通过局部类别令牌特征和全局类别令牌特征进行训练，即，对注意力编码器进行训练的损失包含与局部类别令牌特征相关的损失。After the local category token features are determined, they can be trained with the local category token features and the global category token features in a manner similar to contrastive learning, i.e., the loss for training the attention encoder includes the loss associated with the local category token features.

具体的，可以根据辅助类激活图以及目标掩码矩阵，确定未被目标掩码矩阵中的掩码影响的类激活图中部分激活值的平均激活值，若该平均激活值大于预设阈值，可以将局部类别令牌特征作为正局部类别令牌特征，否则，将该局部类别令牌特征作为负局部类别令牌特征。在一些例子中，通过辅助类激活图和掩码生成策略生成的掩码，以对得到的类激活图没被掩码遮蔽的区域计算平均激活值，并决定正/负标签。注意力掩码的目的是切断被掩码遮蔽的令牌从而产生的区域之间在做注意力操作时的联系。Specifically, the average activation value of some activation values in the class activation map that are not affected by the mask in the target mask matrix can be determined based on the auxiliary class activation map and the target mask matrix. If the average activation value is greater than a preset threshold, the local class token feature can be used as a positive local class token feature, otherwise, the local class token feature is used as a negative local class token feature. In some examples, the mask generated by the auxiliary class activation map and the mask generation strategy is used to calculate the average activation value of the area of the obtained class activation map that is not masked by the mask, and determine the positive/negative label. The purpose of the attention mask is to cut off the connection between the areas generated by the tokens masked by the mask when performing the attention operation.

基于此，与局部类别令牌特征相关的损失可以为：正局部类别令牌特征与全局类别令牌特征之间的差异，以及负局部类别令牌特征与全局类别令牌特征之间的差异。其中，正局部类别令牌特征与全局类别令牌特征之间的差异越小越好，而负局部类别令牌特征与全局类别令牌特征之间的差异越大越好。Based on this, the loss associated with the local category token feature can be: the difference between the positive local category token feature and the global category token feature, and the difference between the negative local category token feature and the global category token feature. Among them, the smaller the difference between the positive local category token feature and the global category token feature, the better, and the larger the difference between the negative local category token feature and the global category token feature, the better.

下面先对模型训练的其他损失进行介绍，再对具体如何确定正局部类别令牌特征和负局部类别令牌特征进行说明。本方法提供的模型训练的完整流程可以如图4所示。图4为本公开实施例提供的一种模型训练的完整流程示意图。The following first introduces other losses of model training, and then explains how to determine positive local category token features and negative local category token features. The complete process of model training provided by this method can be shown in Figure 4. Figure 4 is a schematic diagram of a complete process of model training provided by an embodiment of the present disclosure.

从图4中可以看出，模型训练包含的损失还存在有分类损失和分割损失。进一步的，还可以包括辅助分类损失和亲和力损失。As can be seen from Figure 4, the losses included in model training also include classification loss and segmentation loss. Furthermore, it can also include auxiliary classification loss and affinity loss.

其中，分类损失为图像分类结果与图像分类标签之间的差异，具体可通过以下公式 (6)进行计算。可以将补丁令牌特征输入池化和分类层以计算类概率向量p_cls，然后使用多标签软边际损失作为分类函数。分类损失L_cls的计算方式表达如下：
The classification loss is the difference between the image classification result and the image classification label, which can be expressed as follows: (6) is calculated. The patch token features can be input into the pooling and classification layers to calculate the class probability vector p _cls , and then the multi-label soft margin loss is used as the classification function. The calculation method of the classification loss L _cls is expressed as follows:

其中，C是类别总数，y是图像级别的真实标签。类似的，可以通过上述公式计算辅助分类损失。Where C is the total number of categories and y is the true label at the image level. Similarly, the auxiliary classification loss can be calculated using the above formula.

辅助分类损失可以通过模型中间层输出的结果确定，具体的，可以获取注意力编码器的自注意力编码层中目标自注意力编码层的输出结果，示例性的，可以通过中间的某一自注意力编码层(如，注意力编码器共有12层自注意力编码层时，目标自注意力编码层可以是第10层)输出的中间补丁令牌特征，而后，通过该输出结果，确定辅助分类结果，进而以最小化辅助分类结果与图像分类标签之间的差异为优化目标，对该注意力编码器进行训练。即，辅助分类损失为辅助分类结果与图像分类标签之间的差异。The auxiliary classification loss can be determined by the output result of the intermediate layer of the model. Specifically, the output result of the target self-attention coding layer in the self-attention coding layer of the attention encoder can be obtained. For example, the intermediate patch token features output by a certain intermediate self-attention coding layer (e.g., when the attention encoder has 12 self-attention coding layers, the target self-attention coding layer can be the 10th layer) can be obtained. Then, the auxiliary classification result is determined by the output result, and then the attention encoder is trained with the optimization goal of minimizing the difference between the auxiliary classification result and the image classification label. That is, the auxiliary classification loss is the difference between the auxiliary classification result and the image classification label.

分割损失为第一语义分割结果与第二语义分割结果之间的差异，可通过交叉熵进行计算。The segmentation loss is the difference between the first semantic segmentation result and the second semantic segmentation result, which can be calculated by cross entropy.

在确定亲和力损失时，可以获取目标自注意力编码层的输出结果，并根据通过该输出结果确定出的前景背景分割结果，确定第一亲和力矩阵，以及根据补丁令牌特征，确定第二亲和力矩阵，最后以最小化第一亲和力矩阵与第二亲和力矩阵之间的差异为优化目标，对注意力编码器进行训练。When determining the affinity loss, the output result of the target self-attention encoding layer can be obtained, and the first affinity matrix can be determined according to the foreground background segmentation result determined by the output result, and the second affinity matrix can be determined according to the patch token features. Finally, the attention encoder is trained with the optimization goal of minimizing the difference between the first affinity matrix and the second affinity matrix.

即，第一亲和力矩阵是通过中间的自注意力编码层得到的，可以通过中间的自注意力编码层确定出辅助类激活图，并通过该辅助类激活图得到前景背景分割结果，进而确定第一亲和力矩阵。而第二亲和力矩阵则是通过注意力编码器直接得到的，可以将最后一层的自注意力编码层得到的补丁令牌特征，作为第二亲和力矩阵。可以将辅助类激活图处理为伪亲和标签，作为亲和学习的监督。That is, the first affinity matrix is obtained through the middle self-attention encoding layer, and the auxiliary class activation map can be determined through the middle self-attention encoding layer, and the foreground background segmentation result can be obtained through the auxiliary class activation map, and then the first affinity matrix is determined. The second affinity matrix is obtained directly through the attention encoder, and the patch token features obtained by the last layer of the self-attention encoding layer can be used as the second affinity matrix. The auxiliary class activation map can be processed into a pseudo affinity label as supervision for affinity learning.

从类激活图的激活值准确区分前景和背景是困难的，因为类激活图中具有中等置信度的像素不适合标记为注释对象或背景。因此，为了生成可靠的亲和标签，引入两个阈值β_fg和β_bg满足0＜β_bg＜β_fg＜1，将辅助类激活图划分为前景、背景和不确定区域。It is difficult to accurately distinguish the foreground and background from the activation values of the class activation map, because pixels with medium confidence in the class activation map are not suitable for marking as annotated objects or background. Therefore, in order to generate reliable affinity labels, two thresholds _βfg and _βbg are introduced to satisfy 0＜ _βbg ＜ _βfg ＜1, and the auxiliary class activation map is divided into foreground, background and uncertain areas.

可靠的分割标签如下，可以通过公式(7)表达：
Reliable segmentation labels As follows, it can be expressed by formula (7):

将背景类标记为0，不确定区域标记为255。然后根据这个分割标签构建第一亲和力矩阵。具体来说，在可靠的分割标签Y’中选取任意两个像素点可构成像素对，如果从分割标签中采样的像素对具有相同的语义(例如，像素对中的像素都来自前景或都来自背景)，则确定亲和性为正，否则，它们的亲和性被视为负。当像素从不确定区域采样时，亲和性将被忽略。The background class is marked as 0 and the uncertain region is marked as 255. Then the first affinity matrix is constructed based on this segmentation label. Specifically, any two pixels in the reliable segmentation label Y’ are selected to form a pixel pair. If the pixel pairs sampled from the segmentation label have the same semantics (for example, the pixels in the pixel pair are all from the foreground or all from the background), the affinity is determined to be positive, otherwise, their affinity is considered negative. When the pixel is sampled from the uncertain region, the affinity will be ignored.

然后，上述确定出的第一亲和力矩阵被用作监督，以促进来自Transformer编码器最后一层的补丁令牌的表示。此外，可以使用余弦相似度来衡量两个最终补丁令牌之间的预测亲和性。因此，可以通过如下公式(8)计算亲和力损失L_aff：
The first affinity matrix determined above is then used as supervision to facilitate the representation of patch tokens from the last layer of the Transformer encoder. In addition, the cosine similarity can be used to measure the predicted affinity between two final patch tokens. Therefore, the affinity loss L _aff can be calculated by the following formula (8):

其中T^(L)＝Γ^{D×H′×W′→D×N}(Z^(L)),Y＝Γ^{H′×W′→N}(Y′)。Γ(·)是重塑运算符，cos(·,·)表示余弦函数，N⁺/N^-计算正/负样本的数量。这个目标函数直接鼓励具有正关系的最终补丁令牌特征更相似，否则更具区别性。根据链式法则，它也有助于学习来自早期Transformer层的令牌表示。where T ^(L) = Γ ^{D×H′×W′→D×N} (Z ^(L) ), Y = Γ ^{H′×W′→N} (Y′). Γ(·) is the reshape operator, cos(·,·) represents the cosine function, and N ⁺ /N ^- counts the number of positive/negative samples. This objective function directly encourages the final patch token features with positive relationships to be more similar, otherwise more discriminative. According to the chain rule, it also helps to learn token representations from early Transformer layers.

在引入了上述分割标签Y′之后，说明如何确定正局部类别令牌特征和负局部类别令牌特征，确定Y′_t的方式与分割标签Y^′类似，但不完全相同。After introducing the above segmentation label Y′, it is explained how to determine the positive local category token features and the negative local category token features. The method of determining Y′ _t is similar to that of the segmentation label Y ^′ , but not exactly the same.

具体来说，设M_t＝Γ^{N→H′×W′}(M_i:)为从相应的注意力掩码派生的键掩码，其中i是任意行索引，Γ(·)表示重塑运算符。这里我们用1表示屏蔽的令牌，0表示未屏蔽的令牌。直观地说，如果剩余令牌的平均激活值较高，则它们可能属于语义对象。因此，我们通过以下公式(9)区分正性和负性。
Specifically, let M _t = Γ ^{N → H′×W′} (M _i: ) be the key mask derived from the corresponding attention mask, where i is an arbitrary row index and Γ(·) represents the reshape operator. Here we use 1 to represent the masked tokens and 0 to represent the unmasked tokens. Intuitively, if the average activation value of the remaining tokens is high, they may belong to the semantic object. Therefore, we distinguish between positivity and negativity by the following formula (9).

其中II(·)是指示函数，⊙表示哈达玛积，Y_t′是从辅助类激活图的激活值中派生的离散化令牌级标签，μ与判断正负性的阈值相关联，以下公式(10)给出Y′_t的计算方式，给定两个阈值β_bg和β_fg。
Where II(·) is the indicator function, ⊙ represents the Hadamard product, Y _t ′ is the discretized token-level label derived from the activation value of the auxiliary class activation map, μ is associated with the threshold for judging positivity and negativity, and the following formula (10) gives the calculation method of Y′ _t , given two thresholds β _bg and β _fg .

可以采用InfoNCE损失函数进行对比学习(即，可以采用InfoNCE损失函数计算与局部类别令牌特征相关的损失)。全局和局部类别令牌特征分别通过全局和局部投影器线性投影到适合对比学习的表示空间。投影器由3个线性层和一个L2归一化层组成。设q为投影的全局类别令牌特征，k⁺/k^-为投影的正/负局部类别令牌特征。训练目标是最小化/最大化全局类别令牌特征与正/负局部类别令牌特征之间的距离，对比损失L_mcc可通过如下公式(11)表示：
The InfoNCE loss function can be used for contrastive learning (i.e., the InfoNCE loss function can be used to calculate The global and local category token features are linearly projected into a representation space suitable for contrastive learning by global and local projectors, respectively. The projector consists of three linear layers and an L2 normalization layer. Let q be the projected global category token feature, and k ⁺ /k ^- be the projected positive/negative local category token feature. The training goal is to minimize/maximize the distance between the global category token feature and the positive/negative local category token feature. The contrast loss L _mcc can be expressed by the following formula (11):

其中，N⁺为k⁺样本的数量，τ是温度因子，∈是一个很小的值为了确保数值稳定性。全局投影器的参数使用移动平均策略进行更新，即θ_g←mθ_g+(1-m)θ_l，其中，m是动量因子，θ_g和θ_l分别是全局投影器和局部投影器的参数。这种对两个投影器参数的缓慢演化更新确保了训练稳定性并强制执行表示一致性。其中全局投影器和局部投影器可以为两个全连接层。Where N ⁺ is the number of k ⁺ samples, τ is the temperature factor, and ∈ is a small value to ensure numerical stability. The parameters of the global projector are updated using a moving average strategy, i.e., _θg ← _mθg +(1-m) _θl , where m is the momentum factor, _θg and _θl are the parameters of the global projector and the local projector, respectively. This slowly evolving update of the parameters of the two projectors ensures training stability and enforces representation consistency. The global projector and the local projector can be two fully connected layers.

需要说明的是，上述阐述基本是针对模型训练过程进行阐述的，在训练完成后，可以通过注意力编码器对未知语义分割结果的图像进行语义分割，具体的，可以将待识别图像输入到训练后的注意力编码器中，得到待识别图像的补丁令牌特征，而后通过解码器得到语义分割结果。It should be noted that the above explanation is basically for the model training process. After the training is completed, the attention encoder can be used to perform semantic segmentation on images with unknown semantic segmentation results. Specifically, the image to be identified can be input into the trained attention encoder to obtain the patch token features of the image to be identified, and then the semantic segmentation result can be obtained through the decoder.

从上述内容中可以看出，总体损失可以是分类损失、分割损失、与局部类别令牌特征相关的损失、辅助分类损失和亲和力损失之和。以最小化总体损失为优化目标，可以是以分类损失、分割损失、与局部类别令牌特征相关的损失、辅助分类损失和亲和力损失之和最小，来对注意力编码器进行训练。From the above, it can be seen that the overall loss can be the sum of classification loss, segmentation loss, loss associated with local category token features, auxiliary classification loss and affinity loss. Taking minimizing the overall loss as the optimization goal, the attention encoder can be trained with the minimum sum of classification loss, segmentation loss, loss associated with local category token features, auxiliary classification loss and affinity loss.

从上述内容中可以看出，本方法在通过注意力编码器进行特征提取时，可以分为两种提取方式，第一种不采用掩码机制，直接得到图像分类结果，以及通过类激活图和补丁令牌特征分别得到的两种语义分割结果，第二种采用掩码机制，通过掩码矩阵对注意力机制进行补充，得到局部类别令牌特征，而后，可以结合对比学习的思想，将局部类别令牌特征区分正负性，从而参与模型训练，因此，模型训练不仅包含有分类损失、分割损失，还包含有与局部类别令牌特征相关的对比学习的损失。进一步的，模型训练还可以包括辅助分类损失和亲和力损失。通过引入多种损失(引入通过多种注意力编码器的中间输出或者各种类型的输出得到的损失)，能够更加准确地对图像语义分割的损失进行监督，从而提高语义分割的准确性。From the above content, it can be seen that this method can be divided into two extraction methods when extracting features through the attention encoder. The first method does not use a mask mechanism and directly obtains the image classification result, as well as two semantic segmentation results obtained through the class activation map and the patch token feature respectively. The second method uses a mask mechanism to supplement the attention mechanism through a mask matrix to obtain local category token features. Then, the local category token features can be distinguished between positive and negative based on the idea of contrastive learning, so as to participate in model training. Therefore, model training not only includes classification loss and segmentation loss, but also includes contrastive learning loss related to local category token features. Furthermore, model training can also include auxiliary classification loss and affinity loss. By introducing multiple losses (introducing losses obtained through intermediate outputs of multiple attention encoders or various types of outputs), the loss of image semantic segmentation can be supervised more accurately, thereby improving the accuracy of semantic segmentation.

需要说明的是，为了便于描述，将执行本方法的执行主体作为服务器进行描述，执行主体可以是台式电脑、服务器、大型的服务平台等，在此不进行限定。 It should be noted that, for the sake of ease of description, the execution subject of this method is described as a server. The execution subject can be a desktop computer, a server, a large service platform, etc., which is not limited here.

以上为本公开的一个或多个实施例提供基于注意力掩码的弱监督语义分割方法，基于同样的思路，本公开还提供了基于注意力掩码的弱监督语义分割装置，如图5所示。The above provides a weakly supervised semantic segmentation method based on attention mask for one or more embodiments of the present disclosure. Based on the same idea, the present disclosure also provides a weakly supervised semantic segmentation device based on attention mask, as shown in FIG5 .

图5为本公开实施例提供的一种基于注意力掩码的弱监督语义分割装置示意图，包括：FIG5 is a schematic diagram of a weakly supervised semantic segmentation device based on attention mask provided by an embodiment of the present disclosure, including:

获取模块501，用于获取样本图像以及所述样本图像对应的图像分类标签；An acquisition module 501 is used to acquire a sample image and an image classification label corresponding to the sample image;

输入模块502，用于将所述样本图像输入到注意力编码器中，通过所述注意力编码器得到补丁令牌特征与全局类别令牌特征，并通过所述补丁令牌特征，生成对所述样本图像的图像分类结果以及类激活图，通过所述类激活图得到对所述样本图像的第一语义分割结果，以及对所述补丁令牌特征进行解码，得到第二语义分割结果；An input module 502 is used to input the sample image into an attention encoder, obtain patch token features and global category token features through the attention encoder, generate an image classification result and a class activation map for the sample image through the patch token features, obtain a first semantic segmentation result for the sample image through the class activation map, and decode the patch token features to obtain a second semantic segmentation result;

掩码补偿模块503，用于通过所述注意力编码器生成注意力矩阵，随机生成目标掩码矩阵，并通过所述目标掩码矩阵对所述注意力矩阵进行补偿，得到补偿后的注意力矩阵，根据所述补偿后的注意力矩阵，生成各局部类别令牌特征；A mask compensation module 503 is used to generate an attention matrix through the attention encoder, randomly generate a target mask matrix, and compensate the attention matrix through the target mask matrix to obtain a compensated attention matrix, and generate each local category token feature according to the compensated attention matrix;

判断模块504，用于针对每个局部类别令牌特征，确定根据辅助类激活图以及所述目标掩码矩阵，确定未被所述目标掩码矩阵中的掩码影响的类激活图中部分激活值的平均激活值，若所述平均激活值大于预设阈值，将所述局部类别令牌特征作为正局部类别令牌特征，若所述平均激活值小于或等于预设阈值，将所述局部类别令牌特征作为负局部类别令牌特征；A judgment module 504 is used to determine, for each local class token feature, an average activation value of some activation values in the class activation map that are not affected by the mask in the target mask matrix according to the auxiliary class activation map and the target mask matrix, and if the average activation value is greater than a preset threshold, the local class token feature is used as a positive local class token feature; if the average activation value is less than or equal to the preset threshold, the local class token feature is used as a negative local class token feature;

训练模块505，用于以最小化总体损失为优化目标，对所述注意力编码器进行训练，所述总体损失包括：所述图像分类结果与所述图像分类标签之间的差异、所述第一语义分割结果与所述第二语义分割结果之间的差异、所述正局部类别令牌特征与所述全局类别令牌特征之间的差异，以及所述负局部类别令牌特征与所述全局类别令牌特征之间的差异。A training module 505 is used to train the attention encoder with minimizing the overall loss as the optimization goal, and the overall loss includes: the difference between the image classification result and the image classification label, the difference between the first semantic segmentation result and the second semantic segmentation result, the difference between the positive local category token feature and the global category token feature, and the difference between the negative local category token feature and the global category token feature.

可选地，所述输入模块502具体用于，将所述样本图像分割为若干子图像，以确定出所述若干子图像对应的初始补丁令牌特征以及所述样本图像对应的多个初始类别令牌特征；将所述多个初始补丁令牌特征与所述初始类别令牌特征拼接，输入到所述注意力编码器中。Optionally, the input module 502 is specifically used to divide the sample image into several sub-images to determine the initial patch token features corresponding to the several sub-images and multiple initial category token features corresponding to the sample image; splice the multiple initial patch token features with the initial category token features, and input them into the attention encoder.

可选地，所述掩码补偿模块503具体用于，确定注意力矩阵的尺寸；按照所述尺寸进行预设倍数的下采样，得到采样尺寸，并随机生成大小为所述采样尺寸的初始掩码矩阵；将所述初始掩码矩阵进行所述预设倍数的上采样，得到所述目标掩码矩阵。Optionally, the mask compensation module 503 is specifically used to determine the size of the attention matrix; downsample the size by a preset multiple to obtain a sampling size, and randomly generate an initial mask matrix of the sampling size; upsample the initial mask matrix by the preset multiple to obtain the target mask matrix.

可选地，注意力矩阵包括：查询矩阵、键矩阵以及值矩阵； Optionally, the attention matrix includes: a query matrix, a key matrix, and a value matrix;

所述掩码补偿模块503具体用于，通过以下公式确定补偿后的注意力矩阵：

Z^(l)＝A′^(l)V^(l) The mask compensation module 503 is specifically used to determine the compensated attention matrix by the following formula:

Z ^(l) = A′ ^(l) V ^(l)

其中，A'为补偿后的注意力矩阵，M为所述目标掩码矩阵，Q为查询矩阵，K为键矩阵，V为值矩阵，Z为输出矩阵，l为所述注意力编码器的第l自注意力编码层。Among them, A' is the compensated attention matrix, M is the target mask matrix, Q is the query matrix, K is the key matrix, V is the value matrix, Z is the output matrix, and l is the lth self-attention encoding layer of the attention encoder.

可选地，所述注意力编码器包括多个自注意力编码层；Optionally, the attention encoder comprises a plurality of self-attention encoding layers;

所述训练模块505还用于，获取所述多个自注意力编码层中的目标自注意力编码层的输出结果。The training module 505 is also used to obtain the output result of the target self-attention coding layer among the multiple self-attention coding layers.

可选地，所述目标自注意力编码层的输出结果包括中间补丁令牌特征，所述训练模块505还用于，通过所述中间补丁令牌特征，确定辅助分类结果；以最小化所述总体损失为优化目标，对所述注意力编码器进行训练，其中，所述总体损失还包括：所述辅助分类结果与所述图像分类标签之间的差异。Optionally, the output result of the target self-attention encoding layer includes intermediate patch token features, and the training module 505 is also used to determine the auxiliary classification result through the intermediate patch token features; the attention encoder is trained with minimizing the overall loss as the optimization goal, wherein the overall loss also includes: the difference between the auxiliary classification result and the image classification label.

可选地，所述注意力编码器包括若干自注意力编码层；Optionally, the attention encoder comprises a plurality of self-attention encoding layers;

可选地，所述目标自注意力编码层的输出结果包括所述辅助类激活图，所述训练模块505还用于，通过所述辅助类激活图确定出的前景背景分割结果，确定第一亲和力矩阵；根据所述补丁令牌特征，确定第二亲和力矩阵；以最小化所述总体损失为优化目标，对所述注意力编码器进行训练，其中，所述总体损失还包括：所述第一亲和力矩阵与所述第二亲和力矩阵之间的差异。Optionally, the output result of the target self-attention encoding layer includes the auxiliary class activation map, and the training module 505 is also used to determine a first affinity matrix based on the foreground-background segmentation result determined by the auxiliary class activation map; determine a second affinity matrix based on the patch token features; and train the attention encoder with minimizing the overall loss as the optimization goal, wherein the overall loss also includes: the difference between the first affinity matrix and the second affinity matrix.

可选地，掩码补偿模块503具体用于，随机生成多个目标掩码矩阵；针对每个目标掩码矩阵，将所述样本图像输入到所述注意力编码器中，生成所述样本图像对应的注意力矩阵，通过该目标掩码矩阵对本次生成的注意力矩阵进行补偿，得到对应的补偿后的注意力矩阵，根据所述对应的补偿后的注意力矩阵，生成该目标掩码矩阵对应的局部类别令牌特征；将多个局部类别令牌特征作为所述局部类别令牌特征。Optionally, the mask compensation module 503 is specifically used to randomly generate multiple target mask matrices; for each target mask matrix, the sample image is input into the attention encoder to generate an attention matrix corresponding to the sample image, the attention matrix generated this time is compensated by the target mask matrix to obtain a corresponding compensated attention matrix, and local category token features corresponding to the target mask matrix are generated according to the corresponding compensated attention matrix; and multiple local category token features are used as the local category token features.

可选地，训练模块505还用于，使用训练好的注意力编码器对待识别图像进行图像语义分割。Optionally, the training module 505 is further used to perform image semantic segmentation on the image to be recognized using the trained attention encoder.

本公开还提供了一种计算机可读存储介质，该存储介质存储有计算机程序，计算机程序可用于执行上述基于注意力掩码的弱监督语义分割方法。The present disclosure also provides a computer-readable storage medium storing a computer program, which can be used to execute the above-mentioned weakly supervised semantic segmentation method based on attention mask.

本公开还提供了图6所示的电子设备的示意结构图。如图6所述，在硬件层面，该电子设备包括处理器、内部总线、网络接口、内存以及非易失性存储器，当然还可能包括其他业务所需要的硬件。处理器从非易失性存储器中读取对应的计算机程序到内存中然后运行，以实现基于注意力掩码的弱监督语义分割方法。The present disclosure also provides a schematic structural diagram of an electronic device as shown in FIG6. As shown in FIG6, at the hardware level, The electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile memory, and may also include hardware required for other services. The processor reads the corresponding computer program from the non-volatile memory into the memory and then runs it to implement a weakly supervised semantic segmentation method based on an attention mask.

当然，除了软件实现方式之外，本公开并不排除其他实现方式，比如逻辑器件抑或软硬件结合的方式等等，也就是说以下处理流程的执行主体并不限定于各个逻辑单元，也可以是硬件或逻辑器件。Of course, in addition to software implementation, the present disclosure does not exclude other implementation methods, such as logic devices or a combination of software and hardware, etc., that is, the execution subject of the following processing flow is not limited to each logic unit, but can also be hardware or logic devices.

在20世纪90年代，对于一个技术的改进可以很明显地区分是硬件上的改进(例如，对二极管、晶体管、开关等电路结构的改进)还是软件上的改进(对于方法流程的改进)。然而，随着技术的发展，当今的很多方法流程的改进已经可以视为硬件电路结构的直接改进。设计人员几乎都通过将改进的方法流程编程到硬件电路中来得到相应的硬件电路结构。因此，不能说一个方法流程的改进就不能用硬件实体模块来实现。例如，可编程逻辑器件(Programmable Logic Device,PLD)(例如现场可编程门阵列(Field Programmable Gate Array，FPGA))就是这样一种集成电路，其逻辑功能由用户对器件编程来确定。由设计人员自行编程来把一个数字系统“集成”在一片PLD上，而不需要请芯片制造厂商来设计和制作专用的集成电路芯片。而且，如今，取代手工地制作集成电路芯片，这种编程也多半改用“逻辑编译器(logic compiler)”软件来实现，它与程序开发撰写时所用的软件编译器相类似，而要编译之前的原始代码也得用特定的编程语言来撰写，此称之为硬件描述语言(Hardware Description Language，HDL)，而HDL也并非仅有一种，而是有许多种，如ABEL(Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL(Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby Hardware Description Language)等，目前最普遍使用的是VHDL(Very-High-Speed Integrated Circuit Hardware Description Language)与Verilog。本领域技术人员也应该清楚，只需要将方法流程用上述几种硬件描述语言稍作逻辑编程并编程到集成电路中，就可以很容易得到实现该逻辑方法流程的硬件电路。In the 1990s, it was very clear whether the improvement of a technology was hardware improvement (for example, improvement of the circuit structure of diodes, transistors, switches, etc.) or software improvement (improvement of the method flow). However, with the development of technology, many of the improvements of the method flow today can be regarded as direct improvements of the hardware circuit structure. Designers almost always obtain the corresponding hardware circuit structure by programming the improved method flow into the hardware circuit. Therefore, it cannot be said that the improvement of a method flow cannot be implemented with hardware entity modules. For example, a programmable logic device (PLD) (such as a field programmable gate array (FPGA)) is such an integrated circuit whose logical function is determined by the user's programming of the device. Designers can "integrate" a digital system on a PLD by programming themselves, without having to ask chip manufacturers to design and produce dedicated integrated circuit chips. Moreover, nowadays, instead of manually making integrated circuit chips, this kind of programming is mostly implemented by "logic compiler" software, which is similar to the software compiler used when developing programs. The original code before compilation must also be written in a specific programming language, which is called Hardware Description Language (HDL). There is not only one kind of HDL, but many kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), Confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), Lava, Lola, MyHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., and the most commonly used ones are VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog. Those skilled in the art should also know that it is only necessary to program the method flow slightly in the above-mentioned hardware description languages and program it into the integrated circuit, and then it is easy to obtain the hardware circuit that realizes the logic method flow.

控制器可以按任何适当的方式实现，例如，控制器可以采取例如微处理器或处理器以及存储可由该(微)处理器执行的计算机可读程序代码(例如软件或固件)的计算机可读介质、逻辑门、开关、专用集成电路(Application Specific Integrated Circuit，ASIC)、可编程逻辑控制器和嵌入微控制器的形式，控制器的例子包括但不限于以下微控制器：ARC 625D、Atmel AT91SAM、Microchip PIC18F26K20以及Silicone Labs C8051F320，存储器控制器还可以被实现为存储器的控制逻辑的一部分。本领域技术人员也知道，除了以纯计算机可读程序代码方式实现控制器以外，完全可以通过将方法步骤进行逻辑编程来使得控制器以逻辑门、开关、专用集成电路、可编程逻辑控制器和嵌入微控制器等的形式来实现相同功能。因此这种控制器可以被认为是一种硬件部件，而对其内包括的用于实现各种功能的装置也可以视为硬件部件内的结构。或者甚至，可以将用于实现各种功能的装置视为既可以是实现方法的软件模块又可以是硬件部件内的结构。The controller may be implemented in any suitable manner, for example, the controller may take the form of a microprocessor or processor and a computer-readable medium storing a computer-readable program code (such as software or firmware) executable by the (micro)processor, logic gates, switches, an application specific integrated circuit (ASIC), a programmable logic controller, and an embedded microcontroller. Examples of controllers include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller can also be implemented as part of the control logic of the memory. Those skilled in the art also know that in addition to implementing the controller in a purely computer-readable program code, it is entirely possible to implement the same function in the form of logic gates, switches, application-specific integrated circuits, programmable logic controllers, and embedded microcontrollers by logically programming the method steps. Therefore, this controller can be considered as a hardware component, and the devices for implementing various functions included therein can also be regarded as structures within the hardware component. Or even, the devices for implementing various functions can be regarded as both software modules for implementing the method and structures within the hardware component.

上述实施例阐明的系统、装置、模块或单元，具体可以由计算机芯片或实体实现，或者由具有某种功能的产品来实现。一种典型的实现设备为计算机。具体的，计算机例如可以为个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任何设备的组合。The systems, devices, modules or units described in the above embodiments may be implemented by computer chips or entities, or by products with certain functions. A typical implementation device is a computer. Specifically, the computer may be, for example, a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

为了描述的方便，描述以上装置时以功能分为各种单元分别描述。当然，在实施本公开时可以把各单元的功能在同一个或多个软件和/或硬件中实现。For the convenience of description, the above device is described in various units according to their functions. Of course, when implementing the present disclosure, the functions of each unit can be implemented in the same or multiple software and/or hardware.

本领域内的技术人员应明白，本公开的实施例可提供为方法、系统、或计算机程序产品。因此，本公开可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本公开可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that the embodiments of the present disclosure may be provided as methods, systems, or computer program products. Therefore, the present disclosure may take the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the present disclosure may take the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

本公开是参照根据本公开实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present disclosure is described with reference to the flowchart and/or block diagram of the method, device (system), and computer program product according to the embodiment of the present disclosure. It should be understood that each process and/or box in the flowchart and/or block diagram, as well as the combination of the process and/or box in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, an embedded processor or other programmable data processing device to produce a machine, so that the instructions executed by the processor of the computer or other programmable data processing device produce a device for implementing the functions specified in one process or multiple processes in the flowchart and/or one box or multiple boxes in the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing device to operate in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product including an instruction device that implements the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded into a computer or other programmable data processing device so that A series of operational steps are executed on a computer or other programmable device to produce a computer-implemented process, so that the instructions executed on the computer or other programmable device provide steps for implementing the functions specified in one or more flows of the flowchart and/or one or more blocks of the block diagram.

在一个典型的配置中，计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。In a typical configuration, a computing device includes one or more processors (CPU), input/output interfaces, network interfaces, and memory.

内存可能包括计算机可读介质中的非永久性存储器，随机存取存储器(RAM)和/或非易失性内存等形式，如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。Memory may include non-permanent storage in a computer-readable medium, in the form of random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括，但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带，磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质，可用于存储可以被计算设备访问的信息。按照本文中的界定，计算机可读介质不包括暂存电脑可读媒体(transitory media)，如调制的数据信号和载波。Computer readable media include permanent and non-permanent, removable and non-removable media that can be implemented by any method or technology to store information. Information can be computer readable instructions, data structures, program modules or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disk read-only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media that can be used to store information that can be accessed by a computing device. As defined in this article, computer readable media does not include temporary computer readable media (transitory media), such as modulated data signals and carrier waves.

还需要说明的是，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the terms "include", "comprises" or any other variations thereof are intended to cover non-exclusive inclusion, so that a process, method, commodity or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, commodity or device. In the absence of more restrictions, the elements defined by the sentence "comprises a ..." do not exclude the existence of other identical elements in the process, method, commodity or device including the elements.

本领域技术人员应明白，本公开的实施例可提供为方法、系统或计算机程序产品。因此，本公开可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且，本公开可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。It will be appreciated by those skilled in the art that the embodiments of the present disclosure may be provided as methods, systems or computer program products. Therefore, the present disclosure may take the form of a complete hardware embodiment, a complete software embodiment or an embodiment combining software and hardware. Moreover, the present disclosure may take the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

本公开可以在由计算机执行的计算机可执行指令的一般上下文中描述，例如程序模块。一般地，程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本公开，在这些分布式计算环境中，由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中，程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。The present disclosure may be described in the general context of computer-executable instructions, such as program modules, executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types. The present disclosure may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are connected through a communications network. In distributed computing environments, the present disclosure may be implemented in a variety of ways, such as by a computer program or a computer program that is not directly connected to the computer program. In a typical environment, program modules may be located in local and remote computer storage media including memory storage devices.

本公开中的各个实施例均采用递进的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于系统实施例而言，由于其基本相似于方法实施例，所以描述的比较简单，相关之处参见方法实施例的部分说明即可。Each embodiment in the present disclosure is described in a progressive manner, and the same or similar parts between the embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the partial description of the method embodiment.

以上所述仅为本公开的实施例而已，并不用于限制本公开。对于本领域技术人员来说，本公开可以有各种更改和变化。凡在本公开的精神和原理之内所作的任何修改、等同替换、改进等，均应包含在本公开的权利要求范围之内。 The above description is only an embodiment of the present disclosure and is not intended to limit the present disclosure. For those skilled in the art, the present disclosure may have various modifications and variations. Any modification, equivalent substitution, improvement, etc. made within the spirit and principle of the present disclosure shall be included in the scope of the claims of the present disclosure.

Claims

A weakly supervised semantic segmentation method based on attention mask, characterized by comprising:

Obtaining a sample image and an image classification label corresponding to the sample image;

Input the sample image into the attention encoder;

Obtaining patch token features and global category token features through the attention encoder, and generating an image classification result and a class activation map for the sample image through the patch token features, obtaining a first semantic segmentation result for the sample image through the class activation map, and decoding the patch token features to obtain a second semantic segmentation result;

Generate an attention matrix through the attention encoder, randomly generate a target mask matrix, and compensate the attention matrix through the target mask matrix to obtain a compensated attention matrix, and generate a local category token feature according to the compensated attention matrix;

According to the auxiliary class activation map and the target mask matrix, determine the average activation value of some activation values in the class activation map that are not affected by the mask in the target mask matrix, if the average activation value is greater than a preset threshold, use the local class token feature as a positive local class token feature, if the average activation value is less than or equal to the preset threshold, use the local class token feature as a negative local class token feature;

The attention encoder is trained with minimizing the overall loss as the optimization goal, and the overall loss includes: the difference between the image classification result and the image classification label, the difference between the first semantic segmentation result and the second semantic segmentation result, the difference between the positive local category token feature and the global category token feature, and the difference between the negative local category token feature and the global category token feature.

The method according to claim 1, characterized in that the step of inputting the sample image into an attention encoder specifically comprises:

Segmenting the sample image into a plurality of sub-images to determine a plurality of initial patch token features corresponding to the plurality of sub-images, and determining an initial category token feature corresponding to the sample image;

The multiple initial patch token features are concatenated with the initial category token features and input into the attention encoder.

The method according to claim 1 or 2, characterized in that the randomly generating a target mask matrix specifically comprises:

determining a size of the attention matrix;

Downsampling the size by a preset multiple to obtain a sampling size, and randomly generating an initial mask matrix having a size equal to the sampling size;

The initial mask matrix is upsampled by the preset multiple to obtain the target mask matrix.

The method according to any one of claims 1 to 3, characterized in that the attention matrix comprises: a query matrix, a key matrix and a value matrix;

The compensating the attention matrix by the target mask matrix to obtain a compensated attention matrix specifically includes:

The compensated attention matrix is determined by the following formula:

Z ^(l) = A′ ^(l) V ^(l)

Among them, A' is the compensated attention matrix, M is the target mask matrix, Q is the query matrix, K is the key matrix, V is the value matrix, Z is the output matrix, and l is the lth self-attention encoding layer of the attention encoder.

The method according to any one of claims 1 to 4, characterized in that the attention encoder comprises a plurality of self-attention encoding layers; the method further comprising:

Obtain an output result of a target self-attention coding layer among the multiple self-attention coding layers.

The method of claim 5, wherein the output of the target self-attention encoding layer comprises intermediate patch token features, and the method further comprises:

Determining auxiliary classification results through the intermediate patch token features;

The attention encoder is trained with minimizing the overall loss as an optimization goal, wherein the overall loss also includes: the difference between the auxiliary classification result and the image classification label.

The method according to claim 5 or 6, characterized in that the output result of the target self-attention encoding layer includes the auxiliary class activation map, and the method further comprises:

Determine a first affinity matrix based on the foreground-background segmentation result determined by the auxiliary class activation map;

Determining a second affinity matrix according to the patch token characteristics;

The attention encoder is trained with minimizing the overall loss as an optimization goal, wherein the overall loss also includes: a difference between the first affinity matrix and the second affinity matrix.

The method according to any one of claims 1 to 7, characterized in that the attention matrix is generated by the attention encoder, a target mask matrix is randomly generated, and the attention matrix is compensated by the target mask matrix to obtain a compensated attention matrix, and a local category token feature is generated according to the compensated attention matrix, specifically comprising:

Randomly generate multiple target mask matrices;

For each target mask matrix,

Input the sample image into the attention encoder to generate an attention matrix corresponding to the sample image,

The attention matrix generated this time is compensated by the target mask matrix to obtain the corresponding compensated attention matrix.

Generate local category token features corresponding to the target mask matrix according to the corresponding compensated attention matrix;

A plurality of local category token features are used as the local category token features.

The method according to any one of claims 1 to 8, characterized in that the method further comprises:

Use the trained attention encoder to perform image semantic segmentation on the image to be recognized.

A weakly supervised semantic segmentation device based on attention mask, characterized by comprising:

An acquisition module, used to acquire a sample image and an image classification label corresponding to the sample image;

An input module is used to input the sample image into an attention encoder, obtain patch token features and global category token features through the attention encoder, generate an image classification result and a class activation map for the sample image through the patch token features, obtain a first semantic segmentation result for the sample image through the class activation map, and decode the patch token features to obtain a second semantic segmentation result;

A mask compensation module, used to generate an attention matrix through the attention encoder, randomly generate a target mask matrix, and compensate the attention matrix through the target mask matrix to obtain a compensated attention matrix, and generate each local category token feature according to the compensated attention matrix;

A judgment module, used for determining, according to the auxiliary class activation map and the target mask matrix, an average activation value of some activation values in the class activation map that are not affected by the mask in the target mask matrix, if the average activation value is greater than a preset threshold, using the local class token feature as a positive local class token feature, if the average activation value is less than or equal to the preset threshold, using the local class token feature as a negative local class token feature;

A training module is used to train the attention encoder with minimizing the overall loss as the optimization goal, wherein the overall loss includes: the difference between the image classification result and the image classification label, the difference between the first semantic segmentation result and the second semantic segmentation result, the difference between the positive local category token feature and the global category token feature, and the difference between the negative local category token feature and the global category token feature.

A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the method according to any one of claims 1 to 9 is implemented. Law.

An electronic device comprises a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the method according to any one of claims 1 to 9 when executing the computer program.