CN119904638A

CN119904638A - Weakly supervised semantic segmentation method and system based on component-aware learning segmentation network

Info

Publication number: CN119904638A
Application number: CN202411985702.XA
Authority: CN
Inventors: 凌志刚; 张傲然; 谭浩然; 王耀南
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2024-12-31
Filing date: 2024-12-31
Publication date: 2025-04-29
Anticipated expiration: 2044-12-31
Also published as: CN119904638B

Abstract

The invention discloses a weakly supervised semantic segmentation method and system based on a component-aware learning segmentation network. The method comprises: step 1: constructing a component-aware learning segmentation network; step 2: obtaining a SAM mask set of a training set image; step 3: obtaining a mask pseudo-label set of the training set image and a relationship matrix between a target component and a whole in the mask; step 4: obtaining an initial component-aware learning segmentation network model; step 5: obtaining and optimizing a CAM set of a target mask in the training set image to obtain a final pseudo-label set, and using a segmentation decoder to obtain a semantic segmentation result in an image to be segmented; removing redundant masks through a component-whole refinement module in the network, and capturing the relationship between the target component and the whole in the mask; and encouraging the network to learn the relationship between each component and the whole of the target in the pseudo-label set through a component-aware learning module in the network, so as to achieve accurate and complete segmentation of the target under weak supervision.

Description

Weak supervision semantic segmentation method and system based on part perception learning segmentation network

Technical Field

The invention belongs to the field of semantic segmentation, and particularly relates to a weak supervision semantic segmentation method and system based on a part perception learning segmentation network.

Background

Semantic segmentation is an important task in computer vision and has made significant progress in the last decades. Typically, semantic segmentation methods employ a fully supervised learning paradigm, using large-scale datasets containing pixel-level annotations to train a segmentation model. However, obtaining accurate annotations is a time consuming and costly process. In addition, labels often result in errors due to misjudgment by the annotators. To address these challenges, weakly supervised semantic segmentation (Weakly Supervised Semantic Segmentation, WSSS) has become an important research hotspot. It uses only weak labels and low cost label forms such as image level labels, points, lines and bounding boxes as the primary supervisory information for training the segmentation model.

In pursuing accurate weak supervision semantic segmentation, previous studies have typically generated class activation graphs (ClassActivation Maps, CAM) by classification supervision (reference ：B.Zhou,A.Khosla,A.Lapedriza,A.Oliva,andA.Torralba,"Learning deep features for discriminative localization,"inProc.IEEE Conf.Comput.Vis.Pattern Recognit.,2016,pp.2921–2929.) or variants thereof and treated it as a pseudo tag, however, the CAM-based approach relies primarily on classification loss, often activating only the most discriminative area of the target.

To pursue accurate semantic segmentation, wei et al (ref ：Yunchao Wei,Jiashi Feng,Xiaodan Liang,Ming-Ming Cheng,Yao Zhao,and Shuicheng Yan,"Object region mining with adversarial erasing:A simple classification to semantic segmentation approach,"in Proc.IEEE Conf.Comput.Vis.Pattern Recognit.,2017,pp.1568–1576.) further excavates more target regions on the original CAM by an aggressive erasure strategy Wang et al (ref ：Yude Wang,Jie Zhang,Meina Kan,Shiguang Shan,and Xilin Chen,"Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation,"in Proc.IEEE Conf.Comput.Vis.Pattern Recognit.,2020,pp.12275–12284.) normalizes training targets by introducing an auxiliary task of scale invariance regularization. Yao et al (ref ：YazhouYao,Tao Chen,Guo-Sen Xie,Chuanyi Zhang,Fumin Shen,Qi Wu,Zhenmin Tang,and Jian Zhang,"Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation,"in Proc.IEEE Conf.Comput.Vis.Pattern Recognit.,2021,pp.2623–2632.) uses an additional saliency map as supervisory information to suppress background regions and excavate non-salient targets. Zhou et al (ref ：Tianfei Zhou,Meijie Zhang,Fang Zhao,and Jianwu Li,"Regional semantic contrast and aggregation for weakly supervised semantic segmentation,"in Proc.IEEE Conf.Comput.Vis.Pattern Recognit.,2022,pp.4299–4309.) promotes complete activation of target regions by comparing pixel and prototype representations).

However, these convolutional neural network (Convolutional Neural Network, CNN) based methods are mainly used to identify the main feature region, resulting in a lack of detailed boundary information for the pseudo tag. In contrast to CNN, visual transducers (Vision Transformer, viT) (reference ：Alexey,Dosovitskiy."An image is worth 16x16 words:Transformers for image recognition at scale,"arXivpreprint arXiv:2010.11929,2020.) has met with significant success in WSSS, as its self-attention mechanism is able to capture global information these ViT based approaches enhance remote feature dependencies, thus better learning affinities and disparities in semantics or location, making CAM boundaries more accurate, however, the false labels from ViT still have the problem of boundary inaccuracy.

Recently SEGMENTANYTHING MODEL (SAM) (reference ：A.Kirillov,E.Mintun,N.Ravi,H.Mao,C.Rolland,L.Gustafson,T.Xiao,S.Whitehead,A.C.Berg,W.-Y.Lo et al.,"Segment anything,"arXivpreprint arXiv:2304.02643,2023.) performs excellently in unsupervised semantic segmentation, therefore, the present invention aims to generate more accurate pseudo tags by SAM and reduce reliance on CAM: the SAM method typically generates a large number of low quality masks that, without supervision, are difficult to distinguish between objects and background, such lack of distinguishability results in inaccurate segmentation results, and difficulty in separating overlapping objects, (2) the SAM often segments the entire object into multiple segments, resulting in false labels that are complex and inaccurate, however, even with CAM-directed training, the SAM cannot generate a complete and accurate mask for complex, textured objects, this limitation suggests, SAM-based approaches still face some challenges in providing complete and accurate pseudo tags when dealing with challenging segmentation tasks.

Disclosure of Invention

Aiming at the problems, the technical scheme of the invention provides a weak supervision semantic segmentation method and a weak supervision semantic segmentation system based on a part perception learning segmentation network, which use a mask of SAM prediction to guide more accurate CAM generation so as to generate more accurate pseudo labels with clearer boundaries. The technical scheme overcomes the limitation of the traditional CAM-based method, solves the problem of using SAM in weak supervision semantic segmentation, and improves the precision of weak supervision semantic segmentation.

In one aspect, a method for weak supervision semantic segmentation based on a component aware learning segmentation network includes:

Step 1, constructing a part perception learning segmentation network;

the component perception learning segmentation network comprises a component-integral refining module and a main network which are connected in series, and a classification branch network and a component branch network which are connected in parallel and then connected in series at the rear end of the main network;

the component-whole refining module sequentially performs small target mask removal, high similarity mask removal and relation matrix recording processing between a target component and a whole in the mask;

the component branch network comprises a component perception learning module, a Transformer Encoder module, a full-connection layer and a classifier which are sequentially connected, wherein the component perception learning module sequentially performs embedded feature extraction, embedded feature scale remodeling, mask feature fusion and image feature output processing;

Step 2, inputting the training set image into a SAM segmentation model to obtain a SAM mask set of the training set image;

Step 3, using the SAM mask set as a component-whole refining module input to a component perception learning segmentation network to obtain a mask pseudo-tag set of the training set image and a relation matrix between a target component and a whole in the mask;

Step 4, taking a known training set image and an image-level label, wherein a mask pseudo-label set of the training set image and a relation matrix between a target part and the whole in the mask are used as inputs of a part perception learning segmentation network, and the training part perception learning segmentation network acquires an initial part perception learning segmentation network model;

And step 5, acquiring a CAM set of a target mask in the training set image by using the initial component perception learning segmentation network model, obtaining a final pseudo tag set through threshold segmentation, taking the final pseudo tag set as supervision information, simultaneously training the initial component perception learning segmentation network model and a segmentation decoder connected in series at the rear end of the component branch network to obtain a complete component perception learning segmentation network model, and finally inputting the image to be segmented into the complete component perception learning segmentation network model to obtain a semantic segmentation result in the image to be segmented.

In the weak supervision semantic segmentation task, the pseudo tag is an approximate pixel level tag because it is not the true classification information for each pixel. The target area in the image is usually predicted by a model obtained by current training, and the prediction results are taken as fake pixel-level labels of the target. Thus, pseudo tags have a higher degree of fine granularity than image level tags, but a slightly lower degree of accuracy, typically generated by a Class Activation Map (CAM) or other feature-based strategy. The method utilizes an unsupervised SAM segmentation model to predict and preprocess the output of the model to generate a pseudo tag which is used for guiding the model to learn the regional boundaries of different targets in an image.

The process of training the segmentation decoder is that a training set image is input into a part perception learning segmentation network, the output of a part branch network which is connected in series with a main network is used as input, a final pseudo tag is used as supervision information, a conventional cross entropy loss function is adopted, and the segmentation decoder is trained while the initial part perception learning segmentation network is trained, so that a complete part perception learning segmentation network model is obtained.

Further, the backbone network is composed of ViT (Vision Transformer Encoder), and the classification branch network comprises a full-connection layer and a classifier which are sequentially connected.

Further, the part-overall refining module works as follows:

The small target mask removal and the high similarity mask removal are used for refining the SAM mask set, and the effective mask is screened to improve the data quality, and the relation matrix between the target part and the whole in the mask is recorded and used as the core of the part-whole refining module, so that the relation matrix between the target part and the whole is constructed and extracted, and the matrix plays a key role in the subsequent model training, thereby helping the model to learn the relevance between the target part and the whole better;

Step A1, small target mask removal;

removing unconditional small target masks from the SAM mask set of training set images;

SAM mask set for inputting training set image Calculation ofThe duty ratio U of the foreground pixels in each mask is removed to obtain an output mask set if the duty ratio U is lower than a set threshold τ=0.05

Wherein M _i represents the ith mask in the SAM mask set of the training set image, wherein M represents the SAM mask set, i is the index of the masks in the set, K is the total number of the masks; representing the ith mask in the output mask set, where Representing an output mask set, i being an index of masks in the set; the number of masks in the set for the output mask;

A2, removing a high similarity mask;

Removing masks with too high similarity from the output mask set to reduce redundancy;

input is output mask set Calculating according to the overlapping degree between the masks, judging the masks with the overlapping degree exceeding a set threshold value beta as high-similarity masks, and eliminating the masks to obtain a pseudo tag set of the training set image

The pseudo tag set of the training set image consists of relatively independent effective masks and is used for subsequent model training;

Wherein, AndRespectively represent masksSum maskThe number of pixels in the overlapping area of the two masks and in the total area covered is generally calculated respectively;

Wherein i, j is The different mask indexes of the code sequence are represented by an ith mask and a jth mask; An ith mask in the pseudo tag set representing the training set image, The total number of masks in the pseudo tag set for the training set image, f (i, j) for determining whether to retain a mask, if another mask is not present,And (3) withIf the overlap of (1) exceeds the threshold β, f (i, j) =1, and the mask is reservedOtherwise f (i, j) =1, and remove the maskThe threshold β is set to 0.95;

a3, recording a relation matrix between the target part and the whole in the mask;

pseudo tag set with training set images As input, constructing a relation matrix between the target component and the whole in the mask of the pseudo tag set;

determining which masks represent the relationship between the component and the whole by calculating the overlapping degree between the masks and setting a threshold value, and when the pseudo-label set of the training set image Mask in (a)And (3) withOverlapping portions of (2)Occupying the area ofWhen the ratio of (c) is greater than or equal to the threshold lambda,For the whole mask, recognitionIs thatUsing an indicator function h (i, j) to build a relationship matrix between the object and its components

Where λ is a threshold of 0.9, ifIs thatIf the component mask of (1) is h (i, j) =1, if the component mask of (1) is h (i, j) =0, and obtaining a relation matrix between the target component and the whole in the mask through h (i, j)

Through the three closely-coordinated parts, the part-whole refining module not only effectively optimizes the SAM mask set in the training set image to obtain a pseudo tag set of a training set image with higher quality, but also accurately constructs the relation between a target part and the whole, and provides more reliable additional information for model training, and enhances the capability of the model in the aspect of associating the whole characteristics of the target with the characteristics of the part by utilizing the information.

Further, the working process of the component perception learning module is as follows:

step B1, extracting embedded features;

training set image embedding features extracted from the penultimate layer of a transducer encoder By removing embedded featuresClassifying the dimensions of the mark to obtain the remaining embedded features

Wherein B, N, E represents the size of the input batch, the number of image blocks into which the image is divided, and the feature vector dimension of each image block, respectively,Representing the real number domain;

step B2, remolding the embedded feature scale;

Will embed features The shape of the training set image is rearranged from BxNxE to BxCxHxW, and then the mask pseudo tag set and the embedded feature of the training set image outputted from the component-whole refining moduleDot multiplication to obtain an embedded feature set Then embed the feature setThe shape of (2) is rearranged from Bxk×C×H×W to Bxk×N×E; and is opposite toReducing the original channel dimension to 1/k through a1 x 1 filter layer, outputting a mask feature set

Wherein H, W, C respectively represents the height and width of the training set image and the feature dimension after reshaping, and k represents the number of masks in the mask pseudo tag set;

Step B3, mask feature fusion;

For mask feature set Carrying out average pooling and normalization to obtain a weighting coefficient

According to the relation matrixCorresponding integral mask and component mask, resulting in corresponding integral mask featuresAnd component mask feature

Introducing a relational matrix of component-global refining module outputsWhen the indicator function h (i, j) =1 in the relation matrix, the mask is representedMasking as a whole, maskingMasking its components, where i, j is the pseudo-tag set of the training set imageThe different mask indexes of the training set image represent the ith mask and the jth mask, and the pseudo tag set of the training set imageAnd mask feature setOne-to-one correspondence, integral maskMask features of (a)A mask feature noted as an overall mask feature, satisfying the number h (i, j) =1Marked as part mask feature, the number of part masks associated with an overall mask feature being defined by the overall maskThe corresponding entry decision for h (i, j) =1;

the method comprises the steps of updating part mask features by applying a shared 3X 3 convolution layer to each part mask feature, adding and fusing all updated part mask features with corresponding whole mask features to obtain fused whole mask features, and updating the fused whole mask features by using another shared 3X 3 convolution layer to generate new whole mask features;

Integrating all updated whole and part mask features to obtain a new mask feature set

Updated mask feature set by weighting coefficient WThen weighting and splicing are carried out to generate weighted mask characteristics

Wherein cat is a characteristic splicing operation, W ₁、W₂、W_k represents a weighting coefficient,Mask features in the updated mask feature set;

Step B4, outputting image characteristics;

For weighted mask features Using a1 x 1 convolutional layer process to obtain featuresAnd then output the characteristicSpliced with the removed class Token to obtain the final output characteristic

The mask pseudo tag set of the training set image output by the component-whole refining module and the relation matrix between the target component and the whole in the mask are utilized, and through the synergistic effect of the component perception learning module units, the model can efficiently extract and fuse the mask characteristics of the training set image, and meanwhile, the relation matrix between the component and the whole strengthens the learning of the relation between the target component and the whole by the model, so that the performance of the model is further improved.

Further, the training set image is input into the SAM segmentation model, and a SAM mask set of the training set image is obtained as follows:

The method comprises the steps of taking a known training set image as input, preprocessing the training image, including image size adjustment, image graying and normalization, and color space conversion, inputting the preprocessed training set image into a pre-trained SAM segmentation model, automatically segmenting each training image by the pre-trained SAM segmentation model, and generating segmentation masks of a plurality of target areas to obtain a SAM mask set of the training set image.

Further, the training component perceives the learning segmentation network by taking the known training set image and the image-level label, the mask pseudo-label set of the training set image and the relation matrix between the target component and the whole in the mask as the input of the component perceives the learning segmentation network, and the process of acquiring the initial component perceives the learning segmentation network model is as follows:

the method comprises the steps of taking an input training set image as input, taking an image-level label as supervision information, taking a mask pseudo-label set of the training set image and a relation matrix between a target part and the whole in a mask as auxiliary information, training a part perception learning segmentation network comprising a part perception learning module by utilizing a classifier by utilizing a joint loss function, and obtaining a trained initial part perception learning segmentation network model.

Further, the joint loss function is composed of a classification loss function, an auxiliary classification loss function and a comparison loss function, and the formula is:

wherein lambda ₁,λ₂,λ₃ is used to balance the classification loss L _cls, And a regularization parameter of the contrast loss L _reg;

l _cls denotes the cross entropy loss function:

Wherein, Is the prediction probability of category C, y _c is the true label, C represents the number of categories;

representing an auxiliary classification loss function:

Wherein, Representing predictive labels obtained from features of the component branch network, softmax being an activation function;

l _reg represents the contrast loss function:

Where CosSim () represents cosine similarity calculation, N ⁺/N^- represents the number of positive sample pairs (samples having a part relationship) and negative sample pairs (samples other than the positive samples), respectively, Representing a mask feature setIs an element of (a).

Further, the final pseudo tag set is used as supervision information, a segmentation loss function is calculated, the segmentation loss function is added based on the joint loss function, an initial part perception learning segmentation network and an additional segmentation decoder connected in series at the rear end of a part branch network are trained to obtain a complete part perception learning segmentation network model, an image to be segmented is finally input into the complete part perception learning segmentation network model, and a semantic segmentation result in the image to be segmented is inferred and obtained.

In a second aspect, a weak supervision semantic segmentation system based on a component aware learning segmentation model includes:

The network construction module is used for constructing a part perception learning segmentation network;

the component branch network comprises a component perception learning module, a Transformer Encoder module, a full-connection layer and a classifier which are sequentially connected, wherein the component perception learning module comprises an embedded feature extraction unit, a remodelling embedded feature scale unit, a mask feature fusion unit and an output image feature unit which are sequentially connected in series;

The SAM mask set acquisition module is used for inputting the training set image into the SAM segmentation model to obtain a SAM mask set of the training set image;

The component-whole refining module comprises a small target mask removing unit, a high similarity mask removing unit and a relation matrix unit for removing the target component and the whole in the record mask which are sequentially connected in series, wherein a SAM mask set is taken as input to obtain a mask pseudo tag set of a training set image and a relation matrix between the target component and the whole in the mask;

the segmentation network training module is used for training the part perception learning segmentation network by utilizing the known training set image, the image-level label, the mask pseudo-label set and the relation matrix between the target part and the whole in the mask, and obtaining an initial part perception learning segmentation network model;

The segmentation result optimization module is used for acquiring a CAM set of a target mask in a training set image by using an initial component perception learning segmentation network model, obtaining a final pseudo tag set through threshold segmentation, and simultaneously training an initial component perception learning segmentation network and a segmentation decoder connected in series at the rear end of a component branch network by taking the final pseudo tag set as supervision information to obtain a complete component perception learning segmentation network model and obtain a semantic segmentation result in the image to be segmented.

In a third aspect, a computer storage medium stores a computer program that is invoked by a processor to implement:

the method for the weak supervision semantic segmentation based on the part perception learning segmentation model comprises the following steps.

Advantageous effects

Compared with the prior art, the invention has the advantages that:

1. Unlike previous approaches, the present invention utilizes SAM mask sets to guide the generation of more accurate CAMs, resulting in final pseudo tags with clearer boundaries. The invention proposes a partially aware learning segmentation network to overcome the limitations of traditional CAM-based approaches and solve two challenges in SAM-based WSSS (1) SAM generates a large number of low quality masks while it is difficult to distinguish between foreground and background under unsupervised conditions. Such limitations result in reduced segmentation accuracy and an inability to effectively separate overlapping targets. (2) Since SAM relies on pixel similarity to distinguish, it often partitions a single object into multiple parts, thereby generating complex and inaccurate pseudo tags. Compared with the prior WSSS method based on SAM, the invention relies on the partial-whole relation between the target masks so as to effectively use the SAM mask set to generate the high-quality pseudo tag.

2. The component-whole refining module provided by the invention refines the SAM mask set and captures part-whole relation in the refined mask to obtain a relation matrix between a target component and whole in the mask. The module can provide a high quality set of masking pseudo tags and more reliable information for training a segmented network model.

3. The partial perception learning module provided by the invention can effectively capture the relation between the whole object and each part of the object, thereby avoiding the loss of object information caused by the fact that the model only pays attention to the most significant area, and further improving the precision of weak supervision semantic segmentation.

4. The joint loss function preferably provided by the invention comprises classification loss, auxiliary classification loss and comparison loss, can help a model to better learn the relation between the whole object and the part thereof, and effectively utilizes the pseudo tag to distinguish the foreground and the background of the object.

Drawings

FIG. 1 is a schematic diagram of a component aware learning segmentation network according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a component-overall refining module as referred to in an embodiment of the invention;

FIG. 3 is a schematic diagram of a component aware learning module according to an embodiment of the present invention;

Fig. 4 is a schematic diagram showing the comparison of semantic segmentation results with an evaluation index of mIoU (average cross-correlation) on a VOC and COCO dataset, and PLS is an english abbreviation of a weak supervision semantic segmentation method based on a part-aware learning segmentation model.

Fig. 5 is a graphical representation of the visual results of the method of the present invention on a VOC data set.

Detailed Description

The technical scheme of the invention will be further described with reference to the accompanying drawings and examples.

Example 1

The embodiment of the invention provides a weak supervision semantic segmentation method based on a part perception learning segmentation network, which is shown in fig. 1, and comprises the following specific processes:

Step 1, constructing a part perception learning segmentation network;

The component perception learning segmentation network comprises a component-integral refining module, a main network, two parallel classification branch networks and a component branch network, which are connected in series behind the main network, as shown in figure 1, wherein ViT is the main network and is used for extracting image features, SAM represents a SAM mask set for obtaining a training set image by using the SAM model, N ⁺ represents a positive sample and comprises all mask features with integral-component relation, N ^- represents a negative sample, "3 x 3" represents a convolution layer with the size of 3 x 3, wherein all component mask features share one convolution layer, and all integral mask features share another convolution layer with the size of 3 x 3;

the backbone network is formed by ViT (Vision Transformer Encoder);

the classification branch network comprises a full connection layer and a Classifier which are sequentially connected, wherein as shown in figure 1, "FC" represents the full connection layer;

The component branch network comprises a component perception learning module, a Transformer Encoder module, a full-connection layer and a classifier which are connected in sequence. As shown in FIG. 1, "TEB" represents Transformer Encoder modules, "FC" represents fully connected layers, and "Classifier" represents a Classifier;

Because the classifier is only related to the Class Token of the last Transformer Encoder module, the invention adds an extra Transformer Encoder module behind the component perception learning module, thereby embedding the output of the component perception learning module into the influence classifier.

The component-whole refining module sequentially performs small target mask removal, high similarity mask removal and relation matrix record processing between target components and whole in the mask, as shown in fig. 2;

The method comprises the steps of selecting a target part and a whole object, wherein the target part and the whole object are used as a core of a part-whole refining module, and a relation matrix between the target part and the whole object is recorded in the mask, and the relation matrix between the target part and the whole object is constructed and extracted, and plays a key role in subsequent model training to help a model to learn the relevance between the target part and the whole object better;

a1, removing a small target mask;

the main task is to remove unconditional small target mask from SAM mask set of training set image, and input SAM mask set as training set image Wherein M _i represents SAM mask set masks of the ith training set image, K is the total number of masks, the specific operation is to calculateThe duty ratio U of the foreground pixels in each mask is removed if the duty ratio is lower than the set threshold value tau=0.05, and the output mask set of the step A1 is obtained after processingWherein the method comprises the steps ofRepresenting the number of masks in the output mask set, the formula is as follows:

A2, removing the high similarity mask;

removing masks with too high similarity from the output mask set screened in the step A1 to reduce redundancy, wherein the input is the mask set Wherein, Representing the mask in the i-th output mask set,The method comprises the specific operations of calculating according to the overlapping degree (i.e. the overlapping ratio) between masks, judging the masks with the overlapping degree exceeding a set threshold value beta as high similarity masks and eliminating the masks, wherein an output mask set consists of relatively independent effective masks, is used for subsequent model training as a pseudo tag set of a training set image and is recorded as a pseudo tag set of the training set imageThe formula is as follows:

Wherein i, j is The different mask indexes of the code sequence are represented by an ith mask and a jth mask; a mask in the pseudo tag set representing the ith training set image, Is the total number of the masks, wherein,AndRespectively represent masksSum maskThe number of pixels in the overlapping area of the two masks and in the total area covered is generally calculated respectively;

f (i, j) is used to determine whether to retain a mask, if another mask is not present And (3) withIf the overlap of (1) exceeds β, f (i, j) =1, and the mask is reservedOtherwise f (i, j) =1, and remove the maskThe threshold β is set to 0.95;

pseudo tag set for training set image The specific operation is to calculate the overlapping degree between the masks and set threshold value to judge which masks represent the relation between the parts and the whole, when the pseudo-label set of the training set imageMask in (a)And (3) withOverlapping portions of (2)Occupying the area ofWhen the ratio of (c) is greater than or equal to the threshold value beta, thought to be thatIs thatUsing the indicator function h (i, j) to build a relationship matrix between the object and its componentIt is defined as:

Where λ is a threshold of 0.9, if Is thatIf the component mask of (1) is h (i, j) =1, if the component mask of (1) is h (i, j) =0, and obtaining a relation matrix between the target component and the whole in the mask through h (i, j)

Through the close cooperation of the three steps, the part-whole refining module not only effectively optimizes the SAM mask set in the training set image to obtain a pseudo tag set of a training set image with higher quality, but also accurately constructs the relation between the target part and the whole, and provides more reliable additional information for model training, and enhances the capability of the model in the aspect of associating the integral characteristics of the target with the characteristics of the part by utilizing the information.

As shown in fig. 3, the component aware learning module works as follows:

step B1, extracting embedded features;

Input is training set image embedded features extracted from the penultimate layer of the transducer encoder Wherein B, N, E represents the size of the input batch, the number of image blocks into which the image is divided, and the feature vector dimension of each image block, respectively,Representing real number domain by removing embedded featuresClassifying the markup dimension, outputting the remaining embedded features

Step B2, remolding the embedded feature scale

Will embed featuresIs rearranged from BxNxE to BxCxHxW, wherein H, W, C represents the height, width and feature dimensions after reshaping of the training set image, respectively, and then the mask pseudo tag set and embedded features of the training set image output from the component-whole refinement moduleDot multiplication to obtain an embedded feature setWherein k represents the number of mask codes (in the embodiment of the invention, the value is 7), and then the feature set is embeddedThe shape of (2) is rearranged from Bxk×C×H×W to Bxk×N×E; and is opposite toReducing the original channel dimension to 1/k through a1 x 1 filter layer, outputting a mask feature set

Step B3 mask feature fusion

For a pair ofCarrying out average pooling and normalization to obtain a weighting coefficientAccording to the relation matrixCorresponding integral mask and component mask, resulting in corresponding integral mask featuresAnd component mask feature

Based on the relationship matrix output in the part-to-whole refining moduleWhen the indicator function h (i, j) =1 in the relation matrix, the mask is representedMasking as a whole, maskingMasking its components, where i, j is the pseudo-tag set of the training set imageThe different mask indexes of the training set image represent the ith mask and the jth mask, and the pseudo tag set of the training set imageAnd mask feature setOne-to-one correspondence, integral maskMask features of (a)A mask feature noted as an overall mask feature, satisfying the number h (i, j) =1Marked as part mask feature, the number of part masks associated with an overall mask feature being defined by the overall maskThe corresponding entry decision for h (i, j) =1;

Step B4, outputting image characteristics;

For weighted mask features Using a1 x 1 convolutional layer process to obtain featuresFinally the unit will output the characteristicsSpliced with the removed class Token to obtain the final output characteristic

The mask pseudo tag set of the training set image output by the component-whole refining module and the relation matrix between the target component and the whole in the mask are utilized, and through the synergistic effect of the steps of the component perception learning module, the model can efficiently extract and fuse the mask characteristics of the training set image, and meanwhile, the relation matrix between the component and the whole strengthens the learning of the relation between the target component and the whole by the model, so that the performance of the model is further improved.

The method comprises the steps of taking known training set images as input, preprocessing the images, including image size adjustment, image graying and normalization and color space conversion, inputting the preprocessed training set images into a pre-trained SAM segmentation model, automatically segmenting each training image by the model, generating segmentation masks of a plurality of target areas, and obtaining a SAM mask set of the training set images.

And 3, using a part-whole refining module of the part perception learning segmentation network, taking the SAM mask set as input to obtain a mask pseudo-label set of the training set image and a relation matrix between the target part and the whole in the mask, wherein the method comprises three steps of removing a small target mask, removing a high similarity mask and recording the relation matrix between the target part and the whole in the mask.

The pseudo tag, in the weak supervision semantic segmentation task, is an approximate pixel level tag because it is not the true classification information for each pixel. The target area in the image is usually predicted by a model obtained by current training, and the prediction results are taken as fake pixel-level labels of the target. Thus, pseudo tags have a higher degree of fine granularity than image level tags, but a slightly lower degree of accuracy, typically generated by a Class Activation Map (CAM) or other feature-based strategy. The method utilizes an unsupervised SAM segmentation model to predict and preprocess the output of the model to generate a pseudo tag which is used for guiding the model to learn the regional boundaries of different targets in an image.

And 4, utilizing a component perception learning segmentation network comprising a component perception learning module, taking a known training set image and an image-level label, and taking a mask pseudo-label set of the training set image and a relation matrix between a target component and the whole in the mask as inputs, wherein the process of acquiring an initial component perception learning segmentation network model is as follows:

Taking an input training set image as input, taking an image-level label as supervision information, taking a mask pseudo-label set of the training set image and a relation matrix between a target component and the whole in a mask as auxiliary information, training a component perception learning segmentation network comprising a component perception learning module by utilizing a classifier by utilizing a joint loss function, and obtaining a trained initial component perception learning segmentation network model;

the joint loss function consists of a classification loss function, an auxiliary classification loss function and a comparison loss function, and the formula is as follows:

l _cls denotes the cross entropy loss function:

representing an auxiliary classification loss function:

l _reg represents the contrast loss function:

Where CosSim () represents the cosine similarity calculation, N ⁺/N^- represents the number of positive and negative pairs of samples respectively, Representing a mask feature setIs an element of (a).

And 5, acquiring a CAM set of a target mask in the training set image by using the initial part perception learning segmentation network model, obtaining a final pseudo tag set through threshold segmentation, calculating a segmentation loss function by taking the final pseudo tag set as supervision information, adding the segmentation loss function based on the joint loss function, simultaneously training the initial part perception learning segmentation network model and a segmentation decoder connected in series at the rear end of the part branch network to obtain a complete part perception learning segmentation network model, and finally inputting an image to be segmented into the complete part perception learning segmentation network model to obtain a semantic segmentation result in the image to be segmented.

The CAM process for obtaining the target mask in the image to be segmented by using the initial part perception learning segmentation network model comprises the steps of inputting the image to be segmented into the trained initial part perception learning segmentation network model, obtaining the class weight of the image to be segmented through model prediction, multiplying the class weight with the last layer of characteristics and adding the class weight to obtain the CAM of the target mask.

As shown in FIG. 4, the method of the embodiment of the invention obtains the best results on the VOC and COCO data sets on the evaluation index mIoU (average cross-over ratio) of semantic segmentation, the segmentation accuracy is superior to that of the existing method, and in addition, as shown in FIG. 5, the visual results of the method of the embodiment of the invention on the VOC data sets are superior to that of the existing method, and the results verify the effectiveness of the technical scheme of the invention on the weak supervision semantic segmentation task.

Example 2

A weakly supervised target localization system based on a multi-layer decoupled attention localization network, comprising:

It should be understood that the implementation of the respective modules may be stated with reference to the foregoing method, and the above-described division of the functional modules is merely a division of logic functions, and there may be another division manner when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted or not performed. Meanwhile, the integrated units can be realized in a hardware form or a software functional unit form.

Example 3

A computer readable storage medium storing a computer program, the computer program being invoked by a processor to implement:

The method for positioning the weak supervision target based on the multi-layer decoupling attention positioning network comprises the following steps.

For a specific implementation of each step, please refer to the description of the foregoing method.

It should be appreciated that in embodiments of the present invention, the Processor may be a central processing unit (Central Processing Unit, CPU), which may also be other general purpose Processor, digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf Programmable gate array (Field-Programmable GATEARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The memory may include read only memory and random access memory and provide instructions and data to the processor. A portion of the memory may also include non-volatile random access memory. For example, the memory may also store information of the device type.

The readable storage medium is a computer readable storage medium, which may be an internal storage unit of the software and hardware device according to any one of the foregoing embodiments, for example, a hard disk or a memory of a controller. The readable storage medium may also be an external storage device of the controller, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), or the like, which are provided on the controller. Further, the readable storage medium may also include both an internal storage unit and an external storage device of the controller. The readable storage medium is used to store the computer program and other programs and data required by the controller. The readable storage medium may also be used to temporarily store data that has been output or is to be output.

Based on such understanding, the technical solution of the present invention is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. The readable storage medium includes various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only Memory (ROM), a random access Memory (RAM, randomAccess Memory), a magnetic disk, or an optical disk.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is directed to methods, apparatus (systems), and computer program products in accordance with embodiments of the present application that produce means for implementing the functions specified in the flowchart flow(s) and/or block diagram block or blocks, with reference to the instructions that execute in the flowchart and/or processor(s) of the computer program product. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be emphasized that the examples described herein are illustrative rather than limiting, and that this invention is not limited to the examples described in the specific embodiments, but is capable of other embodiments in accordance with the teachings of the present invention, as long as they do not depart from the spirit and scope of the invention, whether modified or substituted, and still fall within the scope of the invention.

Claims

1. The weak supervision semantic segmentation method based on the component perception learning segmentation network is characterized by comprising the following steps of:

Step 1, constructing a part perception learning segmentation network;

2. The method of claim 1, wherein the backbone network is comprised of ViT and the categorized branch network comprises a fully connected layer and a classifier connected in sequence.

3. The method of claim 1, wherein the component-integrated refining module operates as follows:

Step A1, small target mask removal;

SAM mask set for inputting training set image Calculation ofThe duty ratio U of foreground pixels in each mask is removed if the duty ratio U is lower than a set threshold value tau, and an output mask set is obtained

A2, removing a high similarity mask;

Wherein i, j isThe different mask indexes of the code sequence are represented by an ith mask and a jth mask; An ith mask in the pseudo tag set representing the training set image, The total number of masks in the pseudo tag set for the training set image, wherein, AndRespectively represent masksSum maskThe size of the intersection and union regions of (a), the number of pixels in the total region of overlap and coverage of the two masks, respectively, is typically calculated, f (i, j) is used to determine whether a mask is reserved, if another mask is not present,And (3) withIf the overlap of (1) exceeds the threshold β, f (i, j) =1, and the mask is reservedOtherwise f (i, j) =1, and remove the mask

Where λ is the threshold, ifIs thatIf the component mask of (1) is h (i, j) =1, if the component mask of (1) is h (i, j) =0, and obtaining a relation matrix between the target component and the whole in the mask through h (i, j)

4. A method according to claim 3, wherein the component aware learning module operates as follows:

step B1, extracting embedded features;

step B2, remolding the embedded feature scale;

Step B3, mask feature fusion;

Step B4, outputting image characteristics;

5. The method of claim 1, wherein inputting the training set image into the SAM segmentation model yields a SAM mask set of the training set image as follows:

6. The method according to claim 1, wherein the training component-aware learning segmentation network takes a known training set image, an image-level label, a mask pseudo-label set of the training set image, and a relation matrix between a target component and a whole in the mask as inputs of the component-aware learning segmentation network, and the process of obtaining an initial component-aware learning segmentation network model is as follows:

7. The method of claim 6, wherein the joint loss function consists of a class cross entropy loss function, an auxiliary class loss function, and a contrast loss function, the formula being:

Wherein L _all represents a joint loss function, lambda ₁,λ₂,λ₃ is used to balance L _cls, And regularization parameters of L _reg;

L _cls denotes the classification cross entropy-loss function:

representing an auxiliary classification loss function:

l _reg represents the contrast loss function:

Where CosSim () represents cosine similarity calculation, N ⁺ and N ^- represent the number of positive pairs of samples, i.e., samples having a component relationship, and negative pairs of samples, i.e., other samples than the positive samples, Representing a mask feature setIs an element of (a).

8. The method of claim 7, wherein the final pseudo tag set is used as supervision information, the segmentation loss function is calculated, the segmentation loss function is added based on the joint loss function, an initial part perception learning segmentation network and an additional segmentation decoder connected in series at the rear end of the part branch network are trained simultaneously to obtain a complete part perception learning segmentation network model, the image to be segmented is finally input into the complete part perception learning segmentation network model, and semantic segmentation results in the image to be segmented are inferred and obtained.

9. A weak supervision semantic segmentation system based on a component aware learning segmentation model, comprising:

The segmentation result optimization module is used for acquiring a CAM set of a target mask in a training set image by using an initial part perception learning segmentation network model, obtaining a final pseudo tag set through threshold segmentation, simultaneously training an initial part perception learning segmentation network and a segmentation decoder connected in series at the rear end of a part branch network by using the final pseudo tag set as supervision information to obtain a complete part perception learning segmentation network model, inputting an image to be segmented into the complete part perception learning segmentation network model, and obtaining a semantic segmentation result in the image to be segmented.

10. A computer storage medium storing a computer program, the computer program being invoked by a processor to implement:

the method of any one of claims 1-8.