CN116703950B

CN116703950B - A camouflage target image segmentation method and system based on multi-level feature fusion

Info

Publication number: CN116703950B
Application number: CN202310982262.1A
Authority: CN
Inventors: 任胜兵; 梁义; 周佳蕾
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2023-08-07
Filing date: 2023-08-07
Publication date: 2023-10-20
Anticipated expiration: 2043-08-07
Also published as: CN116703950A

Abstract

The invention discloses a camouflage target image segmentation method and a camouflage target image segmentation system based on multi-level feature fusion, wherein the method carries out global feature enhancement on a first feature image of each level and carries out local feature enhancement on a second feature image of each level; carrying out feature fusion on the reinforced local features of each level and the reinforced global features of the same level to obtain fusion features of multiple levels; conducting boundary guiding on the fusion characteristics of two shallow network layers in the fusion characteristics of multiple layers to obtain a boundary map; performing feature interaction on fusion features of adjacent network layers in the fusion features of multiple layers to obtain multiple interaction features; respectively carrying out boundary fusion on the boundary map and each interactive feature in the plurality of interactive features to obtain a plurality of boundary fusion features; based on the plurality of boundary fusion features, a camouflage target image in the camouflage target images to be segmented corresponding to each boundary fusion feature is segmented. The invention can improve the accuracy of the camouflage target image segmentation.

Description

Camouflage target image segmentation method and system based on multi-level feature fusion

Technical Field

The invention relates to the technical field of camouflage target image segmentation, in particular to a camouflage target image segmentation method and system based on multi-level feature fusion.

Background

Camouflage Object Segmentation (COS) aims at segmenting camouflage objects that are highly similar to the background. The COS uses computer vision models to assist the human vision and perception system in camouflage target image segmentation.

However, in the prior art, the local feature extraction capability of a model taking CNN (such as ResNet) as a backbone is strong, and the capability of CNN is limited when a long-range feature dependency relationship is obtained because of the limitation of a receptive field; models based on a transducer (e.g., vision transformer) benefit from the attention mechanisms in the transducer, which have strong modeling capabilities for global feature relationships, but have limitations in capturing fine-grained details, resulting in reduced expression of local features. The disguised target image segmentation is not only to segment the target from the whole based on the global features, but also to process the detailed information such as boundaries based on the local features, and the adoption of a single king network is low in efficiency because the local features and the global features are fused by means of a complex method. Most methods use simple operations to fuse multi-level features, such as stitching and summing, where high-level features and low-level features interact by first fusing the two features through an addition operation. The fused features are then fed into a Sigmoid activation function to obtain a normalized feature map, which is treated as a feature level attention map to enhance the feature representation. In this case, the way in which cross-level feature enhancement is achieved using the fused feature map obtained by the simple add operation, valuable information relating to the split camouflage target height cannot be captured. Partial models are dedicated to extracting global texture features of camouflage targets, neglecting the impact of boundaries on model expressive power, and these models do not perform well when the target object shares the same texture features as the background. Since the texture of most camouflage objects is similar to the background, distinguishing subtle differences in boundary local information is particularly important to improve model performance. Some models consider boundary features, but often monitor the predicted boundary map as a separate branch without other processing, and the boundary map information is not fully utilized.

To sum up, it is difficult to capture practical feature information and predicted boundary map information cannot be fully utilized in the prior art, and thus it is difficult to achieve accurate camouflage target segmentation.

Disclosure of Invention

The present invention aims to solve at least one of the technical problems existing in the prior art. Therefore, the invention provides a camouflage target image segmentation method and a camouflage target image segmentation system based on multi-level feature fusion, which can improve the accuracy of camouflage target image segmentation.

In a first aspect, an embodiment of the present invention provides a method for dividing a camouflage target image based on multi-level feature fusion, where the method for dividing the camouflage target image based on multi-level feature fusion includes:

obtaining a camouflage target image to be segmented;

performing multi-level feature extraction on the camouflage target image to be segmented by adopting different network layers through a first branch network and a second branch network to obtain a first feature image of multiple levels output by the first branch network and a second feature image of multiple levels output by the second branch network;

performing global feature enhancement on the first feature map of each level to obtain global features after multiple levels of enhancement; carrying out local feature enhancement on the second feature map of each level to obtain local features after multiple levels of enhancement; carrying out feature fusion on the enhanced local features of each level and the enhanced global features of the same level to obtain fusion features of multiple levels;

Conducting boundary guiding on the fusion characteristics of two shallow network layers in the fusion characteristics of the multiple layers to obtain a boundary diagram;

performing feature interaction on the fusion features of adjacent network layers in the fusion features of the multiple layers to obtain multiple interaction features;

respectively carrying out boundary fusion on the boundary map and each interactive feature in the plurality of interactive features to obtain a plurality of boundary fusion features;

and dividing a camouflage target image in the camouflage target image to be divided, which corresponds to each boundary fusion feature, based on the boundary fusion features.

Compared with the prior art, the first aspect of the invention has the following beneficial effects:

according to the method, the first branch network and the second branch network are used for carrying out multi-level feature extraction on the camouflage target image to be segmented by adopting different network layers, so that a plurality of levels of first feature images output by the first branch network and a plurality of levels of second feature images output by the second branch network are obtained, and the features in the target image can be better extracted; the method comprises the steps of carrying out global feature enhancement on a first feature map of each level to obtain global features after multiple levels of enhancement, carrying out local feature enhancement on a second feature map of each level to obtain local features after multiple levels of enhancement, carrying out feature fusion on the local features after enhancement of each level and the global features after enhancement of the same level to obtain fusion features of multiple levels, carrying out fusion on the local features after enhancement and the global features, and carrying out mutual complementation on the local features and the global features to provide comprehensive feature information for accurately dividing camouflage target images; boundary guiding is carried out on the fusion characteristics of two shallow network layers in the fusion characteristics of multiple layers to obtain a boundary map, and as more semantic information is reserved in the shallow layers, the boundary guiding is carried out by adopting the fusion characteristics of the shallow layers, so that a high-quality boundary map can be generated; performing feature interaction on fusion features of adjacent network layers in the fusion features of multiple layers to obtain multiple interaction features, wherein the fusion features of multiple layers can be mutually complemented to obtain comprehensive feature expression; boundary fusion is carried out on the boundary map and each interaction feature in the interaction features to obtain boundary fusion features, the camouflage target image in the camouflage target image to be segmented corresponding to each boundary fusion feature is segmented based on the boundary fusion features, boundary information in the boundary map is used as guidance, the features of the boundary map are integrated with the interaction features of different layers, the boundary features are thinned, clear and complete boundaries are ensured, the separation of fine foreground and background boundaries of the camouflage target is facilitated, better expressive force is provided for the separation of the camouflage target, and the accuracy of the separation of the camouflage target image is improved.

According to some embodiments of the present invention, the extracting, by using different network layers, the multi-level feature of the camouflage target image to be segmented through the first branch network and the second branch network, to obtain a first feature map of multiple levels output by the first branch network and a second feature map of multiple levels output by the second branch network, including:

extracting features of global context information of the camouflage target image to be segmented by adopting different network layers through a first branch network to obtain a first feature map of multiple layers output by the first branch network;

and extracting the characteristics of the local detail information of the camouflage target image to be segmented by adopting different network layers through a second branch network, and obtaining a plurality of layers of second characteristic images output by the second branch network.

According to some embodiments of the present invention, global feature enhancement is performed on the first feature map of each level, so as to obtain global features after multiple levels of enhancement; carrying out local feature enhancement on the second feature map of each level to obtain local features after multiple levels of enhancement; and performing feature fusion on the enhanced local feature of each level and the enhanced global feature of the same level to obtain fusion features of multiple levels, wherein the feature fusion comprises the following steps:

Carrying out global feature enhancement on the first feature map of each level by adopting a residual channel attention mechanism to obtain global features after multiple levels of enhancement;

local feature enhancement is carried out on the second feature map of each level by adopting a spatial attention mechanism, so that local features after multiple levels of enhancement are obtained;

splicing the enhanced local features of each level with the enhanced global features of the same level to obtain splicing features of multiple levels, and adopting a convolution layer to promote feature fusion of the splicing features of the multiple levels to obtain fusion features of multiple levels.

According to some embodiments of the present invention, the performing boundary guiding on the fusion features of the two shallow network layers in the multiple layers of fusion features to obtain a boundary map includes:

convolving the fusion characteristics of two shallow network layers in the fusion characteristics of the multiple layers to obtain a first convolution characteristic and a second convolution characteristic;

and performing addition operation on the first convolution characteristic and the second convolution characteristic to obtain an addition characteristic, and performing boundary guiding on the addition characteristic by adopting a plurality of convolution layers to obtain a boundary map.

According to some embodiments of the present invention, the feature interaction is performed on the fusion features of adjacent network layers in the multiple layers of fusion features, to obtain multiple interaction features, including:

introducing a multi-scale channel attention mechanism into a feature interaction module, and adding fusion features of adjacent network layers in the multi-level fusion features to obtain a plurality of addition features;

inputting each added feature into the multi-scale channel attention mechanism to obtain a plurality of multi-scale channel features;

obtaining a plurality of normalized features by adopting an activation function for each multi-scale channel feature, and obtaining a plurality of normalized difference features by subtracting each normalized feature;

performing feature enhancement on the plurality of normalized features and the plurality of normalized difference features to obtain a plurality of enhanced normalized features and a plurality of enhanced normalized difference features;

residual connection is carried out on each enhanced normalized feature and the corresponding fusion feature to obtain a plurality of first residual features, and each first residual feature is convolved to obtain a plurality of first convolution features;

residual connection is carried out on each enhanced normalized difference feature and the corresponding fusion feature to obtain a plurality of second residual features, and convolution is carried out on each second residual feature to obtain a plurality of second convolution features;

And adding each first convolution feature and the corresponding second convolution feature to obtain a plurality of added convolution features, and adopting a convolution layer to promote fusion of the added convolution features to obtain a plurality of interaction features.

According to some embodiments of the invention, the performing boundary fusion on the boundary map and each of the plurality of interaction features to obtain a plurality of boundary fusion features includes:

based on each interaction characteristic, learning a target overall characteristic by adopting a target attention head branch; wherein the target attention head branch is used for separating a target and a background on the whole based on the interaction characteristics;

based on the boundary map and each interaction characteristic, adopting a boundary attention head branch to learn boundary detail characteristics; the boundary attention head branches are used for capturing sparse local boundary information of the target based on the boundary map and the interaction characteristics;

splicing the output of each target attention head branch and the output of each boundary attention head branch corresponding to the target attention head branch to obtain a plurality of output splicing features, and adopting a convolution layer to promote feature fusion of the plurality of output splicing features to obtain a plurality of convolution fusion features;

And carrying out residual connection on each convolution fusion feature and each interaction feature corresponding to each convolution fusion feature to obtain a plurality of boundary fusion features.

According to some embodiments of the invention, the segmenting the camouflage target image in the camouflage target image to be segmented corresponding to each boundary fusion feature based on the boundary fusion features includes:

inputting the boundary fusion features into a convolution layer with a Sigmoid activation function to generate a plurality of prediction graphs;

and dividing a camouflage target image in the camouflage target images to be divided based on each prediction graph.

In a second aspect, the embodiment of the present invention further provides a camouflage target image segmentation system based on multi-level feature fusion, where the camouflage target image segmentation system based on multi-level feature fusion includes:

the data acquisition unit is used for acquiring a camouflage target image to be segmented;

the feature extraction unit is used for extracting multi-level features of the camouflage target image to be segmented by adopting different network layers through a first branch network and a second branch network to obtain a first feature image of multiple levels output by the first branch network and a second feature image of multiple levels output by the second branch network;

The feature fusion unit is used for carrying out global feature enhancement on the first feature map of each level to obtain global features after multiple levels of enhancement; carrying out local feature enhancement on the second feature map of each level to obtain local features after multiple levels of enhancement; carrying out feature fusion on the enhanced local features of each level and the enhanced global features of the same level to obtain fusion features of multiple levels;

the boundary guiding unit is used for conducting boundary guiding on the fusion characteristics of the two shallow network layers in the fusion characteristics of the multiple layers to obtain a boundary diagram;

the feature interaction unit is used for carrying out feature interaction on the fusion features of the adjacent network layers in the fusion features of the multiple layers to obtain multiple interaction features;

the boundary fusion unit is used for carrying out boundary fusion on the boundary graph and each interaction feature in the interaction features respectively to obtain a plurality of boundary fusion features;

the image segmentation unit is used for segmenting a camouflage target image in the camouflage target images to be segmented, which correspond to each boundary fusion feature, based on the boundary fusion features.

In a third aspect, an embodiment of the present invention further provides an electronic device, including:

at least one memory;

at least one processor;

at least one computer program;

the at least one computer program is stored in the at least one memory, and the at least one processor executes the at least one computer program to implement a camouflage target image segmentation method based on multi-level feature fusion as described in the first aspect.

In a fourth aspect, an embodiment of the present invention further provides a storage medium, where the storage medium is a computer readable storage medium, where the computer readable storage medium stores a computer program, where the computer program is configured to make a computer execute a method for dividing a camouflage target image based on multi-level feature fusion according to the first aspect.

It is to be understood that the advantages of the second to fourth aspects compared with the related art are the same as those of the first aspect compared with the related art, and reference may be made to the related description in the first aspect, which is not repeated herein.

Drawings

The foregoing and/or additional aspects and advantages of the invention will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:

FIG. 1 is a flow chart of a camouflage target image segmentation method based on multi-level feature fusion according to an embodiment of the invention;

FIG. 2 is a flowchart of a method for camouflage target image segmentation based on multi-level feature fusion according to another embodiment of the present invention;

FIG. 3 is a schematic view of the overall structure of a model according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a Res2Net module and a basic convolution module according to one embodiment of the present invention;

FIG. 5 is a block diagram of a residual channel attention mechanism of an embodiment of the present invention;

FIG. 6 is a block diagram of a spatial attention mechanism of an embodiment of the present invention;

FIG. 7 is a block diagram of an LGA module according to an embodiment of the present invention;

FIG. 8 is a block diagram of an MS-CA according to an embodiment of the invention;

FIG. 9 is a block diagram of a CFT module in accordance with an embodiment of the present invention;

fig. 10 is a block diagram of an MTA according to an embodiment of the invention;

FIG. 11 is a block diagram of a BMTA according to an embodiment of the invention;

fig. 12 is a block diagram of a BAH according to an embodiment of the present invention;

FIG. 13 is a block diagram of a camouflage target image segmentation system based on multi-level feature fusion according to an embodiment of the invention;

fig. 14 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.

In the description of the present invention, the description of first, second, etc. is for the purpose of distinguishing between technical features only and should not be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.

In the description of the present invention, it should be understood that the direction or positional relationship indicated with respect to the description of the orientation, such as up, down, etc., is based on the direction or positional relationship shown in the drawings, is merely for convenience of describing the present invention and simplifying the description, and does not indicate or imply that the apparatus or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention.

In the description of the present invention, unless explicitly defined otherwise, terms such as arrangement, installation, connection, etc. should be construed broadly and the specific meaning of the terms in the present invention can be determined reasonably by a person skilled in the art in combination with the specific content of the technical solution.

In the prior art, the local feature extraction capability of a model taking CNN (such as ResNet) as a trunk is strong, and the capability of CNN is limited when a long-range feature dependency relationship is acquired because of the limitation of a receptive field; models based on a transducer (e.g., vision transformer) benefit from the attention mechanisms in the transducer, which have strong modeling capabilities for global feature relationships, but have limitations in capturing fine-grained details, resulting in reduced expression of local features. Most methods use simple operations to fuse multi-level features, such as stitching and summing, where high-level features and low-level features interact by first fusing the two features through an addition operation. The fused features are then fed into a Sigmoid activation function to obtain a normalized feature map, which is treated as a feature level attention map to enhance the feature representation. In this case, the way in which cross-level feature enhancement is achieved using the fused feature map obtained by the simple add operation, valuable information relating to the split camouflage target height cannot be captured. Partial models are dedicated to extracting global texture features of camouflage targets, neglecting the impact of boundaries on model expressive power, and these models do not perform well when the target object shares the same texture features as the background. Since the texture of most camouflage objects is similar to the background, distinguishing subtle differences in boundary local information is particularly important to improve model performance. Some models consider boundary features, but often monitor the predicted boundary map as a separate branch without other processing, and the boundary map information is not fully utilized.

To sum up, it is difficult to capture practical feature information in the prior art, and the predicted boundary map information cannot be fully utilized, so that it is difficult to achieve accurate camouflage target segmentation.

In order to solve the problems, the invention adopts different network layers to extract multi-level characteristics of the camouflage target image to be segmented through the first branch network and the second branch network, so as to obtain a first characteristic image of various layers output by the first branch network and a second characteristic image of various layers output by the second branch network, and the characteristics in the target image can be better extracted; the method comprises the steps of carrying out global feature enhancement on a first feature map of each level to obtain global features after multiple levels of enhancement, carrying out local feature enhancement on a second feature map of each level to obtain local features after multiple levels of enhancement, carrying out feature fusion on the local features after enhancement of each level and the global features after enhancement of the same level to obtain fusion features of multiple levels, carrying out fusion on the local features after enhancement and the global features, and carrying out mutual complementation on the local features and the global features to provide comprehensive feature information for accurately dividing camouflage target images; boundary guiding is carried out on the fusion characteristics of two shallow network layers in the fusion characteristics of multiple layers to obtain a boundary map, and as more semantic information is reserved in the shallow layers, the boundary guiding is carried out by adopting the fusion characteristics of the shallow layers, so that a high-quality boundary map can be generated; performing feature interaction on fusion features of adjacent network layers in the fusion features of multiple layers to obtain multiple interaction features, wherein the fusion features of multiple layers can be mutually complemented to obtain comprehensive feature expression; boundary fusion is carried out on the boundary map and each interaction feature in the interaction features to obtain boundary fusion features, the camouflage target image in the camouflage target image to be segmented corresponding to each boundary fusion feature is segmented based on the boundary fusion features, boundary information in the boundary map is used as guidance, the features of the boundary map are integrated with the interaction features of different layers, the boundary features are thinned, clear and complete boundaries are ensured, the separation of fine foreground and background boundaries of the camouflage target is facilitated, better expressive force is provided for the separation of the camouflage target, and the accuracy of the separation of the camouflage target image is improved.

Referring to fig. 1, an embodiment of the present invention provides a method for dividing a camouflage target image based on multi-level feature fusion, where the method includes, but is not limited to, steps S100 to S700, where:

step S100, obtaining a camouflage target image to be segmented;

step 200, performing multi-level feature extraction on a camouflage target image to be segmented by adopting different network layers through a first branch network and a second branch network to obtain a first feature image of multiple layers output by the first branch network and a second feature image of multiple layers output by the second branch network;

step S300, carrying out global feature enhancement on the first feature map of each level to obtain global features after multiple levels of enhancement; carrying out local feature enhancement on the second feature map of each level to obtain local features after multiple levels of enhancement; carrying out feature fusion on the reinforced local features of each level and the reinforced global features of the same level to obtain fusion features of multiple levels;

step S400, conducting boundary guiding on the fusion characteristics of two shallow network layers in the fusion characteristics of multiple layers to obtain a boundary diagram;

Step S500, carrying out feature interaction on fusion features of adjacent network layers in the fusion features of multiple layers to obtain multiple interaction features;

step S600, respectively carrying out boundary fusion on the boundary map and each interactive feature in the plurality of interactive features to obtain a plurality of boundary fusion features;

step S700, based on a plurality of boundary fusion features, segmenting a camouflage target image in the camouflage target images to be segmented, wherein the camouflage target image corresponds to each boundary fusion feature.

In steps S100 to S700 of some embodiments, in order to better extract features in the target image, in this embodiment, multiple levels of feature extraction are performed on the camouflage target image to be segmented by adopting different network layers through the first branch network and the second branch network, so as to obtain multiple levels of first feature images output by the first branch network and multiple levels of second feature images output by the second branch network; in order to provide comprehensive feature information for accurately dividing a camouflage target image, the embodiment obtains global features after multiple layers of enhancement by carrying out global feature enhancement on the first feature map of each layer; carrying out local feature enhancement on the second feature map of each level to obtain local features after multiple levels of enhancement; carrying out feature fusion on the reinforced local features of each level and the reinforced global features of the same level to obtain fusion features of multiple levels; in order to generate a high-quality boundary map, the boundary map is obtained by conducting boundary guidance on the fusion characteristics of two shallow network layers in the fusion characteristics of multiple layers; in order to obtain comprehensive feature expression, in the embodiment, feature interaction is performed on the fusion features of adjacent network layers in the fusion features of multiple layers to obtain multiple interaction features; in order to improve the accuracy of the camouflage target image segmentation, in the embodiment, boundary fusion is performed on the boundary map and each interactive feature in the plurality of interactive features respectively to obtain a plurality of boundary fusion features, and the camouflage target image in the camouflage target image to be segmented corresponding to each boundary fusion feature is segmented based on the plurality of boundary fusion features.

In some embodiments, the method includes performing multi-level feature extraction on a camouflage target image to be segmented by using different network layers through a first branch network and a second branch network to obtain a first feature map of multiple levels output by the first branch network and a second feature map of multiple levels output by the second branch network, including:

extracting features of global context information of a camouflage target image to be segmented by adopting different network layers through a first branch network to obtain a first feature map of multiple layers output by the first branch network;

and extracting the characteristics of the local detail information of the camouflage target image to be segmented by adopting different network layers through the second branch network, and obtaining a plurality of layers of second characteristic diagrams output by the second branch network.

Specifically, feature extraction is carried out on global context information of a camouflage target image to be segmented by adopting different network layers through a Swin-transform V2 (namely a first branch network), so as to obtain a first feature map of multiple layers output by the first branch network; the multi-head self-attention mechanism in Swin-transform V2 can break through the receptive field limitation in CNN, model the context relation pixel by pixel in the global scope, and allocate larger weight for important features so that the feature expression is richer. Extracting the characteristics of local detail information of the camouflage target image to be segmented by adopting different network layers through Res2Net (namely a second branch network), and obtaining a second characteristic image of multiple layers output by the second branch network; res2Net has stronger and more effective multi-level feature extraction capability, refines features at a finer granularity level, and highlights the difference between foreground and background.

In this embodiment, since most of the current models extract features based on a single backbone network, a complex method is required to fuse local features and global features, which is inefficient. Therefore, in the embodiment, the first branch network and the second branch network adopt different network layers to extract the multi-level characteristics of the camouflage target image to be segmented, so that the characteristics in the target image can be better extracted.

Note that, in this embodiment, the Swin-transformation v2 and Res2Net are used for feature extraction, but the embodiment is not limited to the Swin-transformation v2 and Res2Net, and may be modified according to actual situations, and is not particularly limited.

In some embodiments, global feature enhancement is performed on the first feature map of each level, so as to obtain global features after multiple levels of enhancement; carrying out local feature enhancement on the second feature map of each level to obtain local features after multiple levels of enhancement; and carrying out feature fusion on the reinforced local features of each level and the reinforced global features of the same level to obtain fusion features of multiple levels, wherein the method comprises the following steps:

Carrying out local feature enhancement on the second feature map of each level by adopting a spatial attention mechanism to obtain local features after the enhancement of multiple levels;

splicing the reinforced local features of each level with the reinforced global features of the same level to obtain splicing features of multiple levels, and adopting a convolution layer to promote feature fusion of the splicing features of multiple levels to obtain fusion features of multiple levels.

Specifically, a local space detail and global context information fusion module is adopted to enhance and fuse the characteristics. The local space detail and global context information fusion module simultaneously applies a channel attention mechanism and a space attention mechanism, and adopts a residual channel attention mechanism to carry out global feature enhancement on the first feature map of each level so as to obtain global features after multiple levels of enhancement; carrying out local feature enhancement on the second feature map of each level by adopting a spatial attention mechanism to obtain local features after the enhancement of multiple levels; splicing the reinforced local features of each level with the reinforced global features of the same level to obtain splicing features of multiple levels, and adopting a convolution layer to promote feature fusion of the splicing features of multiple levels to obtain fusion features of multiple levels.

In this embodiment, the local spatial detail and global context information fusion module considers the global context and local detail of the image to identify the overall trend of the image, so as to effectively supplement the feature extraction capability of the two backbone branch networks. The global information is used for extracting important global features for roughly estimating the position of the target object, the local information is used for extracting fine-grained features of the target object, and the local features and the global features complement each other so as to provide comprehensive feature information for realizing accurate disguised target segmentation.

In some embodiments, performing boundary guiding on the fusion features of two shallow network layers in the fusion features of multiple layers to obtain a boundary map, including:

convolving the fusion characteristics of two shallow network layers in the fusion characteristics of multiple layers to obtain a first convolution characteristic and a second convolution characteristic;

and performing addition operation on the first convolution feature and the second convolution feature to obtain an addition feature, and performing boundary guiding on the addition feature by adopting a plurality of convolution layers to obtain a boundary map.

Specifically, a boundary guiding module is adopted to convolve the fusion characteristics of two shallow network layers in the fusion characteristics of multiple layers, so as to obtain a first convolution characteristic and a second convolution characteristic; and performing addition operation on the first convolution feature and the second convolution feature to obtain an addition feature, and performing boundary guiding on the addition feature by adopting a plurality of convolution layers to obtain a boundary map.

In this embodiment, since more semantic information is retained in the shallow layer, boundary guidance is performed by using the fusion features of the shallow layer, so that a high-quality boundary map can be generated.

In some embodiments, feature interaction is performed on fusion features of adjacent network layers in the multiple layers of fusion features to obtain multiple interaction features, including:

introducing a multi-scale channel attention mechanism into the feature interaction module, and adding fusion features of adjacent network layers in the fusion features of multiple layers to obtain multiple addition features;

inputting each added feature into a multi-scale channel attention mechanism to obtain a plurality of multi-scale channel features;

In this embodiment, the feature interaction module introduces a multi-scale channel attention mechanism for realizing efficient interaction of cross-layer features and coping with the change of the target size in the disguised target segmentation. The multi-scale channel attention mechanism has strong adaptability to different scale targets. The multi-scale channel attention mechanism is based on a double-branch structure, one branch utilizes global average pooling to obtain global features and distributes more attention to large-scale objects, and the other branch utilizes point convolution to obtain fine-granularity local details, so that the capturing of the features of small objects is facilitated. Unlike other multi-scale attention mechanisms, the multi-scale channel attention mechanism uses point convolution in two branches to compress and recover characteristics of channel dimensions, thereby aggregating multi-scale channel information of different layers, effectively characterizing multi-scale information of a convolution layer. The fusion characteristics of a plurality of layers can be mutually complemented, so that more comprehensive characteristic expression is obtained.

In some embodiments, boundary fusion is performed on the boundary map and each interactive feature in the plurality of interactive features, so as to obtain a plurality of boundary fusion features, including:

based on each interaction characteristic, a target attention head branch is adopted to learn target overall characteristics; wherein the target attention head branch is used for separating the target and the background on the whole based on the interaction characteristics;

based on the boundary graph and each interaction characteristic, boundary detail characteristics are learned by adopting a boundary attention head branch; the boundary attention head branches are used for capturing sparse local boundary information of the target based on the boundary map and the interaction characteristics;

In the embodiment, the boundary information in the boundary map is used as a guide, the characteristics of the boundary map are integrated with the interactive characteristics of different layers, the boundary characteristics are refined, the definition and the integrity of the boundary are ensured, and the discrimination of the fine foreground and the background boundary of the camouflage target is facilitated.

In some embodiments, segmenting a camouflage target image in a camouflage target image to be segmented corresponding to each boundary fusion feature based on a plurality of boundary fusion features, includes:

inputting a plurality of boundary fusion features into a convolution layer with a Sigmoid activation function to generate a plurality of prediction graphs;

based on each prediction graph, a camouflage target image in the camouflage target images to be segmented is segmented.

In this embodiment, based on the multiple boundary fusion features, the method has better expressive force on the segmentation of the camouflage target, and can improve the accuracy of the image segmentation of the camouflage target.

For ease of understanding by those skilled in the art, a set of preferred embodiments are provided below:

referring to fig. 2 to 3, the overall structure of the model of the present embodiment includes feature extraction, a local spatial detail and global context information fusion module (LGA module) for fusing local features and global features, a feature interaction module (CFT module) for cross-layer feature interaction, a boundary guiding module (BGM module) for predicting a boundary map, a boundary guiding multi-convolution head transpose attention module (BMTA module) for fusing boundary features, and a prediction layer for generating a final segmentation map. Firstly, respectively inputting a camouflage target image to be segmented into a Swin-TransformaerV 2 and Res2Net parallel structure to extract multi-level features, then sending features with the same resolution in two branches into an LGA module to aggregate global information and local information, then carrying out interactive fusion and enhancement on the features output by the LGA module of an adjacent network layer through a CFT module, taking the features a1 and a2 output by the LGA module corresponding to a front shallow network layer as the input of a BGM module to generate a predicted Boundary Map (BM), sending the features and the boundary map output through the CFT module into a BMTA module, fusing the boundary information and the global information, finally sending the fused boundary map into a prediction layer to generate a prediction map, and segmenting the camouflage target image in the camouflage target image to be segmented based on each prediction map. The prediction graphs P1, P2, P3 in fig. 2 to 3 are the coarse to fine process, P1 is the coarsest, P3 is the clearest and the most complete, and P3 is the final result in this embodiment. P1, P2, P3 and BM are supervised by loss functions, the model is guided to optimize parameters, and the segmentation accuracy is improved. The method comprises the following specific steps:

1. And (5) extracting characteristics.

And extracting various layers of features from the camouflage target image to be segmented by adopting a Swin-TransformaerV 2 and Res2Net double-branch structure. The multi-head self-attention mechanism in Swin-transform V2 can break through the receptive field limitation in CNN, model the context relation pixel by pixel in the global scope, and allocate larger weight for important features so that the feature expression is richer. Res2Net has stronger and more effective multi-level feature extraction capability, refines features at a finer granularity level, and highlights the difference between foreground and background. Instead of representing the multi-level feature in a hierarchical fashion, res2Net replaces the 3 x 3 convolutional layer with a series of convolutional groups. The pair of Res2Net modules and basic convolution modules is shown in fig. 4, where a is the basic convolution module in fig. 4 and b is the Res2Net module in fig. 4. Each group has a larger receptive field output than the group after being processed by a 3 x 3 convolution layer, the grouping strategy can better process the characteristic graph, and the larger the split layer dimension is, the more the learned characteristic information is.

For a given image I εR ^3×H×W Where H, W represents the height and width of the input image, respectively, 3 represents an RGB image, and Swin-fransformaerv 2 and Res2Net each comprise four stages, the images are fed into Swin-fransformaerv 2 and Res2Net, respectively, to generate a multi-level feature map from the four stages. Generating a characteristic diagram T through a Swin-TransformarV 2 branch _i （i=1,2,3,4），T ₁ Has a resolution of H/4 XW/4, T ₄ The resolution of (2) is H/32 XW/32, and a feature map R is generated through Res2Net branches _i （i=1,2,3,4），R ₁ Resolution of H/4 XW/4, R ₄ The resolution of (a) is H/32 XW/32, i.e. the feature map generated at the same stage of the two branches is the same spatial size (i.e. has the same resolution).

2. The local spatial detail is fused with global context information.

And inputting the feature graphs obtained by Swin-transform V2 and Res2Net into an LGA module, and simultaneously considering the global context and local detail of the image to identify the overall trend of the image, so that the feature extraction capability of two trunk branch networks is effectively supplemented. The global information is used for extracting important global features for roughly estimating the position of the target object, the local information is used for extracting fine-grained features of the object, the local features and the global features complement each other, and comprehensive feature information is provided for realizing accurate camouflage target segmentation (COS).

The LGA module applies both the residual channel attention mechanism and the spatial attention mechanism. The structure of the residual channel attention mechanism is shown in fig. 5. For the input feature diagram X ε R ^C×H×W The C, H, W represents the number of channels, the height and the width respectively, the residual channel attention mechanism firstly obtains a 1×1×c feature map through global averaging pooling, obtains global important feature information, then compresses the number of channels through downsampling (realized through convolution of 1×1), and then restores to the original number of channels C through upsampling (realized through convolution of 1×1), so as to obtain a weight coefficient of each channel, and the weight coefficient is multiplied with the original feature X to obtain a feature map with more distinguishing. Residual channel attention assigns greater weights to important channels, enhancing global features along the channel dimension.

The structure of the spatial attention mechanism is shown in fig. 6. For the input feature diagram X ε R ^C×H×W The spatial attention mechanism obtains a feature map F by performing maximum pooling and average pooling on channel dimensions to compress the channel dimensions of the features _max ∈R ^1×H×W And F _avg ∈R ^1×H×W Then F is carried out _max And F _avg Splicing and generating a spatial attention map F using a convolution operation on the spliced feature map followed by a Sigmoid activation function _s ∈R ^1×H×W Space attention force F _s Multiplication with the input feature map X may assign greater weight to the important space, thereby enhancing local detail information of the spatial domain.

The structure of the LGA module is shown in FIG. 7. Features extracted from both branches of CNN (Res 2Net is one of CNN) and transducer (Swin-transducer V2 is a variant of transducer) with the same resolution (e.g. T ₁ And R is ₁ ) Is sent to LGA module, and the characteristic Fc from CNN is sent to spatial attentionSA) branching, further enhancing the local features extracted by CNN and suppressing irrelevant areas; features Ft from the transducer are fed into the Residual Channel Attention (RCA) branch, enhancing the transducer to extract global context features.

In order to focus the RCA branch on learning of globally important features, ft is connected with the features passing through RCA as residuals. For SA branches, in order to reduce the calculation amount of a model, firstly, convolution operation is carried out to reduce the channel dimension, and then, residual connection is carried out on the convolved result and the characteristics of the SA, so that the SA branches concentrate on learning spatial characteristics. The outputs of the two branches are then spliced to integrate global position information and local detail information and feature fusion is facilitated by a 3 x 3 convolutional layer, thereby adaptively integrating the local features and global dependencies.

3. And generating a boundary map.

Shallow feature layers (e.g. T ₁ And T ₂ 、R ₁ And R is ₂ ) More of the object's edge space information is preserved, while the deep convolutional layers (e.g., T ₃ And T ₄ 、R ₃ And R is ₄ ) More semantic information is retained, and therefore, a Boundary Map (BM) is generated using shallow features (a 1 and a2 in fig. 3) as input to the BGM module, the BM and the input image are made to have the same spatial size by upsampling, and the generated boundary map is measured by the following binary cross entropy loss function.

Wherein,,a border map true value representing the ith image,/->A boundary map representing a predicted ith image.

4. Cross-layer feature interactions.

The CFT module introduces a multi-scale channel attention (MS-CA) mechanism for realizing high-efficiency interaction of cross-layer characteristics and coping with the change of the target size in COS. The MS-CA mechanism has strong adaptability to different scale targets, and the structure of the MS-CA mechanism is shown in figure 8. The MS-CA mechanism is based on a double-branch structure, one branch utilizes global average pooling to obtain global features and distributes more attention to large-scale objects, and the other branch utilizes point convolution to obtain fine-granularity local details, so that the features of small objects can be captured more conveniently. Unlike other multiscale attention mechanisms, the MS-CA mechanism uses point convolution in two branches to compress and recover the characteristics of the channel dimensions, thereby aggregating multiscale channel information of different layers, effectively characterizing the multiscale information of the convolutional layers.

The overall structure of the feature interaction module is shown in FIG. 9, and features a of adjacent layers _h And a _l Firstly adding, then sending into MS-CA to obtain multi-scale channel information, and then using Sigmoid activation function to obtain normalized characteristic diagram F _s ，F _s 、1-F _s (corresponding to the dashed arrows in FIG. 9) are respectively associated with a _l And a _h Multiplication enhances the feature representation. To preserve the original information of each feature, the original feature and the enhanced feature are connected in residual, then the two branches are combined by addition and promoted to be fused with a 3×3 convolution layer, thus obtaining the output F of the CFT module _e 。

The feature interaction module aggregates cross-layer image features with different levels and receptive fields by introducing an MS-CA mechanism to provide rich multi-level context information, and the multi-level features interact to generate more effective and different image information, so that the model can adaptively divide targets with different sizes.

5. The boundary directs the multi-convolution head transpose attention.

The BMTA module effectively combines local details of the prediction boundary map and global information of the target in a multi-head attention mode. The BMTA module implements interaction of local and non-local pixels based on multi-convolution head transpose attention (MTA), and the structure of the multi-convolution transpose attention is shown in fig. 10.

Firstly, normalizing (layer normalization) an input feature map, then respectively sending three 1×1 convolutions and 3×3 depth convolutions to generate a query (Q), a key (K) and a value (V), then performing matrix multiplication on the transpose of Q and the transpose of K to generate an Attention Map (AM), and then performing matrix multiplication on the transpose matrix of AM and V to generate a new feature map, wherein the calculation formula is as follows:

the structure of the BMTA module is shown in fig. 11. The input feature map is fed into a Boundary Attention Head (BAH) branch and a target attention head (OAH) branch, respectively, for learning of different features. BAH branches are merged into a boundary map, boundary priori is provided, so that the model can learn edge detail characteristics better, and OAH learns overall characteristics of the target.

The OAH structure and calculation formula are consistent with MTA, so that cross-channel global feature extraction is realized, and the target and the background are separated integrally.

BAH introduces a predicted binary Boundary Map (BM) into the MTA, learning a representation of boundary enhancement to effectively capture important sparse local boundary information of the object. The structure of the BAH is shown in FIG. 12.

Specifically, Q and K obtained by convolution operation are multiplied by BM to obtain Q _b And K _b Then Q _b And K _b Multiplication results in an attention matrix with boundary information. The calculation formula of BAH is as follows:

v is a matrix without boundary information, so that features can be refined by establishing pairwise relationships at the boundaries, ensuring the definition and integrity of the boundaries.

The BMTA module finally splices the outputs of the OAH and BAH and then sends the output to a convolution of 3×3 to realize the fusion of the boundary and the integral feature, and to avoid the feature degradation, the residual connection is used to add the original and the fused feature. The BMTA integrates the features of the boundary map with the feature representations of different levels by taking the boundary information as a guide, and is helpful for distinguishing the fine foreground and background boundaries of the camouflage object by the model.

6. And generating a prediction graph.

Generating predictive pictures by features of BMTA being fed into a 3 x 3 convolutional layer with Sigmoid, predictive picture P for each sublayer _i (i=1, 2, 3) are both supervised by BCE and IOU loss functions to optimize the parameters of the whole model. The model of the embodiment adopts multiple supervision, so that the model has better expressive force on the segmentation of the camouflage targets, and the overall loss function of the model is as follows:

wherein,,representing predictive pictures->Representing the true value of the image,/->An i-th prediction graph is shown.

Referring to fig. 13, the embodiment of the present invention further provides a camouflage target image segmentation system based on multi-level feature fusion, where the camouflage target image segmentation system based on multi-level feature fusion includes a data acquisition unit 100, a feature extraction unit 200, a feature fusion unit 300, a boundary guiding unit 400, a feature interaction unit 500, a boundary fusion unit 600, and an image segmentation unit 700, where:

a data acquisition unit 100 for acquiring a camouflage target image to be segmented;

the feature extraction unit 200 is configured to perform multi-level feature extraction on a camouflage target image to be segmented by adopting different network layers through a first branch network and a second branch network, so as to obtain a first feature map of multiple layers output by the first branch network and a second feature map of multiple layers output by the second branch network;

the feature fusion unit 300 is configured to perform global feature enhancement on the first feature map of each level, and obtain global features after multiple levels of enhancement; carrying out local feature enhancement on the second feature map of each level to obtain local features after multiple levels of enhancement; carrying out feature fusion on the reinforced local features of each level and the reinforced global features of the same level to obtain fusion features of multiple levels;

The boundary guiding unit 400 is configured to perform boundary guiding on the fusion features of two shallow network layers in the fusion features of multiple layers, so as to obtain a boundary map;

the feature interaction unit 500 is configured to perform feature interaction on the fusion features of the adjacent network layers in the multiple layers of fusion features to obtain multiple interaction features;

the boundary fusion unit 600 is configured to perform boundary fusion on the boundary map and each of the plurality of interaction features, so as to obtain a plurality of boundary fusion features;

the image segmentation unit 700 is configured to segment a camouflage target image in the camouflage target images to be segmented corresponding to each boundary fusion feature based on the plurality of boundary fusion features.

It should be noted that, since the camouflage target image segmentation system based on the multi-level feature fusion in the present embodiment and the above camouflage target image segmentation method based on the multi-level feature fusion are based on the same inventive concept, the corresponding content in the method embodiment is also applicable to the present system embodiment, and will not be described in detail here.

The embodiment of the application also provides electronic equipment, which comprises: at least one memory, at least one processor, at least one computer program stored in the at least one memory, the at least one processor executing the at least one computer program to implement any of the above embodiments of a camouflage target image segmentation method based on multi-level feature fusion. The electronic equipment can be any intelligent terminal including a tablet personal computer, a vehicle-mounted computer and the like.

Referring to fig. 14, fig. 14 illustrates a hardware structure of an electronic device according to another embodiment, the electronic device includes:

the processor 810 may be implemented by a general-purpose CPU (central processing unit), a microprocessor, an application-specific integrated circuit (ApplicationSpecificIntegratedCircuit, ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solutions provided by the embodiments of the present application;

the memory 820 may be implemented in the form of read-only memory (ReadOnlyMemory, ROM), static storage, dynamic storage, or random access memory (RandomAccessMemory, RAM). The memory 820 may store an operating system and other application programs, and when the technical solution provided in the embodiments of the present disclosure is implemented by software or firmware, relevant program codes are stored in the memory 820, and the processor 810 invokes a camouflage target image segmentation method based on multi-level feature fusion to execute the embodiments of the present disclosure;

an input/output interface 830 for implementing information input and output;

the communication interface 840 is configured to implement communication interaction between the device and other devices, and may implement communication in a wired manner (e.g. USB, network cable, etc.), or may implement communication in a wireless manner (e.g. mobile network, WIFI, bluetooth, etc.);

Bus 850 transfers information between the various components of the device (e.g., processor 810, memory 820, input/output interface 830, and communication interface 840);

wherein processor 810, memory 820, input/output interface 830, and communication interface 840 enable communication connections among each other within the device via bus 850.

The embodiment of the application also provides a storage medium which is a computer readable storage medium, wherein the computer readable storage medium is stored with a computer program, and the computer program is used for enabling a computer to execute the camouflage target image segmentation method based on multi-level feature fusion.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one disk storage, flash memory, or other non-transitory solid state storage. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.

It will be appreciated by those skilled in the art that the solution shown in fig. 1 is not limiting of the embodiments of the application and may include more or fewer steps than shown, or certain steps may be combined, or different steps.

The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the application and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one (item)" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is merely a logical function division, and there may be another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method of the various embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing a program.

The foregoing description of the preferred embodiments of the present application has been presented with reference to the drawings and is not intended to limit the scope of the claims. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims

1. A camouflage target image segmentation method based on multi-level feature fusion, characterized in that the camouflage target image segmentation based on multi-level feature fusion includes:

Obtain the camouflage target image to be segmented;

The first branch network and the second branch network use different network layers to perform multi-level feature extraction on the camouflage target image to be segmented, and obtain multiple levels of first feature maps output by the first branch network and the third Multiple levels of second feature maps output by the two-branch network;

Global feature enhancement is performed on the first feature map of each level to obtain global features after multiple levels of enhancement; local features are enhanced on the second feature map of each level to obtain local features after multiple levels of enhancement. Features; and perform feature fusion on the enhanced local features of each level and the enhanced global features of the same level to obtain fusion features of multiple levels;

Perform boundary guidance on the fusion features of two shallow network layers among the multiple levels of fusion features to obtain a boundary map;

Perform feature interaction on the fusion features of adjacent network layers among the multiple levels of fusion features to obtain multiple interactive features; specifically:

Introduce a multi-scale channel attention mechanism into the feature interaction module, and add the fusion features of adjacent network layers among the multiple levels of fusion features to obtain multiple additive features;

Input each of the additive features to the multi-scale channel attention mechanism to obtain multiple multi-scale channel features;

Use an activation function to obtain multiple normalized features for each of the multi-scale channel features, and subtract each of the normalized features from 1 to obtain multiple normalized difference features;

Perform feature enhancement on the plurality of normalized features and the plurality of normalized difference features to obtain a plurality of enhanced normalized features and a plurality of enhanced normalized difference features;

Residual connection is performed on each of the enhanced normalized features and its corresponding fused features to obtain multiple first residual features, and each of the first residual features is convolved to obtain multiple The first convolutional feature;

Perform residual connection on each of the enhanced normalized difference features and its corresponding fusion feature to obtain multiple second residual features, and perform convolution on each of the second residual features to obtain Multiple second convolutional features;

Add each first convolution feature and the corresponding second convolution feature to obtain multiple additive convolution features, and use a convolution layer to facilitate the multiple additive convolution features. Fusion to obtain multiple interactive features;

Perform boundary fusion on the boundary map with each interaction feature among the multiple interaction features to obtain multiple boundary fusion features;

Based on the plurality of boundary fusion features, the camouflage target image in the camouflage target image to be segmented corresponding to each of the boundary fusion features is segmented.

2. The camouflage target image segmentation method based on multi-level feature fusion according to claim 1, characterized in that the first branch network and the second branch network adopt different network layers to segment the camouflage target image. Perform multi-level feature extraction to obtain multiple levels of first feature maps output by the first branch network and multiple levels of second feature maps output by the second branch network, including:

The first branch network uses different network layers to perform feature extraction on the global context information of the camouflage target image to be segmented, and obtains multiple levels of first feature maps output by the first branch network;

The second branch network uses different network layers to perform feature extraction on the local detail information of the camouflage target image to be segmented, and obtains multiple levels of second feature maps output by the second branch network.

3. The camouflage target image segmentation method based on multi-level feature fusion according to claim 1, characterized in that the first feature map of each level is globally feature enhanced to obtain multiple levels of enhanced features. Global features; perform local feature enhancement on the second feature map of each level to obtain enhanced local features at multiple levels; combine the enhanced local features of each level with the enhanced features of the same level Feature fusion is performed on the final global features to obtain multiple levels of fusion features, including:

Using a residual channel attention mechanism to perform global feature enhancement on the first feature map of each level to obtain enhanced global features at multiple levels;

Using a spatial attention mechanism to perform local feature enhancement on the second feature map of each level to obtain enhanced local features at multiple levels;

The enhanced local features of each level are spliced with the enhanced global features of the same level to obtain multiple levels of spliced features, and a convolution layer is used to facilitate the multiple levels of spliced features. Feature fusion to obtain multiple levels of fused features.

4. The camouflage target image segmentation method based on multi-level feature fusion according to claim 1, wherein the fusion features of two shallow network layers in the multiple levels of fusion features are subjected to boundary guidance, Get the boundary map, including:

Convolute the fusion features of two shallow network layers among the multiple levels of fusion features to obtain the first convolution feature and the second convolution feature;

An addition operation is performed on the first convolution feature and the second convolution feature to obtain an additive feature, and multiple convolution layers are used for boundary guidance on the additive feature to obtain a boundary map.

5. The camouflage target image segmentation method based on multi-level feature fusion according to claim 1, wherein the boundary fusion is performed on the boundary map with each interactive feature in the plurality of interactive features, Obtain multiple boundary fusion features, including:

Based on each of the interaction features, a target attention head branch is used to learn the overall characteristics of the target; wherein the target attention head branch is used to separate the target and the background as a whole based on the interaction features;

Based on the boundary map and each of the interaction features, a boundary attention head branch is used to learn boundary detail features; wherein the boundary attention head branch is used to capture the sparseness of the target based on the boundary map and the interaction features. local boundary information;

The output of each target attention head branch and the output of each corresponding boundary attention head branch are spliced to obtain multiple output splicing features, and the multiple output splicing features are used in a convolution layer Promote feature fusion and obtain multiple convolutional fusion features;

Each convolutional fusion feature and each of its corresponding interactive features are residually connected to obtain multiple boundary fusion features.

6. The camouflage target image segmentation method based on multi-level feature fusion according to claim 1, characterized in that, based on the plurality of boundary fusion features, the to-be-received image corresponding to each of the boundary fusion features is segmented. The camouflage target images in the segmented camouflage target images include:

Input the multiple boundary fusion features into a convolutional layer with a Sigmoid activation function to generate multiple prediction maps;

Based on each of the prediction maps, segment the camouflage target images in the camouflage target images to be segmented.

7. A camouflage target image segmentation system based on multi-level feature fusion, characterized in that the camouflage target image segmentation system based on multi-level feature fusion includes:

A data acquisition unit is used to acquire the camouflage target image to be segmented;

A feature extraction unit configured to use different network layers to perform multi-level feature extraction on the camouflage target image to be segmented through the first branch network and the second branch network, and obtain multiple levels of first features output by the first branch network. Feature maps and multiple levels of second feature maps output by the second branch network;

A feature fusion unit is used to perform global feature enhancement on the first feature map of each level to obtain multi-level enhanced global features; perform local feature enhancement on the second feature map of each level to obtain multi-level features. Enhanced local features at each level; and perform feature fusion on the enhanced local features at each level and the enhanced global features at the same level to obtain multi-level fusion features;

A boundary guidance unit, used to perform boundary guidance on the fusion features of two shallow network layers among the multiple levels of fusion features to obtain a boundary map;

The feature interaction unit is used to perform feature interaction on the fusion features of adjacent network layers among the multiple levels of fusion features to obtain multiple interactive features; specifically:

Input each of the added features to the multi-scale channel attention mechanism to obtain multiple multi-scale channel features;

a boundary fusion unit, configured to perform boundary fusion between the boundary map and each of the plurality of interactive features to obtain multiple boundary fusion features;

An image segmentation unit, configured to segment, based on the plurality of boundary fusion features, the camouflage target image in the camouflage target image to be segmented corresponding to each of the boundary fusion features.

8. An electronic device, characterized in that it includes:

at least one memory;

at least one processor;

at least one computer program;

The at least one computer program is stored in the at least one memory, and the at least one processor executes the at least one computer program to achieve:

The camouflage target image segmentation method based on multi-level feature fusion according to any one of claims 1 to 6.

9. A storage medium, which is a computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, and the computer program is used to cause the computer to execute:

The camouflage target image segmentation method based on multi-level feature fusion described in any one of claims 1 to 6 is performed.