WO2025103079A1 - Method for fusing infrared light and visible light images - Google Patents
Method for fusing infrared light and visible light images Download PDFInfo
- Publication number
- WO2025103079A1 WO2025103079A1 PCT/CN2024/126006 CN2024126006W WO2025103079A1 WO 2025103079 A1 WO2025103079 A1 WO 2025103079A1 CN 2024126006 W CN2024126006 W CN 2024126006W WO 2025103079 A1 WO2025103079 A1 WO 2025103079A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- spatial
- fusion
- network
- information
- visible light
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/86—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using syntactic or structural representations of the image or video pattern, e.g. symbolic string recognition; using graph matching
Definitions
- the present invention relates to the technical field of image fusion, and in particular to a method for fusing infrared light and visible light images.
- Image fusion is an important image enhancement technology that can extract meaningful information from different source images and fuse them into new images.
- the fused images are usually robust, rich in information, and can show more complex and detailed scene expressions. It can not only reduce data redundancy, but also promote the development and decision-making of subsequent applications. Fusion has been widely used in the preprocessing modules of advanced visual tasks and has good application prospects in object detection, object tracking, semantic segmentation, etc.
- the use of spatial and channel attention mechanisms can amplify useful information in the image and suppress the interference of harmful information by assigning different weights, while enriching the semantic information and spatial information of the image to promote the application of subsequent high-level visual tasks.
- the NestFuse network is a pioneering work in the field of image fusion.
- the spatial attention model the deep features are calculated by l1 norm and soft-max operator to obtain enhanced deep features.
- the channel attention model the channel fusion features are obtained by global pooling and soft-max operator, and finally the spatial features and channel features are added to obtain the final fusion feature reference:
- the purpose of the present invention is to provide a method for fusing infrared light and visible light images to solve the technical problems raised in the background technology.
- the present invention provides the following technical solution: a method for fusing infrared light and visible light images, characterized in that it comprises at least the following steps:
- S1 Build a fusion network, using CNN's end-to-end fusion network as the basic framework.
- the end-to-end CNN network has powerful feature extraction and learning capabilities;
- S3 Build a spatial channel attention module, which is used to amplify useful information, suppress the interference of useless information, and promote the development of semantic segmentation tasks to process the deep features of the input;
- the application of the fusion network in S1 at least includes the following steps:
- the feature extraction module By sending the infrared light and visible light images to the feature extraction module respectively, the respective depth features are extracted;
- the extracted deep features are then concatenated and sent to the spatial channel attention mechanism to further extract features and suppress the interference of useless information;
- the fused image is generated through the feature reconstruction module.
- the branches in the gradient aggregated residual dense block in S2 perform a set of transformations, each transformation on a low-dimensional embedding, whose outputs are aggregated by summing.
- the gradient aggregated residual dense block in S2 sets the cardinality of the aggregated transformation to 2, and is composed of at least one residual block and one residual dense block, and each gradient aggregated residual dense block contains three branches to improve the diversity of extracted features so that it can make full use of the deep features extracted by each convolutional layer in the block;
- the gradient aggregation residual dense block also integrates the Laplacian operator and the Sobel operator to retain more coarse textures and fine textures in the image.
- the spatial channel attention module in S3 at least includes multi-scale feature extraction, spatial channel attention mechanism and pooling fusion block.
- the application of the spatial channel attention module in S3 at least comprises the following steps:
- the channel feature map is upsampled to restore it to its original size.
- the present invention uses a convolution operation to control the number of channels of the spatial feature map;
- the generated spatial and channel attention feature maps are fused using the pooling fusion block to generate a fused feature map that satisfies both the intra-class similarity and inter-class difference requirements.
- the application of the S4 separation network comprises at least the following steps:
- a real-time segmentation model Bilateral attention decoder is introduced to segment the fused image, and the segmentation network outputs the segmentation result and auxiliary segmentation result;
- the gap between the segmentation result and the semantic label reflects the richness of the semantic information contained in the fused image
- Semantic loss is used to guide the training of the fusion network through back-propagation, forcing the fused image to contain more semantic information.
- the present invention has the following beneficial effects:
- the present invention is designed based on the spatial channel attention mechanism and the gradient aggregation residual dense block, combining the advantages of ResNeXt and DenseNet while integrating Sobel and Laplacian operators to retain the strong texture and weak texture of the features, and by introducing the spatial and channel attention mechanism, the channel information and spatial information of the feature map are refined to improve the information capture ability of the feature map, and a pooling fusion block is used to fuse the refined spatial and channel feature maps to obtain high-quality fused features; the experimental results of the present invention show that the proposed method performs better than the most advanced methods on the MSRS dataset, highlighting the potential research direction in the future to improve the accuracy of the fused image and promote the development of advanced visual tasks.
- Figure 1 is a spatial attention model under the prior art
- Figure 2 is a channel attention model under the prior art
- FIG3 is a schematic diagram of the overall network framework of SCGRFuse of the present invention.
- FIG5 is a schematic diagram of a GRXDB module of the present invention.
- FIG6 is a schematic diagram of a multi-scale spatial attention module of the present invention.
- FIG7 is a schematic diagram showing a qualitative comparison of SCGRFuse of the present invention and nine most advanced methods on the MSRS dataset;
- FIG8 is a schematic diagram of the process of generating channel attention and spatial attention masks according to the present invention.
- FIG. 1 and FIG. 2 it can be seen that the spatial attention model and the channel attention model under the prior art.
- Embodiment 1 is a diagrammatic representation of Embodiment 1:
- the design of SCGRFuse is a combination of image fusion and semantic segmentation tasks. It first uses the fusion network to obtain the fusion result of the input infrared and visible light images, and uses the content loss to guide the training of the entire fusion network. The fused image is then sent to the semantic segmentation network, and more semantic information is integrated into the fused image through the semantic loss generated by the segmentation network, so as to optimize the entire fused image.
- the overall architecture of the network is shown in Figure 3.
- the present invention uses an end-to-end fusion network based on CNN as the basic framework.
- the end-to-end CNN network has powerful feature extraction and learning capabilities.
- the infrared and visible light images are respectively sent to the feature extraction module to extract their respective deep features, and then the extracted deep features are spliced and sent to the spatial channel attention mechanism to further extract features and suppress the interference of useless information.
- the fused image is generated through the feature reconstruction module. Its network structure is shown in Figure 4.
- Gradient Aggregation Residual Dense Blocks combines the advantages of ResNext and DenseNet networks to retain shallow and deep features through split-transform-merge and dense connections.
- the branches in the GRXDB module perform a set of transformations, each on a low-dimensional embedding, and its output is aggregated by summing.
- the present invention sets the cardinality of the aggregation transformation to 2, where the module of the present invention consists of a residual block and a residual dense block. More precisely, each GRXDB contains three branches to improve the diversity of extracted features, so that it can make full use of the deep features extracted by each convolutional layer in the block. Since the source image contains rich texture details, the Laplacian operator and the Sobel operator are also integrated in GRXDB to retain more coarse and fine textures in the image. GRXDB greatly enhances the feature extraction capability of the network, and its structure is shown in Figure 5.
- SCAM Spatial Channel Attention Module
- the present invention In order to improve the information capture ability of the network and obtain more accurate spatial information and semantic information, the present invention generates spatial and channel attention masks through self-attention functions. After the refinement of the attention branch, the present invention uses an upsampling operation on the channel feature map to restore it to its original size. In addition, the present invention uses a convolution operation to control the number of channels of the spatial feature map. In order to generate a feature map that meets the requirements of both intra-class similarity and inter-class difference, the present invention uses a pooling fusion block (PFB) to fuse the generated spatial and channel attention feature maps, thereby generating a fused feature map that meets the requirements.
- PFB pooling fusion block
- the present invention introduces a real-time segmentation model, Bilateral Attention Decoder, to segment the fused image.
- the segmentation network outputs the segmentation result and the auxiliary segmentation result.
- the gap between the segmentation result and the semantic label can reflect the richness of the semantic information contained in the fused image. Therefore, this gap can be used to construct the semantic loss.
- the semantic loss is used to guide the training of the fusion network through back propagation, forcing the fused image to contain more semantic information.
- Embodiment 2 is a diagrammatic representation of Embodiment 1:
- This embodiment is used to further disclose an application of performance evaluation based on the above embodiment 1;
- the present invention conducts extensive quantitative and qualitative evaluations on the MSRS dataset.
- the present invention is compared with nine state-of-the-art methods, including a traditional method, namely GTF, an ae-based method, namely DenseFuse, two GAN-based methods, namely FusionGAN, GANMcC, four CNN-based methods, namely IFCNN, SDNet, U2Fusion, SeAFusion, and an image decomposition-based method, namely DeFusion.
- the implementations of these nine methods are all public, and the parameters set by the present invention are also consistent with the original paper.
- DenseFuse and IFCNN use element-wise addition and element-wise maximum fusion strategies to fuse deep features, respectively.
- the present invention selects seven indicators, EN, MI, VIF, SF, SD, SCD, and Qabf, to objectively evaluate the fusion effect.
- the present invention also uses IoU to quantify the segmentation performance. The larger the value of these indicators, the better the fusion performance.
- the quantitative results of 7 statistical indicators for 361 pairs of images are shown in Table 1.
- the segmentation results are shown in Table 2.
- the present invention shows significant advantages in SD, SF, and MI.
- the higher the SD the higher the contrast of the fused image.
- the higher the SF value the clearer the fused image is and the better the quality of the fused image.
- the higher the MI value the more information the fused image conveys.
- the present invention presents the best VIF, which shows that the fused image of the present invention is more in line with the human visual system.
- the method of the present invention also obtains the best Qabf, which means that the fused image retains more edge information.
- the present invention also presents the highest EN, which shows that the fused image contains the most information.
- the present invention follows SeaFusion only slightly in the SCD indicator.
- FIG. 7 An example of the qualitative results of SCGRFuse on the MSRS dataset is shown in Figure 7.
- the present invention shows the fusion results of two scenes dominated by day and night, respectively, and uses a red frame to enlarge an area to illustrate the phenomenon that the texture details are polluted to different degrees by the spectrum.
- the present invention uses a green frame to highlight the problem of useless information weakening the prominent target.
- neither GTF nor FusionGAN can well preserve the texture details of visible light images, and other methods are inevitably interfered by useless information.
- the segmentation results are shown in Table 2. It can be seen that the present invention is basically in a leading position in various types of IoU and ranks first in MIoU. This is due to the two advantages of the present invention. On the one hand, the present invention effectively integrates the complementary information of infrared images and visible light images, which helps the segmentation model to fully understand the imaging scene. On the other hand, SCGRFuse improves the information capture capability under the action of spatial and channel attention mechanisms, and enhances the spatial information and semantic information under the guidance of semantic loss, so that the segmentation network can more accurately describe the imaging scene.
- This embodiment is used to disclose the specific operation process under the above embodiment.
- GRXDB gradient aggregation residual dense block
- the spatial and channel attention mechanism to refine the channel information and spatial information of the feature map to improve the information capture ability of the feature map, and used a pooling fusion block to fuse the refined spatial and channel feature maps to obtain high-quality fused features;
- the experimental results of this invention show that the proposed method performs better than the most advanced methods on the MSRS dataset, highlighting the potential research direction in the future to improve the accuracy of fused images while promoting the development of advanced visual tasks.
- the gradient aggregation residual dense block (GRXDB) and the spatial channel attention module (SCAM) are the key technologies of the present invention.
- the GRXDB of the present invention contains three branches.
- the main branch deploys three 3x3 convolution blocks and two 1x1 convolution blocks.
- the present invention introduces dense connections into the main branch, and uses 1x1 convolution blocks to deal with the differences between channel dimensions.
- the main branch also introduces the Laplacian operator to further extract the weak texture of the feature.
- the residual branch also integrates the Sobel operator to retain the strong texture of the feature.
- the third branch does not do any processing, which we call the residual branch to retain the information of the input feature.
- the main branch, residual branch and residual branch outputs are added by element addition to integrate the deep features. It is worth noting that the Relu activation function is equivalent to directly discarding negative activations. This approach may be effective for classification tasks, but it is not suitable for image fusion tasks. In order to better meet the needs of image fusion, the present invention sets the activation function of GRXDB to Leaky ReLU that can retain negative activation information.
- the operation flow of the spatial channel attention module (as shown in Figure 6):
- the present invention uses four convolution blocks with different kernel sizes to capture multi-scale and multi-receptive field depth features.
- the sizes of the convolution blocks are 1x1, 3x3, 5X5, and 7x7 respectively.
- the activation function of the convolution block is still Leaky ReLU.
- the obtained deep features are spliced in the channel direction.
- the 1x1 convolution block is used to change the number of channels and send them to the spatial and channel attention layers.
- generating attention masks Ms and Mc is the most important process. In order to improve the information capture ability of the network, more accurate spatial information and semantic information can be obtained.
- the present invention generates spatial and channel attention masks through self-attention functions, and its main process is shown in Figure 8.
- the present invention uses global average pooling to average all pixels of each channel map, and then obtains a new 1*1 channel map.
- the Tanh activation function and the 1x1 convolution block are used to adjust the value of the feature map.
- the present invention divides the calculated value of the tanh activation function by 2 and adds 0.5 to enable it to be mapped to the range of [0,1].
- the present invention multiplies the calculated channel map with the input feature map to generate the output of the channel attention branch.
- the present invention first uses a 1x1 convolution block to reduce the number of channels of the input feature, and then uses maximum pooling and average pooling to obtain two feature maps, both of which have only one channel and are the same size as the input feature.
- the two feature maps are spliced together, and the number of channels is changed to 1 by using a 1x1 convolution block.
- the activation function of the convolution block is still tanh, which is consistent with the channel attention mechanism.
- the generated spatial attention mask is multiplied with the input feature map to obtain the output of the final spatial attention branch.
- the present invention uses an upsampling operation on the channel feature map to restore it to its original size, and uses a convolution operation to control the number of channels of the spatial feature map.
- the present invention uses a Pooling fusion block (PFB) to fuse the generated spatial and channel attention feature maps.
- PFB Pooling fusion block
- the structure of PFB is shown in Figure 9.
- the present invention uses 3x3 average pooling to smooth the upsampled channel feature map, and uses reflection filling to construct the true boundary to reduce boundary artifacts. Finally, it is spliced with the spatial feature map, and the boundary is corrected using a 1x1 convolution block and a leaky ReLU activation function to generate a fused feature map that meets the requirements.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
Description
本发明涉及图像融合技术领域,具体为一种红外光与可见光图像融合方法。The present invention relates to the technical field of image fusion, and in particular to a method for fusing infrared light and visible light images.
[根据细则26改正 07.11.2024]
图像融合是一种重要的图像增强技术,它可以从不同的源图像中提取出有意义的信息并将其融合成新的图像,融合后的图像通常具备鲁棒性强、信息丰富、能够展现更复杂和详细的场景表达等优点,不仅能够减少数据冗余,还能够促进后续应用的发展和决策。融合已经被广泛作用于高级视觉任务的预处理模块,在目标检测,目标跟踪、语义分割等方面具有良好的应用前景。[Corrected 07.11.2024 in accordance with Rule 26]
Image fusion is an important image enhancement technology that can extract meaningful information from different source images and fuse them into new images. The fused images are usually robust, rich in information, and can show more complex and detailed scene expressions. It can not only reduce data redundancy, but also promote the development and decision-making of subsequent applications. Fusion has been widely used in the preprocessing modules of advanced visual tasks and has good application prospects in object detection, object tracking, semantic segmentation, etc.
因为红外图像和可见光图像的实用性,近年来诞生了许多图像融合技术,大致可分为两类。传统的方法和基于深度学习的方法。根据数学变换的不同,传统的方法又可进一步分为基于多尺度变换的方法,如离散小波变换(DWT),基于表示学习的方法,如稀疏表示(SR)和联合稀疏表示(JSR),基于子空间的方法、基于显著性的方法和混合模型。而基于深度学习的方法根据网络架构可分为三类:基于自编码器(AE)的模型、基于卷积神经网络(CNN)的模型和基于生成式对抗网络(GAN)的模型。Due to the practicality of infrared images and visible light images, many image fusion technologies have been born in recent years, which can be roughly divided into two categories: traditional methods and deep learning-based methods. According to different mathematical transformations, traditional methods can be further divided into multi-scale transformation-based methods, such as discrete wavelet transform (DWT), representation learning-based methods, such as sparse representation (SR) and joint sparse representation (JSR), subspace-based methods, saliency-based methods, and hybrid models. Deep learning-based methods can be divided into three categories according to the network architecture: models based on autoencoders (AE), models based on convolutional neural networks (CNN), and models based on generative adversarial networks (GAN).
现有的深度学习模型基本上都是以CNN或者GAN网络为基础来构建的,尽管这些方法在图像融合领域已经有了很好的效果,但是这些方法都强调图像融合质量而忽略了高级视觉任务的需求。2022年,Tang等人首次提出结合高级视觉任务的图像融合框架SeAFusion,他们并没有在网络架构或者学习范式上进行很大的创新,但是却以一种新的视角来审视图像融合任务,即利用高级视觉任务来驱动图像融合,SeAFusion的出现为图像融合提出了新的可能性。Existing deep learning models are basically built on the basis of CNN or GAN networks. Although these methods have achieved good results in the field of image fusion, they all emphasize the quality of image fusion and ignore the needs of advanced visual tasks. In 2022, Tang et al. first proposed the image fusion framework SeAFusion combined with advanced visual tasks. They did not make great innovations in network architecture or learning paradigm, but they looked at the image fusion task from a new perspective, that is, using advanced visual tasks to drive image fusion. The emergence of SeAFusion has proposed new possibilities for image fusion.
为了更好的促进红外光与可见光图像融合的性能,使用空间和通道注意力机制能够通过分配不同的权重放大图像中的有用信息并抑制有害信息的干扰,同时丰富图像的语义信息和空间信息促进后续高级视觉任务的应用。NestFuse网络是图像融合领域一个开创性的工作,针对多尺度深度特征融合,他们提出了一种基于空间和通道注意力模型的融合策略。在空间注意力模型中将深度特征通过l1范数和soft-max算子计算得到增强的深度特征。而在通道注意力模型中则是通过全局池化和soft-max算子得到通道融合特征,最后将空间特征和通道特征相加得到最终的融合特征引用:In order to better promote the performance of infrared and visible light image fusion, the use of spatial and channel attention mechanisms can amplify useful information in the image and suppress the interference of harmful information by assigning different weights, while enriching the semantic information and spatial information of the image to promote the application of subsequent high-level visual tasks. The NestFuse network is a pioneering work in the field of image fusion. For multi-scale deep feature fusion, they proposed a fusion strategy based on spatial and channel attention models. In the spatial attention model, the deep features are calculated by l1 norm and soft-max operator to obtain enhanced deep features. In the channel attention model, the channel fusion features are obtained by global pooling and soft-max operator, and finally the spatial features and channel features are added to obtain the final fusion feature reference:
现有技术的不足:Disadvantages of existing technology:
1)传统方法对不同的源图像采用相同的变换来提取特征。但是,该操作没有考虑源图像的特征差异,可能导致提取的特征表现力较差。另一方面,传统方法中对于融合策略的选择和人为设计的复杂性限制了性能的提高,这些人工设计的融合策略不仅不能够学习,还会给融合结果带来一定的伪像。1) Traditional methods use the same transformation to extract features from different source images. However, this operation does not take into account the feature differences of the source images, which may lead to poor expressiveness of the extracted features. On the other hand, the choice of fusion strategies and the complexity of human design in traditional methods limit the improvement of performance. These artificially designed fusion strategies are not only unable to learn, but also bring certain artifacts to the fusion results.
2)在进行红外光和可见光图像融合之前,需要从红外光和可见光图像之中提取有用的特征信息,这些信息包括纹理,边缘等。但现有网络体系结构不能有效的提取粗细粒度的细节特征,并且容易受到热辐射的干扰,从而影响融合后的图像质量。如何有效整合粗细纹理特征,并抑制无用信息的干扰仍需更进一步突破。2) Before fusing infrared and visible light images, it is necessary to extract useful feature information from them, including texture, edges, etc. However, the existing network architecture cannot effectively extract coarse and fine granularity detail features and is easily disturbed by thermal radiation, thus affecting the quality of the fused image. How to effectively integrate coarse and fine texture features and suppress the interference of useless information still requires further breakthroughs.
3)现有的融合方法倾向于追求更好的视觉质量和更高的评价指标,很少系统的考虑融合图像是否能够促进高级视觉任务。虽然一些研究在特征层面引入了感知损失来约束融合图像和源图像,但感知损失并不能有效增强融合图像中的语义信息。此外,其他研究人员通过分割掩码来指导图像融合过程,但掩码只分割了一些突出的目标,这对于增强语义信息是有限的。3) Existing fusion methods tend to pursue better visual quality and higher evaluation indicators, and rarely systematically consider whether the fused image can promote high-level visual tasks. Although some studies have introduced perceptual loss at the feature level to constrain the fused image and the source image, the perceptual loss cannot effectively enhance the semantic information in the fused image. In addition, other researchers guide the image fusion process by segmentation masks, but the mask only segments some prominent objects, which is limited in enhancing semantic information.
因此需要对以上问题提出一种新的解决方案。Therefore, a new solution to the above problems needs to be proposed.
本发明的目的在于提供一种红外光与可见光图像融合方法,以解决背景技术中提出的技术问题。The purpose of the present invention is to provide a method for fusing infrared light and visible light images to solve the technical problems raised in the background technology.
为实现上述目的,本发明提供如下技术方案:一种红外光与可见光图像融合方法,其特征在于:至少包括以下步骤:To achieve the above object, the present invention provides the following technical solution: a method for fusing infrared light and visible light images, characterized in that it comprises at least the following steps:
S1:搭建融合网络,采用CNN的端到端的融合网络作为基本框架,端到端的CNN网络具有强大的特征提取能力和学习能力;S1: Build a fusion network, using CNN's end-to-end fusion network as the basic framework. The end-to-end CNN network has powerful feature extraction and learning capabilities;
S2:搭建梯度聚合残差密集块,所述梯度聚合残差密集块结合了ResNext和DenseNet网络的优势,通过拆分-转换-合并以及稠密连接的方式来保留浅层特征和深层特征,加强网络的特征提取能力;S2: Build a gradient aggregation residual dense block, which combines the advantages of ResNext and DenseNet networks, retains shallow and deep features through split-conversion-merge and dense connection, and enhances the feature extraction ability of the network;
S3:搭建空间通道注意力模块,采用空间通道注意力模块放大有用信息,抑制无用信息的干扰,同时促进语义分割任务的发展,用于处理输入的深度特征;S3: Build a spatial channel attention module, which is used to amplify useful information, suppress the interference of useless information, and promote the development of semantic segmentation tasks to process the deep features of the input;
S4:搭建分隔网络,充分增强融合图像的语义信息。S4: Build a separation network to fully enhance the semantic information of the fused image.
优选的,所述S1中融合网络的应用至少包括以下步骤:Preferably, the application of the fusion network in S1 at least includes the following steps:
通过将红外光和可见光图像分别送入到特征提取模块提取各自的深度特征;By sending the infrared light and visible light images to the feature extraction module respectively, the respective depth features are extracted;
随后将提取到的深度特征拼接送入到空间通道注意力机制中进一步提取特征并抑制无用信息的干扰;The extracted deep features are then concatenated and sent to the spatial channel attention mechanism to further extract features and suppress the interference of useless information;
最后通过特征重建模块生成融合图像。Finally, the fused image is generated through the feature reconstruction module.
优选的,所述S2中梯度聚合残差密集块中的分支执行一组转换,每个转换在低维嵌入上,其输出通过求和进行聚合。Preferably, the branches in the gradient aggregated residual dense block in S2 perform a set of transformations, each transformation on a low-dimensional embedding, whose outputs are aggregated by summing.
优选的,所述S2中梯度聚合残差密集块将聚合变换的基数设置为2,至少由一个残差块和一个残差密集块组成,每个梯度聚合残差密集块包含三个分支,以提高提取特征的多样性,使其能够充分利用块内每个卷积层提取的深度特征;Preferably, the gradient aggregated residual dense block in S2 sets the cardinality of the aggregated transformation to 2, and is composed of at least one residual block and one residual dense block, and each gradient aggregated residual dense block contains three branches to improve the diversity of extracted features so that it can make full use of the deep features extracted by each convolutional layer in the block;
由于源图像中包含丰富的纹理细节,所述梯度聚合残差密集块中还集成了拉普拉斯算子和Sobel算子,用于保留图像中更多的粗纹理和细纹理。Since the source image contains rich texture details, the gradient aggregation residual dense block also integrates the Laplacian operator and the Sobel operator to retain more coarse textures and fine textures in the image.
优选的,所述S3中空间通道注意力模块的至少包括多尺度特征提取、空间通道注意力机制和池化融合块。Preferably, the spatial channel attention module in S3 at least includes multi-scale feature extraction, spatial channel attention mechanism and pooling fusion block.
优选的,所述S3中空间通道注意力模块的应用至少包括以下步骤:Preferably, the application of the spatial channel attention module in S3 at least comprises the following steps:
使用了四个不同核大小的卷积块来捕获多尺度和多感受野的深度特征,随后将捕获到深度特征分别送入到空间和通道注意力机制中;Four convolution blocks with different kernel sizes are used to capture multi-scale and multi-receptive field deep features, which are then fed into the spatial and channel attention mechanisms respectively;
通过自注意函数来生成空间和通道注意力mask,提高网络的信息捕获能力,从而获得更准确的空间信息和语义信息;Generate spatial and channel attention masks through self-attention functions to improve the network's information capture ability, thereby obtaining more accurate spatial and semantic information;
经过注意力分支的细化后,对通道特征图采用上采样操作使其恢复到原来的大小,另外本发明用卷积操作来控制空间特征图的通道数;After the refinement of the attention branch, the channel feature map is upsampled to restore it to its original size. In addition, the present invention uses a convolution operation to control the number of channels of the spatial feature map;
使用池化融合块,将生成的空间和通道注意力特征图进行融合,从而生成同时满足类内相似度和类间差异要求的融合特征图。The generated spatial and channel attention feature maps are fused using the pooling fusion block to generate a fused feature map that satisfies both the intra-class similarity and inter-class difference requirements.
优选的,所述S4分隔网络的应用至少包括以下步骤:Preferably, the application of the S4 separation network comprises at least the following steps:
引入了实时分割模型Bilateral attention decoder对融合后的图像进行分割,分割网络输出分割结果和辅助分割结果;A real-time segmentation model Bilateral attention decoder is introduced to segment the fused image, and the segmentation network outputs the segmentation result and auxiliary segmentation result;
分割结果与语义标签之间的差距反映融合图像中所包含的语义信息的丰富程度;The gap between the segmentation result and the semantic label reflects the richness of the semantic information contained in the fused image;
利用这种差距构建语义损失;This gap is exploited to construct semantic loss;
利用语义损失通过反向传播来指导融合网络的训练,迫使融合图像包含更多的语义信息。Semantic loss is used to guide the training of the fusion network through back-propagation, forcing the fused image to contain more semantic information.
与现有技术相比,本发明的有益效果是:Compared with the prior art, the present invention has the following beneficial effects:
本发明基于空间通道注意力机制和梯度聚合残差密集块进行设计,结合ResNeXt和DenseNet优势的同时集成Sobel和拉普拉斯算子来保留特征的强纹理和弱纹理,并通过引入空间和通道注意力机制,对特征图的通道信息和空间信息进行细化,提高特征图的信息捕获能力,并利用一个池化融合块,将细化后的空间和通道特征图进行融合,得到高质量的融合特征;本发明的实验结果表明,所提出的方法在MSRS数据集上的表现优于最先进的方法,突出了未来潜在研究方向,以提升融合图像的精度,同时促进高级视觉任务的发展。The present invention is designed based on the spatial channel attention mechanism and the gradient aggregation residual dense block, combining the advantages of ResNeXt and DenseNet while integrating Sobel and Laplacian operators to retain the strong texture and weak texture of the features, and by introducing the spatial and channel attention mechanism, the channel information and spatial information of the feature map are refined to improve the information capture ability of the feature map, and a pooling fusion block is used to fuse the refined spatial and channel feature maps to obtain high-quality fused features; the experimental results of the present invention show that the proposed method performs better than the most advanced methods on the MSRS dataset, highlighting the potential research direction in the future to improve the accuracy of the fused image and promote the development of advanced visual tasks.
为了更清楚地说明本发明实施例的技术方案,下面将对实施例描述所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings required for describing the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For ordinary technicians in this field, other accompanying drawings can be obtained based on these accompanying drawings without paying creative work.
图1为现有技术下的空间注意力模型;Figure 1 is a spatial attention model under the prior art;
图2为现有技术下的通道注意力模型;Figure 2 is a channel attention model under the prior art;
图3为本发明SCGRFuse总体网络框架的示意图;FIG3 is a schematic diagram of the overall network framework of SCGRFuse of the present invention;
图4为本发明融合网络的示意图;FIG4 is a schematic diagram of a fusion network of the present invention;
图5为本发明GRXDB模块的示意图;FIG5 is a schematic diagram of a GRXDB module of the present invention;
图6为本发明多尺度空间注意力模块的示意图;FIG6 is a schematic diagram of a multi-scale spatial attention module of the present invention;
图7为本发明SCGRFuse与9种最先进方法在MSRS数据集上的定性比较示意图;FIG7 is a schematic diagram showing a qualitative comparison of SCGRFuse of the present invention and nine most advanced methods on the MSRS dataset;
图8为本发明产生通道注意和空间注意mask的过程示意图;FIG8 is a schematic diagram of the process of generating channel attention and spatial attention masks according to the present invention;
图9为本发明池化融合块的整体结构示意图。FIG9 is a schematic diagram of the overall structure of the pooling fusion block of the present invention.
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。The technical solutions in the embodiments of the present invention will be described clearly and completely below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, rather than all the embodiments.
参阅图1和图2可知,现有技术下的空间注意力模型和通道注意力模型。Referring to FIG. 1 and FIG. 2 , it can be seen that the spatial attention model and the channel attention model under the prior art.
实施例一:Embodiment 1:
本实施例用于公开基于空间通道注意力机制和梯度聚合残差密集块进行一种红外光与可见光图像融合方法;This embodiment is used to disclose a method for fusing infrared and visible light images based on a spatial channel attention mechanism and a gradient aggregation residual dense block;
SCGRFuse的设计是将图像融合和语义分割任务相结合的产物。它将输入的红外光和可见光图像首先使用融合网络获取融合后的结果,期间使用内容损失引导整个融合网络的训练。随后将得到的融合图像送入到语义分割网络中,通过分割网络产生的语义损失将更多的语义信息整合到融合图像中,以此来优化整个融合图像。网络的整体架构如图3所示。The design of SCGRFuse is a combination of image fusion and semantic segmentation tasks. It first uses the fusion network to obtain the fusion result of the input infrared and visible light images, and uses the content loss to guide the training of the entire fusion network. The fused image is then sent to the semantic segmentation network, and more semantic information is integrated into the fused image through the semantic loss generated by the segmentation network, so as to optimize the entire fused image. The overall architecture of the network is shown in Figure 3.
1)融合网络(Fusion Network)1) Fusion Network
本发明使用基于CNN的端到端的融合网络作为基本框架,端到端的CNN网络具有强大的特征提取能力和学习能力。通过将红外光和可见光图像分别送入到特征提取模块提取各自的深度特征,随后将提取到的深度特征拼接送入到空间通道注意力机制中进一步提取特征并抑制无用信息的干扰,最后通过特征重建模块生成融合图像。其网络结构如图4所示。The present invention uses an end-to-end fusion network based on CNN as the basic framework. The end-to-end CNN network has powerful feature extraction and learning capabilities. The infrared and visible light images are respectively sent to the feature extraction module to extract their respective deep features, and then the extracted deep features are spliced and sent to the spatial channel attention mechanism to further extract features and suppress the interference of useless information. Finally, the fused image is generated through the feature reconstruction module. Its network structure is shown in Figure 4.
2)梯度聚合残差密集块(GRXDB)2) Gradient Aggregation Residual Dense Block (GRXDB)
梯度聚合残差密集块(Gradient Aggregation Residual Dense Blocks,GRXDB)结合了ResNext和DenseNet网络的优势,通过拆分-转换-合并以及稠密连接的方式来保留浅层特征和深层特征。GRXDB模块中的分支执行一组转换,每个转换在低维嵌入上,其输出通过求和进行聚合。本发明将聚合变换的基数设置为2,其中本发明的模块由一个残差块和一个残差密集块组成。更准确地说,每个GRXDB包含三个分支,以提高提取特征的多样性,使其能够充分利用块内每个卷积层提取的深度特征。由于源图像中包含丰富的纹理细节,因此GRXDB中还集成了拉普拉斯算子和Sobel算子来保留图像中更多的粗纹理和细纹理。GRXDB大大加强了网络的特征提取能力,其结构如图5所示。Gradient Aggregation Residual Dense Blocks (GRXDB) combines the advantages of ResNext and DenseNet networks to retain shallow and deep features through split-transform-merge and dense connections. The branches in the GRXDB module perform a set of transformations, each on a low-dimensional embedding, and its output is aggregated by summing. The present invention sets the cardinality of the aggregation transformation to 2, where the module of the present invention consists of a residual block and a residual dense block. More precisely, each GRXDB contains three branches to improve the diversity of extracted features, so that it can make full use of the deep features extracted by each convolutional layer in the block. Since the source image contains rich texture details, the Laplacian operator and the Sobel operator are also integrated in GRXDB to retain more coarse and fine textures in the image. GRXDB greatly enhances the feature extraction capability of the network, and its structure is shown in Figure 5.
3)空间通道注意力模块(SCAM)3) Spatial Channel Attention Module (SCAM)
为了放大有用信息,抑制无用信息的干扰,同时促进语义分割任务的发展,本发明提出采用多尺度空间通道注意力机制来处理输入的深度特征。SCAM主要由三个部分组成:多尺度特征提取,空间通道注意力机制,池化融合块。由于同尺度卷积会阻碍不同特征中的多尺度信息的提取,因此在计算注意力权重之前,为了获得更全面的源图像表征信息,本发明使用了四个不同核大小的卷积块来捕获多尺度和多感受野的深度特征,随后将捕获到深度特征分别送入到空间和通道注意力机制中。为了提高网络的信息捕获能力,从而获得更准确的空间信息和语义信息,本发明通过自注意函数来生成空间和通道注意力mask。经过注意力分支的细化后,本发明对通道特征图采用上采样操作使其恢复到原来的大小,另外本发明用卷积操作来控制空间特征图的通道数。为了生成同时满足类内相似度和类间差异要求的特征图,本发明使用池化融合块(PFB),将生成的空间和通道注意力特征图进行融合,从而生成满足要求的融合特征图。空间通道注意力模块结构如图6所示。In order to amplify useful information, suppress the interference of useless information, and promote the development of semantic segmentation tasks, the present invention proposes to use a multi-scale spatial channel attention mechanism to process the input deep features. SCAM mainly consists of three parts: multi-scale feature extraction, spatial channel attention mechanism, and pooling fusion block. Since the same-scale convolution will hinder the extraction of multi-scale information in different features, before calculating the attention weight, in order to obtain more comprehensive source image representation information, the present invention uses four convolution blocks with different kernel sizes to capture multi-scale and multi-receptive field deep features, and then sends the captured deep features to the spatial and channel attention mechanisms respectively. In order to improve the information capture ability of the network and obtain more accurate spatial information and semantic information, the present invention generates spatial and channel attention masks through self-attention functions. After the refinement of the attention branch, the present invention uses an upsampling operation on the channel feature map to restore it to its original size. In addition, the present invention uses a convolution operation to control the number of channels of the spatial feature map. In order to generate a feature map that meets the requirements of both intra-class similarity and inter-class difference, the present invention uses a pooling fusion block (PFB) to fuse the generated spatial and channel attention feature maps, thereby generating a fused feature map that meets the requirements. The structure of the spatial channel attention module is shown in Figure 6.
4)分割网络(Segmentation Network)4) Segmentation Network
为了充分增强融合图像的语义信息,本发明引入了实时分割模型Bilateral attention decoder对融合后的图像进行分割,分割网络输出分割结果和辅助分割结果。分割结果与语义标签之间的差距可以反映融合图像中所包含的语义信息的丰富程度。因此,可以利用这种差距构建语义损失。最后,利用语义损失通过反向传播来指导融合网络的训练,迫使融合图像包含更多的语义信息。In order to fully enhance the semantic information of the fused image, the present invention introduces a real-time segmentation model, Bilateral Attention Decoder, to segment the fused image. The segmentation network outputs the segmentation result and the auxiliary segmentation result. The gap between the segmentation result and the semantic label can reflect the richness of the semantic information contained in the fused image. Therefore, this gap can be used to construct the semantic loss. Finally, the semantic loss is used to guide the training of the fusion network through back propagation, forcing the fused image to contain more semantic information.
实施例二:Embodiment 2:
本实施例用于在上述实施例一的基础上进一步的公开了一种性能评估的应用;This embodiment is used to further disclose an application of performance evaluation based on the above embodiment 1;
为了全面评估所提出的方法,本发明在MSRS数据集上进行广泛的定量和定性评估,此外本发明还与九种最先进的方法进行比较,包括一种传统方法,即GTF,一种基于ae的方法,即DenseFuse,两种基于GAN的方法,即FusionGAN,GANMcC,四种基于CNN的方法,即IFCNN,SDNet,U2Fusion, SeAFusion,一种基于图像分解的方法,即DeFusion。这九种方法的实现都是公开的,本发明设置的参数也与原论文保持一致。其中,DenseFuse和IFCNN分别采用了元素明智加法和元素明智最大融合策略来对深度特征进行融合。In order to comprehensively evaluate the proposed method, the present invention conducts extensive quantitative and qualitative evaluations on the MSRS dataset. In addition, the present invention is compared with nine state-of-the-art methods, including a traditional method, namely GTF, an ae-based method, namely DenseFuse, two GAN-based methods, namely FusionGAN, GANMcC, four CNN-based methods, namely IFCNN, SDNet, U2Fusion, SeAFusion, and an image decomposition-based method, namely DeFusion. The implementations of these nine methods are all public, and the parameters set by the present invention are also consistent with the original paper. Among them, DenseFuse and IFCNN use element-wise addition and element-wise maximum fusion strategies to fuse deep features, respectively.
在定量评价方面,本发明选取了EN,MI,VIF,SF,SD,SCD,Qabf七个指标客观评价融合效果。此外,本发明还采用IoU来量化分割性能。这些指标的值越大,表明融合性能越好。7个统计指标对361对图像的定量结果如表1所示。分割结果如表2所示。In terms of quantitative evaluation, the present invention selects seven indicators, EN, MI, VIF, SF, SD, SCD, and Qabf, to objectively evaluate the fusion effect. In addition, the present invention also uses IoU to quantify the segmentation performance. The larger the value of these indicators, the better the fusion performance. The quantitative results of 7 statistical indicators for 361 pairs of images are shown in Table 1. The segmentation results are shown in Table 2.
表 1对来自MSRS数据集的361对图像进行了7个指标的定量比较Table 1 Quantitative comparison of 7 indicators for 361 pairs of images from the MSRS dataset
如表1所示。本发明在SD,SF,MI表现出显著优势,SD越高,意味着融合的图像具有最高的对比度。SF值越高,表示融合后的图像越清晰,融合图像质量越好。MI值越高,说明融合后的图像传递的信息越多。此外,本发明呈现出最好的VIF,这说明本发明的融合图像更符合人类的视觉系统。本发明的方法还获得了最好的Qabf,这意味着融合图像保留了更多的边缘信息。此外,本发明还呈现了最高的EN,这说明融合图像包含的信息最多。本发明在SCD指标上仅以微弱的差距跟随SeAFusion。SCGRFuse在MSRS数据集上的定性结果示例如图7所示。本发明分别展示了白天黑夜为主的两种场景融合结果,并使用红框放大一个区域来说明纹理细节受到不同程度的光谱污染现象。另外,本发明使用绿框突出显示无用信息削弱突出目标的问题。在白天场景中,GTF和FusionGAN都不能很好的保留可见光图像的纹理细节,其他方法也都不可避免的受到了无用信息的干扰。对于夜间场景,可以看到,所有方法都在一定程度上融合了红外光和可见光图像中的互补信息,但大部分方法在融合图像中都引入了无用信息,表现在显著目标的弱化和纹理背景细节的污染。只有本发明和SeAFusion可以保留丰富的纹理细节和突出目标,但相对于SeAFusion来说,本发明的融合图像具有更高的对比度。As shown in Table 1. The present invention shows significant advantages in SD, SF, and MI. The higher the SD, the higher the contrast of the fused image. The higher the SF value, the clearer the fused image is and the better the quality of the fused image. The higher the MI value, the more information the fused image conveys. In addition, the present invention presents the best VIF, which shows that the fused image of the present invention is more in line with the human visual system. The method of the present invention also obtains the best Qabf, which means that the fused image retains more edge information. In addition, the present invention also presents the highest EN, which shows that the fused image contains the most information. The present invention follows SeaFusion only slightly in the SCD indicator. An example of the qualitative results of SCGRFuse on the MSRS dataset is shown in Figure 7. The present invention shows the fusion results of two scenes dominated by day and night, respectively, and uses a red frame to enlarge an area to illustrate the phenomenon that the texture details are polluted to different degrees by the spectrum. In addition, the present invention uses a green frame to highlight the problem of useless information weakening the prominent target. In daytime scenes, neither GTF nor FusionGAN can well preserve the texture details of visible light images, and other methods are inevitably interfered by useless information. For night scenes, it can be seen that all methods have integrated the complementary information in infrared and visible light images to some extent, but most of them have introduced useless information in the fused image, which is manifested in the weakening of salient targets and the pollution of texture background details. Only the present invention and SeAFusion can retain rich texture details and highlight targets, but the fused image of the present invention has higher contrast than SeAFusion.
表 2MSRS数据集上可见光、红外和融合图像的分割性能(miou)。Table 2 Segmentation performance of visible light, infrared and fused images on the MSRS dataset (miou).
分割结果如表2所示,可以看到,本发明在各类IoU中基本处于领先地位,并且在MIoU中的排名第一。这归咎于本发明的两点优势,一方面,本发明有效的融合了红外图像和可见光图像的互补信息,这些互补信息有助于分割模型全面理解成像场景。另一方面,SCGRFuse在空间和通道注意力机制的作用下,提高了信息捕获能力,在语义损失的引导下,增强了空间信息和语义信息,使得分割网络能够更准确的描述成像场景。The segmentation results are shown in Table 2. It can be seen that the present invention is basically in a leading position in various types of IoU and ranks first in MIoU. This is due to the two advantages of the present invention. On the one hand, the present invention effectively integrates the complementary information of infrared images and visible light images, which helps the segmentation model to fully understand the imaging scene. On the other hand, SCGRFuse improves the information capture capability under the action of spatial and channel attention mechanisms, and enhances the spatial information and semantic information under the guidance of semantic loss, so that the segmentation network can more accurately describe the imaging scene.
本实施例用于公开在上述实施例下的具体操作流程。This embodiment is used to disclose the specific operation process under the above embodiment.
图像融合的目的是使融合后的图像能保持不同图像的优点,然而现有的融合方法设计复杂并且忽略了注意力机制对于深度特征的影响。为了解决这些问题,benfam 提出了一种名为SCGRFuse的红外光与可见光图像融合方法。首先,我们构建了一个梯度聚合残差密集块(GRXDB),结合ResNeXt和DenseNet优势的同时集成Sobel和拉普拉斯算子来保留特征的强纹理和弱纹理。然后,我们引入空间和通道注意力机制,对特征图的通道信息和空间信息进行细化,提高特征图的信息捕获能力,并利用一个池化融合块,将细化后的空间和通道特征图进行融合,得到高质量的融合特征;本发明的实验结果表明,所提出的方法在MSRS数据集上的表现优于最先进的方法,突出了未来潜在研究方向,以提升融合图像的精度,同时促进高级视觉任务的发展。The purpose of image fusion is to make the fused image maintain the advantages of different images. However, the existing fusion methods are complex in design and ignore the impact of attention mechanism on deep features. To solve these problems, benfam proposed a method for infrared and visible light image fusion called SCGRFuse. First, we constructed a gradient aggregation residual dense block (GRXDB), which combines the advantages of ResNeXt and DenseNet while integrating Sobel and Laplacian operators to retain the strong and weak textures of the features. Then, we introduced the spatial and channel attention mechanism to refine the channel information and spatial information of the feature map to improve the information capture ability of the feature map, and used a pooling fusion block to fuse the refined spatial and channel feature maps to obtain high-quality fused features; the experimental results of this invention show that the proposed method performs better than the most advanced methods on the MSRS dataset, highlighting the potential research direction in the future to improve the accuracy of fused images while promoting the development of advanced visual tasks.
其中,梯度聚合残差密集块(GRXDB)和空间通道注意力模块(SCAM)是本发明的关键技术。Among them, the gradient aggregation residual dense block (GRXDB) and the spatial channel attention module (SCAM) are the key technologies of the present invention.
1)梯度聚合残差密集块1) Gradient Aggregation Residual Dense Block
梯度聚合残差密集块总体操作流程(如图5所示):本发明的GRXDB包含三个分支,主分支部署了3个3x3的卷积块和两个1x1的卷积块,为了能够更充分的利用各种卷积层来提取特征,本发明将密集连接引入了主分支,并通过1x1的卷积块来相处通道维度之间的差异,除此之外主分支还引入了拉普拉斯算子来进一步提取特征的弱纹理。另外,残差分支除了两个3x3大小的卷积块和一个1x1的卷积块之外还集成了Sobel算子来保留特征的强纹理。第三条分支不做任何处理,我们称之为剩余分支以保留输入特征的信息。最后,通过元素加法将主分支、残差分支以及剩余分支输出相加,以整合深度特征。值得注意的是,Relu激活函数相当于直接丢弃负激活,这种做法对于分类任务可能是有效的,但是却不适用于图像融合任务,为了更好的满足图像融合的需要,本发明将GRXDB的激活函数设置为能保留负激活信息的Leaky ReLU。Gradient aggregation residual dense block overall operation flow (as shown in Figure 5): The GRXDB of the present invention contains three branches. The main branch deploys three 3x3 convolution blocks and two 1x1 convolution blocks. In order to make full use of various convolution layers to extract features, the present invention introduces dense connections into the main branch, and uses 1x1 convolution blocks to deal with the differences between channel dimensions. In addition, the main branch also introduces the Laplacian operator to further extract the weak texture of the feature. In addition, in addition to two 3x3 convolution blocks and one 1x1 convolution block, the residual branch also integrates the Sobel operator to retain the strong texture of the feature. The third branch does not do any processing, which we call the residual branch to retain the information of the input feature. Finally, the main branch, residual branch and residual branch outputs are added by element addition to integrate the deep features. It is worth noting that the Relu activation function is equivalent to directly discarding negative activations. This approach may be effective for classification tasks, but it is not suitable for image fusion tasks. In order to better meet the needs of image fusion, the present invention sets the activation function of GRXDB to Leaky ReLU that can retain negative activation information.
2)空间通道注意力模块2) Spatial channel attention module
空间通道注意力模块的操作流程(如图6所示):本发明使用了四个不同核大小的卷积块来捕获多尺度和多感受野的深度特征,卷积块的大小分别为1x1,3x3,5X5,7x7,卷积块的激活函数本发明依然选择Leaky ReLU,然后将得到的深度特征按通道方向进行拼接,最后利用1x1的卷积块改变通道数发送到空间和通道注意力层。在注意力机制中,产生注意力mask Ms 和Mc是最重要的过程,为了提高网络的信息捕获能力,从而获得更准确的空间信息和语义信息。本发明通过自注意函数来生成空间和通道注意力mask,其主要过程如图8所示。为了生成通道注意力mask,首先,本发明使用全局平均池化对每一个通道图的所有像素求平均值,然后得到一个新的1*1的通道图。然后利用Tanh激活函数和1x1卷积块来调节特征映射的值,值得注意的是,本发明对tanh激活函数的计算值除以2并加上0.5,使其能够映射到[0,1]的范围。最后,本发明将计算好的通道图和输入特征映射相乘,生成通道注意分支的输出。同样,为了得到空间注意力mask,本发明首先利用1x1的卷积块来减少输入特征的通道数,然后利用最大池化和平均池化来得到两张特征图,这两张特征图都只有一个通道,并且大小和输入特征保持一致。接着,将两张特征图拼接起来,通过使用1x1的卷积块将其通道数变为1,卷积块的激活函数依然是tanh,和通道注意力机制保持一致。最后将生成的空间注意力mask与输入特征映射相乘,得到最终空间注意分支的输出。经过注意力分支的细化后,本发明对通道特征图采用上采样操作使其恢复到原来的大小,另外用卷积操作来控制空间特征图的通道数。为了生成同时满足类内相似度和类间差异要求的特征图,本发明使用Pooling fusion block(PFB),将生成的空间和通道注意力特征图进行融合,PFB的结构如图9所示。本发明使用3x3的平均池化对上采样的通道特征图进行平滑处理,并利用反射填充来构造真边界,减少边界伪影。最后,将其与空间特征图进行拼接,并利用1x1的卷积块和Leaky ReLU激活函数对边界进行校正,从而生成满足要求的融合特征图。The operation flow of the spatial channel attention module (as shown in Figure 6): The present invention uses four convolution blocks with different kernel sizes to capture multi-scale and multi-receptive field depth features. The sizes of the convolution blocks are 1x1, 3x3, 5X5, and 7x7 respectively. The activation function of the convolution block is still Leaky ReLU. Then the obtained deep features are spliced in the channel direction. Finally, the 1x1 convolution block is used to change the number of channels and send them to the spatial and channel attention layers. In the attention mechanism, generating attention masks Ms and Mc is the most important process. In order to improve the information capture ability of the network, more accurate spatial information and semantic information can be obtained. The present invention generates spatial and channel attention masks through self-attention functions, and its main process is shown in Figure 8. In order to generate the channel attention mask, first, the present invention uses global average pooling to average all pixels of each channel map, and then obtains a new 1*1 channel map. Then, the Tanh activation function and the 1x1 convolution block are used to adjust the value of the feature map. It is worth noting that the present invention divides the calculated value of the tanh activation function by 2 and adds 0.5 to enable it to be mapped to the range of [0,1]. Finally, the present invention multiplies the calculated channel map with the input feature map to generate the output of the channel attention branch. Similarly, in order to obtain the spatial attention mask, the present invention first uses a 1x1 convolution block to reduce the number of channels of the input feature, and then uses maximum pooling and average pooling to obtain two feature maps, both of which have only one channel and are the same size as the input feature. Next, the two feature maps are spliced together, and the number of channels is changed to 1 by using a 1x1 convolution block. The activation function of the convolution block is still tanh, which is consistent with the channel attention mechanism. Finally, the generated spatial attention mask is multiplied with the input feature map to obtain the output of the final spatial attention branch. After the refinement of the attention branch, the present invention uses an upsampling operation on the channel feature map to restore it to its original size, and uses a convolution operation to control the number of channels of the spatial feature map. In order to generate a feature map that meets both the intra-class similarity and inter-class difference requirements, the present invention uses a Pooling fusion block (PFB) to fuse the generated spatial and channel attention feature maps. The structure of PFB is shown in Figure 9. The present invention uses 3x3 average pooling to smooth the upsampled channel feature map, and uses reflection filling to construct the true boundary to reduce boundary artifacts. Finally, it is spliced with the spatial feature map, and the boundary is corrected using a 1x1 convolution block and a leaky ReLU activation function to generate a fused feature map that meets the requirements.
对于本领域技术人员而言,显然本发明不限于上述示范性实施例的细节,而且在不背离本发明的精神或基本特征的情况下,能够以其他的具体形式实现本发明。因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本发明的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化囊括在本发明内。不应将权利要求中的任何附图标记视为限制所涉及的权利要求。It will be apparent to those skilled in the art that the invention is not limited to the details of the exemplary embodiments described above and that the invention can be implemented in other specific forms without departing from the spirit or essential features of the invention. Therefore, the embodiments should be considered exemplary and non-limiting in all respects, and the scope of the invention is defined by the appended claims rather than the foregoing description, and it is intended that all variations falling within the meaning and scope of the equivalent elements of the claims be included in the invention. Any reference numeral in a claim should not be considered as limiting the claim to which it relates.
Claims (7)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202311537704.8 | 2023-11-17 | ||
| CN202311537704.8A CN117576543B (en) | 2023-11-17 | 2023-11-17 | A method for fusion of infrared light and visible light images |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025103079A1 true WO2025103079A1 (en) | 2025-05-22 |
Family
ID=89887588
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2024/126006 Pending WO2025103079A1 (en) | 2023-11-17 | 2024-10-21 | Method for fusing infrared light and visible light images |
Country Status (3)
| Country | Link |
|---|---|
| CN (1) | CN117576543B (en) |
| LU (1) | LU508856B1 (en) |
| WO (1) | WO2025103079A1 (en) |
Cited By (11)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120259100A (en) * | 2025-06-04 | 2025-07-04 | 南京信息工程大学 | A method and device for multi-scale fusion of hyperspectral image and multispectral image |
| CN120339092A (en) * | 2025-06-11 | 2025-07-18 | 河北工业大学 | Task-driven equivariant consistency image fusion method and system |
| CN120472333A (en) * | 2025-07-09 | 2025-08-12 | 上海辰山植物园 | Multispectral and visible light remote sensing image fusion segmentation method, system and medium |
| CN120495302A (en) * | 2025-07-17 | 2025-08-15 | 湖南大学 | Photovoltaic hot spot detection method and system based on fusion of visible light and thermal infrared images |
| CN120510403A (en) * | 2025-07-18 | 2025-08-19 | 北京交通大学 | Railway scene image enhancement method based on image feature aggregation model |
| CN120599667A (en) * | 2025-08-05 | 2025-09-05 | 江南大学 | A method and system for infrared-visible light person re-identification based on wavelet representation |
| CN120598800A (en) * | 2025-08-05 | 2025-09-05 | 山东理工大学 | An image fusion method based on cross-modal difference and dual-axis attention network |
| CN120689218A (en) * | 2025-06-13 | 2025-09-23 | 中国矿业大学(北京) | An infrared and visible light image fusion system based on multimodal feature differences |
| CN120708219A (en) * | 2025-08-27 | 2025-09-26 | 南京大学 | A method and system for precise positioning of single polymer growth based on hybrid network |
| CN120765479A (en) * | 2025-06-30 | 2025-10-10 | 大连大学 | Infrared and visible light image fusion method and system based on text-guided semantic perception |
| CN120808885A (en) * | 2025-09-11 | 2025-10-17 | 徐州市中心医院 | HE (HE-staining) image-based tumor HER2 expression grading method |
Families Citing this family (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117576543B (en) * | 2023-11-17 | 2025-01-21 | 重庆理工大学 | A method for fusion of infrared light and visible light images |
| CN118429771B (en) * | 2024-05-16 | 2025-03-18 | 昆明理工大学 | Infrared and visible light image fusion method based on fusion-semantic feature classifier |
| CN119205526B (en) * | 2024-09-23 | 2025-11-18 | 南京信息工程大学 | An Infrared and Visible Light Fusion Method Based on Multiple Feature Extraction |
| CN119809948A (en) * | 2024-12-16 | 2025-04-11 | 天津大学 | Infrared and visible light image fusion method based on meta-learning and semantic perception |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11113802B1 (en) * | 2019-09-09 | 2021-09-07 | Apple Inc. | Progressive image fusion |
| CN115063329A (en) * | 2022-06-10 | 2022-09-16 | 中国人民解放军国防科技大学 | Visible light and infrared image fusion enhancement method and system in low light environment |
| CN115471723A (en) * | 2022-09-23 | 2022-12-13 | 安徽优航遥感信息技术有限公司 | Substation unmanned aerial vehicle inspection method based on infrared and visible light image fusion |
| CN117576543A (en) * | 2023-11-17 | 2024-02-20 | 重庆理工大学 | A method of fusion of infrared and visible light images |
Family Cites Families (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| AU2020100178A4 (en) * | 2020-02-04 | 2020-03-19 | Huang, Shuying DR | Multiple decision maps based infrared and visible image fusion |
| CN111950467B (en) * | 2020-08-14 | 2021-06-25 | 清华大学 | Fusion network lane line detection method and terminal device based on attention mechanism |
| CN113592018B (en) * | 2021-08-10 | 2024-05-10 | 大连大学 | Infrared light and visible light image fusion method based on residual dense network and gradient loss |
| KR102553934B1 (en) * | 2021-09-07 | 2023-07-10 | 인천대학교 산학협력단 | Apparatus for Image Fusion High Quality |
| CN115170915B (en) * | 2022-08-10 | 2025-08-01 | 上海理工大学 | Infrared and visible light image fusion method based on end-to-end attention network |
| CN116363034A (en) * | 2023-03-31 | 2023-06-30 | 徐州鑫达房地产土地评估有限公司 | Lightweight infrared and visible light image fusion method, system, device and medium |
| CN116664462B (en) * | 2023-05-19 | 2024-01-19 | 兰州交通大学 | Infrared and visible light image fusion method based on MS-DSC and I_CBAM |
| CN116757986B (en) * | 2023-07-05 | 2025-12-16 | 南京信息工程大学 | Infrared and visible light image fusion method and device |
| CN116704274A (en) * | 2023-07-06 | 2023-09-05 | 杭州电子科技大学 | Infrared and Visible Image Fusion Method Based on Spatial Correlation Attention |
| CN117036875B (en) * | 2023-07-11 | 2024-04-26 | 南京航空航天大学 | An infrared dim moving target generation algorithm based on fusion attention GAN |
| CN116681636B (en) * | 2023-07-26 | 2023-12-12 | 南京大学 | Light infrared and visible light image fusion method based on convolutional neural network |
| CN116757988B (en) * | 2023-08-17 | 2023-12-22 | 齐鲁工业大学(山东省科学院) | Infrared and visible light image fusion method based on semantic enrichment and segmentation tasks |
-
2023
- 2023-11-17 CN CN202311537704.8A patent/CN117576543B/en active Active
-
2024
- 2024-10-21 WO PCT/CN2024/126006 patent/WO2025103079A1/en active Pending
- 2024-10-21 LU LU508856A patent/LU508856B1/en active IP Right Grant
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11113802B1 (en) * | 2019-09-09 | 2021-09-07 | Apple Inc. | Progressive image fusion |
| CN115063329A (en) * | 2022-06-10 | 2022-09-16 | 中国人民解放军国防科技大学 | Visible light and infrared image fusion enhancement method and system in low light environment |
| CN115471723A (en) * | 2022-09-23 | 2022-12-13 | 安徽优航遥感信息技术有限公司 | Substation unmanned aerial vehicle inspection method based on infrared and visible light image fusion |
| CN117576543A (en) * | 2023-11-17 | 2024-02-20 | 重庆理工大学 | A method of fusion of infrared and visible light images |
Non-Patent Citations (2)
| Title |
|---|
| LONG, YONGZHI ET AL.: "RXDNFuse: A Aggregated Residual Dense Network for Infrared and Visible Image Fusion", INFORMATION FUSION, 27 November 2020 (2020-11-27), pages 128 - 141, XP086444575, ISSN: 1566-2535, DOI: 10.1016/j.inffus.2020.11.009 * |
| WANG YONG, PU JIANFEI, MIAO DUOQIAN, ZHANG L., ZHANG LULU, DU XIN: "SCGRFuse: An infrared and visible image fusion network based on spatial/channel attention mechanism and gradient aggregation residual dense blocks", ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE., PINERIDGE PRESS, SWANSEA., GB, vol. 132, 1 June 2024 (2024-06-01), GB , pages 107898, XP093316010, ISSN: 0952-1976, DOI: 10.1016/j.engappai.2024.107898 * |
Cited By (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120259100A (en) * | 2025-06-04 | 2025-07-04 | 南京信息工程大学 | A method and device for multi-scale fusion of hyperspectral image and multispectral image |
| CN120259100B (en) * | 2025-06-04 | 2025-08-15 | 南京信息工程大学 | A method and device for multi-scale fusion of hyperspectral images and multispectral images |
| CN120339092A (en) * | 2025-06-11 | 2025-07-18 | 河北工业大学 | Task-driven equivariant consistency image fusion method and system |
| CN120689218A (en) * | 2025-06-13 | 2025-09-23 | 中国矿业大学(北京) | An infrared and visible light image fusion system based on multimodal feature differences |
| CN120765479A (en) * | 2025-06-30 | 2025-10-10 | 大连大学 | Infrared and visible light image fusion method and system based on text-guided semantic perception |
| CN120472333A (en) * | 2025-07-09 | 2025-08-12 | 上海辰山植物园 | Multispectral and visible light remote sensing image fusion segmentation method, system and medium |
| CN120495302A (en) * | 2025-07-17 | 2025-08-15 | 湖南大学 | Photovoltaic hot spot detection method and system based on fusion of visible light and thermal infrared images |
| CN120510403A (en) * | 2025-07-18 | 2025-08-19 | 北京交通大学 | Railway scene image enhancement method based on image feature aggregation model |
| CN120599667A (en) * | 2025-08-05 | 2025-09-05 | 江南大学 | A method and system for infrared-visible light person re-identification based on wavelet representation |
| CN120598800A (en) * | 2025-08-05 | 2025-09-05 | 山东理工大学 | An image fusion method based on cross-modal difference and dual-axis attention network |
| CN120708219A (en) * | 2025-08-27 | 2025-09-26 | 南京大学 | A method and system for precise positioning of single polymer growth based on hybrid network |
| CN120808885A (en) * | 2025-09-11 | 2025-10-17 | 徐州市中心医院 | HE (HE-staining) image-based tumor HER2 expression grading method |
Also Published As
| Publication number | Publication date |
|---|---|
| CN117576543B (en) | 2025-01-21 |
| CN117576543A (en) | 2024-02-20 |
| LU508856B1 (en) | 2025-05-30 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2025103079A1 (en) | Method for fusing infrared light and visible light images | |
| Zhu et al. | Eemefn: Low-light image enhancement via edge-enhanced multi-exposure fusion network | |
| CN112818862B (en) | Face tampering detection method and system based on multi-source clues and mixed attention | |
| CN115565035A (en) | A Fusion Method of Infrared and Visible Light Images for Target Enhancement at Night | |
| Balchandani et al. | A deep learning framework for smart street cleaning | |
| CN118397458A (en) | Infrared dim target detection method and device | |
| CN118967479A (en) | An adaptive feature extraction method for infrared and visible light image fusion | |
| CN118411313B (en) | A SAR optical image declouding method based on superposition attention feature fusion | |
| Wang et al. | PMSNet: Parallel multi-scale network for accurate low-light light-field image enhancement | |
| Zhang et al. | An ultra-lightweight network combining Mamba and frequency-domain feature extraction for pavement tiny-crack segmentation | |
| Bhattacharya et al. | D2bgan: A dark to bright image conversion model for quality enhancement and analysis tasks without paired supervision | |
| CN118711038A (en) | Automatic quality inspection method and system for inverter assembly based on improved YOLOv8 detection algorithm | |
| Xue et al. | MPE-DETR: A multiscale pyramid enhancement network for object detection in low-light images | |
| Wang et al. | WaveFusion: A Novel Wavelet Vision Transformer with Saliency-Guided Enhancement for Multimodal Image Fusion | |
| CN116309504A (en) | A visual inspection image acquisition and analysis method | |
| Lim et al. | LAU-Net: A low light image enhancer with attention and resizing mechanisms | |
| Lu et al. | Context-constrained accurate contour extraction for occlusion edge detection | |
| Han et al. | Unsupervised learning based dual-branch fusion low-light image enhancement | |
| Xue et al. | LAE-GAN: a novel cloud-based Low-light Attention Enhancement Generative Adversarial Network for unpaired text images | |
| CN118799181A (en) | Image super-resolution method and device | |
| EP4492321A1 (en) | Image processing method and apparatus, and storage medium, electronic device and product | |
| CN118506318A (en) | A traffic sign detection method based on LSKA-DAT fusion | |
| Xie et al. | A joint learning method for low-light facial expression recognition | |
| CN116485807A (en) | A digital image segmentation method and device | |
| Li et al. | MFMENet: multi-scale features mutual enhancement network for change detection in remote sensing images |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| WWE | Wipo information: entry into national phase |
Ref document number: LU508856 Country of ref document: LU |
|
| WWG | Wipo information: grant in national office |
Ref document number: LU508856 Country of ref document: LU |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24890437 Country of ref document: EP Kind code of ref document: A1 |