CN116757957A

CN116757957A - Two-stage decoupling image defogging method based on zero sample learning

Info

Publication number: CN116757957A
Application number: CN202310744185.6A
Authority: CN
Inventors: 赵丽; 林盛; 张笑钦
Original assignee: Wenzhou University
Current assignee: Wenzhou University
Priority date: 2023-06-25
Filing date: 2023-06-25
Publication date: 2023-09-15

Abstract

The invention discloses a two-stage decoupling image defogging method based on zero sample learning. The main focus of this approach is to recover the haze-free image and the transmission map. Briefly, the method uses an embedded Dark Channel Prior (DCP) in a first stage to obtain a rough estimate of the haze free image and transmission map. In the second stage, the two subnetworks refine the results of the first stage to obtain more accurate haze-free images and transmission maps, while the other subnetwork directly estimates atmospheric light. In particular, we have also proposed a new multi-scale Transformer block and applied it to a sub-network that refines the haze-free image, which performs multi-scale token aggregation in the self-attention part, enabling it to capture features of different scales and effectively recover the potential scene information in the haze-image.

Description

A two-stage decoupled image dehazing method based on zero-shot learning

技术领域Technical Field

本发明涉及图像去雾技术领域，具体涉及一种基于零样本学习的两阶段解耦图像去雾方法。The present invention relates to the technical field of image defogging, and in particular to a two-stage decoupling image defogging method based on zero-sample learning.

背景技术Background Art

雾霾是一种由于空气中颗粒物大量的聚集和悬浮而产生的现象。由于大气光的吸收和散射，光线在传播时受到颗粒物的影响，使得各种图像采集设备所获取的图像清晰度、对比度降低，图片降质严重，给后续的各种图像处理操作带来较大的困难。在严重的情况下，这些因素还会使图像中的颜色信息失真，从而使各类计算机视觉技术更加难以应用。Haze is a phenomenon caused by the accumulation and suspension of large amounts of particles in the air. Due to the absorption and scattering of atmospheric light, the light is affected by particles during propagation, which reduces the clarity and contrast of images acquired by various image acquisition devices, seriously degrading the quality of the images, and bringing great difficulties to various subsequent image processing operations. In severe cases, these factors will also distort the color information in the image, making it more difficult to apply various computer vision technologies.

近年来，随着计算机视觉系统的普及，这些系统在道路、航空和其他领域中发挥了重要作用，但雾天低能见度的天气对航空、道路交通等视觉设备造成了极大的困扰。因此，图像去雾一直是计算机视觉领域研究的一个焦点。In recent years, with the popularization of computer vision systems, these systems have played an important role in roads, aviation and other fields, but foggy and low-visibility weather has caused great trouble to visual equipment such as aviation and road traffic. Therefore, image dehazing has always been a focus of research in the field of computer vision.

图像去雾的目的是从观测到的雾霾图像中估计潜在的无雾图像。对于单图像去雾问题，有一个常用的模型来表示雾霾图像的退化过程：The purpose of image dehazing is to estimate the potential haze-free image from the observed haze image. For the single image dehazing problem, there is a commonly used model to represent the degradation process of haze images:

I＝J(x)t(x)+A(1-t(x))I＝J(x)t(x)+A(1-t(x))

其中I为雾霾图像，J为潜在的无雾图像，A为全局大气光，t为传输图，x为像素位置信息。而传输图可以表示为：Where I is the haze image, J is the potential haze-free image, A is the global atmospheric light, t is the transmission map, and x is the pixel position information. The transmission map can be expressed as:

t(x)＝e^-βd(x),t(x)＝e ^-βd(x) ,

其中，β是大气的散射系数，d是场景深度。Where β is the scattering coefficient of the atmosphere and d is the scene depth.

以上所述可以看出，图像去雾是一个典型的不适定问题。由此，许多图像去雾方法被提出，它们大致可以被分为基于先验的方法和基于学习的方法。基于先验的方法，即传统的去雾方法，倾向于使用图像本身具有的先验知识进行去雾。例如：暗通道先验(DCP)来检测图像的雾霾分布，颜色衰减先验(CAP)来估计传输图，虽然这些基于先验的方法取得了一定的效果，但是所使用的先验知识在实践中很容易被推翻，并且在复杂的实际场景中鲁棒性较低。From the above, it can be seen that image dehazing is a typical ill-posed problem. Therefore, many image dehazing methods have been proposed, which can be roughly divided into prior-based methods and learning-based methods. Prior-based methods, that is, traditional dehazing methods, tend to use the prior knowledge of the image itself for dehazing. For example: dark channel prior (DCP) is used to detect the haze distribution of the image, and color attenuation prior (CAP) is used to estimate the transmission map. Although these prior-based methods have achieved certain results, the prior knowledge used is easily overturned in practice and has low robustness in complex actual scenes.

发明内容Summary of the invention

针对现有技术存在的不足，本发明的目的在于提供一种基于零样本学习的两阶段解耦图像去雾方法，该方法基于零样本学习，直接将雾霾图像输入网络进行去雾，无需提供雾霾-干净图像对，避免了大规模的数据训练；同时，该方法在恢复无雾图像的过程中，不仅能保证多尺度特征的提取，同时能保证原始尺度和邻域信息的提取。In view of the shortcomings of the prior art, the purpose of the present invention is to provide a two-stage decoupled image defogging method based on zero-shot learning. The method is based on zero-shot learning and directly inputs the haze image into the network for defogging. There is no need to provide haze-clean image pairs, thereby avoiding large-scale data training. At the same time, in the process of restoring the haze-free image, the method can not only ensure the extraction of multi-scale features, but also ensure the extraction of original scale and neighborhood information.

为实现上述目的，本发明提供了如下技术方案：一种基于零样本学习的两阶段解耦图像去雾方法，包括以下步骤：To achieve the above object, the present invention provides the following technical solution: a two-stage decoupling image defogging method based on zero-sample learning, comprising the following steps:

(1)获取真实的单图像去雾(REalistic Single Image DEhazing，RESIDE)数据集中的合成目标测试集(Synthetic Objective Testing Set，SOTS)和混合主观测试集(Hybrid Subjective Testing Set，HSTS)的两个测试子集，对原始数据集进行预处理，作为测试集；(1) Obtain two test subsets of the synthetic objective testing set (SOTS) and the hybrid subjective testing set (HSTS) from the real single image dehazing (RESIDE) dataset, and preprocess the original dataset as the test set;

(2)构建去雾网络模型：根据大气散射物理模型I＝J(x)t(x)+A(1-t(x))，(2) Constructing a dehazing network model: Based on the atmospheric scattering physical model I = J(x)t(x)+A(1-t(x)),

其中x表示输入的雾霾图像，I(x)表示重构后的有雾图像，J(x)表示原始的场景信息，即清晰无雾图像，t(x)表示透射率，A表示大气散射光。Where x represents the input haze image, I(x) represents the reconstructed haze image, J(x) represents the original scene information, that is, the clear haze-free image, t(x) represents the transmittance, and A represents the atmospheric scattered light.

基于解耦的思想将网络模型分为潜在的无雾图像层、传输图层和全局大气光层三层，对潜在的无雾图像以及传输图层使用两个阶段进行估计，第一阶段使用嵌入的暗通道先验进行粗略估计，第二阶段则分别使用无雾图像细化网络和传输图层细化网络进行精细化的估计，对于全局大气光，使用一个编码器-解码器网络(即大气光估计网络)来估计，在网络的顶端，将三个子网络(无雾图像细化网络、传输图层细化网络以及大气光估计网络)估计的图像再根据大气散射物理模型重构，获得重构后的雾霾图像；Based on the idea of decoupling, the network model is divided into three layers: potential fog-free image layer, transmission layer layer and global atmospheric light layer. The potential fog-free image and transmission layer are estimated in two stages. The first stage uses the embedded dark channel prior for rough estimation, and the second stage uses the fog-free image refinement network and the transmission layer refinement network for refined estimation. For the global atmospheric light, an encoder-decoder network (i.e., atmospheric light estimation network) is used for estimation. At the top of the network, the images estimated by the three sub-networks (fog-free image refinement network, transmission layer refinement network and atmospheric light estimation network) are reconstructed according to the atmospheric scattering physical model to obtain the reconstructed haze image.

(3)将预处理后的数据集中的雾霾图像输入到构建好的去雾网络模型中，根据设计好的损失函数计算损失，不断迭代更新参数，对雾霾图像直接进行图像去雾。(3) The haze images in the preprocessed data set are input into the constructed dehazing network model, the loss is calculated according to the designed loss function, the parameters are continuously updated iteratively, and the haze images are directly dehazed.

作为优选的：步骤(2)，第一阶段的粗略估计具体包括以下步骤：As a preferred embodiment: Step (2), the rough estimation in the first stage specifically includes the following steps:

(2.1)计算输入图像x的暗通道：先取RGB三通道中的最小值来构建灰度图，然后再对灰度图取逆，进行最大池化操作，对最大池化后的结果取逆即得到I^dark，具体公式为其中I^c指的是I的R，G，B通道之一；(2.1) Calculate the dark channel of the input image x: First, take the minimum value of the three RGB channels to construct a grayscale image, then invert the grayscale image, perform a maximum pooling operation, and invert the result after maximum pooling to obtain I ^dark . The specific formula is: Where I ^c refers to one of the R, G, or B channels of I;

(2.2)计算传输图的粗略估计T_DCP(x)：将I^dark带入大气散射物理模型得到I^dark(x)＝J^dark(x)t(x)+A(1-t(x))，其中I^dark(x)和J^dark(x)分别是图像I和J的暗通道；对于大气光A，在I^dark中选取亮度前0.1％的像素，取其平均值为大气光A的值；而对于J^dark，根据暗通道先验的假设，对于户外无雾图像J的非天空区域的暗通常趋向于零，J^dark(x)→0，从而根据得到传输图的初步结果；由于暗通道图片存在的边缘过渡不平滑，采用导向滤波的方式使得T_DCP(x)的边缘更为平滑，使用一个平均池化操作实现导向滤波，该平均池化操作的内核大小为19*19，步幅为1。(2.2) Calculate a rough estimate of the transmission map T _DCP (x): Substitute I ^dark into the atmospheric scattering physical model to obtain I ^dark (x) = J ^dark (x)t(x) + A(1-t(x)), where I ^dark (x) and J ^dark (x) are the dark channels of images I and J, respectively; for atmospheric light A, select the top 0.1% of pixels in I ^dark and take their average value as the value of atmospheric light A; and for J ^dark , according to the dark channel prior assumption, the darkness of the non-sky area of the outdoor fog-free image J usually tends to zero, J ^dark (x)→0, so according to The preliminary results of the transmission graph are obtained; since the edge transition of the dark channel image is not smooth, the guided filtering method is used to make the edge of T _DCP (x) smoother. An average pooling operation is used to implement the guided filtering. The kernel size of the average pooling operation is 19*19 and the stride is 1.

(2.3)计算无雾图的粗略估计J_DCP(x)：根据大气散射物理模型，且已知大气光A和传输图T_DCP(x)，J_DCP(x):通过等式计算获得 (2.3) Calculate a rough estimate of the haze-free image _JDCP (x): According to the physical model of atmospheric scattering, and with the known atmospheric light A and transmission map T _DCP (x), _JDCP (x): is calculated by Eq.

作为优选的：步骤(2)，构建去雾网络模型的三个子网，具体包括以下步骤：As a preferred embodiment: Step (2), constructing three subnets of the defogging network model, specifically includes the following steps:

(A)无雾图像细化网络的架构是一种改进的5级U-Net架构，其卷积块被引入的多尺度Transformer块所取代，并且使用SK融合模块来进行不同层之间的连接和融合，使用了软重建层来获得全局残差；(A) The architecture of the haze-free image refinement network is an improved 5-level U-Net architecture, in which the convolutional blocks are replaced by the introduced multi-scale Transformer blocks, and the SK fusion module is used to connect and fuse different layers, and a soft reconstruction layer is used to obtain the global residual;

(B)传输图细化网络是一种由五个卷积层构成的非退化结构的网络，前四个层中，每一层都只包括卷积层、批处理归一化层和LeakyReLU激活函数，最后一层包含了卷积层以及一个将输出归一化为[0,1]的sigmoid函数；(B) The transfer graph refinement network is a non-degenerate structured network consisting of five convolutional layers. In the first four layers, each layer only includes a convolutional layer, a batch normalization layer, and a LeakyReLU activation function. The last layer includes a convolutional layer and a sigmoid function that normalizes the output to [0, 1].

(C)大气光估计网络是一个对称的编码器-解码器架构的网络，网络的编码器和解码器各有四层，其中编码器的每一层均包含了卷积层、ReLU激活函数和最大池化层，解码器的每一层均包含了上采样、卷积层、批处理归一化层和Re LU激活函数，在编码器和解码器的中间，使用了一个重参数化技巧模块，该模块将编码器的输出转换为潜在的高斯分布，然后对其进行重采样，再输入解码器。(C) The atmospheric light estimation network is a symmetrical encoder-decoder architecture network. The encoder and decoder of the network each have four layers. Each layer of the encoder includes a convolution layer, a ReLU activation function and a maximum pooling layer. Each layer of the decoder includes an upsampling layer, a convolution layer, a batch normalization layer and a ReLU activation function. In the middle of the encoder and the decoder, a reparameterization technique module is used, which converts the output of the encoder into a potential Gaussian distribution, which is then resampled and input into the decoder.

作为优选的：步骤(A)中，构建多尺度Transformer块，具体包括以下步骤：As a preferred embodiment: in step (A), constructing a multi-scale Transformer block specifically includes the following steps:

(A1)采用RescaleNorm层作为归一化层，RescaleNorm的具体归一化过程表示为： (A1) The RescaleNorm layer is used as the normalization layer. The specific normalization process of RescaleNorm is expressed as:

其中F(·)表示多尺度Transformer块中的主要部分多尺度自注意力，分别表示均值和标准差，分别是学习到的比例因子和偏差，和分别为两个用于转换μ和σ的线性层权重和偏置项，转换过程表示为{γ,β}＝{σW_γ+B_γ,μW_β+B_β}，为了加速收敛，B_γ和B_β初始化为1和0。where F(·) represents the main part of the multi-scale self-attention in the multi-scale Transformer block, denote the mean and standard deviation respectively, are the learned scaling factor and bias, respectively. and are the two linear layer weights and bias terms used to convert μ and σ respectively. The conversion process is expressed as {γ, β} = {σW _γ +B _γ ,μW _β +B _β }. In order to accelerate convergence, B _γ and B _β are initialized to 1 and 0.

(A2)采用多尺度自注意力来计算自注意力，多尺度自注意力(Multi-scale Self-attention，MSSA)拥有两个分支，聚合窗口内的多尺度特征信息，同时在一定程度上保留邻域的信息。(A2) Multi-scale self-attention is used to calculate self-attention. Multi-scale self-attention (MSSA) has two branches, which aggregates multi-scale feature information within the window while retaining the information of the neighborhood to a certain extent.

作为优选的：步骤(A2)中，计算多尺度自注意力(Multi-scale Self-attention，MSSA)，包括两个分支：As a preferred embodiment: in step (A2), multi-scale self-attention (MSSA) is calculated, including two branches:

(A21)在分支中，对查询向量Q、键向量K、值向量V进行不同的处理，对于Q，保留原始的尺度，对其进行普通的全连接操作，对于键K，值V，在同一自注意力层中，使用尺度变换(ST)来获得不同尺寸的K和V，具体表示为：(A21) In the branch, the query vector Q, key vector K, and value vector V are processed differently. For Q, the original scale is retained and a normal full connection operation is performed on it. For key K and value V, in the same self-attention layer, scale transformation (ST) is used to obtain K and V of different sizes, which can be specifically expressed as:

Q＝XW^Q,Q＝XW ^Q ,

K_i,V_i＝ST_i(X)W_i ^K,ST_i(X)W_i ^V,K _i ,V _i =ST _i (X)W _i ^K ,ST _i (X)W _i ^V ,

V_i＝V_i+LE_i(V_i),V _i =V _i +LE _i (V _i ),

其中X表示输入的特征序列，W^Q,W_i ^K,W_i ^V是第i个尺度层在自注意层同一头部的线性投影参数，LE_i(·)是深度卷积的局部增强分量；而ST_i(·)是第i个尺度层的尺度变换操作，它提供对X的降采样，可具体表示为：Where X represents the input feature sequence, W ^Q , _Wi ^K , _Wi ^V are the linear projection parameters of the i-th scale layer on the same head of the self-attention layer, LE _i (·) is the local enhancement component of the depthwise convolution; and S _Ti (·) is the scale transformation operation of the i-th scale layer, which provides downsampling of X, which can be specifically expressed as:

ST_i(X)＝LN(Conv_i(X)),ST _i (X) = LN (Conv _i (X)),

其中，LN(·)是指层归一化(LayerNormalization)；Conv_i(·)是指第i个尺度层所使用的卷积，不同尺度层所用的内核大小和步幅不同，从而造就了不同的尺度变换；因此，在同一自我注意力层中，键和值捕捉到不同尺度的特征；然后，自注意力头head可以计算为：Among them, LN(·) refers to layer normalization; Conv _i (·) refers to the convolution used by the i-th scale layer. Different scale layers use different kernel sizes and strides, resulting in different scale transformations; therefore, in the same self-attention layer, the keys and values capture features of different scales; then, the self-attention head can be calculated as:

其中Softmax(·)是激活函数，T表示转置操作，d是维度，将不同的头连接，通过以下方法计算主要分支中的多头自我注意(Multi-head selfattention，MH)：Where Softmax(·) is the activation function, T represents the transposition operation, d is the dimension, and different heads are connected. Multi-head selfattention (MH) in the main branch is calculated by the following method:

MH(Q,K,V)＝Concat(head₀,…,head_j)W^O,MH(Q,K,V)=Concat(head ₀ ,...,head _j )W ^O ,

其中Concat(·)是连接操作，head_i表示第i个自注意力头，W^O是线性投影参数；Where Concat(·) is the concatenation operation, head _i represents the i-th self-attention head, and W ^O is the linear projection parameter;

(A22)另一个保留邻域信息的分支仅对输入的特征依次进行线性投影和卷积操作，多尺度自注意力MSSA可表示为：(A22) Another branch that retains neighborhood information only performs linear projection and convolution operations on the input features in sequence. The multi-scale self-attention MSSA can be expressed as:

MSSA＝Concat(MH{Q,K_i,V_i}_i＝1,…,m)+Conv₀(XW⁰)，MSSA=Concat(MH{Q,K _i ,V _i } _i=1,...,m )+Conv ₀ (XW ⁰ ),

其中Concat(·)是连接操作，MH表示多头自我注意力，Q,K_i,V_i分别表示查询向量、第i个尺度层的键向量、第i个尺度层的值向量，Conv₀(·)是指卷积操作，X是输入的特征序列，W^O是线性投影参数；Where Concat(·) is a concatenation operation, MH represents multi-head self-attention, Q, _Ki , _Vi represent the query vector, the key vector of the i-th scale layer, and the value vector of the i-th scale layer, respectively, Conv ₀ (·) refers to the convolution operation, X is the input feature sequence, and W ^O is the linear projection parameter;

作为优选的：步骤(3)中，所述损失函数由五个损失函数构成：Preferably, in step (3), the loss function is composed of five loss functions:

(3.1)重建损失L_Rec，用于计算输入图像与重建后图像的误差，以此来约束整个网络，L_Rec定义为：(3.1) Reconstruction loss L _Rec is used to calculate the error between the input image and the reconstructed image to constrain the entire network. L _Rec is defined as:

L_Rec＝||I(x)-x||_F,L _Rec ＝||I(x)-x|| _F ,

其中，||·||_F表示给定矩阵的弗罗比尼乌斯范数，x是输入的雾霾图像，I(x)是由三个子网的输出重建成的雾霾图像；Among them, ||·|| _F represents the Frobenius norm of the given matrix, x is the input haze image, and I(x) is the haze image reconstructed by the output of the three subnetworks;

(3.2)损失函数L_J，用于计算无雾图像估计J_R(x)的亮度和其饱和度之间的差，具体表示为：(3.2) The loss function L _J is used to calculate the difference between the brightness and saturation of the haze-free image estimate J _R (x), which is specifically expressed as:

L_J＝||V(J_R(x))-S(J_R(x))||_F,L _J =||V(J _R (x))-S(J _R (x))|| _F ,

其中，V(·)表示亮度，S(·)表示饱和度，||·||_F表示给定矩阵的弗罗比尼乌斯范数。where V(·) represents brightness, S(·) represents saturation, and ||·|| _F represents the Frobenius norm of the given matrix.

(3.3)大气光损失L_H，用于计算估计的大气光A(x)和A_init(x)之间的损失，其中A_init(x)是从输入的图像数据中自动估计的初始化大气光，L_H具体表示为：(3.3) Atmospheric light loss L _H is used to calculate the loss between the estimated atmospheric light A(x) and A _init (x), where A _init (x) is the initial atmospheric light automatically estimated from the input image data. L _H is specifically expressed as:

L_H＝||A(x)-A_init(x)||_F.L _H =||A(x)-A _init (x)|| _F .

(3.4)L_KL是大气估计光网络中重参数化模块的的变分推理项损失，使网络中重采样后的潜码和采样前的z差异最小化，在数学上，(3.4) L _KL is the variational inference loss of the reparameterized module in the atmosphere estimation optical network, making the latent code after resampling in the network Minimize the difference between z and before sampling. Mathematically,

其中，KL(·)表示两个分布之间的Kullback-Leibler散度，z_i表示z的第i个维度，μ_z,σ_z分别表示z的均值和标准差，分别表示z_i的均值和标准差；where KL(·) represents the Kullback-Leibler divergence between two distributions, z _i represents the i-th dimension of z, μ _z ,σ _z represent the mean and standard deviation of z respectively. denote the mean and standard deviation of z _i respectively;

(3.5)为了避免网络过拟合设置的正则项损失L_Reg，具体表示为：(3.5) In order to avoid network overfitting, the regularization term loss L _Reg is set, which is specifically expressed as:

其中，N(·)表示二阶邻域，|N(·)|表示二阶邻域的大小，n表示A(x)的像素数，A(x)表示大气光；Where N(·) represents the second-order neighborhood, |N(·)| represents the size of the second-order neighborhood, n represents the number of pixels of A(x), and A(x) represents atmospheric light;

总损失通过结合以上五种损失函数来定义：The total loss is defined by combining the above five loss functions:

L＝L_Rec+L_J+L_H+L_KL+λL_Reg,L＝L _Rec +L _J +L _H +L _KL +λL _Reg ,

其中，L_Reg前的参数λ是用来平衡正则化的一个非负参数，在实践中为0.1。Among them, the parameter λ before L _Reg is a non-negative parameter used to balance regularization, which is 0.1 in practice.

采用上述技术方案，本发明具有如下有益效果：By adopting the above technical solution, the present invention has the following beneficial effects:

本发明构建图像去雾网络模型：根据大气散射物理模型，基于解耦的思想将网络模型分为三个层，即潜在的无雾图像层、传输图层和全局大气光层。对潜在的无雾图像以及传输图层使用两个阶段进行估计，第一阶段使用嵌入的暗通道先验进行粗略估计，第二阶段则分别使用无雾图像细化网络和传输图层细化网络进行精细化的估计。对于全局大气光，使用一个编码器-解码器网络来估计。在网络的顶端，将三个子网估计的图像再根据大气散射物理模型重构，获得重构后的雾霾图像。在去雾网络模型的精细无雾图像的子网络中引入一种新的多尺度Transformer块，它在自关注部分执行多尺度令牌聚合，使其能够捕捉不同尺度的特征，并有效地恢复雾霾图像中的潜在场景信息。The present invention constructs an image defogging network model: according to the atmospheric scattering physical model, the network model is divided into three layers based on the idea of decoupling, namely, a potential fog-free image layer, a transmission layer layer and a global atmospheric light layer. The potential fog-free image and the transmission layer layer are estimated in two stages. The first stage uses the embedded dark channel prior for a rough estimate, and the second stage uses the fog-free image refinement network and the transmission layer refinement network for refined estimates. For the global atmospheric light, an encoder-decoder network is used for estimation. At the top of the network, the images estimated by the three subnetworks are reconstructed according to the atmospheric scattering physical model to obtain a reconstructed haze image. A new multi-scale Transformer block is introduced in the refined fog-free image subnetwork of the defogging network model, which performs multi-scale token aggregation in the self-attention part, so that it can capture features of different scales and effectively restore the potential scene information in the haze image.

本发明基于零样本学习，直接将雾霾图像输入网络进行去雾，无需提供雾霾-干净图像对，避免了大规模的数据训练；同时，该方法在恢复无雾图像的过程中，不仅能保证多尺度特征的提取，同时能保证原始尺度和邻域信息的提取。The present invention is based on zero-sample learning and directly inputs the haze image into the network for dehazing without providing haze-clean image pairs, thus avoiding large-scale data training. At the same time, in the process of restoring the haze-free image, the method can not only ensure the extraction of multi-scale features, but also ensure the extraction of original scale and neighborhood information.

下面结合说明书附图和具体实施例对本发明作进一步说明。The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明实施例的去雾网络模型示意图；FIG1 is a schematic diagram of a defogging network model according to an embodiment of the present invention;

图2为本发明实施例多尺度Transformer块的示意图。FIG2 is a schematic diagram of a multi-scale Transformer block according to an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

参见图1和图2，本发明公开的一种基于零样本学习的两阶段解耦图像去雾方法，包括以下步骤：Referring to FIG. 1 and FIG. 2 , the present invention discloses a two-stage decoupled image defogging method based on zero-sample learning, comprising the following steps:

(2.3)计算无雾图的粗略估计J_DCP(x)：根据大气散射物理模型，且已知大气光A和传输图T_DCP(x)，J_DCP(x):通过等式计算获得 (2.3) Calculate a rough estimate of the haze-free image _JDCP (x): According to the physical model of atmospheric scattering, and with the known atmospheric light A and transmission map T _DCP (x), _JDCP (x): is calculated by equation

Q＝XW^Q,Q＝XW ^Q ,

V_i＝V_i+LE_i(V_i),V _i =V _i +LE _i (V _i ),

ST_i(X)＝LN(Conv_i(X)),ST _i (X) = LN (Conv _i (X)),

其中Softmax(·)是激活函数，T表示转置操作，d是维度。将不同的头连接，通过以下方法计算主要分支中的多头自我注意：Where Softmax(·) is the activation function, T represents the transpose operation, and d is the dimension. Connect different heads and calculate the multi-head self-attention in the main branch by:

其中Concat(·)是连接操作，head_i表示第i个自注意力头，W^O是线性投影参数；(A22)另一个保留邻域信息的分支仅对输入的特征依次进行线性投影和卷积操作，多尺度自注意力可表示为：Where Concat(·) is a concatenation operation, head _i represents the i-th self-attention head, and W ^O is a linear projection parameter; (A22) the other branch that retains neighborhood information only performs linear projection and convolution operations on the input features in sequence. The multi-scale self-attention can be expressed as:

MSSA＝Concat(MH{Q,K_i,V_i}_i＝1,…,m)+Conv₀(XW⁰).。MSSA=Concat(MH{Q,K _i ,V _i } _i=1,...,m )+Conv ₀ (XW ⁰ )..

L_Rec＝||I(x)-x||_F,L _Rec ＝||I(x)-x|| _F ,

L_J＝||V(J_R(x))-S(J_R(x))||_F,L _J =||V(J _R (x))-S(J _R (x))|| _F ,

L_H＝||A(x)-A_init(x)||_F.L _H =||A(x)-A _init (x)|| _F .

L＝L_Rec+L_J+L_H+L_KL+λL_Reg,L＝L _Rec +L _J +L _H +L _KL +λL _Reg ,

实际应用时:In practical application:

本发明提供一种基于零样本学习的两阶段解耦图像去雾方法，可以以端到端方式进行去雾，且主要由两个阶段组成：第一阶段：使用嵌入的暗通道先验来获得无雾图像和透射图的粗略估计；第二阶段，使用两个子网络对第一阶段的结果进行细化，以获得更准确的无雾图像和传输图，而另一个子网络则直接估计大气光照。特别地，本发明在去雾网络模型的精细无雾图像的子网络中引入一种新的多尺度Transformer块。构建好的去雾网络模型后，根据设计好的损失函数计算损失，不断迭代更新参数，对雾霾图像直接进行图像去雾。The present invention provides a two-stage decoupled image defogging method based on zero-shot learning, which can perform defogging in an end-to-end manner and mainly consists of two stages: the first stage: using the embedded dark channel prior to obtain a rough estimate of the haze-free image and the transmission map; the second stage, using two sub-networks to refine the results of the first stage to obtain a more accurate haze-free image and transmission map, while the other sub-network directly estimates the atmospheric illumination. In particular, the present invention introduces a new multi-scale Transformer block in the sub-network of the refined haze-free image of the defogging network model. After the defogging network model is constructed, the loss is calculated according to the designed loss function, and the parameters are continuously iterated and updated to directly perform image defogging on the haze image.

一、两阶段解耦去雾网络模型1. Two-stage decoupling defogging network model

构建两阶段解耦去雾网络模型的第一阶段，具体包括以下步骤：The first stage of building a two-stage decoupled defogging network model includes the following steps:

步骤一：计算输入图像x的暗通道。即先取RGB三通道中的最小值来构建灰度图，然后再对灰度图取逆，进行最大池化操作，对最大池化后的结果取逆即得到I^dark。具体公式为：Step 1: Calculate the dark channel of the input image x. First, take the minimum value of the three RGB channels to construct a grayscale image, then invert the grayscale image, perform the maximum pooling operation, and invert the result after the maximum pooling to get I ^dark . The specific formula is:

其中I^c指的是I的R、G、B通道之一。Where ^Ic refers to one of the R, G, or B channels of I.

步骤二：计算传输图的粗略估计T_DCP(x)。将I^dark带入大气散射物理模型得到：Step 2: Calculate a rough estimate of the transmission diagram, T _DCP (x). Substituting I ^dark into the atmospheric scattering physics model yields:

I^dark(x)＝J^dark(x)t(x)+A[1-t(x)], (2)I ^dark (x)＝J ^dark (x)t(x)+A[1-t(x)], (2)

其中I^dark(x)和J^dark(x)分别是图像I和J的暗通道。对于大气光A，在I^dark中选取亮度前0.1％的像素，取其平均值为大气光A。而对于J^dark，根据暗通道先验的假设，对于户外无雾图像J的非天空区域的暗通常趋向于零，即Where I ^dark (x) and J ^dark (x) are the dark channels of images I and J, respectively. For the atmospheric light A, the pixels with the top 0.1% brightness in I ^dark are selected and their average value is taken as the atmospheric light A. For J ^dark , according to the dark channel prior assumption, the darkness of the non-sky area of the outdoor fog-free image J usually tends to zero, that is,

J^dark(x)→0. (3)J ^dark (x)→0. (3)

结合等式2，根据以下式子，得到传输图的初步结果：Combined with equation 2, the preliminary results of the transmission diagram are obtained according to the following formula:

由于暗通道图片存在的边缘过渡不平滑的问题，因此采用导向滤波的方式使得T_DCP(x)的边缘更为平滑。使用一个平均池化操作来实现导向滤波，该平均池化操作的内核大小为19*19，步幅为1。Since the edge transition of the dark channel image is not smooth, guided filtering is used to make the edge of T _DCP (x) smoother. An average pooling operation is used to implement guided filtering, and the kernel size of the average pooling operation is 19*19 and the stride is 1.

步骤三：计算无雾图的粗略估计J_DCP(x)。根据大气散射物理模型，且已知大气光A和传输图T_DCP(x)，J_DCP(x)可通过以下等式计算：Step 3: Calculate a rough estimate of _JDCP (x) for the haze-free image. Based on the physical model of atmospheric scattering, and given the atmospheric light A and the transmission map T _DCP (x), _JDCP (x) can be calculated using the following equation:

构建两阶段解耦去雾网络模型的第二阶段，具体包括以下三个子网络：The second stage of building a two-stage decoupled defogging network model includes the following three sub-networks:

(1)无雾图像细化网络：无雾图像细化网络的架构是一种改进的5级U-Net架构，其卷积块被引入的多尺度Transformer块所取代，并且使用SK融合模块来进行不同层之间的连接和融合，使用了软重建层来获得全局残差。(1) Fog-free image refinement network: The architecture of the fog-free image refinement network is an improved 5-level U-Net architecture, in which the convolutional blocks are replaced by the introduced multi-scale Transformer blocks, and the SK fusion module is used to connect and fuse different layers. A soft reconstruction layer is used to obtain the global residual.

(2)传输图细化网络：传输图细化网络是一种由五个卷积层构成的非退化结构的网络。具体来说，该子网的前四个层中，每一层都只包括卷积层、批处理归一化层和LeakyReLU激活函数。而最后一层包含了卷积层以及一个将输出归一化为[0,1]的sigmoid函数。(2) Transfer Graph Refinement Network: The transfer graph refinement network is a non-degenerate structured network consisting of five convolutional layers. Specifically, each of the first four layers of this subnet only includes a convolutional layer, a batch normalization layer, and a LeakyReLU activation function. The last layer includes a convolutional layer and a sigmoid function that normalizes the output to [0,1].

(3)大气光估计网络：大气光估计网络是一个对称的编码器-解码器架构的网络。网络的编码器和解码器各有四层，其中编码器的每一层均包含了卷积层、ReLU激活函数和最大池化层，解码器的每一层均包含了上采样、卷积层、批处理归一化层和Re LU激活函数。而在编码器和解码器的中间，使用了一个重参数化技巧模块。该模块将编码器的输出转换为潜在的高斯分布，然后对其进行重采样，再输入解码器。(3) Atmospheric light estimation network: The atmospheric light estimation network is a symmetrical encoder-decoder network. The encoder and decoder of the network each have four layers. Each layer of the encoder contains a convolution layer, a ReLU activation function, and a maximum pooling layer. Each layer of the decoder contains an upsampling layer, a convolution layer, a batch normalization layer, and a ReLU activation function. In between the encoder and the decoder, a reparameterization technique module is used. This module converts the output of the encoder into a potential Gaussian distribution, which is then resampled and input into the decoder.

二、多尺度Transformer块2. Multi-scale Transformer Block

构建两阶段解耦去雾网络模型的无雾图像细化网络，包括RescaleNorm层、多尺度自注意力以及MLP层。此处具体介绍RescaleNorm层和本发明引入的多尺度自注意力：A two-stage decoupled defogging network model is constructed to refine the haze-free image, including the RescaleNorm layer, multi-scale self-attention, and MLP layer. The RescaleNorm layer and the multi-scale self-attention layer introduced in this invention are specifically introduced here:

(1)RescaleNorm层。作为归一化层，RescaleNorm的具体归一化过程可表示为：(1) RescaleNorm layer. As a normalization layer, the specific normalization process of RescaleNorm can be expressed as:

其中F(·)表示多尺度Transformer块中的主要部分多尺度自注意力，分别表示均值和标准差，分别是学习到的比例因子和偏差，和分别为两个用于转换μ和σ的线性层权重和偏置项，转换过程表示为{γ,β}＝{σW_γ+B_γ,μW_β+B_β}。为了加速收敛，B_γ和B_β初始化为1和0。where F(·) represents the main part of the multi-scale self-attention in the multi-scale Transformer block, denote the mean and standard deviation respectively, are the learned scaling factor and bias, respectively. and are the weights and bias terms of two linear layers used to convert μ and σ respectively. The conversion process is expressed as {γ, β} = {σW _γ + B _γ , μW _β + B _β }. In order to accelerate convergence, B _γ and B _β are initialized to 1 and 0.

(2)多尺度自注意力。多尺度自注意力(Multi-scale Self-attention，MSSA)拥有两个分支，聚合窗口内的多尺度特征信息，同时在一定程度上保留邻域的信息。(2) Multi-scale self-attention. Multi-scale self-attention (MSSA) has two branches, which aggregates multi-scale feature information within the window while retaining the information of the neighborhood to a certain extent.

在主要分支中，对查询向量Q、键向量K、值向量V进行不同的处理。对于Q，保留原始的尺度，对其进行普通的全连接操作。而对于键K，值V，在同一自注意力层中，使用尺度变换(ST)来获得不同尺寸的K和V，具体表示为：In the main branch, the query vector Q, key vector K, and value vector V are processed differently. For Q, the original scale is retained and a normal full connection operation is performed on it. For key K and value V, in the same self-attention layer, scale transformation (ST) is used to obtain K and V of different sizes, which can be specifically expressed as:

其中，X表示输入的特征序列，W^Q,W_i ^K,W_i ^V是第i个尺度层在自注意层同一头部的线性投影参数，LE_i(·)是深度卷积的局部增强分量。而ST_i(·)是第i个尺度层的尺度变换操作，它提供对X的降采样，可具体表示为：Where X represents the input feature sequence, W ^Q , _Wi ^K , _Wi ^V are the linear projection parameters of the i-th scale layer on the same head of the self-attention layer, and LE _i (·) is the local enhancement component of the deep convolution. And ST _i (·) is the scale transformation operation of the i-th scale layer, which provides downsampling of X, which can be specifically expressed as:

ST_i(X)＝LN(Conv_i(X)), (8)ST _i (X) = LN (Conv _i (X)), (8)

其中，LN(·)是指层归一化(LayerNormalization)。Conv_i(·)是指第i个尺度层所使用的卷积，不同尺度层所用的内核大小和步幅不同，从而造就了不同的尺度变换。因此，在同一自我注意力层中，键和值可以捕捉到不同尺度的特征。然后，自注意力头head可以计算为：Among them, LN(·) refers to layer normalization. Conv _i (·) refers to the convolution used by the i-th scale layer. Different scale layers use different kernel sizes and strides, resulting in different scale transformations. Therefore, in the same self-attention layer, the keys and values can capture features of different scales. Then, the self-attention head can be calculated as:

MH(Q,K,V)＝Concat(head₀,…,head_j)W^O, (10)MH(Q,K,V)=Concat(head ₀ ,...,head _j )W ^O , (10)

其中Concat(·)是连接操作，head_i表示第i个自注意力头，W^O是线性投影参数。Where Concat(·) is the concatenation operation, head _i represents the i-th self-attention head, and W ^O is the linear projection parameter.

另一个保留邻域信息的分支仅对输入的特征依次进行线性投影和卷积操作。综上所述，多尺度自注意力可表示为：The other branch that retains neighborhood information only performs linear projection and convolution operations on the input features in sequence. In summary, multi-scale self-attention can be expressed as:

MSSA＝Concat(MH{Q,K_i,V_i}_i＝1,…,m)+Conv₀(XW⁰). (11)MSSA=Concat(MH{Q,K _i ,V _i } _i=1,...,m )+Conv ₀ (XW ⁰ ). (11)

其中Concat(·)是连接操作，MH表示多头自我注意力，Q,K_i,V_i分别表示查询向量、第i个尺度层的键向量、第i个尺度层的值向量，Conv₀(·)是指卷积操作，X是输入的特征序列，W^O是线性投影参数。Where Concat(·) is a concatenation operation, MH represents multi-head self-attention, Q, _Ki , _Vi represent the query vector, the key vector of the i-th scale layer, and the value vector of the i-th scale layer, respectively, Conv ₀ (·) refers to the convolution operation, X is the input feature sequence, and W ^O is the linear projection parameter.

三、损失函数3. Loss Function

所述损失函数由以下五个损失函数构成：The loss function is composed of the following five loss functions:

(1)重建损失L_Rec，用于计算输入图像与重建后图像的误差，以此来约束整个网络。L_Rec定义为：(1) Reconstruction loss L _Rec is used to calculate the error between the input image and the reconstructed image, so as to constrain the entire network. L _Rec is defined as:

L_Rec＝||I(x)-x||_F, (12)L _Rec ＝||I(x)-x|| _F , (12)

其中，||·||_F表示给定矩阵的弗罗比尼乌斯范数，x是输入的雾霾图像，I(x)是由三个子网的输出重建成的雾霾图像。Among them, ||·|| _F represents the Frobenius norm of the given matrix, x is the input haze image, and I(x) is the haze image reconstructed by the output of the three subnetworks.

(2)损失函数L_J，用于计算无雾图像估计J_R(x)的亮度和其饱和度之间的差，具体表示为：(2) The loss function L _J is used to calculate the difference between the brightness and saturation of the haze-free image estimate J _R (x), which is specifically expressed as:

L_J＝||V(J_R(x))-S(J_R(x))||_F, (13)L _J =||V(J _R (x))-S(J _R (x))|| _F , (13)

其中，V(·)表示亮度，S(·)表示饱和度。Among them, V(·) represents brightness and S(·) represents saturation.

(3)大气光损失L_H，用于计算估计的大气光A(x)和A_init(x)之间的损失，其中A_init(x)是从输入的图像数据中自动估计的初始化大气光。L_H具体表示为：(3) Atmospheric light loss L _H is used to calculate the loss between the estimated atmospheric light A(x) and A _init (x), where A _init (x) is the initial atmospheric light automatically estimated from the input image data. _{L H} is specifically expressed as:

L_H＝||A(x)-A_init(x)||_F. (14)L _H =||A(x)-A _init (x)|| _F . (14)

(4)L_KL是大气估计光网络中重参数化模块的的变分推理项损失，它的目的是使网络中重采样后的潜码和采样前的z差异最小化。在数学上，(4) L _KL is the variational inference loss of the reparameterized module in the atmosphere estimation optical network. Its purpose is to make the latent code after resampling in the network Minimize the difference between z and before sampling. Mathematically,

其中，KL(·)表示两个分布之间的Kullback-Leibler散度，z_i表示z的第i个维度，μ_z,σ_z分别表示z的均值和标准差，分别表示z_i的均值和标准差。where KL(·) represents the Kullback-Leibler divergence between two distributions, z _i represents the i-th dimension of z, μ _z ,σ _z represent the mean and standard deviation of z respectively. represent the mean and standard deviation of _zi respectively.

(5)为了避免网络过拟合设置的正则项损失L_Reg，可具体表示为：(5) In order to avoid network overfitting, the regularization term loss L _Reg is set, which can be specifically expressed as:

其中，N(·)表示二阶邻域，|N(·)|表示二阶邻域的大小，n表示A(x)的像素数，A(x)表示大气光。Where N(·) represents the second-order neighborhood, |N(·)| represents the size of the second-order neighborhood, n represents the number of pixels of A(x), and A(x) represents atmospheric light.

综上所述，总损失通过结合以上五种损失函数来定义：In summary, the total loss is defined by combining the above five loss functions:

L＝L_Rec+L_J+L_H+L_KL+λL_Reg, (17)L＝L _Rec +L _J +L _H +L _KL +λL _Reg , (17)

其中L_Reg前的参数λ是用来平衡正则化的一个非负参数，在实践中为0.1。The parameter λ before L _Reg is a non-negative parameter used to balance regularization, which is 0.1 in practice.

根据大气散射物理模型，基于解耦的思想将网络模型分为三个层，即潜在的无雾图像层、传输图层和全局大气光层。对潜在的无雾图像以及传输图层使用两个阶段进行估计，第一阶段使用嵌入的暗通道先验进行粗略估计，第二阶段则分别使用无雾图像细化网络和传输图层细化网络进行精细化的估计。对于全局大气光，使用一个编码器-解码器网络来估计。在网络的顶端，将三个子网估计的图像再根据大气散射物理模型重构，获得重构后的雾霾图像。在去雾网络模型的精细无雾图像的子网络中引入一种新的多尺度Transformer块，它在自关注部分执行多尺度令牌聚合，使其能够捕捉不同尺度的特征，并有效地恢复雾霾图像中的潜在场景信息。According to the physical model of atmospheric scattering, the network model is divided into three layers based on the idea of decoupling, namely the potential haze-free image layer, the transmission layer layer and the global atmospheric light layer. The potential haze-free image and the transmission layer layer are estimated in two stages. The first stage uses the embedded dark channel prior for rough estimation, and the second stage uses the haze-free image refinement network and the transmission layer refinement network for refined estimation. For the global atmospheric light, an encoder-decoder network is used for estimation. At the top of the network, the images estimated by the three subnetworks are reconstructed according to the physical model of atmospheric scattering to obtain the reconstructed haze image. A new multi-scale Transformer block is introduced in the refined haze-free image subnetwork of the dehazing network model, which performs multi-scale token aggregation in the self-attention part, enabling it to capture features of different scales and effectively restore the potential scene information in the haze image.

上述实施例对本发明的具体描述，只用于对本发明进行进一步说明，不能理解为对本发明保护范围的限定，本领域的技术工程师根据上述发明的内容对本发明作出一些非本质的改进和调整均落入本发明的保护范围之内。The specific description of the present invention in the above embodiments is only used to further illustrate the present invention and cannot be understood as limiting the scope of protection of the present invention. Technical engineers in this field may make some non-essential improvements and adjustments to the present invention based on the contents of the above invention, which fall within the scope of protection of the present invention.

Claims

1. A two-stage decoupled image defogging method based on zero-shot learning, characterized in that it includes the following steps:

(1) Obtain two test subsets of the synthetic target test set and the mixed subjective test set from the real single image dehazing dataset, and preprocess the original dataset as the test set;

(2) Constructing a dehazing network model: According to the atmospheric scattering physical model I(x) = J(x)t(x) + A[1-t(x)],

Where x represents the input haze image, I(x) represents the reconstructed haze image, J(x) represents the original scene information, t(x) represents the transmittance, and A represents the atmospheric scattered light;

Based on the idea of decoupling, the network model is divided into three layers: potential fog-free image layer, transmission layer layer and global atmospheric light layer. The potential fog-free image and transmission layer are estimated in two stages. The first stage uses the embedded dark channel prior for rough estimation, and the second stage uses the fog-free image refinement network and the transmission layer refinement network for refined estimation. For the global atmospheric light, an encoder-decoder network is used for estimation. At the top of the network, the images estimated by the three sub-networks of the fog-free image refinement network, the transmission layer refinement network and the atmospheric light estimation network are reconstructed according to the atmospheric scattering physical model to obtain the reconstructed haze image.

(3) The haze images in the preprocessed data set are input into the constructed dehazing network model, the loss is calculated according to the designed loss function, the parameters are continuously updated iteratively, and the haze images are directly dehazed.

2. According to the two-stage decoupled image defogging method based on zero-shot learning in claim 1, it is characterized in that: step (2), the rough estimation in the first stage specifically comprises the following steps:

(2.1) Calculate the dark channel of the input image x: First, take the minimum value of the three RGB channels to construct a grayscale image, then invert the grayscale image, perform a maximum pooling operation, and invert the result after maximum pooling to obtain I ^dark . The specific formula is: Where I ^c refers to one of the R, G, or B channels of I;

(2.2) Calculate a rough estimate of the transmission map T _DCP (x): Substitute I ^dark into the atmospheric scattering physical model to obtain I ^dark (x) = J ^dark (x)t(x) + A(1-t(x)), where I ^dark (x) and J ^dark (x) are the dark channels of images I and J, respectively; for atmospheric light A, select the top 0.1% of pixels in I ^dark and take their average value as the value of atmospheric light A; and for J ^dark , according to the dark channel prior assumption, the darkness of the non-sky area of the outdoor fog-free image J usually tends to zero, J ^dark (x)→0, so according to The preliminary results of the transmission graph are obtained; since the edge transition of the dark channel image is not smooth, the guided filtering method is used to make the edge of T _DCP (x) smoother. An average pooling operation is used to implement the guided filtering. The kernel size of the average pooling operation is 19*19 and the stride is 1.

(2.3) Calculate a rough estimate of the haze-free image _JDCP (x): According to the physical model of atmospheric scattering, and with the known atmospheric light A and transmission map T _DCP (x), _JDCP (x): is calculated by equation

3. According to the two-stage decoupled image defogging method based on zero-shot learning in claim 2, it is characterized in that: step (2), constructing three subnets of the defogging network model, specifically comprises the following steps:

(A) The architecture of the haze-free image refinement network is an improved 5-level U-Net architecture, in which the convolutional blocks are replaced by the introduced multi-scale Transformer blocks, and the SK fusion module is used to connect and fuse different layers, and a soft reconstruction layer is used to obtain the global residual;

(B) The transfer graph refinement network is a non-degenerate structured network consisting of five convolutional layers. In the first four layers, each layer only includes a convolutional layer, a batch normalization layer, and a LeakyReLU activation function. The last layer includes a convolutional layer and a sigmoid function that normalizes the output to [0, 1].

(C) The atmospheric light estimation network is a symmetrical encoder-decoder architecture network. The encoder and decoder of the network each have four layers. Each layer of the encoder includes a convolution layer, a ReLU activation function and a maximum pooling layer. Each layer of the decoder includes an upsampling layer, a convolution layer, a batch normalization layer and a ReLU activation function. In the middle of the encoder and the decoder, a reparameterization trick module is used, which converts the output of the encoder into a potential Gaussian distribution, which is then resampled and input into the decoder.

4. The two-stage decoupled image dehazing method based on zero-shot learning according to claim 3 is characterized in that: in step (A), constructing a multi-scale Transformer block specifically comprises the following steps:

(A1) The RescaleNorm layer is used as the normalization layer. The specific normalization process of RescaleNorm is expressed as:

where F(·) represents the main part of the multi-scale self-attention in the multi-scale Transformer block, denote the mean and standard deviation respectively, are the learned scaling factor and bias, respectively. and are the two linear layer weights and bias terms used to convert μ and σ respectively. The conversion process is expressed as {γ, β} = {σW _γ +B _γ ,μW _β +B _β }. In order to accelerate convergence, B _γ and B _β are initialized to 1 and 0.

(A2) Multi-scale self-attention is used to calculate self-attention. Multi-scale self-attention (MSSA) has two branches, which aggregates multi-scale feature information within the window while retaining the information of the neighborhood to a certain extent.

5. The two-stage decoupled image dehazing method based on zero-shot learning according to claim 4, characterized in that: in step (A2), multi-scale self-attention (MSSA) is calculated, including two branches:

(A21) In the branch, the query vector Q, key vector K, and value vector V are processed differently. For Q, the original scale is retained and a normal full connection operation is performed on it. For K and V, in the same self-attention layer, scale transformation (ST) is used to obtain K and V of different sizes, which can be specifically expressed as:

Q＝XW ^Q ,

K _i ,V _i =ST _i (X)W _i ^K ,ST _i (X)W _i ^V ,

V _i =V _i +LE _i (V _i ),

Where X represents the input feature sequence, W ^Q , _Wi ^K , _Wi ^V are the linear projection parameters of the i-th scale layer on the same head of the self-attention layer, LE _i (·) is the local enhancement component of the depthwise convolution; and S _Ti (·) is the scale transformation operation of the i-th scale layer, which provides downsampling of X, which can be specifically expressed as:

ST _i (X) = LN (Conv _i (X)),

Among them, LN(·) refers to layer normalization; Conv _i (·) refers to the convolution used by the i-th scale layer. Different scale layers use different kernel sizes and strides, resulting in different scale transformations; therefore, in the same self-attention layer, the keys and values capture features of different scales; then, the self-attention head can be calculated as:

Where Softmax(·) is the activation function, T represents the transposition operation, and d is the dimension. Connect different heads and calculate the multi-head self-attention (MH) in the main branch by the following method:

MH(Q,K,V)=Concat(head ₀ ,...,head _j )W ^O ,

Where Concat(·) is the concatenation operation, head _i represents the i-th self-attention head, and W ^O is the linear projection parameter;

(A22) Another branch that retains neighborhood information only performs linear projection and convolution operations on the input features in sequence. The multi-scale self-attention MSSA can be expressed as:

MSSA=Concat(MH{Q,K _i ,V _i } _i=1,...,m )+Conv ₀ (XW ⁰ ),

Where Concat(·) is a concatenation operation, MH represents multi-head self-attention, Q, _Ki , _Vi represent the query vector, the key vector of the i-th scale layer, and the value vector of the i-th scale layer, respectively, Conv ₀ (·) refers to the convolution operation, X is the input feature sequence, and W ^O is the linear projection parameter.

6. The two-stage decoupling image defogging method based on zero-shot learning according to claim 5, characterized in that: in step (3), the loss function is composed of five loss functions:

(3.1) Reconstruction loss L _Rec is used to calculate the error between the input image and the reconstructed image to constrain the entire network. L _Rec is defined as:

L _Rec ＝||I(x)-x|| _F ,

Among them, ||·|| _F represents the Frobenius norm of the given matrix, x is the input haze image, and I(x) is the haze image reconstructed by the output of the three subnetworks;

(3.2) The loss function L _J is used to calculate the difference between the brightness and saturation of the haze-free image estimate J _R (x), which is specifically expressed as:

L _J =||V(J _R (x))-S(J _R (x))|| _F ,

where V(·) represents brightness, S(·) represents saturation, and ||·|| _F represents the Frobenius norm of the given matrix.

(3.3) Atmospheric light loss L _H is used to calculate the loss between the estimated atmospheric light A(x) and A _init (x), where A _init (x) is the initial atmospheric light automatically estimated from the input image data. L _H is specifically expressed as:

L _H =||A(x)-A _init (x)|| _F .

(3.4) L _KL is the variational inference loss of the reparameterized skill module in the atmosphere estimation optical network, making the latent code after resampling in the network Minimize the difference between the latent code z before sampling. Mathematically,

where KL(·) represents the Kullback-Leibler divergence between two distributions, z _i represents the i-th dimension of z, μ _z ,σ _z represent the mean and standard deviation of z respectively. denote the mean and standard deviation of z _i respectively;

(3.5) In order to avoid network overfitting, the regularization term loss L _Reg is set, which is specifically expressed as:

Where N(·) represents the second-order neighborhood, |N(·)| represents the size of the second-order neighborhood, n represents the number of pixels of A(x), and A(x) represents atmospheric light;

The total loss is defined by combining the above five loss functions:

L＝L _Rec +L _J +L _H +L _KL +λL _Reg ,

Among them, the parameter λ before L _Reg is a non-negative parameter used to balance regularization, which is 0.1 in practice.