CN119784598A

CN119784598A - An underwater image enhancement method based on frequency domain analysis and visual Mamba

Info

Publication number: CN119784598A
Application number: CN202411690682.3A
Authority: CN
Inventors: 顾俊涛; 蒋先涛; 韩冰; 吴中岱
Original assignee: Shanghai Maritime University
Current assignee: Shanghai Maritime University
Priority date: 2024-11-25
Filing date: 2024-11-25
Publication date: 2025-04-08

Abstract

The present invention relates to an underwater image enhancement method based on frequency domain analysis and visual Mamba, comprising: constructing a multi-level wavelet transform network, the network is an encoder-decoder structure, wherein the encoder comprises a space-frequency feature fusion module and a discrete wavelet transform; the decoder comprises a space-frequency feature fusion module and an inverse discrete wavelet transform; in the encoder, the discrete wavelet transform decomposes the underwater image into a low-frequency part and a high-frequency part, the low-frequency part and the high-frequency part are respectively extracted by the corresponding space-frequency feature fusion module based on Mamba and frequency domain convolution for spatial features and frequency features, and the decoder splices along the channel in the order of low frequency first and high frequency later, and then reconstructs through inverse discrete wavelet transform; the underwater image data is input into the multi-level wavelet transform network to obtain the enhanced underwater image data. Compared with the prior art, the present invention has the advantages of improving enhancement quality, reducing complexity, and rapid enhancement.

Description

Underwater image enhancement method based on frequency domain analysis and vision Mamba

Technical Field

The invention relates to the technical field of underwater image enhancement, in particular to an underwater image enhancement method based on frequency domain analysis and vision Mamba.

Background

With the continuous development of human beings on the ocean, underwater detection devices such as an underwater automatic cruiser are widely used in underwater biological detection, underwater environment monitoring, underwater resource exploration and other underwater research works. Wherein underwater imaging is an important task. Underwater imaging involves three elements, direct light, forward scatter and backward scatter, respectively. Direct light is light reflected directly from an underwater object to a camera, forward scattering is light reflected from an underwater object but scattered at a small angle, backward scattering is light reflected from an underwater tiny object, wherein forward scattering and backward scattering affect the quality of underwater imaging, forward scattering causes blurring of the underwater object structure, and backward scattering masks the details, textures and boundaries of the underwater object. In addition, light of different wavelengths decays with depth in water, with red light of the longest wavelength decaying fastest and blue and green light decaying slower, which causes the underwater image to appear a bluish-green hue. Therefore, there is a need for an underwater image enhancement method to restore detail texture and correct color of an image.

At present, three main methods for enhancing underwater images are a traditional method, a method based on a physical model and a method based on deep learning, but the existing methods have the following problems:

1) The traditional methods mainly directly adjust image pixels, such as histogram equalization, image fusion, gamma correction and the like, neglect the underwater imaging principle and easily generate underenhancement/overenhancement.

2) The method based on the physical model mainly estimates parameters in a physical imaging model (an atmospheric scattering model and the like) through a physical priori, and further obtains clear underwater images such as a dark channel priori, minimum information loss, a fuzzy priori and the like through a back-calculation model. But accurately estimating parameters in a physical imaging model is challenging, while physical priors are sensitive to the scene, prone to under/over enhancement and artifacts. At the same time, the time complexity of this method is high, resulting in excessively long processing times.

3) The method based on deep learning mainly takes supervised learning as a main part, the network structure mainly takes an encoder-decoder as a main part, and the end-to-end underwater image enhancement is realized by learning the nonlinear relation between the degraded image and the reference image in the pair data set. As a representation learning method, the deep learning-based underwater image enhancement, as well as other image processing tasks, efficient global modeling is important for the quality of the image enhancement. The existing deep learning-based method often uses convolution or self-attention to extract the features, wherein two problems are that 1) convolution is good at extracting local features, but lacks the capability of extracting global features, so that generalization is poor, and 2) self-attention is high in computational complexity although global features can be extracted, and a large amount of computational resources are needed, so that the method is not suitable for an underwater environment with limited computational resources.

The existing method is mainly researched in a space domain, and relatively few researches are conducted in a frequency domain.

How to realize the underwater image enhancement rapidly, with less calculation resources and high quality becomes a technical problem to be solved.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide an underwater image enhancement method based on frequency domain analysis and vision Mamba.

The aim of the invention can be achieved by the following technical scheme:

According to one aspect of the present invention, there is provided a method of underwater image enhancement based on frequency domain analysis and vision Mamba, the method comprising:

step S1, acquiring underwater image data;

S2, constructing a multi-level wavelet transformation network based on frequency domain analysis and vision Mamba, wherein the network is in an encoder-decoder structure, and the encoder comprises a space-frequency characteristic fusion module and discrete wavelet transformation;

the method comprises the steps that discrete wavelet transformation in an encoder decomposes an underwater image into a low-frequency part and a high-frequency part, the low-frequency part and the high-frequency part are subjected to space feature and frequency feature extraction by a corresponding space-frequency feature fusion module based on Mamba and frequency domain convolution respectively, the space-frequency feature fusion module is used for processing data in a decoder, the data are spliced along a channel according to a sequence of low frequency and high frequency, and then the data are reconstructed through inverse discrete wavelet transformation;

and step S3, inputting the underwater image data into the multi-stage wavelet transformation network constructed in the step S2 to obtain the enhanced underwater image data.

Preferably, the space-frequency characteristic fusion module comprises a space characteristic extraction part and a frequency characteristic extraction part which are arranged in parallel;

The data after discrete wavelet transformation processing is firstly expanded into two equal parts by a linear layer, and is divided into two equal parts along the channel, and the space and frequency characteristics are extracted by a space characteristic extraction part and a frequency characteristic extraction part respectively;

The spatial feature extraction part comprises a layer normalization and Mamba-based spatial feature extraction sub-module;

The frequency characteristic extraction part comprises a fast Fourier transform, two convolution-based frequency characteristic extraction submodules and an inverse fast Fourier transform, wherein the two convolution-based frequency characteristic extraction submodules respectively conduct characteristic extraction on the amplitude and the phase of a spectrogram.

More preferably, the Mamba-based spatial feature extraction part includes a vision Mamba sub-module and a Mamba-based channel attention sub-module;

The vision Mamba sub-module selectively scans the unfolded input data along four different directions to obtain global dependence;

The Mamba-based channel attention submodule comprises an adaptive average pooling function and a Sigmoid activation function, the adaptive average pooling result is selectively scanned in two directions through Mamba, and then the Sigmoid activation function generates a channel attention diagram.

More preferably, the space-frequency characteristic fusion module further comprises a feedforward neural network, which is used for splicing the space characteristic extraction part and the frequency characteristic extraction part and screening data;

the feed-forward neural network comprises a layer normalization and gating attention sub-module.

More preferably, the gating attention submodule comprises a depth-separable convolution, siLU activation functions and a linear layer for channel adjustment, wherein input data is subjected to the depth-separable convolution to enlarge a channel and is divided into two equal parts along the channel, one part is subjected to the SiLU activation functions to obtain an attention map, and the other part is subjected to inner product with the attention map to output a result.

Preferably, the multi-level wavelet transform network is an n-layer encoder-decoder structure, where n is 3 or more;

The encoder comprises a layer 1, a layer 2 and a layer n, wherein the layer 1 of the encoder comprises a convolution module and discrete wavelet transformation;

The 1 st to n th layers of the decoder comprise a plurality of space-frequency characteristic fusion modules, convolution modules and inverse discrete wavelet transformation, the last layer of the decoder is a refining layer, and the refining layer comprises two convolution modules and a plurality of space-frequency characteristic fusion modules.

More preferably, the 1 st to n th layers of the encoder are respectively connected with the 1 st to n th layers of the decoder in a jumping manner.

More preferably, the process of reconstructing the enhanced underwater image by the decoder includes:

The nth layer of the decoder uses a corresponding space-frequency characteristic fusion module to process the low frequency part and the high frequency part respectively, the processed results are spliced along the channel according to the sequence of low frequency first and high frequency second, and the spliced results are reconstructed by using inverse discrete wavelet transformation;

Splicing the convolved result in the nth layer of the encoder and the nth layer output of the decoder along the channel to serve as the input of the nth-1 layer of the decoder;

The n-1 layer of the decoder firstly uses convolution to reduce the channel number of input data, then uses a space-frequency characteristic fusion module to process the result after the channel is reduced, splices the processed result and the n-1 layer high-frequency processed result of the encoder along the channel, inputs the spliced result into inverse discrete wavelet transformation for reconstruction, and the reconstruction result is in jump connection with the convolution output result of the n-1 layer of the encoder and is used as the input of the n-2 layer of the decoder;

The n-2 layer to the 1 layer of the decoder are consistent with the processing flow of the n-1 layer of the decoder;

The refining layer of the decoder firstly uses convolution to reduce the channel number of input data, then uses a space-frequency characteristic fusion module to refine the convolved result, and finally uses convolution to output and strengthen the underwater image.

Preferably, the number of spatial-frequency feature fusion modules per layer in the encoder and decoder is not the same.

Preferably, the loss function L _total of the multi-stage wavelet transform network is calculated as follows:

L_total＝λ₁L_L1+λ₂L_edge+θ₃L_fft+θ₄L_adv

Wherein L _L1 is L1 loss, namely an absolute value used for calculating a difference between a predicted value and a true value, L _edge is boundary loss used for calculating a difference value of a boundary of two images, L _fft is frequency loss used for calculating a difference value between a phase and an amplitude of the two images, L _adv is counter loss used for measuring a difference between data generated by a generator and real data and the capability of a discriminator for distinguishing the real data from the generated data, and lambda ₁、λ₂、λ₃、θ₄ is a super parameter.

Compared with the prior art, the invention has the following beneficial effects:

1) The invention constructs a multilevel wavelet transformation underwater image enhancement network based on frequency domain analysis and vision Mamba, uses wavelet transformation to separate high-frequency and low-frequency parts, integrates characteristics from two aspects of space domain and frequency domain to enhance the underwater image, has less related parameter quantity, uses Mamba to process space domain data, uses convolution to process frequency domain data, the frequency domain data comprises the phase and amplitude of a spectrogram, enhances the degradation area in the underwater image better according to the phase and amplitude in the spectrogram, and Mamba extracts the global dependence of the image with linear complexity, so that the model effectively improves the visual quality of the underwater image with fewer calculation resources, thereby greatly improving the visual quality of the image, the calculation complexity and the calculation resources.

2) The invention reflects the degradation of the underwater image on the low frequency and the high frequency, and respectively processes the low frequency and the high frequency separately, extracts the space characteristics to restore the detail information, extracts the frequency domain characteristics to restore the structure information, and fuses the two characteristics to enhance the quality of the underwater image.

3) The invention introduces a vision Mamba sub-module and a Mamba channel attention sub-module, selectively scans the unfolded input data along different directions to obtain global dependence, selectively scans the self-adaptive average pooled global information sequence from two directions, and then uses Sigmoid output channel attention force diagram to pay attention force diagram and input inner product, thereby emphasizing important channels and inhibiting unimportant channels, solving the channel redundancy problem existing in Mamba, and improving the underwater image enhancement capability.

Drawings

FIG. 1 is a flow chart of an underwater image enhancement method based on frequency domain analysis and vision Mamba in the present invention;

FIG. 2 is a schematic diagram of a multi-level wavelet transform network according to the present invention;

FIG. 3 is a schematic diagram of a space-frequency feature fusion module according to the present invention;

FIG. 4 is a schematic diagram of a Mamba-based spatial feature extraction submodule according to the present invention;

FIG. 5 is a schematic diagram of a vision Mamba sub-module according to the present invention;

FIG. 6 is a schematic diagram of a Mamba-channel-based attention sub-module according to the present invention;

FIG. 7 is a schematic diagram of a gated attention sub-module according to the present invention;

FIG. 8 is a convolution-based frequency feature extraction module of the present invention;

In the figure, avgPool, conv, convolution, convBlock, frequency feature extraction based on convolution, channel attention sub-module based on Mamba, DWConv, depth separable convolution, FFT, fast Fourier transform, iFFT, inverse fast Fourier transform, gated Attention, gated attention sub-module, leak ReLU activation function, linear layer, layerNorm, layer normalization, sigmoid activation function, siLU, siLU activation function, spatial Mamba, vision Mamba sub-module, VSSM, spatial feature extraction sub-module based on Mamba.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

Aiming at the problems of poor generalization, high computational complexity and low speed of the existing method, firstly, we consider that the generalization of the method based on the deep learning is stronger than that of the traditional method and the method based on the physical model, and the time complexity of the method based on the deep learning is smaller than or equal to that of the method based on the physical model and the traditional method, therefore, we aim to improve the generalization capability of the method based on the deep learning and reduce the time complexity thereof, and start from the two fields of the space domain and the frequency domain by Mamba and convolution, so that the effective and rapid underwater image enhancement with less parameters and lower time complexity is realized, and the visual quality of the underwater image is improved.

The embodiment relates to an underwater image enhancement method based on frequency domain analysis and vision Mamba, as shown in fig. 1, comprising the following steps:

s1, acquiring and constructing an underwater image data set

S2, establishing a multi-level wavelet transformation network based on frequency domain analysis and vision Mamba;

s3, designing a loss function of the multi-stage wavelet transformation network;

s4, training a model by using the underwater image data set;

s5, testing the model by using the underwater image dataset;

S6, using the model to enhance the real underwater image.

The multi-level wavelet transform network based on frequency domain analysis and vision Mamba is a 3-layer encoder-decoder structure. The layer 1 of the encoder comprises 13×3 convolution and discrete wavelet transform, the layer 2 of the encoder comprises a plurality of space-frequency characteristic fusion modules, 13×3 convolution and discrete wavelet transform, and the layer 3 of the encoder comprises a plurality of space-frequency characteristic fusion modules, 13×3 convolution and discrete wavelet transform.

The layer 1 of the decoder comprises a plurality of space-frequency characteristic fusion modules, 13 multiplied by 3 convolution and inverse discrete wavelet transform, the layer 2 of the decoder comprises a plurality of space-frequency characteristic fusion modules, 13 multiplied by 3 convolution and inverse discrete wavelet transform, and the layer 3 of the decoder comprises a plurality of space-frequency characteristic fusion modules and 1 inverse discrete wavelet transform. The final layer of the decoder is a refinement layer comprising 23 x 3 convolutions and a plurality of spatial-frequency feature fusion modules. The number of space-frequency feature fusion modules per layer is different.

The method comprises the steps of establishing jump connection between a layer 1 of an encoder and a layer 1 of a decoder, establishing jump connection between a layer 2 of the encoder and a layer 2 of the decoder, and establishing jump connection between a layer 3 of the encoder and a layer 3 of the decoder.

The discrete wavelet transform decomposes an input underwater image into low frequency and high frequency and downsamples the input data, the high frequency and the low frequency being separately processed by corresponding space-frequency feature fusion modules, respectively. The inverse discrete wavelet transform reconstructs and upsamples the input low and high frequencies.

The space-frequency feature fusion module is divided into 3 parts including space feature extraction, frequency feature extraction and feedforward neural network. The spatial feature extraction comprises a layer normalization and Mamba-based spatial feature extraction sub-module, the frequency feature extraction comprises a fast Fourier transform, 2 convolution-based frequency feature extraction sub-modules and an inverse fast Fourier transform, and the feedforward neural network comprises a layer normalization and gating attention sub-module. The data after discrete wavelet transformation processing is firstly divided into 2 equal parts along the channel through a linear layer expanding channel, the data is respectively subjected to space feature extraction and frequency feature extraction, and finally, the data screening is carried out on the splicing results of the space feature extraction and the frequency feature extraction through a feedforward neural network.

A Fast Fourier Transform (FFT) converts the spatial domain data into a spectrogram, and an Inverse Fast Fourier Transform (IFFT) converts the spectrogram into a spatial domain.

The Mamba-based spatial feature extraction module included 2 linear layers of adjustment channels, 13 x 3 depth separable convolution, 1 SiLU activation function, 1 vision Mamba sub-module, 1 Mamba-based channel attention sub-module, and 1 layer normalization. The input underwater image data firstly passes through a linear layer to expand channels, sequentially passes through a depth separable convolution, siLU activation functions, a vision Mamba submodule, a Mamba-based channel attention submodule and layer normalization, and finally, the linear layer restores the number of the channels to the original size.

The vision Mamba sub-module selectively scans the expanded input data along 4 different directions to obtain a global dependence.

The Mamba-based channel attention submodule includes 1 adaptive average pooling and 1 Sigmoid activation function. The adaptive average pooling result is selectively scanned in two directions through Mamba, and then a channel attention diagram is generated by a Sigmoid activation function.

The gated attention submodule includes 1 linear layer of 3 x 3 depth separable convolutions, 1 SiLU activation functions, and 1 channel adjustment. Input data is passed through a depth-separable convolution expansion channel and divided into 2 equal parts along the channel, wherein 1 part is subjected to SiLU activation functions to obtain attention force diagrams, and the other 1 part is subjected to inner product with the attention force diagrams to obtain results.

The convolution-based frequency feature extraction module includes 21 x1 convolution products and 2 LeakyReLU activation functions. The input data is sequentially subjected to a1 x1 convolution, leakyReLU activation functions, a1 x1 convolution, and LeakyReLU activation functions.

And S1, acquiring and constructing an underwater image data set. Underwater image datasets are acquired from the internet, including paired underwater image datasets and unpaired underwater image datasets. The current common underwater image set includes EUVP, UIEB, LSUI, oceanDark, RUIE and the like. And constructing a training set and a testing set from the acquired underwater image data sets in a random selection mode, wherein the training set is a paired underwater image data set, and the testing set is a paired underwater image data set and a non-paired underwater image data set.

And step S2, establishing a multi-stage wavelet transformation network based on frequency domain analysis and vision Mamba.

Fig. 2 shows the general structure of a multi-level wavelet transform network based on frequency domain analysis and vision Mamba. At layer 1 of the encoder, the input image is subjected to 3×3 convolution to extract shallow features and expand the channel by 1 time, and then the feature map is decomposed into 1 equal part of low frequency subbands and 3 equal parts of high frequency subbands using a discrete Haar (Haar) wavelet transform. This allows downsampling of the data and reduces the computational resources required at the initial resolution with substantially no information loss. At the 2 nd layer of the encoder, different numbers of space-frequency characteristic fusion modules are set according to the number of input data channels, and the input high-frequency and low-frequency data are processed respectively. Since the degradation of the underwater image is reflected in both low and high frequencies, it is believed that if the data is directly processed without high and low frequency separation, the recovered structural and detail information is insufficient. Thus, we process the high and low frequencies separately. Thereafter, the channel of the processing result of the low frequency part is enlarged 1-fold using a 3×3 convolution, the representation capability of the model is enhanced, and the low frequency data is further decomposed using a discrete haar wavelet transform. The layer 3 processing flow of the encoder is consistent with the layer 2 processing flow of the encoder. And the 3 rd layer of the decoder processes the low frequency and the high frequency data by using different numbers of space-frequency characteristic fusion modules, the processed results are spliced along the channel according to the sequence of low frequency and high frequency, and the spliced results are reconstructed by using inverse discrete Haar (Haar) wavelet transformation. The result after inverse discrete haar wavelet transformation belongs to low frequency data, and meanwhile, the jump connection can assist a decoder to reconstruct. Therefore, a jump connection is introduced in layer 3 of the decoder, and the result of the convolution in layer 3 of the encoder is spliced with the output of layer 3 of the decoder along the channel as input to layer 2 of the decoder. The layer 2 of the decoder firstly uses 3X 3 convolution to reduce the channel number of input data by 1 time, then uses a plurality of space-frequency characteristic fusion modules to process the result after the channel reduction, splices the processing result and the layer 2 high-frequency processing result of the encoder along the channel, inputs the spliced result into inverse discrete Haar wavelet transform to reconstruct, and the reconstructed result is in jump connection with the convolution output result of the layer 2 of the encoder and is used as the input of the layer 1 of the decoder. The layer 1 and layer 2 processes of the decoder are identical. The last layer (after layer 1) of the decoder is a refining layer, the channel number of input data is reduced by 1 time by using 3×3 convolution, then the convolved result is refined by using a plurality of space-frequency characteristic fusion modules, and finally the result is output by the 3×3 convolution.

Fig. 3 shows the structure of the space-frequency characteristic fusion module. The number of channels is enlarged by 1 time through the linear layer by the input data, so that the channels are conveniently divided. The spatial domain processing has the advantage that detailed information can be recovered, while the frequency domain processing has the advantage that structural information can be recovered, and it is considered that combining the two can enhance the representation learning ability of the model and enhance the enhancement effect. Thus, we input the aliquoted results to the Mamba-based spatial feature extraction module and the convolutional-based frequency feature extraction module, respectively. And then, splicing the results output by the two modules along the channel and inputting the results to the gating attention sub-module. Wherein the data is converted from the spatial domain to the frequency domain using a Fast Fourier Transform (FFT), and then the absolute value and arctan () are calculated for the spectrogram to obtain the phase and amplitude, respectively. Studies have shown that the amplitude may reflect the degradation of the image to some extent. Therefore, we input the phase and the amplitude to the convolution-based feature extraction module, respectively, wherein the convolution-based feature extraction module corresponding to the phase removes the last LeakyReLU activation functions considering that the value range of the phase is [ -n to n ]. The feature map is converted into a spectrogram, and the spectrogram is converted into a space domain by inverse fast Fourier transform. Finally, the spatial domain results are input to a layer normalization and gating attention sub-module. To speed training and prevent gradient vanishing, we introduce two residual connections in the module. The formulation flow of the space-frequency characteristic fusion module is as follows:

x_1,x_2=Chunk(Linear(Input)) (1)

x_2^Freq=FFT(x_2) (2)

x_2^Phase=abs(x_2^Freq) (3)

x_2^Amplitude=arctan(x_2^Freq) (4)

x_2^Real=ConvBlock(x_2^Amplitude)×cos(x_2^Phase) (5)

x_2^Imag=ConvBlock(x_2^Amplitude)×sin(x_2^Phase) (6)

〖Output〗_mid=Linear(Cat[LN(VSSM(x_1)),iFFT(x_2^real+x_2^imagi)])+Input (7)

Output=LN(GatedAttention(〖Output〗_mid))+〖Output〗_mid (8)

Wherein Input represents data after Input underwater image decomposition, linear represents a Linear layer, chunk represents channel division, x_1 and x_2 represent the divided result, FFT represents fast Fourier transform, x_2≡Freq represents a spectrogram of x ₂, abs represents taking absolute value, x_2≡phase, x_2 fact represents amplitude, convBlock represents a convolution-based frequency feature extraction module, x_2 real represents a real part of complex number, x_2 image represents an imaginary part of complex number, VSSM represents a Mamba-based spatial feature extraction module, LN represents layer normalization, cat represents splicing operation, iFFT represents inverse fast Fourier transform, output _mid represents intermediate result, gatedAttention represents a gating attention sub-module, and Output represents Output.

Fig. 4 shows the structure of the Mamba-based spatial feature extraction module. The input data is first expanded in channel number by a linear layer, then input to a depth separable convolution to extract local features and SiLU to activate a function to increase nonlinearity. Mamba is a selective state space model that can efficiently extract global dependencies of data with linear complexity. With the powerful global modeling capability of Mamba, we introduced the vision Mamba sub-module. As shown in fig. 5, the input data is selectively scanned in 4 directions (upper left to lower right, lower right to upper left, upper right to lower left, lower left to upper right) as shown in the figure, and the scanning results in the 4 directions are added to obtain an output result. Considering that Mamba has a channel redundancy problem, we designed channel attention based on Mamba. As shown in fig. 6, we first use adaptive averaging pooling to extract the global information of the data, then use Mamba to selectively scan the global information sequence in 2 directions (front-to-back, back-to-front), and finally use Sigmoid output channel attention to try. Attention is paid to the inner product with the input, emphasizing important channels and suppressing unimportant channels. After the vision Mamba sub-module and the Mamba-based channel attention processing, the result is output through the layer normalization and the linear layer. The formulation flow of the Mamba-based spatial feature extraction module is as follows:

Output=Linear(LN(ChannelaMamba(SpatialMamba(SiLU(DWConv(Linear(Input))))))) (9)

wherein DWConv represents a depth separable convolution, siLU represents a SiLU activation function, spatialMamba represents a vision Mamba sub-module, and ChannelMamba represents a Mamba-based channel attention module.

Fig. 7 shows the structure of the gated attention sub-module. Gating attention extracts local features through depth-separable convolution and then divides the local features into 2 equal parts, wherein 1 part generates attention force diagram through SiLU activation functions, and the attention force diagram is internally integrated with the other 1 part, so that control over information flow is realized. The formulation flow of the gating attention is as follows:

x₁,x₂=Chunk(DWConv(Input)) (10)

Output=SiLU(x₁)×x₂ (11)

Fig. 8 shows the structure of the convolution-based frequency feature extraction module. Since the information for each location of the spectrogram represents global information, we process the spectrogram using a 1 x 1 convolution, and then add nonlinearities by LeakyReLU activation functions. The formulation flow of the convolution-based frequency feature extraction module is as follows:

Output=LeakyReLU(Conv(LeakyReLU(Conv(Input)))) (12)

Where Conv represents convolution and LeakyReLU represents LeakyReLU activation function.

And 3, designing a loss function. Training deep neural networks requires a loss function to constrain the network.

The loss function L _total of the multi-stage wavelet transform network based on frequency domain analysis and vision Mamba is as follows:

L_total＝λ₁L_L1+λ₂L_edge+λ₃L_fft+λ₄L_adv (1)

Where L _L1 is L1 loss, L _edge is boundary loss, L _fft is frequency loss, L _adv is contrast loss, and lambda ₁、λ₂、λ₃、λ₄ is super parameter.

The L1 loss measures the absolute value of the difference between the predicted value and the true value. The use of L1 loss can reduce the effect of noise and outliers on the reconstruction results.

Boundary loss calculates the difference value of the boundary of two images. Firstly, boundary extraction based on Sobel operator is carried out on the image, and then L1 loss of the boundary diagrams of the reconstructed image and the reference image is calculated. The boundary loss allows the model to focus on the texture of the image, leaving the reconstructed result with more texture information.

The frequency loss calculates the difference between the phase and amplitude of the two images. The image is converted into a frequency domain by using a fast fourier transform, the phase and the amplitude of the spectrogram are calculated respectively, and finally the L1 loss between the phase and the amplitude of the reconstructed image and the reference image is calculated. The frequency loss can be used to optimize the frequency representation of the image to more effectively repair the degraded area.

The countermeasures against loss are used to measure the difference between the data generated by the generator and the real data, and the ability of the arbiter to distinguish the real data from the generated data. PatchGAN as a discriminator. By countering the loss, the data generated by the generator will be more realistic.

Step 4, training model by using the image data set under water

We take as input the original images in the paired underwater image dataset and the reference image as the subject of supervised learning.

The number of iterations of the training model was set to 300000.

The optimizer of the training model is AdamW optimizer, the learning rate is initially set to 0.0003, and the betas is set to 0.9,0.999.

The learning rate scheduler of the training model is a cosine annealing learning rate scheduler, the adjustment interval is [92000,208000], and the learning rate is reduced from 0.0003 to 0.000001 in the interval.

And 5, testing the network by using the underwater image data set. The enhancement effect of the model is tested using a test set in the underwater image dataset, and the enhanced image is evaluated using a plurality of evaluation indicators.

The evaluation index used in the present invention includes a reference evaluation index and a no-reference evaluation index. The reference evaluation indicators include peak signal-to-noise ratio (PSNR), structural Similarity (SSIM) and Mean Square Error (MSE), and the no-reference evaluation indicators include Underwater Image Quality Measurement (UIQM), underwater color image quality evaluation indicator (UCIQE) and normalized image quality evaluation indicator (NIQE).

The peak signal-to-noise ratio (PSNR) is a reference evaluation index for measuring the quality and denoising effect of an image. It evaluates the quality of image enhancement by comparing the difference between the enhanced image and the reference image. The higher the value of PSNR, the closer the enhanced image is to the reference image, and the smaller the distortion.

Structural Similarity (SSIM) is a reference evaluation index that aims to measure the structural similarity between two images. Structural similarity takes into account the brightness, contrast and structural variations of the image, so it can better reflect the perception of image quality by the human visual system. The SSIM index ranges from 0 to 1, where 1 indicates that the two images are completely similar and 0 indicates that they are completely different. The higher the SSIM value, the better the image quality.

The Mean Square Error (MSE) is a reference evaluation index used in image enhancement to quantify the difference between an enhanced image and a reference image, and the image quality is evaluated by calculating the mean square of the difference in pixel values of the two images. The lower the value of MSE, the closer the enhanced image is to the reference image, and the better the image quality.

The underwater image quality measurement index (UIQM) is a non-reference evaluation index, and aiming at the degradation mechanism and imaging characteristics of an underwater image, a color measurement index (UICM), a definition measurement index (UISM) and a contrast measurement index (UIConM) are adopted as evaluation basis, and UIQM is expressed as a linear combination of the three. The higher the UIQM value, the better the quality of the enhanced image.

The underwater color image quality evaluation index (UCIQE) is a no-reference evaluation index. UCIQE to linearly combine color concentration, saturation and contrast for quantitative evaluation of underwater image color shift, blur and low contrast. The higher the UCIQE value, the better the quality of the enhanced image.

The normalized image quality evaluation index (NIQE) is a no-reference evaluation index. NIQE evaluate poor visual quality in images, such as noise, blur and distortion, independent of specific features of the image content. The value NIQE is typically between 0 and 1, with lower values indicating better image quality and higher values indicating worse image quality.

And 6, enhancing the real underwater image by using the model.

The underwater robot or diver photographs an underwater image using a camera, inputs the underwater image into a trained model, and outputs an enhanced underwater image.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A method of underwater image enhancement based on frequency domain analysis and vision Mamba, the method comprising:

step S1, acquiring underwater image data;

2. The underwater image enhancement method based on frequency domain analysis and vision Mamba as claimed in claim 1, wherein the spatial-frequency feature fusion module includes a spatial feature extraction part and a frequency feature extraction part in parallel;

3. The underwater image enhancement method based on frequency domain analysis and vision Mamba as claimed in claim 2, wherein the spatial feature extraction part based on Mamba includes a vision Mamba sub-module and a channel attention sub-module based on Mamba;

4. The underwater image enhancement method based on frequency domain analysis and vision Mamba as claimed in claim 2, wherein the spatial-frequency feature fusion module further comprises a feedforward neural network for splicing the spatial feature extraction part and the frequency feature extraction part and screening data;

5. The method of claim 4, wherein the gated attention submodule comprises a depth separable convolution, an SiLU activation function and a linear layer for channel adjustment, wherein the input data is divided into two equal parts along the channel by the depth separable convolution, one part is subjected to SiLU activation function to obtain an attention map, and the other part is subjected to inner product with the attention map to output a result.

6. The underwater image enhancement method based on frequency domain analysis and vision Mamba as in claim 1, wherein the multi-level wavelet transform network is an n-layer encoder-decoder structure, where n is 3 or more;

7. The underwater image enhancement method based on frequency domain analysis and vision Mamba as claimed in claim 6, wherein the 1 st to n th layers of the encoder are respectively connected with the 1 st to n th layers of the decoder in a jumping manner.

8. The underwater image enhancement method based on the frequency domain analysis and the vision Mamba of claim 6, wherein the process of the decoder reconstructing the enhanced underwater image comprises:

The n-1 layer of the decoder firstly uses convolution to reduce the channel number of input data, then uses a space-frequency characteristic fusion module to process the result after the channel is reduced, splices the processed result and the n-1 layer high-frequency processed result of the encoder along the channel, inputs the spliced result into inverse discrete wavelet transformation for reconstruction, and the reconstruction result is in jump connection with the convolution output result of the _n -1 layer of the encoder and is used as the input of the n-2 layer of the decoder;

9. The underwater image enhancement method based on frequency domain analysis and vision Mamba as claimed in claim 1, wherein the number of spatial-frequency feature fusion modules per layer in the encoder and decoder is not the same.

10. The underwater image enhancement method based on frequency domain analysis and vision Mamba as claimed in claim 1, wherein the loss function L _total of the multi-stage wavelet transform network is calculated as follows:

L_total＝λ₁L_L1+λ₂L_edge+λ₃L_fft+λ₄L_adv

Wherein L _L1 is L1 loss, namely an absolute value used for calculating a difference between a predicted value and a true value, L _edge is boundary loss used for calculating a difference value of a boundary of two images, L _fft is frequency loss used for calculating a difference value between a phase and an amplitude of the two images, L _adv is counter loss used for measuring a difference between data generated by a generator and real data and the capability of a discriminator for distinguishing the real data from the generated data, and lambda ₁、λ₂、λ₃、λ₄ is a super parameter.