Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application more apparent, the specific technical solutions of the present application will be described in further detail below with reference to the accompanying drawings in the embodiments of the present application. The following examples are illustrative of the application and are not intended to limit the scope of the application.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.
It should be noted that the term "first\second\third" in relation to embodiments of the present application is merely to distinguish similar or different objects and does not represent a specific ordering for the objects, it being understood that the "first\second\third" may be interchanged in a specific order or sequence, where allowed, to enable embodiments of the present application described herein to be practiced in an order other than that illustrated or described herein.
The super-resolution processing is to process the low-resolution image to obtain a corresponding high-resolution image. For example, for an image with 540P resolution, after super resolution processing is performed, an image with 1080P resolution can be obtained. In the super-resolution processing process, the representation value of each pixel position in the low-resolution image is firstly acquired, then, based on the representation value of each pixel position in the low-resolution image, each representation value is calculated based on a trained image reconstruction model, multi-channel data is output to obtain a plurality of values related to each representation value, the values can be used as the representation values of new pixel positions on the super-resolution image, and finally, the high-resolution image can be generated by arranging the representation values of the new pixel positions.
The application scenario of the embodiment of the present application in the image reconstruction model described below is for more clearly describing the technical solution of the embodiment of the present application, and does not constitute a limitation on the technical solution provided by the embodiment of the present application. As can be known by those skilled in the art, with the appearance of a new service scenario, the technical solution provided by the embodiment of the present application is applicable to similar technical problems.
The trained, available image reconstruction model is deployed in an image playback application installed on a terminal with image display and/or video playback functions, such as a cell phone, tablet, personal computer (Personal Computer, PC) or wearable device. The video playback application may be, for example, various applications capable of playing video and/or various applications capable of displaying images (e.g., camera applications), and the like. The user opens the image playing application on the terminal and can request to play video or display images.
Fig. 1 is a schematic view of an application scenario of an image reconstruction model, as shown in fig. 1, when a user 101 requests to play a video through an image playing application, a terminal 102 requests a server 103 providing a video playing service for the video, after receiving the video returned by the server 103, the terminal 102 extracts a low resolution video therein, determines each video frame 104 in the low resolution video as a low resolution image, and inputs the low resolution image as an image to be processed into at least one image reconstruction model, the at least one image reconstruction model performs super resolution processing on the image to be processed, correspondingly obtains at least one frame of high resolution image, obtains a target image when the high resolution image is one, and fuses the images to obtain the target image when the high resolution image is a plurality of images. Each target image corresponds to a video frame. And after the target image is obtained, the image playing application sequentially plays the target image according to factors such as playing time sequence and the like, so that the playing of the high-resolution video is realized.
Similarly, when the terminal aims at a single image, the single image can be directly used as an image to be processed, and then the obtained target image is displayed to a user. Therefore, on one hand, the terminal only needs to download the video data with low resolution, the occupation of bandwidth is saved, the flow of the terminal can be saved, on the other hand, the storage space of the server can be saved, and furthermore, the user can watch higher-definition video conveniently.
The user can request display switching between the low resolution image and the high resolution image by clicking the switching button. Of course, the terminal can also receive user operation through an application interface, and only when the user selects the high-resolution play button on the application interface, the terminal performs super-resolution processing on the low-resolution image to obtain a high-resolution target image and plays the target image to the user.
Fig. 2 is a schematic diagram of another application scenario of an image reconstruction model, where the image reconstruction model is configured in an application server 201 that provides a video playing service or an image display service, and a corresponding video playing application is installed on a terminal 202. The application server 201 may store low resolution video or image therein, and when the user 203 requests to view the low resolution video or image through the video playing application installed on the terminal 202, the application server 201 may transmit the data of the super resolution processed video or image to the terminal 202 and display the data to the user through the application installed on the terminal 202.
While watching a low-resolution video, if the user wishes to watch a high-resolution video, the user may click a high-resolution play button set on the application interface, at this time, the terminal 202 sends a K-time high-resolution play request to the application server 201, the application server 201 determines low-resolution video data in response to the high-resolution play request, acquires an image to be processed (for example, a single low-resolution picture or a video frame in the low-resolution video) therefrom, performs super-resolution processing through at least one image reconstruction model, and finally outputs the target image 204. And the target image is transmitted to the terminal 202, and the terminal 202 displays the target image with high resolution through the video and audio playing application or sequentially plays the target image with high resolution according to the playing time sequence and the like, so that the super-resolution video is played.
Fig. 3 is a schematic diagram of an audio-visual playing interface, after the user clicks the "super-resolution" play button 301 at the lower right corner, the audio-visual player performs super-resolution processing on the video or image according to the manner corresponding to fig. 1 or the related application server performs super-resolution processing on the video or image according to the manner corresponding to fig. 2, so as to provide the super-resolution video or image for the user. The "super resolution" button 301 in the lower right corner may be a plurality of buttons of other types, such as an icon button 302 of "2 times super resolution" and an icon button 303 of "4 times super resolution", etc.
In some embodiments, the server may also switch between the low resolution image and the high resolution image based on the network resource information between the server and the terminal, automatically convert the low resolution image to a high resolution target image to be sent to the user when the network resource satisfies a condition (e.g., when the bandwidth is sufficient), and send the low resolution image when the network resource does not satisfy the condition (e.g., when the bandwidth is small).
The image processing method of the embodiment of the present application is described in detail below in two aspects. In both aspects, the training of the model is included on the one hand, and the process of super-resolution processing of the image to be processed based on the trained at least one image reconstruction model is included on the other hand.
The model training method provided by the embodiment of the application can be applied to electronic equipment, wherein the electronic equipment can be a terminal or a server, and the image processing method provided by the embodiment of the application can be applied to the electronic equipment.
Fig. 4A is a schematic flow chart of an implementation of the model training method according to an embodiment of the present application, as shown in fig. 4A, the method at least may include the following steps 401 to 403:
in step 401, the original image set is subjected to downsampling processing according to each preset downsampling parameter value, so as to obtain a downsampled image set with a corresponding downsampling parameter value.
Each of the predetermined downsampling parameter values may include one or more different downsampling parameter values. The method comprises the steps of obtaining one image reconstruction model through corresponding training when the method comprises one method, and obtaining a plurality of image reconstruction models through corresponding training when the method comprises a plurality of methods. For various examples, as shown in fig. 5, the preset downsampling parameter values are 2×2, 3×3 and 4×4, downsampling each image 501 in the original image set according to 2×2, and correspondingly obtaining a downsampled image set 502, downsampling each image 501 in the original image set according to 3×3, and correspondingly obtaining a downsampled image set 503, and downsampling each image 501 in the original image set according to 4×4, and correspondingly obtaining a downsampled image set 504. It should be noted that, the term "binning" means merging, and the term "2×2 binning" means merging 2×2 pixels into 1 pixel, for example, taking the average value of the pixels of the 4 pixels as the pixel value of the pixel, so that the term "binning" may also be understood as downsampling.
In some embodiments, the electronic device may perform downsampling on an original image in the original image set according to the downsampling parameter value to obtain the downsampled image, and perform gaussian blur processing on the original image with the downsampling parameter value as a size of a gaussian kernel to obtain a gaussian blurred image, and perform bicubic interpolation on the gaussian blurred image according to the downsampling parameter value to obtain the downsampled image.
And step 402, generating a training sample set corresponding to the downsampled image set according to each downsampled image set and a truth image (Ground Truth, GT) corresponding to the downsampled image set, wherein the truth image is obtained by fusing multiple frames of reference images, and the resolution of the truth image is larger than that of the corresponding sample image.
It will be appreciated that in supervised learning, each downsampled image corresponds to a frame of truth image, and in semi-supervised learning, each image of a portion of the downsampled images in the set corresponds to a frame of truth image.
In implementing step 402, the device may first determine a second displacement of each downsampled image with respect to a corresponding truth image, then displace each downsampled image pixel by pixel according to the corresponding second displacement to align to the corresponding truth image, thereby obtaining a corresponding sample image, and generate the training sample set using each sample image and the corresponding truth image as a data pair. In this way, through the alignment mode, when the super-resolution processing is performed on the low-resolution image, the target image with local no blurring and no deformation can be obtained by the trained image reconstruction model.
The method of determining the second displacement is the same as the method of determining the first displacement. For example, the electronic device may detect feature points of two frames of images and extract feature description information of each feature point, then match the feature points of the two frames of images according to the feature description information of each feature point of the two frames of images, so as to find a plurality of feature point matching pairs between the two frames of images, determine a homography matrix between the two frames of images according to pixel coordinates of the feature points of the feature point matching pairs, and finally determine displacement of one frame of images relative to another frame of images, that is, displacement of each pixel point in one frame of images relative to a corresponding pixel point in another frame of images according to the homography matrix.
In some embodiments, the device may post-process the downsampled image prior to aligning the downsampled image to the corresponding truth image and then align the post-processed downsampled image to the corresponding truth image. The post-processing is not described in detail here, see in particular the examples of implementation of the post-processing below.
And step 403, training the original deep learning model by using the training sample set under each downsampling parameter value to obtain an image reconstruction model under the corresponding downsampling parameter value.
The structure of the original deep learning model may be varied, for example, the model is a recurrent neural network (Recurrent Neural Networks, RNN), a recurrent neural network (Recursive Neural Networks, RNN), a convolutional network (Convolutional Neural Networks, CNN), or the like.
It can be appreciated that the super-resolution conversion magnification of the image reconstruction model obtained by training is different for the training sample sets under different downsampling parameter values. Still taking fig. 5 as an example, a training sample set 512 obtained at downsampling parameter values of 2×2 canning, which includes a true image and a sample image aligned therewith, trains a resulting image reconstruction model 522 with 2 times the super-resolution conversion magnification, i.e., with 2 times the magnification capability. The model 522, when used, is capable of transforming each pixel location in a low resolution image to be processed into 2 x 2 pixel locations, and similarly, trains the resulting image reconstruction model 523 with 3-fold magnification capability for a training sample set 513 obtained at a downsampling parameter value of 3 x 3binning, and trains the resulting image reconstruction model 524 with 4-fold magnification capability for a training sample set 514 obtained at a downsampling parameter value of 4 x 4 binning.
In the embodiment of the application, a model training method is provided, in which, in a training sample set for training an original deep learning model, a true value image corresponding to a sample image is obtained by fusion of multiple frames of reference images instead of only taking a single frame of reference image as the true value image, so that the signal to noise ratio of the obtained true value image is higher, the details are clearer, the performance of the image reconstruction model obtained by training is better, and when the model is used, the image to be processed can be reconstructed into a target image with higher quality. I.e. the signal-to-noise ratio of the output target image is higher and the details are clearer.
In the research, it is found that a true value image obtained by fusing multiple frames of reference images may deviate slightly from a corresponding downsampled image, and the deviation may cause problems of local blurring or deformation of an output target image when the trained image reconstruction model processes an image to be processed. To solve the above problem, in some embodiments, for the step 402, a training sample set is generated according to the set of downsampled images and the truth image corresponding to the downsampled images, as shown in fig. 4B, where a device such as a terminal or a server may be implemented by the following steps 4021 to 4023:
step 4021, determining a second displacement of each of the downsampled images relative to the corresponding truth image.
When the method is implemented, the device can find a plurality of feature point matching pairs between the downsampled image and the corresponding truth image, then solve a homography matrix between the downsampled image and the corresponding truth image through direct linear Transformation (DIRECT LINEAR Transformation, DLT) based on the feature point matching pairs, and finally determine the displacement of the downsampled image relative to the corresponding truth image according to the homography matrix.
Step 4022, shifting each downsampled image pixel by pixel according to the corresponding second shift to align to the corresponding truth image, thereby obtaining a corresponding sample image;
In step 4023, the training sample set is generated using each of the sample images and the corresponding truth image as a data pair.
An embodiment of the present application further provides a model training method, and fig. 6 is a schematic flow chart of another implementation of the model training method according to the embodiment of the present application, as shown in fig. 6, where the method at least may include the following steps 601 to 610:
In step 601, N frame candidate images are acquired, N being an integer greater than 1.
In general, the candidate image is a high resolution image. The N-frame candidate image may be a multi-frame image acquired for an actual scene by a camera (e.g., a single-lens reflex or a mobile phone having a high-definition photographing function, etc.). Of course, the N frame candidate image may also be a simulated high resolution image. Compared with the latter, the image reconstruction model obtained according to the former has better universality, and the reconstructed high-resolution image (namely the target image) has better image quality for the truly acquired image to be processed.
Step 602, determining the definition of each candidate image.
Sharpness may be characterized by a variety of image parameters. In some embodiments, the sharpness is characterized by sharpness, and each candidate image is subjected to sharpness estimation, so that the sharpness of the corresponding image is obtained. For example, the sharpness SHARPNESS of an image may be determined by a sharpness estimation model as shown in the following equation (1):
in formula (1), the entire image is divided into k 1*k2 windows, where I max,k,l and I min,k,l represent the maximum luminance value and the minimum luminance value in the kth window, respectively.
And 603, determining an image with definition meeting a specific condition in the N frames of candidate images as the reference image, thereby obtaining a multi-frame reference image.
In some embodiments, an image of the N frame candidate images having a sharpness greater than a particular threshold is determined as the reference image.
For example, the camera continuously collects 30 candidate images, 8 images with higher sharpness are selected as reference images, and a true image is obtained based on fusion. It will be appreciated that the purpose of multi-frame fusion is to be able to obtain true images with higher signal-to-noise ratios and clearer details. Then, the higher the definition of the image used for fusion, the clearer the true image obtained by fusion, and further the better the performance of the model obtained by training. In other words, when the model is used online, a high-resolution target image with clearer details and higher signal-to-noise ratio can be reconstructed.
In step 604, one frame of the multi-frame reference image is selected as the first reference image.
When the method is implemented, any frame in the multi-frame image can be selected as a first reference image, namely, an image to be referred to is aligned.
Step 605 determines a first displacement of each other of the multiple frames of reference pictures relative to the first reference picture.
The method for determining the first displacement is the same as the method for determining the second displacement described above, and therefore will not be described here.
Step 606, shifting each of the other reference images pixel by pixel according to the corresponding first shift to align the other reference images to the first reference image, thereby obtaining a corresponding second reference image;
and step 607, fusing the first reference image and each second reference image to obtain a true value image.
It should be noted that, the method of image fusion, i.e. the method of implementing step 607, may be various. The device can realize image fusion in a space domain and can realize image fusion in a frequency domain.
In the spatial domain implementation, for example, the sharpness of each region of the first reference image and each of the second reference images may be determined first, the weight of each pixel position in the corresponding region is determined according to the sharpness of each region, and the first reference image and each of the second reference images are weighted and averaged according to the pixel position and the weight of the pixel position to obtain the true image.
The values may be varied. For example, the value may be a brightness value V at a pixel position, a brightness value L at a pixel position, or brightness channel data (i.e., Y channel value) at a pixel position, or any one or more of Red (Red, R), green (Green, G), blue (Blue, B) values at a pixel position, or gray values of each channel of a multispectral camera, or gray values of each channel of a special camera (infrared camera, ultraviolet camera, depth camera), or the like.
It will be appreciated that sharpness has a certain mapping relation to weights. The greater the sharpness of an image region, the greater the weighting of pixel locations in that region. In this way, the influence of the blurred region in the reference image on the sharpness of the true image can be reduced when image fusion is performed.
If the image is not uniform in definition of each region, the definition of a focused part is high, and the definition of a virtual focus part is blurred, if the weight is not used, the image information of the virtual focus region can influence the definition of the region corresponding to the true image more directly by numerical averaging. Therefore, here, the lower the weight of the blurred region, the higher the weight of the region with high definition, so that each region of the fused truth image is as clear as possible.
When the image fusion is realized in the frequency domain, in some embodiments, the device may perform multi-layer filtering on the first reference image and the second reference image to be fused by adopting wavelet transformation, decompose and form a high-frequency sub-image and a low-frequency sub-image corresponding to each image respectively, perform high-frequency component fusion according to the high-frequency sub-image corresponding to each image respectively to form a high-frequency component fusion coefficient, perform low-frequency component fusion according to the low-frequency sub-image corresponding to each image respectively to form a low-frequency component fusion coefficient, and perform wavelet inverse transformation according to the high-frequency component corresponding to the high-frequency component fusion coefficient and the low-frequency component corresponding to the low-frequency component fusion coefficient to generate a true value image. In other embodiments, the apparatus may further perform a discrete fourier transform (Discrete Fourier Transform, DFT) or a discrete cosine transform (Discrete Cosine Transform, DCT) on each reference image to convert the image signal to the frequency domain, perform image averaging in the frequency domain, and then inverse transform to the spatial domain to obtain the true image.
In step 608, the original image set is subjected to downsampling according to each preset downsampling parameter value, so as to obtain a downsampled image set with a corresponding downsampling parameter value.
It should be noted that, the execution sequence of the step of acquiring the truth image (corresponding to step 601 to step 607) and the step of acquiring the set of downsampled images (i.e. step 608) is not limited, and step 608 may be performed first to obtain the set of downsampled images with different downsampling parameter values, and then the truth image is determined through step 601 to step 607. Step 608 may be performed first, and then steps 601 to 607 may be performed. The step of determining the truth image and the step of determining the set of sampled images may also be performed in parallel.
Step 609, generating a training sample set according to the downsampled image set and the truth image corresponding to the downsampled image;
And step 610, training the original deep learning model by using the training sample set under each downsampling parameter value to obtain an image reconstruction model under the corresponding downsampling parameter value.
An embodiment of the present application provides an image processing method, and fig. 7 is a schematic flow chart of an implementation of the image processing method according to the embodiment of the present application, as shown in fig. 7, the method may include the following steps 701 to 703:
Step 701, acquiring an image to be processed.
As described in the application scenario above, the image to be processed may be any video frame image in the low resolution video, or may be a single low resolution image, for example, an image captured by the user through the terminal.
Step 702, invoking at least one trained image reconstruction model, wherein the image reconstruction model is obtained after training based on a training sample set comprising a plurality of frame sample images and a truth image corresponding to each frame, the truth image is obtained by fusing a plurality of frame reference images, and the resolution of the truth image is larger than that of the corresponding sample image.
It will be appreciated that different image reconstruction models have different magnifications. However, in the online phase, i.e. when these models are in use, not all image reconstruction models are called but part of the models in order to reduce the computational effort. In some embodiments, the device may determine a magnification to be applied to the image to be processed, and select the at least one image reconstruction model matching the magnification to be applied from a plurality of trained image reconstruction models according to the magnification to be applied. Therefore, the visual requirement of a user on the image resolution can be met, the calculated amount can be reduced, and the real-time requirement of high-resolution image display or high-resolution video display can be met.
And 703, performing super-resolution processing on the image to be processed through the at least one image reconstruction model to obtain a target image, wherein the resolution of the target image is greater than that of the image to be processed.
It will be appreciated that the at least one image reconstruction model may be one, or two or more. When the at least one image reconstruction model is a model, the model is directly used for converting the image to be processed, and a high-resolution image (called a high-resolution image for short) output by the model is taken as a target image. When the at least one image reconstruction model is two or more, in some embodiments, the device may perform super-resolution processing on the image to be processed by using each model in the at least one image reconstruction model to obtain a high-resolution image output by the corresponding model, and fuse each high-resolution image to obtain the target image.
In the embodiment of the application, the image reconstruction model for performing super-resolution processing on the image to be processed is obtained by training a training sample set comprising a plurality of frame sample images and a truth image corresponding to each frame, wherein the truth image is formed by fusing a plurality of frame reference images instead of directly taking a single frame reference image as the truth image, so that the signal-to-noise ratio of the truth image obtained by fusion is higher and the details are clearer, thereby improving the performance of the model after training, and reconstructing the image to be processed into a target image with higher quality when the model is used, namely the signal-to-noise ratio of the obtained target image is higher and the details are clearer.
An embodiment of the present application provides a further image processing method, and fig. 8 is a schematic flowchart of an implementation of the image processing method according to the embodiment of the present application, as shown in fig. 8, where the method may include the following steps 801 to 805:
Step 801, acquiring an image to be processed;
step 802, determining a magnification to be performed on the image to be processed.
In general, a user selects a magnification desired to be amplified through an application interface, that is, the apparatus receives an operation instruction indicating the magnification, and determines the magnification to be amplified of the image to be processed according to the operation instruction.
Step 803, selecting at least one image reconstruction model matched with the to-be-amplified magnification from a plurality of trained image reconstruction models according to the to-be-amplified magnification, wherein the image reconstruction model is obtained after training based on a training sample set comprising a plurality of frames of sample images and true images corresponding to each frame, the true images are obtained by fusion of a plurality of frames of reference images, and the resolution of the true images is larger than that of the corresponding sample images;
Step 804, performing super-resolution processing on the image to be processed by using each model in the at least one image reconstruction model to obtain a high-resolution image output by the corresponding model;
And step 805, fusing each high-resolution image to obtain the target image.
From the above description, different image reconstruction models have different magnification capabilities, and corresponding magnifications are different. For example, the pre-trained image reconstruction model includes model 1, model 2, and model 3, wherein model 1 has a 2-fold magnification capability, model 2 has a 4-fold magnification capability, and model 3 has a 6-fold magnification capability. If the to-be-amplified multiplying power of the to-be-processed image is 2.3, the definition of the obtained target image cannot meet the user requirement if only the model 1 is selected to reconstruct the to-be-processed image, and the obtained target image cannot meet the user requirement if only the model 2 is selected to reconstruct the to-be-processed image, based on the above, in order to enable the target image to meet the image effect actually corresponding to the to-be-amplified multiplying power as much as possible, the model 1 and the model 2 can be selected to reconstruct the to-be-processed image respectively, and then the high-resolution images output by the model 1 and the model 2 are fused to obtain the target image.
In some embodiments, if the magnification to be amplified is different from the magnification of each of the plurality of image reconstruction models, the apparatus may select two models closest to the magnification to be amplified as the at least one image reconstruction model to reconstruct the image to be processed respectively, and then fuse the high-resolution images output by the two models, so as to obtain the target image.
In other embodiments, if the magnification to be amplified is the same as that of a certain model in the plurality of image reconstruction models, the model can be used as the at least one image reconstruction model, and accordingly, the image to be processed is reconstructed through the model, and the output high-resolution image is used as the target image.
The single image super-resolution SISR refers to that a signal processing method is adopted to restore a low-resolution image into a high-resolution image under the condition of not changing shooting hardware equipment.
The super-resolution method mainly adopts a data-driven deep learning model to reconstruct required details so as to obtain accurate super-resolution.
Therefore, the quality of the data pair (DATA PAIRS) greatly affects the quality of reconstructing high resolution by the deep learning model, in the conventional deep learning model based on super resolution of a single Image of deep learning, bicubic (Bicubic) downsampling or Gaussian (Gaussian) fuzzy extraction downsampling are two most commonly used degradation models, namely, a process of using a high resolution Image (High Resolution Image, HR Image) which is shot in a single direction as a training truth value (Ground Truth) of the deep learning model, and a low resolution Image (Low Resolution Image, LRimage) obtained by downsampling the HR Image by Bicubic or Gaussian fuzzy extraction as a training Input (Input) of the model, so as to form the data pair.
The two degradation models perform well in processing simulation data sets with the same degradation process. However, in a complex real imaging system, these degradation models cannot accurately simulate the real degradation process, so that the performance of the super-resolution algorithm is significantly reduced on a real image, such as an image acquired by a smart phone.
Therefore, how to construct training data conforming to the degradation model of a real imaging system is a key for further improving the performance of a super-resolution deep learning model and reconstructing a more real high-resolution image.
The related methods for generating the super-division network training data are basically single-frame simulation methods, and Bicubic or Gaussian fuzzy downsampling is adopted to obtain a low-resolution image. The super-division network trained by the training data obtained through the two degradation models often has good performance in processing the simulation data with the same degradation process, but the performance in processing the truly acquired image is obviously reduced. In recent years, in order to better simulate a complex real degradation process, there is also a process of adding more other degradation factors (such as noise, blurring, quantization) on the basis of downsampling to more closely approximate the real degradation.
Although various simulated degradation methods improve the reconstruction performance of super-resolution algorithms based on deep learning on noisy or blurred low-resolution images, accurate modeling of the degradation process in a real imaging system is still not possible. As shown in fig. 9, there is a problem in The related literature in that The super-resolution model (for example ESRGAN) Of The State Of The Art (SOTA) at present looks very unrealistic for The high-resolution reconstructed image 902 obtained by The smartphone compared with The original image 901.
Based on this, in a related art, the residual channel attention mechanism is used to effectively enhance the useful features, suppress the noise and simultaneously ensure that the multi-scale high-frequency image features can be accurately recovered, but the training data pair is to generate the corresponding low-resolution image I LR by shrinking the high-resolution image I HR through bicubic interpolation, which has a better image effect on the same bicubic interpolation simulation, but has a greatly reduced effect on the truly acquired image.
In another related art, high/low resolution images are actually acquired through a change in field of view, thereby implicitly modeling a real degradation model in a data-driven manner. However, three problems are that the high-resolution image acquired by ① single frames still has noise and blurring, the high-resolution image and the low-resolution image are not enough to be taken as Ground Truth of the super-resolution neural network, the ② high-resolution image and the low-resolution image are acquired by two times, the difficulty of aligning the image content is increased, the difficulty of aligning brightness and color is high, local blurring and even deformation of a result of super-resolution processing can be caused by local alignment failure, and the ③ is required to acquire images in each Zoom focal segment for the Zoom application of a mobile phone, so that the workload is huge and difficult to realize.
Based on this, an exemplary application of the embodiment of the present application in one practical application scenario will be described below.
In order to solve the technical problems, the embodiment of the application provides a method for generating super-resolution network real training data based on multi-frame fusion. On the one hand, compared with a high-definition image acquired by a single frame, the method can enable the obtained Ground Truth signal-to-noise ratio to be higher and the details to be clearer, and on the other hand, the method can directly reflect the degradation effect (noise, quantization and the like) generated by the real acquisition equipment (smart phone, single inverse and the like) on the image without simulation by utilizing the real acquired image. Meanwhile, the mode of combining real equipment acquisition and analog post-processing enables acquisition of a plurality of image pairs with magnification to be possible.
Fig. 10 shows a main flow of generating super-resolution network real training data based on multi-frame fusion. Mainly comprises the following steps of one to four:
step one, binning, i.e. downsampling, here using gaussian blur followed by bicubic interpolation to obtain Bining results.
And step two, multi-frame fusion, namely synthesizing a training Ground Truth with higher definition and lower noise by using continuously acquired multi-frame images.
Step three, post-processing, which belongs to an optional step, simulates an image compression coding process.
And fourthly, aligning, namely slightly shifting the image after multi-frame fusion and the image after Binning, wherein the post-processed image is required to be aligned in order to prevent local blurring or deformation of the model.
The details of the above steps are as follows:
for Binning in step one:
binning is a down-sampling process, as shown in equation (2):
Db{u}=(Hbinning*Hblur)(u) (2);
In equation (1), H blur represents gaussian blur, which represents convolution operation, the gaussian kernel size is the same as the downsampling (downsample) size, and if downsample is 2×2, the gaussian kernel size is also 2×2, and H binning represents the neighboring pixel merging process, that is, downsampling. Here, the downsampling is performed by using a neighboring pixel averaging method, and a hardware downsampling (hardware binning) process of the analog sampling device, b represents a bin parameter, that is, a size of downsample, and if downsample is 2×2, a bin parameter b=2 represents that neighboring 2×2 pixels are averaged.
For multi-frame fusion in step two:
Multi-frame synthesis uses a large number of input frames to obtain a Ground Truth image with lower noise and clearer detail. As shown in fig. 11, the process includes a reference frame (sharpness estimate), image registration, and multi-frame averaging process.
Wherein for sharpness estimation, sharpness estimation of an image can be performed using a sharpness reference model based on local luminance features, as shown in the following equation (3):
The whole image is divided into k 1*k2 windows, where I max,k,l and I min,k,l represent the maximum luminance value and the minimum luminance value in the kth window, respectively.
For image registration, the homography matrix estimation based on SIFT feature point detection may be used to perform image registration, where the homography matrix H K from the kth image Y K to the reference frame image Y 0 is estimated, as shown in the following formula (4):
In the formula (3), (x, Y) is the coordinate of any point in Y K, the 3×3 matrix is H K, and the converted point (x ', Y') is the coordinate of (x, Y) registered to the reference frame image Y 0, w=w=1.
Thus, the displacement [ mv xk,mvyk ] of each point in Y K relative to Y 0 can be calculated according to the homography matrix H K to form a two-channel offset vector diagram with the same size as Y 0 and Y K, and the main flow is as shown in fig. 12, and the method comprises the following steps 121 to 124:
step 121, feature point detection is performed on images Y 0 and Y K respectively;
Step 122, determining a feature point description of the feature point of each image;
Step 123, based on the feature point description, performing feature point matching on the two frames of images;
Step 124, calculating a homography matrix based on the feature point matching result.
The Scale-INVARIANT FEATURE TRANSFORM (SIFT) is an original algorithm for detecting key points, and is essentially to search key points (feature points) in different Scale spaces, calculate the size, direction and Scale information of the key points, and describe the feature points by using the information to form the key points. The key points searched by SIFT are all very prominent, and stable characteristic points which are not transformed due to factors such as illumination, affine transformation and noise are avoided. After the feature points are obtained, the gradient histogram of the points around the feature points is used to form feature vectorsThis feature vector is a description of the current feature point. The characteristic points of the two images Y 0 and Y k are respectively calculated to obtain two groups of characteristic pointsAnd
The feature point matching process is to find 4 or more feature point pairs closest to each other on two graphs of Y 0 and Y k from the two sets of feature points obtained above. In implementation, the nearest feature point pair can be searched for by the euclidean distance shown in the following formula (5).
After finding the nearest feature point pair, the coefficients of homography matrix H K can be solved by direct linear Transformation (DIRECT LINEAR Transformation, DLT), thus obtaining the displacement [ mv x,mvy ] of each point in Y K relative to Y 0. Assuming that the coordinates of the feature points on Y K obtained by feature point matching are (x 1,y1),(x2,y2),...,(xt,yt) and the coordinates of the feature points on the corresponding Y 0 are (x' 1,y`1),(x`2,y`2),...,(x`t,y`t), the homography matrix is applied to the corresponding point pair, and the equation is rewritten, so that the equation shown in the following formula (6) can be obtained:
Or AH K =0 (6);
In equation (6), a is a matrix with corresponding point-to-two times the number of rows, the coefficients of these corresponding point-to-equation are stacked into a matrix, and the least squares solution of H can be found using a singular value decomposition (Singular Value Decomposition, SVD) algorithm to calculate the displacement [ mv xk,mvyk ] in each frame Y K relative to Y 0. After estimating the displacement, Y K can be displaced pixel by pixel according to [ mv xk,mvyk ] to obtain a k frame aligned result, denoted as Y k`.
And for multi-frame averaging, carrying out numerical averaging on the registered images on a Y channel according to pixel positions to obtain a final multi-frame synthesis result.
For the post-processing in step three, the following explanation is given here.
This step is optional depending on whether the trained model processes the YUV- > YUV domain image super-resolution process or the jpg- > jpg format image after compression. If the model is a super-resolution model of YUV- > YUV, the image coding compression process does not need to be simulated, and if the model is a super-resolution model of jpg- > jpg, the details of reconstruction damaged by the image coding compression also need to be considered.
The following formula (7) shows:
Xncompress=Cc{Xn} (7);
In the formula (7), n represents a binding parameter, and C C { x } represents image compression with the compression strength of C. The image can be encoded and compressed by adopting JPEG2000 compression standard, in order to ensure the robustness of the model, aiming at the super-resolution model of jpg- > jpg, the intensity of compression encoding, namely the quantization parameter (Quantization Parameters, QP) is randomly selected from 0,10,20,30,40, namely the compression artifacts (compression artifacts) with different degrees are generated from the process of completely not compressing to strongly compressing
For the alignment in step four, the following explanation is given here:
The image after multi-frame fusion and the image after Binning are possibly slightly offset, in order to prevent local blurring or deformation of the model, the image registration method in the second step is adopted to align Xn compression to Ground Truth Y gt, and a training input image Xn SR input of the final super-resolution neural network is obtained.
In the embodiment of the application, a method for generating the real training data of the super-resolution neural network based on multi-frame fusion is adopted, on one hand, compared with a high-definition image acquired by a single frame, the multi-frame fusion can enable the obtained Ground Truth to have higher signal-to-noise ratio and clearer details, and on the other hand, the real acquired image is utilized to directly reflect the degradation effects (noise, blurring, quantization and the like) generated by real acquisition equipment (smart phone, single shot and the like) on the image without simulation. Meanwhile, the mode of combining real equipment acquisition and analog post-processing enables acquisition of a plurality of image pairs with magnification to be possible.
The embodiment of the application provides a method for generating real training data of a super-resolution neural network by utilizing multi-frame synthesis, which ensures that the super-resolution model based on deep learning has better performance on a real acquired image than data generated by simulation degradation before, improves the upper limit of the performance of the deep learning applied on a super-resolution algorithm, considers artifacts generated by image compression coding in processing an LR image, and expands the training data to be applied to the super-resolution model for processing coded images such as jpg images.
In an embodiment of the present application, the manner of image registration includes, but is not limited to, the following:
As one possible implementation, image registration may employ fast robustness features (Speed Up Robust Feature, SURF), corner points, or other features for feature point detection and description.
As one possible implementation, the optical flow vector for each pixel from the current frame to the reference frame is solved based on the luminance information around each point of the adjacent frame, and the motion vector for the pixel is calculated based on the optical flow vector.
In an embodiment of the present application, the manner of multi-frame averaging includes, but is not limited to, the following:
As a possible implementation, the multi-frame averaging may also be weighted based on sharpness estimates of image portions.
As a possible implementation, the multi-frame averaging may also be performed in the frequency domain, for example, by performing a wavelet transform (Wavelet Transform, WT) or a discrete fourier transform, or by performing a discrete cosine transform to convert the image signal into the frequency domain, then performing image averaging, and then converting back into the spatial domain.
In an embodiment of the present application, the modes of post-processing image compression encoding include, but are not limited to, the following modes:
As a possible implementation, the compression coding may also employ the h.265/HEVC video coding standard, for data applications of the video super resolution model.
Compression encoding may also employ, as one possible implementation, the JPEG-XR or MPEG series (e.g., MPEG-2) image compression standards.
Based on the foregoing embodiments, the image processing apparatus and the model training apparatus provided in the embodiments of the present application may include each module included, and each unit included in each module may be implemented by a processor in an electronic device, or may of course be implemented by a specific logic circuit, where in the implementation process, the processor may be a Central Processing Unit (CPU), a Microprocessor (MPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), or the like.
Fig. 13A is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application, as shown in fig. 13A, the apparatus 130 includes an obtaining module 131, a calling module 132, and a super-resolution processing module 133, where:
An acquiring module 131, configured to acquire an image to be processed;
The invoking module 132 is configured to invoke at least one trained image reconstruction model, where the image reconstruction model is obtained after training based on a training sample set including multiple frames of sample images and true images corresponding to each frame, the true images are obtained by fusing multiple frames of reference images, and resolution of the true images is greater than resolution of the corresponding sample images;
And the super-processing module 133 is configured to perform super-resolution processing on the image to be processed through the at least one image reconstruction model, so as to obtain a target image.
In some embodiments, a module 132 is invoked for determining a to-be-amplified ratio of the to-be-processed image, selecting the at least one image reconstruction model matched with the to-be-amplified ratio from a plurality of trained image reconstruction models according to the to-be-amplified ratio, wherein the amplification ratios corresponding to different image reconstruction models are different.
In some embodiments, the super-processing module 133 is configured to perform super-resolution processing on the image to be processed by using each model in the at least one image reconstruction model to obtain a high-resolution image output by the corresponding model, and fuse each high-resolution image to obtain the target image.
In some embodiments, as shown in fig. 13B, the apparatus 130 further includes a downsampling module 134, a sample generating module 135 and a model training module 136, where the downsampling module 134 is configured to perform downsampling processing on an original image set according to each preset downsampling parameter value to obtain a downsampled image set with a corresponding downsampling parameter value, the sample generating module 135 is configured to generate a training sample set according to the downsampled image set and a true image corresponding to the downsampled image, and the model training module 136 is configured to train the original deep learning model by using the training sample set with each downsampling parameter value to obtain an image reconstruction model with a corresponding downsampling parameter value.
In some embodiments, the downsampling module 134 is configured to take the downsampling parameter value as a size of a gaussian kernel, perform gaussian blur processing on the original image to obtain a gaussian blurred image, and perform bicubic interpolation on the gaussian blurred image according to the downsampling parameter value to obtain the downsampled image.
In some embodiments, the sharpness is characterized by sharpness, the obtaining module 131 is further configured to obtain N frame candidate images, where N is an integer greater than 1, determine the sharpness of each of the candidate images, and determine an image in the N frame candidate images, where the sharpness satisfies a specific condition, as the reference image.
In some embodiments, the obtaining module 131 is configured to perform sharpness estimation on each of the candidate images to obtain sharpness of a corresponding image, and determine an image with sharpness greater than a specific threshold value in the N frames of candidate images as the reference image.
In some embodiments, the obtaining module 131 is further configured to select one frame of image of the multiple frames of reference images as a first reference image, determine a first displacement of each other reference image of the multiple frames of reference images except the first reference image relative to the first reference image, shift each other reference image pixel by pixel according to the corresponding first displacement to align the other reference images with the first reference image, thereby obtaining a corresponding second reference image, and fuse the first reference image with each second reference image, thereby obtaining the true value image.
In some embodiments, an obtaining module 131 is configured to detect a feature point of each reference image and extract feature description information of each feature point, perform feature point matching on the other reference images of an i-th frame and the first reference image according to the feature description information of each feature point of the other reference images of the i-th frame and the first reference image to obtain a feature point matching pair set, where i is an integer greater than 0 and less than or equal to the total number of the other reference images, determine a homography matrix between two frames of images according to pixel coordinates of feature points in the feature point matching pair set, and determine a first displacement of the other reference images of the i-th frame relative to the first reference image according to the homography matrix.
An embodiment of the present application provides a model training device, fig. 14A is a schematic structural diagram of the model training device of the embodiment of the present application, and as shown in fig. 14A, the device 140 includes a downsampling module 141, a sample generating module 142 and a model training module 143, where,
The downsampling module 141 is configured to downsample the original image set according to each preset downsampling parameter value, so as to obtain a downsampled image set with a corresponding downsampling parameter value;
The sample generation module 142 is configured to generate a training sample set according to the downsampled image set and a truth image corresponding to the downsampled image, where the truth image is obtained by fusing multiple frames of reference images, and a resolution of the truth image is greater than a resolution of a corresponding sample image;
the model training module 143 is configured to train the original deep learning model by using the training sample set under each of the downsampling parameter values, so as to obtain an image reconstruction model under the corresponding downsampling parameter values.
In some embodiments, the downsampling module 141 is configured to perform gaussian blur processing on the original image with the downsampling parameter value as a size of a gaussian kernel to obtain a gaussian blurred image, and perform bicubic interpolation on the gaussian blurred image according to the downsampling parameter value to obtain the downsampled image.
In some embodiments, as shown in fig. 14B, the apparatus 140 further includes a determining module 144 configured to obtain N frame candidate images, where N is an integer greater than 1, determine a sharpness of each of the candidate images, and determine an image in the N frame candidate images, where the sharpness satisfies a specific condition, as the reference image.
In some embodiments, the sharpness is characterized by sharpness, and the determining module 144 is configured to perform sharpness estimation on each of the candidate images to obtain sharpness of a corresponding image, and determine an image with sharpness greater than a specific threshold in the N frame candidate images as the reference image.
In some embodiments, as shown in fig. 14B, the apparatus 140 further includes an image fusion module 145 configured to select one of the multiple frames of reference images as a first reference image, determine a first displacement of each other of the multiple frames of reference images except the first reference image relative to the first reference image, displace each of the other reference images pixel by pixel according to the corresponding first displacement to align with the first reference image, thereby obtaining a corresponding second reference image, and fuse the first reference image with each of the second reference images, thereby obtaining the true value image.
In some embodiments, the image fusion module 145 is configured to perform feature point detection on each of the reference images and extract feature description information of each feature point, perform feature point matching on the other reference images of an i-th frame and the first reference image according to the feature description information of each feature point of the other reference images of the i-th frame and the first reference image to obtain a feature point matching pair set, where i is an integer greater than 0 and less than or equal to the total number of the other reference images, determine a homography matrix between two frames of images according to pixel coordinates of feature points in the feature point matching pair set, and determine a first displacement of the other reference images of the i-th frame relative to the first reference image according to the homography matrix.
The description of the apparatus embodiments above is similar to that of the method embodiments above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the apparatus of the present application, please refer to the description of the embodiments of the method of the present application.
It should be noted that, in the embodiment of the present application, if the above-mentioned image processing method is implemented in the form of a software functional module, and sold or used as a separate product, it may also be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partly contributing to the related art, embodied in the form of a software product stored in a storage medium, including several instructions for causing an electronic device to execute all or part of the methods described in the embodiments of the present application. The storage medium includes various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the application are not limited to any specific combination of hardware and software.
Correspondingly, as shown in fig. 15, the electronic device 150 provided by the embodiment of the present application may include a memory 151 and a processor 152, where the memory 151 stores a computer program that can be run on the processor 152, and the processor 152 implements the steps in the method provided in the above embodiment when executing the program.
The memory 151 is configured to store instructions and applications executable by the processor 152, and may also cache data (e.g., image data, audio data, voice communication data, and video communication data) to be processed or processed by the respective modules in the processor 152 and the electronic device 150, which may be implemented by a FLASH memory (FLASH) or a random access memory (Random Access Memory, RAM).
Accordingly, embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method provided in the above embodiments.
It should be noted here that the description of the storage medium and the device embodiments above is similar to the description of the method embodiments above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the storage medium and the apparatus of the present application, please refer to the description of the method embodiments of the present application.
It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application. The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is merely a logical function division, and there may be additional divisions of actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.
The units described as separate components may or may not be physically separate, and components displayed as units may or may not be physical units, may be located in one place or distributed on a plurality of network units, and may select some or all of the units according to actual needs to achieve the purpose of the embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may be separately used as a unit, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of hardware plus a form of software functional unit.
It will be appreciated by those of ordinary skill in the art that implementing all or part of the steps of the above method embodiments may be implemented by hardware associated with program instructions, where the above program may be stored in a computer readable storage medium, where the program when executed performs the steps comprising the above method embodiments, where the above storage medium includes various media that may store program code, such as a removable storage device, a Read Only Memory (ROM), a magnetic disk, or an optical disk.
Or the above-described integrated units of the application may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partly contributing to the related art, embodied in the form of a software product stored in a storage medium, including several instructions for causing an electronic device to execute all or part of the methods described in the embodiments of the present application. The storage medium includes various media capable of storing program codes such as a removable storage device, a ROM, a magnetic disk, or an optical disk.
The methods disclosed in the method embodiments provided by the application can be arbitrarily combined under the condition of no conflict to obtain a new method embodiment.
The features disclosed in the several product embodiments provided by the application can be combined arbitrarily under the condition of no conflict to obtain new product embodiments.
The features disclosed in the embodiments of the method or the apparatus provided by the application can be arbitrarily combined without conflict to obtain new embodiments of the method or the apparatus.
The foregoing is merely an embodiment of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.