[go: up one dir, main page]

WO2025110998A1 - Multi-scale image processing network - Google Patents

Multi-scale image processing network Download PDF

Info

Publication number
WO2025110998A1
WO2025110998A1 PCT/US2023/080788 US2023080788W WO2025110998A1 WO 2025110998 A1 WO2025110998 A1 WO 2025110998A1 US 2023080788 W US2023080788 W US 2023080788W WO 2025110998 A1 WO2025110998 A1 WO 2025110998A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
layers
scale
delta
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/US2023/080788
Other languages
French (fr)
Inventor
Mauricio Delbracio
Zhengzhong TU
Hossein TALEBI
Peyman Milanfar
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Google LLC
Original Assignee
Google LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Google LLC filed Critical Google LLC
Priority to PCT/US2023/080788 priority Critical patent/WO2025110998A1/en
Publication of WO2025110998A1 publication Critical patent/WO2025110998A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/73Deblurring; Sharpening
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/60Image enhancement or restoration using machine learning, e.g. neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/70Denoising; Smoothing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20172Image enhancement details
    • G06T2207/20201Motion blur correction

Definitions

  • the present disclosure relates generally to image processing. More particularly, the present disclosure relates to a multi-scale image processing network for performing various image processing tasks such as, for example, image restoration or ‘“de-blurring”.
  • Image processing tasks involve a series of computational procedures aimed at recovering an original image from a degraded version. This could be due to various reasons such as noise, motion blur, atmospheric turbulence, etc.
  • Image restoration often involves complex techniques such as deblurring, denoising, superresolution and more. Image restoration is crucial in fields where image quality significantly impacts results, such as remote sensing, medical imaging, surveillance, and digital photography.
  • a system of one or more computers can be configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions.
  • One or more computer programs can be configured to perform particular operations or actions by virtue of including instructions that, when executed by data processing apparatus, cause the apparatus to perform the actions.
  • One general aspect includes a computing system for performing image processing.
  • the computing system also includes one or more processors.
  • the system also includes one or more non-transitory computer-readable media that store a machine-learned multi-scale image processing network
  • the machine- learned multi-scale image processing network may include a sequence of layers arranged to operate on respective versions of an input image in a sequence of increasing image scales, the sequence of layers may include one or more delta layers.
  • Each of the one or more delta layers can be configured to: obtain a respective version of the input image having a respective scale in the sequence of increasing image scales; process the respective version of the input image with a set of learned parameters to generate an intermediate image; apply a high pass image filter to the intermediate image to generate a filtered image; and generate a respective output image having the respective scale by combining the filtered image with an upscaled version of the respective output image from a preceding layer in the sequence of layers.
  • Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.
  • Implementations may include one or more of the following features.
  • the computing system where the high pass image filter may include a Laplacian filter.
  • the machine-learned multi-scale image processing network may include an image restoration model, and where the input image may include a degraded image and the respective output image from a final layer in the sequence of layers may include a restored image.
  • the sequence of layers may include a coarsest scale layer as an initial layer in the sequence of layers, where the one or more delta layers follow the coarsest scale layer in the sequence of layers, and where the coarsest scale layer does not apply the high pass filter.
  • the set of learned parameters is structured in a u-net architecture.
  • the one or more delta layers may include a plurality of delta layers.
  • Two or more of the plurality of delta layers may share the same set of learned parameters.
  • the input image may- have a first scale, and where the first scale is a largest scale in the sequence of increasing image scales.
  • the machine-learned multi-scale image processing network further may include a dow nscaling block configured to downscale the input image from the first scale to generate the respective versions of the input image at the sequence of increasing image scales.
  • Each of the one or more delta layers may be configured to concatenate and then process with the set of learned parameters the respective version of the input image and the upscaled version of the respective output image from the preceding layer in the sequence of layers to generate the intermediate image.
  • the respective output image generated by each of the one or more delta layers may include a restored version of the input image.
  • the method also includes processing, by the computing system, the input image with the multi-scale image processing netw ork to generate a model prediction, where processing the input image with the multi-scale image processing network may include some of the following for each of the one or more delta layers.
  • the method also includes obtaining a respective version of the input image having a respective scale in the sequence of increasing image scales.
  • the method also includes processing the respective version of the input image with a set of parameters to generate an intermediate image.
  • the method also includes applying a high pass image filter to the intermediate image to generate a filtered image.
  • the method also includes generating a respective output image having the respective scale by combining the filtered image with an upscaled version of the respective output image from a preceding layer in the sequence of layers.
  • the method also includes evaluating, by the computing system, a loss function that compares the model prediction to the ground truth data.
  • the method also includes modifying, by the computing system, one or more parameter values of one or more of the sets of parameters of the one or more delta layers based on the loss function.
  • Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.
  • Figure 1 depicts a graphical diagram of an example multi-scale image processing model according to example embodiments of the present disclosure.
  • Figure 3 A depicts a block diagram of an example computing system according to example embodiments of the present disclosure.
  • Example aspects of the present disclosure are directed to a new neural network architecture and associated framework and methodology for multiscale image restoration or other forms of image processing tasks.
  • Example implementations of the proposed neural network architecture can be referred to as Delta-Net.
  • This proposed technology utilizes a machine-learned multi-scale image processing network that operates on increasing scales of an input image. The architecture begins by processing the image at the coarsest resolution, then proceeds to higher scale levels. For instance, it can start with a low-resolution image and gradually refine the details to generate a high-resolution output.
  • the delta layer can combine the filtered image with an upscaled version of the output image from the preceding layer. This combination generates a respective output image at the current scale. For example, a filtered mid-scale image could be combined with an upscaled version of a low-scale output image to produce a mid-scale output image. This process ensures that the learned component from each layer generates additional details with respect to the previous low er-resolution output.
  • the Delta-Net architecture can include a sequence of delta layers, which follow a coarsest scale layer.
  • the coarsest scale layer does not apply the high pass fdter.
  • This layer serves as the initial layer in the sequence of layers, with the delta layers following in sequence. For example, an initial layer might process the image at its lowest resolution, and subsequent delta layers would then refine the image at progressively higher resolutions.
  • the delta layers within the Delta-Net architecture can include multiple layers.
  • the network might feature several delta layers, each processing the image at a different scale.
  • This multi-layer approach allows for a more detailed and nuanced image restoration process.
  • two or more of these delta layers could share the same set of learned parameters, which might streamline the processing and reduce computational demands.
  • the input image has the largest scale in the sequence of increasing image scales.
  • the network can include a downscaling block to generate the respective versions of the input image at the sequence of scales.
  • a high-resolution input image could be downscaled to create lower-resolution versions for processing by the delta layers.
  • each delta layer in the network can concatenate and process the respective version of the input image and the upscaled version of the respective output image from the preceding layer. This combination is then processed with the set of learned parameters to generate the intermediate image. This approach ensures that each layer incorporates information from both the current scale and the previous scale, potentially- improving the quality of the final restored image.
  • the proposed network can be applied to a number of different tasks.
  • the respective output image generated by each delta layer could either be a restored version of the input image or can alternatively be a feature image comprising latent feature values expressed in a latent dimensional space. The latter could be useful for other machine learning tasks, such as object detection or image classification.
  • the system can further include a machine-learned prediction model that processes one or more of the respective output images to generate model predictions.
  • the systems and methods of the present disclosure provide a number of technical effects and benefits.
  • One technical benefit provided by the present disclosure is a novel neural network architecture, Delta-Net, which can be used for multiscale image restoration.
  • This technology can significantly enhance the performance of computing systems in several ways.
  • the Delta-Net architecture can improve the efficiency of image processing tasks by operating on increasing scales of an input image.
  • This multiscale approach allows the system to start with a low-resolution image and gradually refine the details to generate a high-resolution output. This can lead to a more efficient use of computational resources, as the system can focus on refining details instead of processing the entire image at a high resolution from the start.
  • the proposed techniques also provide a unique benefit of the Delta layers predicting higher frequency details while leveraging the lower frequency details from the preceding layer. This approach ensures that each layer contributes to the final output in a meaningful and complementary way.
  • the application of high-pass filters at each scale allows the Delta layer to focus on extracting and enhancing high-frequency details — the finer textures and edges — that are often lost in degraded images.
  • the system By replacing the missing low- frequency components with the output of the previous scale, the system ensures continuity and coherence across scales. This process capitalizes on the inherent hierarchical structure of images, where low-frequency components provide the basic structure and high-frequency components add the finer details. The result is a more consistent and scale-invariant image restoration output.
  • This method is particularly advantageous in handling various types of image degradation, such as motion blur or out-of-focus blur, where both the overall image structure and finer details need to be restored for a high-quality output.
  • This architecture therefore, enables a more nuanced and comprehensive approach to image restoration, effectively addressing the challenges presented by multiscale and multifaceted image degradation.
  • the present disclosure can also enable new functionalities.
  • the respective output image generated by each delta layer could be a feature image comprising latent feature values expressed in a latent dimensional space. This could be useful for other machine learning tasks, such as object detection or image classification.
  • the system could include a machine-learned prediction model that processes one or more of the respective output images to generate model predictions. This could potentially enable the system to perform multiple tasks simultaneously, such as restoring an image and identifying objects within the image.
  • the present disclosure offers a versatile and efficient solution for image restoration tasks.
  • the system can increase the performance of computing systems in image processing tasks, while also enabling new functionalities.
  • Figure 1 provides a graphical diagram of an example multi-scale image processing model, referenced as 1000. This figure illustrates the process by which some example implementations of the Delta-Net system operate, from an initial input image 1002 to a final output image 1040, which, in some implementations, may be a restored version of the input image 1002.
  • the image restoration process begins with the input image, labeled as 1002 in Figure 1.
  • This image serves as the initial data for the Delta-Net system, and can be any image that requires restoration.
  • the input image 1002 may exhibit blur, compression artifacts, or other forms of image degradation.
  • the input image 1002 can be any digital image that needs to be enhanced or restored. For example, it could be a photograph taken with a digital camera that has been blurred due to motion or out-of-focus issues. Alternatively, the input image could be a frame from a video that has been degraded due to compression artifacts.
  • the input image could also be a scanned document that needs to be enhanced for better readability.
  • the input image 1002 can be in any suitable format that can be processed by the model 1000. For instance, it could be a bitmap image, a vector image, or a raw image fde.
  • the input image could also be in any suitable color space, such as RGB, grayscale, or a color space designed for a specific application.
  • the system can be configured to convert the input image to a suitable format and color space for processing, if necessary.
  • the input image 1002 can also be subjected to various preprocessing steps before being processed by the Delta-Net architecture.
  • the system could apply noise reduction, color correction, or other image enhancement techniques to the input image.
  • These preprocessing steps can help to improve the quality of the final restored image.
  • the input image 1002 could have the largest scale in the sequence of increasing image scales, thus it may need to be downscaled for further processing.
  • the downscaling block denoted as 1004, then processes the input image to generate downscaled versions. For instance, a high-resolution input image can be downscaled by this block to create lower-resolution versions, labeled as 1006 and 1008 in Figure 1. These downscaled images are then used as input by the respective layers in the Delta-Net system. Alternatively, the system could start with a low-resolution input image and use an upscaling technique to create higher-resolution versions of the image.
  • the illustrated Delta-Net architecture can operate across multiple different image scales.
  • Image scales refer to the different resolutions at which an image can be processed in the network.
  • the model 100 can operate on respective versions of an input image in a sequence of increasing image scales, starting from a coarse resolution and gradually moving to higher resolutions. This multiscale approach allows the system to initially process the image at a lower resolution, which can potentially reduce computational demands and improve efficiency.
  • the number and range of image scales used in the Delta-Net architecture can vary based on the specific application. For example, for a simple image restoration task, the system might use a small number of scales, starting from a low resolution and gradually increasing to the original resolution of the input image. On the other hand, for a more complex task such as deblurring or denoising, the system might use a larger number of scales, starting from a very coarse resolution and increasing to a high resolution.
  • the specific resolutions used for the image scales can also vary.
  • the system could use standard resolutions such as 480p, 720p, 1080p, and 4K for the image scales.
  • the system could use custom resolutions based on the specific requirements of the task.
  • the system could also dynamically determine the resolutions based on the characteristics of the input image.
  • a coarsest layer, denoted as 1010, of the model 1000 can serves as an initial layer in the sequence of layers in the Delta-Net architecture.
  • This layer 1010 processes the lowest resolution version 1008 of the image, which is the coarsest version in the sequence. Notably, this layer does not apply a high pass filter to the image, as shown in Figure 1.
  • the output of this layer is denoted as output image_2 or 1012.
  • the Delta-Net system employs one or more delta layers.
  • the first of these, delta layer l, or 1014 processes the image 1006 at a higher resolution than the coarsest layer. It utilizes a set of learned parameters, 1016, to process the respective version 1006 of the image and generate an intermediate image, 1018.
  • the intermediate image 1018 can be seen as a step in the image restoration process, representing the state of the image after it has been processed at a particular scale but before it has been filtered.
  • the intermediate image 1018 can be in any suitable format that can be processed by the high pass image filter, such as a bitmap image or a raw image file.
  • the set of learned parameters 1016 can be structured as a U-Net architecture.
  • the U-Net architecture can be implemented in various ways depending on the specific requirements of the image processing task.
  • the U-Net architecture could include a series of convolutional layers, followed by a series of deconvolutional layers.
  • the convolutional layers can be used to extract features from the respective version of the input image, while the deconvolutional layers can be used to reconstruct the output image from the extracted features.
  • the U-Net architecture can also include skip connections between the convolutional and deconvolutional layers. These skip connections can help to preserve the high-frequency details in the image, which can be particularly beneficial for tasks such as image restoration. For instance, a skip connection might directly connect a convolutional layer to a corresponding deconvolutional layer, allowing the high- frequency details extracted by the convolutional layer to be directly incorporated into the reconstructed output image.
  • the U-Net architecture could potentially be implemented with different numbers of layers.
  • the U-Net architecture could include a small number of layers for simple image processing tasks, or a large number of layers for more complex tasks.
  • the U-Net architecture could also include various types of layers, such as pooling layers, normalization layers, or activation layers.
  • Each layer in the U- Net architecture could use different types of kernels or activation functions.
  • the convolutional layers could use small kernels to extract fine-grained features, or large kernels to extract coarse-grained features.
  • the activation functions could include linear functions, sigmoid functions, or rectified linear unit (ReLU) functions, among others.
  • the layer delta lay er_l. or 1014 then applies a high pass filter, 1020, to the intermediate image 1018, yielding a filtered image, 1022.
  • This filter 1020 can potentially enhance the details in the image, particularly those that are of a higher frequency.
  • the high pass image filter 1020 could be any suitable type of filter that is capable of enhancing high-frequency details. For instance, it could be a Eaplacian filter, a Sobel filter, or any other type of high pass filter.
  • the filter 1020 can be a Laplacian filter.
  • This filter can be designed to enhance the details in the image, particularly those that are of a higher frequency.
  • the Laplacian filter can be implemented in various ways, depending on the specific requirements of the image processing task.
  • the Laplacian filter could be a 2D filter that operates on the spatial domain of the image. This ty pe of filter could be particularly effective for enhancing edges and other high-frequency details in the image.
  • the Laplacian filter could be a 3D filter that operates on both the spatial and temporal domains of the image. This could be beneficial for processing video frames, as it could help to enhance details that change over time.
  • the Laplacian filter could also be implemented as a convolutional filter. This involves convolving the filter with the intermediate image to generate the filtered image.
  • the convolution operation could be performed using various methods, such as direct convolution, fast Fourier transform (FFT) based convolution, or separable convolution. The choice of convolution method could depend on factors such as the size of the image and the computational resources available.
  • FFT fast Fourier transform
  • the coefficients of the Laplacian filter could be predetermined or learned from training data.
  • the filter could use a standard set of coefficients that are known to be effective for enhancing high-frequency details.
  • the coefficients could be learned from training data, allowing the filter to adapt to the specific characteristics of the images being processed.
  • the filtered image 1022 is the result of applying the high pass image filter 1020 to the intermediate image 1018.
  • the filtered image 1022 represents the state of the image after the high-frequency details have been isolated.
  • This filtered image 1022 can be in any suitable format.
  • the filtered image could be a bitmap image, a vector image, or a raw image file.
  • delta layer_l 1014 combines the filtered image 1022 with an upscaled version of the output image from the previous layer (output image_2. 1012 in this case) to generate output image_l, 1024.
  • the output image 1024 is generated by combining the filtered image 1022 with an upscaled version of the output image 1012 from the preceding layer. This combination generates a respective output image 1024 at the current scale.
  • delta layer_l 1014 This process discussed with respect to delta layer_l 1014 can then be repeated in each of a number of subsequent delta layer(s). with each layer processing the image at a progressively higher scale.
  • the last layer, delta layer_0 or 1030 processes the input image 1002 at the highest resolution.
  • This layer also uses a set of learned parameters, 1032, to generate an intermediate image, 1034.
  • a high pass filter, 1036 is then applied to the intermediate image 1034 to generate a filtered image, 1038.
  • delta layer O combines the filtered image 1038 with an upscaled version of the output image from the previous layer (output image_l, 1024 in this case) to generate the final output image, labeled as output image_0 or 1040 in Figure 1.
  • this final output image 1040 represents the restored version of the original input image 1002, and can in some cases be the end product of the Delta-Net image restoration process.
  • the output image (e.g., output images 1024, 1040, etc.) can be a restored version of the input image.
  • the output image at each layer can be a feature image comprising latent feature values expressed in a latent dimensional space.
  • a feature image represents a potential output of each delta layer in the Delta-Net architecture.
  • This image can include latent feature values expressed in a latent dimensional space.
  • the feature image can be a useful output of the image processing system, especially for other machine learning tasks. For instance, the feature image could be employed in tasks such as object detection, image classification, or even image segmentation.
  • the latent feature values in the feature image could represent various characteristics of the input image. For instance, these values could represent edges, textures, colors, or other visual features in the image. Alternatively, the latent feature values could represent higher-level features, such as the presence of specific objects or patterns in the image. The specific types of features represented could depend on the learned parameters used by the delta layer.
  • the feature image could be in any suitable format that can be processed by subsequent layers in the Delta-Net architecture or by other machine learning models.
  • the feature image could be a bitmap image, a vector image, or a raw image file.
  • the feature image could also be in any suitable color space, such as RGB, grayscale, or a color space designed for a specific application.
  • one or more of the output images can be processed further by a machine-learned prediction model (not illustrated).
  • This prediction model could generate one or more model predictions based on the output image(s) (e.g., output feature image(s)). For instance, the prediction model could identify objects in the output image(s), classify the input image based on its features, or predict future states of the image based on its current features.
  • the prediction model is an optional component of the Delta-Net architecture.
  • the prediction model can process one or more of the respective output images generated by one or more of the delta layers to generate model predictions.
  • the prediction model can be a machine-learned model that has been trained to generate predictions based on the output image(s).
  • the prediction model could be a convolutional neural network (CNN), a recurrent neural network (RNN), a fully connected network, or any other type of machine learning model.
  • the prediction model can generate various types of model predictions.
  • the model predictions could include class labels, probabilities, scores, or other types of predictions.
  • the specific type of model prediction could depend on the task at hand. For instance, if the task is image classification, the model prediction could be a class label that indicates the category' of the input image. If the task is object detection, the model prediction could be a set of bounding boxes that indicate the locations of objects in the input image.
  • the prediction model can be implemented in various ways. For instance, the prediction model could be implemented as a standalone component that operates independently of the delta layers. Alternatively, the prediction model could be integrated into the Delta-Net architecture, operating in conjunction with the delta layers (e.g., trained jointly therewith). For example, the output of a delta layer could be used as an input to the prediction model, allowing the prediction model to generate model predictions based on the intermediate result(s) of the image processing.
  • Figure 2 provides a graphical diagram of an exemplary training scheme, referred to as 2000, for training a multi-scale image processing model (e.g., a Delta-Net model) using a set of training data 2001.
  • a multi-scale image processing model e.g., a Delta-Net model
  • the training scheme 2000 begins by obtaining a training tuple 2002 from the set of training data 2001.
  • the training tuple 2002 can include an input image 2004 and corresponding ground truth data 2006.
  • the input image 2004 can be any image that requires restoration or enhancement, similar to the input image 1002 discussed in relation to Figure 1.
  • the ground truth data 2006 represents the ideal output for the given input image, and this data could be a perfectly restored version of the degraded input image or any other data that represents the desired output (e g., a prediction such as a classification or detection prediction).
  • the ground truth data 2006 can be obtained in various ways. For instance, it could be manually created by a human operator. Alternatively, it could be automatically generated by a computer program or algorithm. The ground truth data 2006 could also be obtained from a database or other data source. In some instances, the ground truth data 2006 could be obtained through a combination of these methods. For example, an initial set of ground truth data could be manually created by a human operator, and then refined or supplemented by a computer program or algorithm.
  • the training tuple 2002 serves as the basis for training the multi-scale image processing model 2008 of the Delta-Net system. Specifically, the input image 2004 is processed by the multi-scale image processing model 2008 to generate a prediction 2010, and the resulting prediction 2010 is compared to the ground truth data 2006. This comparison allows the system to evaluate its performance and adjust its parameters to improve future outputs.
  • the computing system accesses a machine-learned multi-scale image processing network 2008, which can be the Delta-Net architecture discussed in relation to Figure 1.
  • This network operates on respective versions of the input image 2004 in a sequence of increasing image scales, processing each version with a sequence of layers that include one or more delta layers.
  • the processing of the input image 2004 by the network 2008 generates a model prediction 2010.
  • This prediction represents the system's output for the given input image.
  • the model prediction could be an image that has been restored or enhanced by the system, a set of features extracted from the image, or any other output generated by the system.
  • the specific type of model prediction will depend on the task at hand and the configuration of the Delta- Net system.
  • the processing of the input image 2004 by the network 2008 can include several steps. For example, for each of the delta layers in the network, the system obtains a respective version of the input image 2004 that corresponds to a respective scale in the sequence of increasing image scales. This version of the input image is then processed with a set of parameters to generate an intermediate image.
  • the system evaluates one or more loss functions 2012.
  • One or more of these loss function(s) can measure the difference between the model prediction 2010 and the ground truth data 2006.
  • the loss functions 2012 can provide a measure of how well the system's prediction 2010 matches the desired output, as represented by the ground truth data 2006.
  • the loss functions 2012 can be designed to measure various types of discrepancies between the model prediction 2010 and the ground truth data 2006. For instance, the loss functions could measure the mean squared error, the cross-entropy, the Kullback-Leibler divergence, or any other suitable discrepancy measure. The specific types of loss functions used can depend on the task at hand and the configuration of the Delta-Net system.
  • the system modifies one or more parameter values of one or more of the sets of parameters of the delta layers of the network 2008. This modification can involve adjusting the values of the parameters to reduce the discrepancy between the model prediction 2010 and the ground truth data 2006.
  • the specific method of modification can depend on the optimization algorithm used by the system.
  • the optimization algorithm used to modify the parameters of the machine- learned multi-scale image processing network 2008 can be any suitable algorithm designed, for example, to minimize the loss functions 2012.
  • the optimization algorithm could be a gradient descent algorithm, a stochastic gradient descent algorithm, or any other suitable optimization algorithm.
  • the specific optimization algorithm used can depend on the configuration of the network 2008 and the nature of the task at hand.
  • the modification of the parameters of the network 2008 could involve updating the parameters (e.g., weights and biases) of the neural network. This could involve, for example, applying a learning rate to the gradient of the loss function(s) 2012 with respect to the parameters and then subtracting this value from the current parameter values. This process could be repeated iteratively until the loss function(s) 2012 reaches a minimum value, or until some other stopping criterion is met.
  • the learning rate used to update the parameters (e.g., weights and biases) could be a fixed value.
  • the learning rate could be dynamically adjusted based on the progress of the training process. For example, the learning rate could be decreased if the loss function is not decreasing fast enough, or increased if the loss function is decreasing too slowly.
  • the modification of the parameters could involve other types of updates, such as updates to the filter coefficients of the high pass image filter, updates to the structure of the delta layers or the U-Net architecture, or updates to any other parameters or components of the Delta-Net system.
  • the corresponding ground truth data 2006 can consist of a ground truth restored image, which serves as the ideal or desired output for the degraded input image.
  • the system processes the degraded input image, it generates a model prediction 2010, which represents the system's output for the given input image.
  • the model prediction 2010 can comprise a predicted restored image output by a final delta layer of one or more delta layers in the multi-scale image processing network 2008.
  • the training scheme 2000 can apply a multi-teacher model distillation training approach.
  • This approach can scale up the process of real data curation while also leveraging the expertise of multiple teacher models.
  • the multi-teacher model distillation training approach can include two primary steps, which can either be performed in sequence or iteratively until a desired level of performance is achieved.
  • a second step can include human curation, where a human operator can curate a high-quality real blurry image dataset with a reasonably good reference image to supplement the synthetic dataset.
  • the aim is to distill multiple teachers (e.g., candidate models) having different deblurring capabilities into a single model by training on the curated dataset.
  • the process of curating a high-quality' real blurry' image dataset can involve various steps. For instance, the human operator could select a set of blurry images that exhibit a wide range of blur types, levels, and patterns. The operator could then manually enhance or restore these images to create the reference images. These reference images can serve as the ground truth data 2006 for the training process.
  • the multi-teacher model distillation training approach can be implemented in various ways.
  • the system could use a weighted combination of the teacher models' outputs as the target for the training process.
  • the system could use a voting scheme, where each teacher model's output is considered as a vote and the most common output is chosen as the target.
  • the training scheme 2000 utilizes a loss function 2012 that incorporates multiple terms. This loss function is designed to guide the Delta-Net system in generating more accurate and nuanced image restorations or enhancements.
  • the training scheme 2000 of the Delta-Net system can leverage a unique combination of loss functions 2012, including LI loss, projected distribution loss (e.g., ID-Wasserstein distances between CNN activations), adversarial loss, and/or other loss(es). By optimizing these loss functions 2012, the system can effectively learn to generate accurate and detailed image restorations or enhancements. This makes the Delta-Net system a versatile and powerful tool for a wide range of image processing tasks.
  • the training scheme 2000 for the Delta-Net architecture can incorporate a mix of various sources of data for a comprehensive and robust training process. In some example implementations, two specific types of data sources are utilized: synthetically degraded data and real blur-sharp image pairs captured with different camera configurations.
  • Synthetically degraded data refers to images that have been artificially altered or degraded in order to simulate various forms of image degradation. These alterations allow for the creation of a wide range of degraded images that cover a comprehensive array of potential degradation scenarios. The corresponding ground truth data 2006 for these synthetically degraded images would be the original, unaltered image.
  • Real blur-sharp image pairs captured with different camera configurations are actual captured images that have not been artificially altered. These images are captured under various camera settings, introducing real-world scenarios into the training process. These real-world scenarios can involve a multitude of factors such as motion blur, defocus blur, and other types of degradation that can occur naturally in photography.
  • a final training data set 2001 can encompass training data from either or both of the classes of data described above. Additionally or alternatively, the training data set 2001 can include training examples derived from a distillation process utilizing multiple models. The mix of synthetically degraded data, real blur-sharp image pairs, and the multi-teacher model distillation process provides a comprehensive and robust training scheme 2000 for the Delta-Net architecture. This training scheme allows the Delta-Net architecture to learn and adapt to a wide range of image degradation scenarios, while also leveraging the expertise of multiple models, effectively enhancing its image restoration capabilities.
  • Figure 3 A depicts a block diagram of an example computing system 100 according to example embodiments of the present disclosure.
  • the system 100 includes a user computing device 102, a server computing system 130, and a training computing system 150 that are communicatively coupled over a network 180.
  • the user computing device 102 can be any type of computing device, such as, for example, a personal computing device (e.g., laptop or desktop), a mobile computing device (e.g., smartphone or tablet), a gaming console or controller, a wearable computing device, an embedded computing device, or any other type of computing device.
  • a personal computing device e.g., laptop or desktop
  • a mobile computing device e.g., smartphone or tablet
  • a gaming console or controller e.g., a gaming console or controller
  • a wearable computing device e.g., an embedded computing device, or any other type of computing device.
  • the user computing device 102 includes one or more processors 112 and a memory 114.
  • the one or more processors 112 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality’ of processors that are operatively connected.
  • the memory 114 can include one or more non-transitory computer-readable storage media, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof.
  • the memory 114 can store data 1 16 and instructions 118 which are executed by the processor 112 to cause the user computing device 102 to perform operations.
  • the user computing device 102 can store or include one or more machine-learned models 120.
  • the machine-learned models 120 can be or can otherwise include various machine-learned models such as neural networks (e.g., deep neural networks) or other types of machine-learned models, including non-linear models and/or linear models.
  • Neural networks can include feed-forward neural networks, recurrent neural networks (e.g., long short-term memory' recurrent neural networks), convolutional neural networks or other forms of neural networks.
  • Some example machine-learned models can leverage an attention mechanism such as self-attention.
  • some example machine-learned models can include multi-headed self-attention models (e.g., transformer models).
  • Example machine-learned models 120 are discussed with reference to Figures 1 and 2.
  • the one or more machine-learned models 120 can be received from the server computing system 130 over network 180, stored in the user computing device memory' 114, and then used or otherwise implemented by the one or more processors 112.
  • the user computing device 102 can implement multiple parallel instances of a single machine-learned model 120 (e.g., to perform parallel image processing across multiple instances of images).
  • one or more machine-learned models 140 can be included in or otherwise stored and implemented by the server computing system 130 that communicates with the user computing device 102 according to a client-server relationship.
  • the machine-learned models 140 can be implemented by the server computing system 140 as a portion of a web service (e.g., an image processing or image restoration service).
  • a web service e.g., an image processing or image restoration service.
  • one or more models 120 can be stored and implemented at the user computing device 102 and/or one or more models 140 can be stored and implemented at the server computing system 130.
  • the user computing device 102 can also include one or more user input components 122 that receives user input.
  • the user input component 122 can be a touch-sensitive component (e.g.. a touch-sensitive display screen or a touch pad) that is sensitive to the touch of a user input object (e.g., a finger or a stylus).
  • the touch-sensitive component can serve to implement a virtual keyboard.
  • Other example user input components include a microphone, a traditional keyboard, or other means by which a user can provide user input.
  • the server computing system 130 includes one or more processors 132 and a memory 134.
  • the one or more processors 132 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected.
  • the memory 134 can include one or more non-transilory computer-readable storage media, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof.
  • the memory 134 can store data 136 and instructions 138 which are executed by the processor 132 to cause the server computing system 130 to perform operations.
  • the server computing system 130 includes or is otherw ise implemented by one or more server computing devices.
  • the sen' er computing system 130 includes plural server computing devices, such sen er computing devices can operate according to sequential computing architectures, parallel computing architectures, or some combination thereof.
  • the server computing system 130 can store or othenvise include one or more machine-learned models 140.
  • the models 140 can be or can otherwise include various machine-learned models.
  • Example machine-learned models include neural networks or other multi-layer non-linear models.
  • Example neural networks include feed forward neural networks, deep neural networks, recurrent neural networks, and convolutional neural networks.
  • Some example machine-learned models can leverage an attention mechanism such as self-attention.
  • some example machine-learned models can include multi-headed self-attention models (e.g., transformer models).
  • Example models 140 are discussed with reference to Figures 1 and 2.
  • the user computing device 102 and/or the server computing system 130 can train the models 120 and/or 140 via interaction with the training computing system 150 that is communicatively coupled over the network 180.
  • the training computing system 150 can be separate from the server computing system 130 or can be a portion of the server computing system 130.
  • the training computing system 150 can include a model trainer 160 that trains the machine-learned models 120 and/or 140 stored at the user computing device 102 and/or the server computing system 130 using various training or learning techniques, such as, for example, backwards propagation of errors.
  • a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the loss function).
  • Various loss functions can be used such as mean squared error, likelihood loss, cross entropy loss, hinge loss, and/or various other loss functions.
  • Gradient descent techniques can be used to iteratively update the parameters over a number of training iterations.
  • performing backwards propagation of errors can include performing truncated backpropagation through time.
  • the model trainer 160 can perform a number of generalization techniques (e.g., weight decays, dropouts, etc.) to improve the generalization capability of the models being trained.
  • the model trainer 160 can train the machine-learned models 120 and/or 140 based on a set of training data 162.
  • the training examples can be provided by the user computing device 102.
  • the model 120 provided to the user computing device 102 can be trained by the training computing system 150 on user-specific data received from the user computing device 102. In some instances, this process can be referred to as personalizing the model.
  • the model trainer 160 includes computer logic utilized to provide desired functionality.
  • the model trainer 160 can be implemented in hardware, firmware, and/or software controlling a general purpose processor.
  • the model trainer 160 includes program files stored on a storage device, loaded into a memory' and executed by one or more processors.
  • the model trainer 160 includes one or more sets of computer-executable instructions that are stored in a tangible computer-readable storage medium such as RAM, hard disk, or optical or magnetic media.
  • the network 180 can be any type of communications network, such as a local area network (e.g., intranet), wide area network (e.g., Internet), or some combination thereof and can include any number of wired or wireless links.
  • communication over the network 180 can be carried via any type of wired and/or wireless connection, using a wide variety of communication protocols (e.g., TCP/IP, HTTP, SMTP, FTP), encodings or formats (e.g., HTML, XML), and/or protection schemes (e.g., VPN, secure HTTP, SSL).
  • TCP/IP Transmission Control Protocol/IP
  • HTTP HyperText Transfer Protocol
  • SMTP Simple Stream Transfer Protocol
  • FTP e.g., HTTP, HTTP, HTTP, HTTP, FTP
  • encodings or formats e.g., HTML, XML
  • protection schemes e.g., VPN, secure HTTP, SSL
  • the input to the machine-learned model (s) of the present disclosure can be image data.
  • the machine-learned model(s) can process the image data to generate an output.
  • the machine-learned model(s) can process the image data to generate an image recognition output (e.g., a recognition of the image data, a latent embedding of the image data, an encoded representation of the image data, a hash of the image data. etc.).
  • the machine-learned model(s) can process the image data to generate an image segmentation output.
  • the machine- learned model(s) can process the image data to generate an image classification output.
  • the machine-learned model(s) can process the image data to generate an image data modification output (e.g., an alteration of the image data, etc.).
  • the machine-learned model(s) can process the image data to generate an encoded image data output (e.g., an encoded and/or compressed representation of the image data, etc.).
  • the machine-learned model(s) can process the image data to generate an upscaled image data output.
  • the machine-learned model(s) can process the image data to generate a prediction output.
  • the input includes visual data and the task is a computer vision task.
  • the input includes pixel data for one or more images and the task is an image processing task.
  • the image processing task can be image classification, where the output is a set of scores, each score corresponding to a different object class and representing the likelihood that the one or more images depict an object belonging to the object class.
  • the image processing task may be object detection, where the image processing output identifies one or more regions in the one or more images and, for each region, a likelihood that region depicts an object of interest.
  • the image processing task can be image segmentation, where the image processing output defines, for each pixel in the one or more images, a respective likelihood for each category in a predetermined set of categories.
  • the set of categories can be foreground and background.
  • the set of categories can be object classes.
  • the image processing task can be depth estimation, where the image processing output defines, for each pixel in the one or more images, a respective depth value.
  • the image processing task can be motion estimation, where the network input includes multiple images, and the image processing output defines, for each pixel of one of the input images, a motion of the scene depicted at the pixel between the images in the network input.
  • Figure 3 A illustrates one example computing system that can be used to implement the present disclosure.
  • the user computing device 102 can include the model trainer 160 and the training dataset 162.
  • the models 120 can be both trained and used locally at the user computing device 102.
  • the user computing device 102 can implement the model trainer 160 to personalize the models 120 based on user-specific data.
  • FIG. 3B depicts a block diagram of an example computing device 10 that performs according to example embodiments of the present disclosure.
  • the computing device 10 can be a user computing device or a server computing device.
  • the computing device 10 includes a number of applications (e.g., applications 1 through N). Each application contains its own machine learning library and machine-learned model(s). For example, each application can include a machine-learned model.
  • Example applications include a text messaging application, an email application, a dictation application, a virtual keyboard application, a browser application, etc.
  • each application can communicate with a number of other components of the computing device, such as, for example, one or more sensors, a context manager, a device state component, and/or additional components.
  • each application can communicate with each device component using an API (e.g., a public API).
  • the API used by each application is specific to that application.
  • FIG. 3C depicts a block diagram of an example computing device 50 that performs according to example embodiments of the present disclosure.
  • the computing device 50 can be a user computing device or a server computing device.
  • the computing device 50 includes a number of applications (e.g., applications 1 through N). Each application is in communication with a central intelligence layer.
  • Example applications include a text messaging application, an email application, a dictation application, a virtual keyboard application, a browser application, etc.
  • each application can communicate with the central intelligence layer (and model(s) stored therein) using an API (e.g., a common API across all applications).
  • the central intelligence layer includes a number of machine-learned models. For example, as illustrated in Figure 3C. a respective machine-learned model can be provided for each application and managed by the central intelligence layer. In other implementations, two or more applications can share a single machine-learned model. For example, in some implementations, the central intelligence layer can provide a single model for all of the applications. In some implementations, the central intelligence layer is included within or otherwise implemented by an operating system of the computing device 50.
  • the central intelligence layer can communicate with a central device data layer.
  • the central device data layer can be a centralized repository' of data for the computing device 50. As illustrated in Figure 3C, the central device data layer can communicate with a number of other components of the computing device, such as, for example, one or more sensors, a context manager, a device state component, and/or additional components. In some implementations, the central device data layer can communicate with each device component using an API (e.g., a private API).
  • an API e.g., a private API
  • An example computing system can serve as a platform for implementing the Delta-Net, a new neural network architecture for multiscale image restoration.
  • This system comprises one or more processors and one or more non-transitory computer-readable media.
  • the processors can include, for example, central processing units (CPUs), graphics processing units (GPUs), or specialized hardware accelerators for machine learning tasks.
  • the non-transitory computer-readable media can store the machine-learned multi-scale image processing network and other necessary software components.

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Image Processing (AREA)

Abstract

The present disclosure provides a new neural network architecture and associated framework and methodology for multiscale image restoration or other forms of image processing tasks. Example implementations of the proposed neural network architecture can be referred to as Delta-Net. This proposed technology utilizes a machine-learned multi-scale image processing network that operates on increasing scales of an input image. The architecture begins by processing the image at the coarsest resolution, then proceeds to higher scale levels. For instance, it can start with a low-resolution image and gradually refine the details to generate a high-resolution output.

Description

MULTI-SCALE IMAGE PROCESSING NETWORK
FIELD
[0001] The present disclosure relates generally to image processing. More particularly, the present disclosure relates to a multi-scale image processing network for performing various image processing tasks such as, for example, image restoration or ‘“de-blurring”.
BACKGROUND
[0002] Image processing tasks, particularly image restoration, involve a series of computational procedures aimed at recovering an original image from a degraded version. This could be due to various reasons such as noise, motion blur, atmospheric turbulence, etc. Image restoration often involves complex techniques such as deblurring, denoising, superresolution and more. Image restoration is crucial in fields where image quality significantly impacts results, such as remote sensing, medical imaging, surveillance, and digital photography.
[0003] Existing neural networks that are used for image restoration face challenges when processing high-resolution images, where the need to accurately restore details at various scales is critical. Traditional neural networks often struggle with the efficient processing of high-resolution images as this typically requires large amounts of computational resources, making the task time consuming and computationally expensive.
[0004] Furthermore, existing neural networks typically operate on a single scale of an image, which may result in loss of information or details when transitioning between different scales. This can be particularly problematic for tasks that require a comprehensive understanding of the image context, such as image segmentation.
[0005] Another related issue is the restoration of images that have undergone various types of degradation, such as motion blur or out-of-focus blur. The restoration of such images requires the system to accurately identity7 and restore both the overall image structure and finer details, a task that many current systems struggle to perform satisfactorily.
SUMMARY
[0006] Aspects and advantages of embodiments of the present disclosure will be set forth in part in the following description, or can be learned from the description, or can be learned through practice of the embodiments. [0007] A system of one or more computers can be configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions. One or more computer programs can be configured to perform particular operations or actions by virtue of including instructions that, when executed by data processing apparatus, cause the apparatus to perform the actions. One general aspect includes a computing system for performing image processing. The computing system also includes one or more processors. The system also includes one or more non-transitory computer-readable media that store a machine-learned multi-scale image processing network, the machine- learned multi-scale image processing network may include a sequence of layers arranged to operate on respective versions of an input image in a sequence of increasing image scales, the sequence of layers may include one or more delta layers. Each of the one or more delta layers can be configured to: obtain a respective version of the input image having a respective scale in the sequence of increasing image scales; process the respective version of the input image with a set of learned parameters to generate an intermediate image; apply a high pass image filter to the intermediate image to generate a filtered image; and generate a respective output image having the respective scale by combining the filtered image with an upscaled version of the respective output image from a preceding layer in the sequence of layers. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.
[0008] Implementations may include one or more of the following features. The computing system where the high pass image filter may include a Laplacian filter. The machine-learned multi-scale image processing network may include an image restoration model, and where the input image may include a degraded image and the respective output image from a final layer in the sequence of layers may include a restored image. The sequence of layers may include a coarsest scale layer as an initial layer in the sequence of layers, where the one or more delta layers follow the coarsest scale layer in the sequence of layers, and where the coarsest scale layer does not apply the high pass filter. For each of the one or more delta layers, the set of learned parameters is structured in a u-net architecture. The one or more delta layers may include a plurality of delta layers. Two or more of the plurality of delta layers may share the same set of learned parameters. The input image may- have a first scale, and where the first scale is a largest scale in the sequence of increasing image scales. The machine-learned multi-scale image processing network further may include a dow nscaling block configured to downscale the input image from the first scale to generate the respective versions of the input image at the sequence of increasing image scales. Each of the one or more delta layers may be configured to concatenate and then process with the set of learned parameters the respective version of the input image and the upscaled version of the respective output image from the preceding layer in the sequence of layers to generate the intermediate image. The respective output image generated by each of the one or more delta layers may include a restored version of the input image. The one or more non-transitory computer-readable media further store a machine-learned prediction model configured to process one or more of the respective output images generated by one or more of the delta layers to generate one or more model predictions. The respective output image generated by each of the one or more delta layers may include a feature image may include latent feature values expressed in a latent dimensional space. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.
[0009] One general aspect includes a computer-implemented method to train a multiscale image processing network. The computer-implemented method includes obtaining, by a computing system may include one or more computing devices, a training tuple may include an input image and ground truth data. The method also includes accessing, by the computing system, the multi-scale image processing network, the machine-learned multi-scale image processing network may include a sequence of layers arranged to operate on respective versions of the input image in a sequence of increasing image scales, the sequence of layers may include one or more delta layers. The method also includes processing, by the computing system, the input image with the multi-scale image processing netw ork to generate a model prediction, where processing the input image with the multi-scale image processing network may include some of the following for each of the one or more delta layers. The method also includes obtaining a respective version of the input image having a respective scale in the sequence of increasing image scales. The method also includes processing the respective version of the input image with a set of parameters to generate an intermediate image. The method also includes applying a high pass image filter to the intermediate image to generate a filtered image. The method also includes generating a respective output image having the respective scale by combining the filtered image with an upscaled version of the respective output image from a preceding layer in the sequence of layers. The method also includes evaluating, by the computing system, a loss function that compares the model prediction to the ground truth data. The method also includes modifying, by the computing system, one or more parameter values of one or more of the sets of parameters of the one or more delta layers based on the loss function. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.
[0010] Implementations may include one or more of the following features. The computer-implemented method where the input image may include a degraded image, where the ground truth data may include a ground truth restored image, and where the model prediction may include a predicted restored image output by a final delta layer of the one or more delta layers. Loss function may include a multi-teacher model distillation loss term. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.
[0011] One general aspect includes one or more non-transilory computer-readable media that collectively store a machine-learned multi-scale image restoration network. The one or more non-transitory computer-readable media also includes instructions to perform operations. The instructions cause a computing system to obtain a respective version of the input image having a respective scale in the sequence of increasing image scales. The instructions cause the computing system to process the respective version of the input image with a set of learned parameters to generate an intermediate image. The instructions cause the computing system to apply a high pass image filter to the intermediate image to generate a filtered image. The instructions cause the computing system to generate a respective output image having the respective scale by combining the filtered image with an upscaled version of the respective output image from a preceding layer in the sequence of layers. Where the respective output image of a final layer of the one or more delta layers may include a restored version of the input image. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.
[0012] Implementations may include one or more of the following features. The one or more non-transitory computer readable media where the high pass image filter may include a Laplacian filter. The one or more delta layers may include a plurality of delta layers, and w here at least two of the plurality of delta layers share the same set of learned parameters. For each of the one or more delta layers, the set of learned parameters may be structured in a U-Net architecture. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium. [0013] Other aspects of the present disclosure are directed to various systems, apparatuses, non-transitory computer-readable media, user interfaces, and electronic devices. [0014] These and other features, aspects, and advantages of various embodiments of the present disclosure will become better understood with reference to the following description and appended claims. The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate example embodiments of the present disclosure and, together with the description, serve to explain the related principles.
BRIEF DESCRIPTION OF THE DRAWINGS
[0015] Detailed discussion of embodiments directed to one of ordinary skill in the art is set forth in the specification, which makes reference to the appended figures, in which: [0016] Figure 1 depicts a graphical diagram of an example multi-scale image processing model according to example embodiments of the present disclosure.
[0017] Figure 2 depicts a graphical diagram of a training scheme for training a multiscale image processing model.
[0018] Figure 3 A depicts a block diagram of an example computing system according to example embodiments of the present disclosure.
[0019] Figure 3B depicts a block diagram of an example computing device according to example embodiments of the present disclosure.
[0020] Figure 3C depicts a block diagram of an example computing device according to example embodiments of the present disclosure.
DETAILED DESCRIPTION
[0021] Example aspects of the present disclosure are directed to a new neural network architecture and associated framework and methodology for multiscale image restoration or other forms of image processing tasks. Example implementations of the proposed neural network architecture can be referred to as Delta-Net. This proposed technology utilizes a machine-learned multi-scale image processing network that operates on increasing scales of an input image. The architecture begins by processing the image at the coarsest resolution, then proceeds to higher scale levels. For instance, it can start with a low-resolution image and gradually refine the details to generate a high-resolution output.
[0022] According to an aspect of the present disclosure, the Delta-Net architecture can employ delta layers within the network. Each delta layer can be configured to process a version of the input image at a particular scale, using a set of learned parameters. This could involve, for example, processing a mid-scale version of the image using parameters learned from training data. The result is an intermediate image that is then processed by further components in the delta layer.
[0023] In some implementations, the learned parameters in each delta layer can be structured in a U-Net architecture. This structure is a type of convolutional neural network that is particularly effective for tasks that require understanding the context, such as image segmentation. By using a U-Net architecture, the Delta-Net could potentially generate more precise and contextually accurate image restorations.
[0024] The respective intermediate image generated by the learned parameters in each delta layer can be subjected to a high pass filter, which could, for example, be a Laplacian filter. This filter can be applied to generate a filtered image. The high pass filter can isolate details in the image which exhibit a higher frequency.
[0025] After filtering, the delta layer can combine the filtered image with an upscaled version of the output image from the preceding layer. This combination generates a respective output image at the current scale. For example, a filtered mid-scale image could be combined with an upscaled version of a low-scale output image to produce a mid-scale output image. This process ensures that the learned component from each layer generates additional details with respect to the previous low er-resolution output.
[0026] The Delta-Net architecture can include a sequence of delta layers, which follow a coarsest scale layer. In particular, in some implementations, the coarsest scale layer does not apply the high pass fdter. This layer serves as the initial layer in the sequence of layers, with the delta layers following in sequence. For example, an initial layer might process the image at its lowest resolution, and subsequent delta layers would then refine the image at progressively higher resolutions.
[0027] Thus, the delta layers within the Delta-Net architecture can include multiple layers. For instance, the network might feature several delta layers, each processing the image at a different scale. This multi-layer approach allows for a more detailed and nuanced image restoration process. Furthermore, in some implementations, two or more of these delta layers could share the same set of learned parameters, which might streamline the processing and reduce computational demands.
[0028] In some implementations, the input image has the largest scale in the sequence of increasing image scales. In this case, the network can include a downscaling block to generate the respective versions of the input image at the sequence of scales. For example, a high-resolution input image could be downscaled to create lower-resolution versions for processing by the delta layers.
[0029] In some implementations, each delta layer in the network can concatenate and process the respective version of the input image and the upscaled version of the respective output image from the preceding layer. This combination is then processed with the set of learned parameters to generate the intermediate image. This approach ensures that each layer incorporates information from both the current scale and the previous scale, potentially- improving the quality of the final restored image.
[0030] The proposed network can be applied to a number of different tasks. As examples, the respective output image generated by each delta layer could either be a restored version of the input image or can alternatively be a feature image comprising latent feature values expressed in a latent dimensional space. The latter could be useful for other machine learning tasks, such as object detection or image classification. For example, the system can further include a machine-learned prediction model that processes one or more of the respective output images to generate model predictions.
[0031] The systems and methods of the present disclosure provide a number of technical effects and benefits. One technical benefit provided by the present disclosure is a novel neural network architecture, Delta-Net, which can be used for multiscale image restoration. This technology can significantly enhance the performance of computing systems in several ways. For instance, the Delta-Net architecture can improve the efficiency of image processing tasks by operating on increasing scales of an input image. This multiscale approach allows the system to start with a low-resolution image and gradually refine the details to generate a high-resolution output. This can lead to a more efficient use of computational resources, as the system can focus on refining details instead of processing the entire image at a high resolution from the start.
[0032] The proposed techniques also provide a unique benefit of the Delta layers predicting higher frequency details while leveraging the lower frequency details from the preceding layer. This approach ensures that each layer contributes to the final output in a meaningful and complementary way. The application of high-pass filters at each scale allows the Delta layer to focus on extracting and enhancing high-frequency details — the finer textures and edges — that are often lost in degraded images. By replacing the missing low- frequency components with the output of the previous scale, the system ensures continuity and coherence across scales. This process capitalizes on the inherent hierarchical structure of images, where low-frequency components provide the basic structure and high-frequency components add the finer details. The result is a more consistent and scale-invariant image restoration output. This method is particularly advantageous in handling various types of image degradation, such as motion blur or out-of-focus blur, where both the overall image structure and finer details need to be restored for a high-quality output. This architecture, therefore, enables a more nuanced and comprehensive approach to image restoration, effectively addressing the challenges presented by multiscale and multifaceted image degradation.
[0033] In addition to enhancing the performance of computing systems, the present disclosure can also enable new functionalities. For example, the respective output image generated by each delta layer could be a feature image comprising latent feature values expressed in a latent dimensional space. This could be useful for other machine learning tasks, such as object detection or image classification. Furthermore, the system could include a machine-learned prediction model that processes one or more of the respective output images to generate model predictions. This could potentially enable the system to perform multiple tasks simultaneously, such as restoring an image and identifying objects within the image.
[0034] Overall, the present disclosure offers a versatile and efficient solution for image restoration tasks. By leveraging a multiscale approach and a novel neural network architecture, the system can increase the performance of computing systems in image processing tasks, while also enabling new functionalities.
[0035] With reference now to the Figures, example embodiments of the present disclosure will be discussed in further detail.
Example Multi-Scale Model Architecture
[0036] Figure 1 provides a graphical diagram of an example multi-scale image processing model, referenced as 1000. This figure illustrates the process by which some example implementations of the Delta-Net system operate, from an initial input image 1002 to a final output image 1040, which, in some implementations, may be a restored version of the input image 1002.
[0037] The image restoration process begins with the input image, labeled as 1002 in Figure 1. This image serves as the initial data for the Delta-Net system, and can be any image that requires restoration. For example, the input image 1002 may exhibit blur, compression artifacts, or other forms of image degradation. [0038] The input image 1002 can be any digital image that needs to be enhanced or restored. For example, it could be a photograph taken with a digital camera that has been blurred due to motion or out-of-focus issues. Alternatively, the input image could be a frame from a video that has been degraded due to compression artifacts. The input image could also be a scanned document that needs to be enhanced for better readability.
[0039] The input image 1002 can be in any suitable format that can be processed by the model 1000. For instance, it could be a bitmap image, a vector image, or a raw image fde. The input image could also be in any suitable color space, such as RGB, grayscale, or a color space designed for a specific application. The system can be configured to convert the input image to a suitable format and color space for processing, if necessary.
[0040] The input image 1002 can also be subjected to various preprocessing steps before being processed by the Delta-Net architecture. For instance, the system could apply noise reduction, color correction, or other image enhancement techniques to the input image. These preprocessing steps can help to improve the quality of the final restored image.
[0041] In some implementations, the input image 1002 could have the largest scale in the sequence of increasing image scales, thus it may need to be downscaled for further processing. The downscaling block, denoted as 1004, then processes the input image to generate downscaled versions. For instance, a high-resolution input image can be downscaled by this block to create lower-resolution versions, labeled as 1006 and 1008 in Figure 1. These downscaled images are then used as input by the respective layers in the Delta-Net system. Alternatively, the system could start with a low-resolution input image and use an upscaling technique to create higher-resolution versions of the image.
[0042] More particularly, the illustrated Delta-Net architecture can operate across multiple different image scales. Image scales refer to the different resolutions at which an image can be processed in the network. As illustrated, the model 100 can operate on respective versions of an input image in a sequence of increasing image scales, starting from a coarse resolution and gradually moving to higher resolutions. This multiscale approach allows the system to initially process the image at a lower resolution, which can potentially reduce computational demands and improve efficiency.
[0043] The number and range of image scales used in the Delta-Net architecture can vary based on the specific application. For example, for a simple image restoration task, the system might use a small number of scales, starting from a low resolution and gradually increasing to the original resolution of the input image. On the other hand, for a more complex task such as deblurring or denoising, the system might use a larger number of scales, starting from a very coarse resolution and increasing to a high resolution.
[0044] The specific resolutions used for the image scales can also vary. For instance, the system could use standard resolutions such as 480p, 720p, 1080p, and 4K for the image scales. Alternatively, the system could use custom resolutions based on the specific requirements of the task. The system could also dynamically determine the resolutions based on the characteristics of the input image.
[0045] The transition between different image scales can be handled in various ways. For instance, the system could use a smooth transition, gradually increasing the resolution from one scale to the next. Alternatively, the system could use a stepwise transition, abruptly changing the resolution at each scale. The choice between these methods could depend on factors such as the nature of the image degradation and the computational resources available. [0046] Referring again to Figure 1, a coarsest layer, denoted as 1010, of the model 1000 can serves as an initial layer in the sequence of layers in the Delta-Net architecture. This layer 1010 processes the lowest resolution version 1008 of the image, which is the coarsest version in the sequence. Notably, this layer does not apply a high pass filter to the image, as shown in Figure 1. The output of this layer is denoted as output image_2 or 1012.
[0047] Following the coarsest layer, the Delta-Net system employs one or more delta layers. The first of these, delta layer l, or 1014, processes the image 1006 at a higher resolution than the coarsest layer. It utilizes a set of learned parameters, 1016, to process the respective version 1006 of the image and generate an intermediate image, 1018. The intermediate image 1018 can be seen as a step in the image restoration process, representing the state of the image after it has been processed at a particular scale but before it has been filtered. The intermediate image 1018 can be in any suitable format that can be processed by the high pass image filter, such as a bitmap image or a raw image file.
[0048] In some implementations, the set of learned parameters 1016 can be structured as a U-Net architecture. The U-Net architecture can be implemented in various ways depending on the specific requirements of the image processing task. For example, the U-Net architecture could include a series of convolutional layers, followed by a series of deconvolutional layers. The convolutional layers can be used to extract features from the respective version of the input image, while the deconvolutional layers can be used to reconstruct the output image from the extracted features.
[0049] In some implementations, the U-Net architecture can also include skip connections between the convolutional and deconvolutional layers. These skip connections can help to preserve the high-frequency details in the image, which can be particularly beneficial for tasks such as image restoration. For instance, a skip connection might directly connect a convolutional layer to a corresponding deconvolutional layer, allowing the high- frequency details extracted by the convolutional layer to be directly incorporated into the reconstructed output image.
[0050] In some implementations, the U-Net architecture could potentially be implemented with different numbers of layers. For example, the U-Net architecture could include a small number of layers for simple image processing tasks, or a large number of layers for more complex tasks. The U-Net architecture could also include various types of layers, such as pooling layers, normalization layers, or activation layers. Each layer in the U- Net architecture could use different types of kernels or activation functions. For instance, the convolutional layers could use small kernels to extract fine-grained features, or large kernels to extract coarse-grained features. The activation functions could include linear functions, sigmoid functions, or rectified linear unit (ReLU) functions, among others.
[0051] Referring still to Figure 1, the layer delta lay er_l. or 1014 then applies a high pass filter, 1020, to the intermediate image 1018, yielding a filtered image, 1022. This filter 1020 can potentially enhance the details in the image, particularly those that are of a higher frequency. The high pass image filter 1020 could be any suitable type of filter that is capable of enhancing high-frequency details. For instance, it could be a Eaplacian filter, a Sobel filter, or any other type of high pass filter.
[0052] In particular, as one example, the filter 1020 can be a Laplacian filter. This filter can be designed to enhance the details in the image, particularly those that are of a higher frequency. The Laplacian filter can be implemented in various ways, depending on the specific requirements of the image processing task.
[0053] For instance, the Laplacian filter could be a 2D filter that operates on the spatial domain of the image. This ty pe of filter could be particularly effective for enhancing edges and other high-frequency details in the image. Alternatively, the Laplacian filter could be a 3D filter that operates on both the spatial and temporal domains of the image. This could be beneficial for processing video frames, as it could help to enhance details that change over time.
[0054] The Laplacian filter could also be implemented as a convolutional filter. This involves convolving the filter with the intermediate image to generate the filtered image. The convolution operation could be performed using various methods, such as direct convolution, fast Fourier transform (FFT) based convolution, or separable convolution. The choice of convolution method could depend on factors such as the size of the image and the computational resources available.
[0055] The coefficients of the Laplacian filter could be predetermined or learned from training data. For instance, the filter could use a standard set of coefficients that are known to be effective for enhancing high-frequency details. Alternatively, the coefficients could be learned from training data, allowing the filter to adapt to the specific characteristics of the images being processed.
[0056] Referring still to Figure 1, the filtered image 1022 is the result of applying the high pass image filter 1020 to the intermediate image 1018. The filtered image 1022 represents the state of the image after the high-frequency details have been isolated. This filtered image 1022 can be in any suitable format. For instance, the filtered image could be a bitmap image, a vector image, or a raw image file.
[0057] After the high pass filter 1020 is applied, delta layer_l 1014 combines the filtered image 1022 with an upscaled version of the output image from the previous layer (output image_2. 1012 in this case) to generate output image_l, 1024. Thus, the output image 1024 is generated by combining the filtered image 1022 with an upscaled version of the output image 1012 from the preceding layer. This combination generates a respective output image 1024 at the current scale.
[0058] This process discussed with respect to delta layer_l 1014 can then be repeated in each of a number of subsequent delta layer(s). with each layer processing the image at a progressively higher scale.
[0059] The last layer, delta layer_0 or 1030, processes the input image 1002 at the highest resolution. This layer also uses a set of learned parameters, 1032, to generate an intermediate image, 1034. A high pass filter, 1036, is then applied to the intermediate image 1034 to generate a filtered image, 1038.
[0060] Finally, delta layer O combines the filtered image 1038 with an upscaled version of the output image from the previous layer (output image_l, 1024 in this case) to generate the final output image, labeled as output image_0 or 1040 in Figure 1. In some implementations, this final output image 1040 represents the restored version of the original input image 1002, and can in some cases be the end product of the Delta-Net image restoration process.
[0061] More particularly, at each layer the output image (e.g., output images 1024, 1040, etc.) can be a restored version of the input image. However, in some implementations, in addition or alternatively to a restored image, the output image at each layer can be a feature image comprising latent feature values expressed in a latent dimensional space. A feature image, as referenced herein, represents a potential output of each delta layer in the Delta-Net architecture. This image can include latent feature values expressed in a latent dimensional space. The feature image can be a useful output of the image processing system, especially for other machine learning tasks. For instance, the feature image could be employed in tasks such as object detection, image classification, or even image segmentation.
[0062] The latent feature values in the feature image could represent various characteristics of the input image. For instance, these values could represent edges, textures, colors, or other visual features in the image. Alternatively, the latent feature values could represent higher-level features, such as the presence of specific objects or patterns in the image. The specific types of features represented could depend on the learned parameters used by the delta layer.
[0063] The feature image could be in any suitable format that can be processed by subsequent layers in the Delta-Net architecture or by other machine learning models. For instance, the feature image could be a bitmap image, a vector image, or a raw image file. The feature image could also be in any suitable color space, such as RGB, grayscale, or a color space designed for a specific application.
[0064] In some embodiments, one or more of the output images (e.g., output feature image(s)) can be processed further by a machine-learned prediction model (not illustrated). This prediction model could generate one or more model predictions based on the output image(s) (e.g., output feature image(s)). For instance, the prediction model could identify objects in the output image(s), classify the input image based on its features, or predict future states of the image based on its current features.
[0065] Thus, the prediction model is an optional component of the Delta-Net architecture. The prediction model can process one or more of the respective output images generated by one or more of the delta layers to generate model predictions. The prediction model can be a machine-learned model that has been trained to generate predictions based on the output image(s). For instance, the prediction model could be a convolutional neural network (CNN), a recurrent neural network (RNN), a fully connected network, or any other type of machine learning model.
[0066] The prediction model can generate various types of model predictions. For example, the model predictions could include class labels, probabilities, scores, or other types of predictions. The specific type of model prediction could depend on the task at hand. For instance, if the task is image classification, the model prediction could be a class label that indicates the category' of the input image. If the task is object detection, the model prediction could be a set of bounding boxes that indicate the locations of objects in the input image. [0067] The prediction model can be implemented in various ways. For instance, the prediction model could be implemented as a standalone component that operates independently of the delta layers. Alternatively, the prediction model could be integrated into the Delta-Net architecture, operating in conjunction with the delta layers (e.g., trained jointly therewith). For example, the output of a delta layer could be used as an input to the prediction model, allowing the prediction model to generate model predictions based on the intermediate result(s) of the image processing.
Example Training Scheme
[0068] Figure 2 provides a graphical diagram of an exemplary training scheme, referred to as 2000, for training a multi-scale image processing model (e.g., a Delta-Net model) using a set of training data 2001.
[0069] The training scheme 2000 begins by obtaining a training tuple 2002 from the set of training data 2001. The training tuple 2002 can include an input image 2004 and corresponding ground truth data 2006. The input image 2004 can be any image that requires restoration or enhancement, similar to the input image 1002 discussed in relation to Figure 1. The ground truth data 2006 represents the ideal output for the given input image, and this data could be a perfectly restored version of the degraded input image or any other data that represents the desired output (e g., a prediction such as a classification or detection prediction).
[0070] The ground truth data 2006 can be obtained in various ways. For instance, it could be manually created by a human operator. Alternatively, it could be automatically generated by a computer program or algorithm. The ground truth data 2006 could also be obtained from a database or other data source. In some instances, the ground truth data 2006 could be obtained through a combination of these methods. For example, an initial set of ground truth data could be manually created by a human operator, and then refined or supplemented by a computer program or algorithm.
[0071] The training tuple 2002 serves as the basis for training the multi-scale image processing model 2008 of the Delta-Net system. Specifically, the input image 2004 is processed by the multi-scale image processing model 2008 to generate a prediction 2010, and the resulting prediction 2010 is compared to the ground truth data 2006. This comparison allows the system to evaluate its performance and adjust its parameters to improve future outputs.
[0072] Specifically, the computing system accesses a machine-learned multi-scale image processing network 2008, which can be the Delta-Net architecture discussed in relation to Figure 1. This network operates on respective versions of the input image 2004 in a sequence of increasing image scales, processing each version with a sequence of layers that include one or more delta layers.
[0073] The processing of the input image 2004 by the network 2008 generates a model prediction 2010. This prediction represents the system's output for the given input image. The model prediction could be an image that has been restored or enhanced by the system, a set of features extracted from the image, or any other output generated by the system. The specific type of model prediction will depend on the task at hand and the configuration of the Delta- Net system.
[0074] The processing of the input image 2004 by the network 2008 can include several steps. For example, for each of the delta layers in the network, the system obtains a respective version of the input image 2004 that corresponds to a respective scale in the sequence of increasing image scales. This version of the input image is then processed with a set of parameters to generate an intermediate image.
[0075] A high-pass image filter is then applied to the intermediate image, yielding a filtered image. The filtered image is then combined with an upscaled version of the respective output image from the preceding layer in the sequence of layers. This combination generates a respective output image at the current scale.
[0076] The sequence of steps outlined above can be performed for each of the delta layers in the network. This iterative process allows the system to gradually refine the details of the input image at increasing scales, effectively restoring or enhancing the image.
[0077] Following the processing of the input image 2004 by the network 2008, the system evaluates one or more loss functions 2012. One or more of these loss function(s) can measure the difference between the model prediction 2010 and the ground truth data 2006. The loss functions 2012 can provide a measure of how well the system's prediction 2010 matches the desired output, as represented by the ground truth data 2006.
[0078] The loss functions 2012 can be designed to measure various types of discrepancies between the model prediction 2010 and the ground truth data 2006. For instance, the loss functions could measure the mean squared error, the cross-entropy, the Kullback-Leibler divergence, or any other suitable discrepancy measure. The specific types of loss functions used can depend on the task at hand and the configuration of the Delta-Net system.
[0079] Based on the evaluation of the loss functions 2012, the system modifies one or more parameter values of one or more of the sets of parameters of the delta layers of the network 2008. This modification can involve adjusting the values of the parameters to reduce the discrepancy between the model prediction 2010 and the ground truth data 2006. The specific method of modification can depend on the optimization algorithm used by the system.
[0080] The optimization algorithm used to modify the parameters of the machine- learned multi-scale image processing network 2008 can be any suitable algorithm designed, for example, to minimize the loss functions 2012. For instance, the optimization algorithm could be a gradient descent algorithm, a stochastic gradient descent algorithm, or any other suitable optimization algorithm. The specific optimization algorithm used can depend on the configuration of the network 2008 and the nature of the task at hand.
[0081] In some embodiments of the present disclosure, the modification of the parameters of the network 2008 could involve updating the parameters (e.g., weights and biases) of the neural network. This could involve, for example, applying a learning rate to the gradient of the loss function(s) 2012 with respect to the parameters and then subtracting this value from the current parameter values. This process could be repeated iteratively until the loss function(s) 2012 reaches a minimum value, or until some other stopping criterion is met. [0082] In some instances, the learning rate used to update the parameters (e.g., weights and biases) could be a fixed value. Alternatively, the learning rate could be dynamically adjusted based on the progress of the training process. For example, the learning rate could be decreased if the loss function is not decreasing fast enough, or increased if the loss function is decreasing too slowly.
[0083] In other embodiments, the modification of the parameters could involve other types of updates, such as updates to the filter coefficients of the high pass image filter, updates to the structure of the delta layers or the U-Net architecture, or updates to any other parameters or components of the Delta-Net system.
[0084] In some implementations, when the input image 2004 comprises a degraded image that requires restoration or enhancement, the corresponding ground truth data 2006 can consist of a ground truth restored image, which serves as the ideal or desired output for the degraded input image. As the system processes the degraded input image, it generates a model prediction 2010, which represents the system's output for the given input image. The model prediction 2010 can comprise a predicted restored image output by a final delta layer of one or more delta layers in the multi-scale image processing network 2008.
[0085] According to another aspect of the present disclosure, in some implementations, the training scheme 2000 can apply a multi-teacher model distillation training approach. This approach can scale up the process of real data curation while also leveraging the expertise of multiple teacher models. The multi-teacher model distillation training approach can include two primary steps, which can either be performed in sequence or iteratively until a desired level of performance is achieved.
[0086] A first step can include automatic quality evaluation using metrics. A computing system, comprising one or more computing devices, can use one or more quality metrics to evaluate the quality of the input images 2004 and the corresponding model predictions 2010. The specific quality metrics used can depend on the task at hand and the configuration of the network 2008.
[0087] A second step can include human curation, where a human operator can curate a high-quality real blurry image dataset with a reasonably good reference image to supplement the synthetic dataset. The aim is to distill multiple teachers (e.g., candidate models) having different deblurring capabilities into a single model by training on the curated dataset.
[0088] The process of curating a high-quality' real blurry' image dataset can involve various steps. For instance, the human operator could select a set of blurry images that exhibit a wide range of blur types, levels, and patterns. The operator could then manually enhance or restore these images to create the reference images. These reference images can serve as the ground truth data 2006 for the training process.
[0089] The distillation of multiple teachers into a single model during the training process allows the system to leverage the expertise of multiple candidate models. Each of these models can have different deblurring capabilities, potentially complementing each other. By7 training on the curated dataset using the multi-teacher model distillation approach, the system can effectively leam to combine the strengths of the teacher models, resulting in a more capable image restoration model.
[0090] The multi-teacher model distillation training approach can be implemented in various ways. For instance, the system could use a weighted combination of the teacher models' outputs as the target for the training process. Alternatively, the system could use a voting scheme, where each teacher model's output is considered as a vote and the most common output is chosen as the target. [0091] According to another aspect of the present disclosure, in some implementations, the training scheme 2000 utilizes a loss function 2012 that incorporates multiple terms. This loss function is designed to guide the Delta-Net system in generating more accurate and nuanced image restorations or enhancements.
[0092] In some examples, the training scheme 2000 of the Delta-Net system can leverage a unique combination of loss functions 2012, including LI loss, projected distribution loss (e.g., ID-Wasserstein distances between CNN activations), adversarial loss, and/or other loss(es). By optimizing these loss functions 2012, the system can effectively learn to generate accurate and detailed image restorations or enhancements. This makes the Delta-Net system a versatile and powerful tool for a wide range of image processing tasks. [0093] Thus, the training scheme 2000 for the Delta-Net architecture can incorporate a mix of various sources of data for a comprehensive and robust training process. In some example implementations, two specific types of data sources are utilized: synthetically degraded data and real blur-sharp image pairs captured with different camera configurations. [0094] Synthetically degraded data refers to images that have been artificially altered or degraded in order to simulate various forms of image degradation. These alterations allow for the creation of a wide range of degraded images that cover a comprehensive array of potential degradation scenarios. The corresponding ground truth data 2006 for these synthetically degraded images would be the original, unaltered image.
[0095] Real blur-sharp image pairs captured with different camera configurations are actual captured images that have not been artificially altered. These images are captured under various camera settings, introducing real-world scenarios into the training process. These real-world scenarios can involve a multitude of factors such as motion blur, defocus blur, and other types of degradation that can occur naturally in photography.
[0096] A final training data set 2001 can encompass training data from either or both of the classes of data described above. Additionally or alternatively, the training data set 2001 can include training examples derived from a distillation process utilizing multiple models. The mix of synthetically degraded data, real blur-sharp image pairs, and the multi-teacher model distillation process provides a comprehensive and robust training scheme 2000 for the Delta-Net architecture. This training scheme allows the Delta-Net architecture to learn and adapt to a wide range of image degradation scenarios, while also leveraging the expertise of multiple models, effectively enhancing its image restoration capabilities. Example Devices and Systems
[0097] Figure 3 A depicts a block diagram of an example computing system 100 according to example embodiments of the present disclosure. The system 100 includes a user computing device 102, a server computing system 130, and a training computing system 150 that are communicatively coupled over a network 180.
[0098] The user computing device 102 can be any type of computing device, such as, for example, a personal computing device (e.g., laptop or desktop), a mobile computing device (e.g., smartphone or tablet), a gaming console or controller, a wearable computing device, an embedded computing device, or any other type of computing device.
[0099] The user computing device 102 includes one or more processors 112 and a memory 114. The one or more processors 112 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality’ of processors that are operatively connected. The memory 114 can include one or more non-transitory computer-readable storage media, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. The memory 114 can store data 1 16 and instructions 118 which are executed by the processor 112 to cause the user computing device 102 to perform operations. [0100] In some implementations, the user computing device 102 can store or include one or more machine-learned models 120. For example, the machine-learned models 120 can be or can otherwise include various machine-learned models such as neural networks (e.g., deep neural networks) or other types of machine-learned models, including non-linear models and/or linear models. Neural networks can include feed-forward neural networks, recurrent neural networks (e.g., long short-term memory' recurrent neural networks), convolutional neural networks or other forms of neural networks. Some example machine-learned models can leverage an attention mechanism such as self-attention. For example, some example machine-learned models can include multi-headed self-attention models (e.g., transformer models). Example machine-learned models 120 are discussed with reference to Figures 1 and 2.
[0101] In some implementations, the one or more machine-learned models 120 can be received from the server computing system 130 over network 180, stored in the user computing device memory' 114, and then used or otherwise implemented by the one or more processors 112. In some implementations, the user computing device 102 can implement multiple parallel instances of a single machine-learned model 120 (e.g., to perform parallel image processing across multiple instances of images). [0102] Additionally or alternatively, one or more machine-learned models 140 can be included in or otherwise stored and implemented by the server computing system 130 that communicates with the user computing device 102 according to a client-server relationship. For example, the machine-learned models 140 can be implemented by the server computing system 140 as a portion of a web service (e.g., an image processing or image restoration service). Thus, one or more models 120 can be stored and implemented at the user computing device 102 and/or one or more models 140 can be stored and implemented at the server computing system 130.
[0103] The user computing device 102 can also include one or more user input components 122 that receives user input. For example, the user input component 122 can be a touch-sensitive component (e.g.. a touch-sensitive display screen or a touch pad) that is sensitive to the touch of a user input object (e.g., a finger or a stylus). The touch-sensitive component can serve to implement a virtual keyboard. Other example user input components include a microphone, a traditional keyboard, or other means by which a user can provide user input.
[0104] The server computing system 130 includes one or more processors 132 and a memory 134. The one or more processors 132 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. The memory 134 can include one or more non-transilory computer-readable storage media, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. The memory 134 can store data 136 and instructions 138 which are executed by the processor 132 to cause the server computing system 130 to perform operations.
[0105] In some implementations, the server computing system 130 includes or is otherw ise implemented by one or more server computing devices. In instances in which the sen' er computing system 130 includes plural server computing devices, such sen er computing devices can operate according to sequential computing architectures, parallel computing architectures, or some combination thereof.
[0106] As described above, the server computing system 130 can store or othenvise include one or more machine-learned models 140. For example, the models 140 can be or can otherwise include various machine-learned models. Example machine-learned models include neural networks or other multi-layer non-linear models. Example neural networks include feed forward neural networks, deep neural networks, recurrent neural networks, and convolutional neural networks. Some example machine-learned models can leverage an attention mechanism such as self-attention. For example, some example machine-learned models can include multi-headed self-attention models (e.g., transformer models). Example models 140 are discussed with reference to Figures 1 and 2.
[0107] The user computing device 102 and/or the server computing system 130 can train the models 120 and/or 140 via interaction with the training computing system 150 that is communicatively coupled over the network 180. The training computing system 150 can be separate from the server computing system 130 or can be a portion of the server computing system 130.
[0108] The training computing system 150 includes one or more processors 152 and a memory 154. The one or more processors 152 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. The memory7 154 can include one or more non-transitory computer-readable storage media, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. The memory 154 can store data 156 and instructions 158 which are executed by the processor 152 to cause the training computing system 150 to perform operations. In some implementations, the training computing system 150 includes or is otherwise implemented by one or more server computing devices.
[0109] The training computing system 150 can include a model trainer 160 that trains the machine-learned models 120 and/or 140 stored at the user computing device 102 and/or the server computing system 130 using various training or learning techniques, such as, for example, backwards propagation of errors. For example, a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the loss function). Various loss functions can be used such as mean squared error, likelihood loss, cross entropy loss, hinge loss, and/or various other loss functions. Gradient descent techniques can be used to iteratively update the parameters over a number of training iterations.
[0110] In some implementations, performing backwards propagation of errors can include performing truncated backpropagation through time. The model trainer 160 can perform a number of generalization techniques (e.g., weight decays, dropouts, etc.) to improve the generalization capability of the models being trained.
[0111] In particular, the model trainer 160 can train the machine-learned models 120 and/or 140 based on a set of training data 162. In some implementations, if the user has provided consent, the training examples can be provided by the user computing device 102. Thus, in such implementations, the model 120 provided to the user computing device 102 can be trained by the training computing system 150 on user-specific data received from the user computing device 102. In some instances, this process can be referred to as personalizing the model.
[0112] The model trainer 160 includes computer logic utilized to provide desired functionality. The model trainer 160 can be implemented in hardware, firmware, and/or software controlling a general purpose processor. For example, in some implementations, the model trainer 160 includes program files stored on a storage device, loaded into a memory' and executed by one or more processors. In other implementations, the model trainer 160 includes one or more sets of computer-executable instructions that are stored in a tangible computer-readable storage medium such as RAM, hard disk, or optical or magnetic media. [0113] The network 180 can be any type of communications network, such as a local area network (e.g., intranet), wide area network (e.g., Internet), or some combination thereof and can include any number of wired or wireless links. In general, communication over the network 180 can be carried via any type of wired and/or wireless connection, using a wide variety of communication protocols (e.g., TCP/IP, HTTP, SMTP, FTP), encodings or formats (e.g., HTML, XML), and/or protection schemes (e.g., VPN, secure HTTP, SSL).
[0114] In some implementations, the input to the machine-learned model (s) of the present disclosure can be image data. The machine-learned model(s) can process the image data to generate an output. As an example, the machine-learned model(s) can process the image data to generate an image recognition output (e.g., a recognition of the image data, a latent embedding of the image data, an encoded representation of the image data, a hash of the image data. etc.). As another example, the machine-learned model(s) can process the image data to generate an image segmentation output. As another example, the machine- learned model(s) can process the image data to generate an image classification output. As another example, the machine-learned model(s) can process the image data to generate an image data modification output (e.g., an alteration of the image data, etc.). As another example, the machine-learned model(s) can process the image data to generate an encoded image data output (e.g., an encoded and/or compressed representation of the image data, etc.). As another example, the machine-learned model(s) can process the image data to generate an upscaled image data output. As another example, the machine-learned model(s) can process the image data to generate a prediction output. [0115] In some cases, the input includes visual data and the task is a computer vision task. In some cases, the input includes pixel data for one or more images and the task is an image processing task. For example, the image processing task can be image classification, where the output is a set of scores, each score corresponding to a different object class and representing the likelihood that the one or more images depict an object belonging to the object class. The image processing task may be object detection, where the image processing output identifies one or more regions in the one or more images and, for each region, a likelihood that region depicts an object of interest. As another example, the image processing task can be image segmentation, where the image processing output defines, for each pixel in the one or more images, a respective likelihood for each category in a predetermined set of categories. For example, the set of categories can be foreground and background. As another example, the set of categories can be object classes. As another example, the image processing task can be depth estimation, where the image processing output defines, for each pixel in the one or more images, a respective depth value. As another example, the image processing task can be motion estimation, where the network input includes multiple images, and the image processing output defines, for each pixel of one of the input images, a motion of the scene depicted at the pixel between the images in the network input.
[0116] Figure 3 A illustrates one example computing system that can be used to implement the present disclosure. Other computing systems can be used as well. For example, in some implementations, the user computing device 102 can include the model trainer 160 and the training dataset 162. In such implementations, the models 120 can be both trained and used locally at the user computing device 102. In some of such implementations, the user computing device 102 can implement the model trainer 160 to personalize the models 120 based on user-specific data.
[0117] Figure 3B depicts a block diagram of an example computing device 10 that performs according to example embodiments of the present disclosure. The computing device 10 can be a user computing device or a server computing device.
[0118] The computing device 10 includes a number of applications (e.g., applications 1 through N). Each application contains its own machine learning library and machine-learned model(s). For example, each application can include a machine-learned model. Example applications include a text messaging application, an email application, a dictation application, a virtual keyboard application, a browser application, etc.
[0119] As illustrated in Figure 3B, each application can communicate with a number of other components of the computing device, such as, for example, one or more sensors, a context manager, a device state component, and/or additional components. In some implementations, each application can communicate with each device component using an API (e.g., a public API). In some implementations, the API used by each application is specific to that application.
[0120] Figure 3C depicts a block diagram of an example computing device 50 that performs according to example embodiments of the present disclosure. The computing device 50 can be a user computing device or a server computing device.
[0121] The computing device 50 includes a number of applications (e.g., applications 1 through N). Each application is in communication with a central intelligence layer. Example applications include a text messaging application, an email application, a dictation application, a virtual keyboard application, a browser application, etc. In some implementations, each application can communicate with the central intelligence layer (and model(s) stored therein) using an API (e.g., a common API across all applications).
[0122] The central intelligence layer includes a number of machine-learned models. For example, as illustrated in Figure 3C. a respective machine-learned model can be provided for each application and managed by the central intelligence layer. In other implementations, two or more applications can share a single machine-learned model. For example, in some implementations, the central intelligence layer can provide a single model for all of the applications. In some implementations, the central intelligence layer is included within or otherwise implemented by an operating system of the computing device 50.
[0123] The central intelligence layer can communicate with a central device data layer. The central device data layer can be a centralized repository' of data for the computing device 50. As illustrated in Figure 3C, the central device data layer can communicate with a number of other components of the computing device, such as, for example, one or more sensors, a context manager, a device state component, and/or additional components. In some implementations, the central device data layer can communicate with each device component using an API (e.g., a private API).
Additional Disclosure
[0124] The technology discussed herein makes reference to servers, databases, software applications, and other computer-based systems, as well as actions taken and information sent to and from such systems. The inherent flexibility- of computer-based systems allows for a great variety of possible configurations, combinations, and divisions of tasks and functionality between and among components. For instance, processes discussed herein can be implemented using a single device or component or multiple devices or components working in combination. Databases and applications can be implemented on a single system or distributed across multiple systems. Distributed components can operate sequentially or in parallel.
[0125] While the present subject matter has been described in detail with respect to various specific example embodiments thereof, each example is provided by way of explanation, not limitation of the disclosure. Those skilled in the art, upon attaining an understanding of the foregoing, can readily produce alterations to, variations of, and equivalents to such embodiments. Accordingly, the subject disclosure does not preclude inclusion of such modifications, variations and/or additions to the present subject matter as would be readily apparent to one of ordinary skill in the art. For instance, features illustrated or described as part of one embodiment can be used with another embodiment to yield a still further embodiment. Thus, it is intended that the present disclosure cover such alterations, variations, and equivalents.
[0126] An example computing system can serve as a platform for implementing the Delta-Net, a new neural network architecture for multiscale image restoration. This system comprises one or more processors and one or more non-transitory computer-readable media. The processors can include, for example, central processing units (CPUs), graphics processing units (GPUs), or specialized hardware accelerators for machine learning tasks. The non-transitory computer-readable media can store the machine-learned multi-scale image processing network and other necessary software components.

Claims

WHAT IS CLAIMED IS:
1. A computing system for performing image processing, the computing system comprising: one or more processors; and one or more non-transitory computer-readable media that store a machine-learned multi-scale image processing network, the machine-learned multi-scale image processing network comprising a sequence of layers arranged to operate on respective versions of an input image in a sequence of increasing image scales, the sequence of layers comprising one or more delta layers, each of the one or more delta layers configured to: obtain a respective version of the input image having a respective scale in the sequence of increasing image scales; process the respective version of the input image with a set of learned parameters to generate an intermediate image; apply a high pass image filter to the intermediate image to generate a filtered image; and generate a respective output image having the respective scale by combining the filtered image with an upscaled version of the respective output image from a preceding layer in the sequence of layers.
2. The computing system of claim 1, wherein the high pass image filter comprises a Laplacian filter.
3. The computing system of any preceding claim, wherein the machine-learned multi-scale image processing network comprises an image restoration model, and wherein the input image comprises a degraded image and the respective output image from a final layer in the sequence of layers comprises a restored image.
4. The computing system of any preceding claim, wherein the sequence of layers comprises a coarsest scale layer as an initial layer in the sequence of layers, wherein the one or more delta layers follow the coarsest scale layer in the sequence of layers, and wherein the coarsest scale layer does not apply the high pass filter.
5. The computing system of any preceding claim, wherein, for each of the one or more delta layers, the set of learned parameters is structured in a U-Net architecture.
6. The computing system of any preceding claim, wherein the one or more delta layers comprise a plurality of delta layers.
7. The computing system of claim 6, wherein two or more of the plurality of delta layers share the same set of learned parameters.
8. The computing system of any preceding claim, wherein the input image has a first scale, and wherein the first scale is a largest scale in the sequence of increasing image scales.
9. The computing system of claim 8, wherein the machine-learned multi-scale image processing network further comprises a downscaling block configured to downscale the input image from the first scale to generate the respective versions of the input image at the sequence of increasing image scales.
10. The computing system of any preceding claim, wherein each of the one or more delta layers is configured to concatenate and then process with the set of learned parameters the respective version of the input image and the upscaled version of the respective output image from the preceding layer in the sequence of layers to generate the intermediate image.
11. The computing system of any preceding claim, wherein the respective output image generated by each of the one or more delta layers comprises a restored version of the input image.
12. The computing system of any of claims 1-10. wherein the respective output image generated by each of the one or more delta layers comprises a feature image comprising latent feature values expressed in a latent dimensional space.
13. The computing system of any preceding claim, wherein the one or more non- transitory computer-readable media further store a machine-learned prediction model configured to process one or more of the respective output images generated by one or more of the delta layers to generate one or more model predictions.
14. A computer-implemented method to train a multi-scale image processing network, the method comprising: obtaining, by a computing system comprising one or more computing devices, a training tuple comprising an input image and ground truth data; accessing, by the computing system, the multi-scale image processing network, the machine-learned multi-scale image processing network comprising a sequence of layers arranged to operate on respective versions of the input image in a sequence of increasing image scales, the sequence of layers comprising one or more delta layers, processing, by the computing system, the input image with the multi-scale image processing network to generate a model prediction, wherein processing the input image with the multi-scale image processing network comprises, for each of the one or more delta layers: obtaining a respective version of the input image having a respective scale in the sequence of increasing image scales; processing the respective version of the input image with a set of parameters to generate an intermediate image; applying a high pass image filter to the intermediate image to generate a filtered image; and generating a respective output image having the respective scale by combining the filtered image with an upscaled version of the respective output image from a preceding layer in the sequence of layers; evaluating, by the computing system, a loss function that compares the model prediction to the ground truth data; and modifying, by the computing system, one or more parameter values of one or more of the sets of parameters of the one or more delta layers based on the loss function.
15. The computer-implemented method of claim 14, wherein the input image comprises a degraded image, wherein the ground truth data comprises a ground truth restored image, and wherein the model prediction comprises a predicted restored image output by a final delta layer of the one or more delta layers.
16. The computer-implemented method of claim 14, wherein loss function comprises a multi-teacher model distillation loss term.
17. One or more non-transitory computer-readable media that collectively store a machine-learned multi-scale image restoration network, the machine-learned multi-scale image restoration network comprising a sequence of layers arranged to operate on respective versions of an input image in a sequence of increasing image scales, the sequence of layers comprising one or more delta layers, each of the one or more delta layers configured to: obtain a respective version of the input image having a respective scale in the sequence of increasing image scales; process the respective version of the input image with a set of learned parameters to generate an intermediate image; apply a high pass image filter to the intermediate image to generate a filtered image; and generate a respective output image having the respective scale by combining the filtered image with an upscaled version of the respective output image from a preceding layer in the sequence of layers; wherein the respective output image of a final layer of the one or more delta layers comprises a restored version of the input image.
18. The one or more non-transitory computer readable media of claim 17, wherein the high pass image filter comprises a Laplacian filter.
19. The one or more non-transitory computer readable media of any of claims 17-18, wherein the one or more delta layers comprise a plurality of delta layers, and wherein at least two of the plurality of delta layers share the same set of learned parameters.
20. The one or more non-transitory computer readable media of any of claims 17-19, wherein, for each of the one or more delta layers, the set of learned parameters is structured in a U-Net architecture.
PCT/US2023/080788 2023-11-21 2023-11-21 Multi-scale image processing network Pending WO2025110998A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2023/080788 WO2025110998A1 (en) 2023-11-21 2023-11-21 Multi-scale image processing network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2023/080788 WO2025110998A1 (en) 2023-11-21 2023-11-21 Multi-scale image processing network

Publications (1)

Publication Number Publication Date
WO2025110998A1 true WO2025110998A1 (en) 2025-05-30

Family

ID=89507635

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/080788 Pending WO2025110998A1 (en) 2023-11-21 2023-11-21 Multi-scale image processing network

Country Status (1)

Country Link
WO (1) WO2025110998A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170213321A1 (en) * 2016-01-22 2017-07-27 Siemens Healthcare Gmbh Deep Unfolding Algorithm For Efficient Image Denoising Under Varying Noise Conditions
US20190172230A1 (en) * 2017-12-06 2019-06-06 Siemens Healthcare Gmbh Magnetic resonance image reconstruction with deep reinforcement learning
US20230206399A1 (en) * 2021-12-24 2023-06-29 Advanced Micro Devices, Inc. Low-latency architecture for full frequency noise reduction in image processing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170213321A1 (en) * 2016-01-22 2017-07-27 Siemens Healthcare Gmbh Deep Unfolding Algorithm For Efficient Image Denoising Under Varying Noise Conditions
US20190172230A1 (en) * 2017-12-06 2019-06-06 Siemens Healthcare Gmbh Magnetic resonance image reconstruction with deep reinforcement learning
US20230206399A1 (en) * 2021-12-24 2023-06-29 Advanced Micro Devices, Inc. Low-latency architecture for full frequency noise reduction in image processing

Similar Documents

Publication Publication Date Title
Bai et al. Retinexmamba: Retinex-based mamba for low-light image enhancement
CN114008663B (en) Real-time video super-resolution
CN115812206B (en) Machine Learning for High-Quality Image Processing
WO2024081778A1 (en) A generalist framework for panoptic segmentation of images and videos
US20230359862A1 (en) Systems and Methods for Machine-Learned Models Having Convolution and Attention
US12175642B2 (en) High resolution inpainting with a machine-learned augmentation model and texture transfer
EP4392935A1 (en) Robustifying nerf model novel view synthesis to sparse data
Soniya et al. Integrating Kalman filter noise residue into U-Net for robust image denoising: the KU-Net model
JP7715937B2 (en) Cascaded multiresolution machine learning for computationally efficient image processing
US20250316075A1 (en) Machine Learning for Computation of Visual Attention Center
WO2025110998A1 (en) Multi-scale image processing network
EP4460960A1 (en) High-definition video segmentation for web-based video conferencing
US20250086760A1 (en) Guided Contextual Attention Map for Inpainting Tasks
Mishra et al. Experimentally proven bilateral blur on SRCNN for optimal convergence
Kong et al. MUFFNet: lightweight dynamic underwater image enhancement network based on multi-scale frequency
Cimtay et al. Joint deep learning and atmospheric light scattering for fast image dehazing
Zang et al. Improvement and Application of Multi-type Image Enhancement Algorithms
Ranjan et al. ViConNet: Temporal-Spatial Probe Distilled (TSPD) Network for Real-Time Video Deblurring and Enhancement with Facial Micro-expression and Pose Estimation for Real-Time Video Conferencing
CN117557452A (en) An image restoration method, device, equipment and storage medium
CN119169371A (en) Apple fruit image recognition method with multiple overlapping fruits and complex occlusions
CN120318069A (en) Method, apparatus and computer program product for generating super-resolution images
CN115769226A (en) Machine learning discretization level reduction

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23837010

Country of ref document: EP

Kind code of ref document: A1