WO2025015279A1

WO2025015279A1 - Iterative denoising diffusion models for image restoration

Info

Publication number: WO2025015279A1
Application number: PCT/US2024/037821
Authority: WO
Inventors: Yang Zhao; Tingbo Hou; Yu-Chuan Su; Xuhui JIA; Matthias Grundmann
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2023-07-13
Filing date: 2024-07-12
Publication date: 2025-01-16
Anticipated expiration: 2026-01-13

Abstract

Computer-implemented methods, systems, devices, and tangible non-transitory computer readable media for training and utilizing multiple denoising diffusion models that restore images are provided. Specifically, the proposed method therein involves the training of one or more denoising diffusion models, which are then used as extrinsic learning systems that restore images, which serve as training data for a secondary series of one or more denoising diffusion models, which are used as the primary system for processing and restoring images provided by a user or system. The proposed systems feature the ability to implement said computer-implemented method, and further includes a variety of other features, including the ability to send, receive, and save images, communicate with external devices, and more.

Description

ITERATIVE DENOISING DIFFUSION MODELS FOR IMAGE RESTORATION RELATED APPLICATIONS [0001] This application claims priority to and the benefit of United States Provisional Patent Application Number 63/513,433 filed July 13, 2023. United States Provisional Patent Application Number 63/513,433 is hereby incorporated by reference in its entirety. FIELD [0002] The present disclosure relates generally to machine learning. More particularly, the present disclosure relates to the use of multiple denoising diffusion models (DDMs) to iteratively train a system capable of achieving authentic image restoration. BACKGROUND [0003] Image restoration is a significant task in machine learning and computer vision that aims to reconstruct or recover an original image that has been degraded by various factors. One particular type of image restoration is facial image restoration, in which the image at issue depicts a face. [0004] Facial restoration has been a growing field in the last several years, with numerous different models being used to recover high quality facial images from low quality source material. Facial restoration has been used in several areas, including but not limited to live streaming, where facial features are improved from their original state, which could be impacted due to internet connection or camera quality, and photography, where blurry or low-quality images can be upscaled to improve the quality. However, many such models fail to restore the original image authentically and faithfully, often failing to preserve delicate identify features, creating uncanny artifacts nonexistent in the source material, and/or erasing high-frequency details, like that of facial hair, wrinkles, and freckles. [0005] As such, users are often left with a resulting image or video that, while upscaled regarding image quality (e.g., level of blurriness, haziness, coloration, saturation, etc.), is not an authentic representation of the original source material. This raises several concerns, and additionally prevents wide-scale usage and adoption of these models, since the restored image is still of unsatisfactory quality. Therefore, current facial restoration models fail to provide users with a high-quality and faithful restoration of their input images. [0006] As such, there is a need within the field for novel models and training methodologies that can maintain the high-quality image restoration process (e.g., facial restoration process), while also engaging in authentic restoration that accurately captures original features, including but not limited to wrinkles, freckles, and facial hair. Additionally, there is the need that such a method will operate efficiently and will be capable of quickly processing images for use in video streaming and live photography. SUMMARY [0007] Aspects and advantages of embodiments of the present disclosure will be set forth in part in the following description, or can be learned from the description, or can be learned through practice of the embodiments. [0008] One example aspect of the present disclosure is directed to a computer- implemented method to perform image restoration. The method includes training one or more initial denoising diffusion models on a database of images. The method includes processing one or more images through the trained one or more initial denoising diffusion models to generate one or more restored images. The method includes using the one or more restored images generated by the one or more initial denoising diffusion models to train one or more secondary denoising diffusion models. The method includes processing one or more subsequent images through the trained one or more secondary denoising diffusion models to generate one or more subsequent restored images. [0009] Another example aspect of the present disclosure is directed to one or more non- transitory computer-readable media that collectively store instructions that, when executed by one or more computing devices, cause the one or more computing devices to perform operations. The operations include training one or more initial denoising diffusion models on a database of images. The operations include processing one or more images through the one or more initial denoising diffusion models to generate one or more restored images. The operations include using the one or more restored images generated by the one or more initial denoising diffusion models to train one or more secondary denoising diffusion models. The operations include processing one or more subsequent images through the one or more secondary denoising diffusion models to generate one or more subsequent restored images. [0010] Another example aspect of the present disclosure is directed to computer- implemented method to enable improved performance of downstream models based on refined training data images. The method includes training one or more initial denoising diffusion models on a set of training images. The method includes processing a separate set of images through the trained one or more initial denoising diffusion models to generate restored images. The method includes using the restored images generated by the one or more initial denoising diffusion models to train one or more downstream machine learning models. [0011] Other aspects of the present disclosure are directed to various systems, apparatuses, non-transitory computer-readable media, user interfaces, and electronic devices. [0012] These and other features, aspects, and advantages of various embodiments of the present disclosure will become better understood with reference to the following description and appended claims. The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate example embodiments of the present disclosure and, together with the description, serve to explain the related principles. [0013] The attached Appendix, which is fully incorporated into and forms a portion of this disclosure describes example implementations the systems and methods described herein. The present disclosure is not limited to the example implementations described in the attached Appendix. BRIEF DESCRIPTION OF THE DRAWINGS [0014] Detailed discussion of embodiments directed to one of ordinary skill in the art is set forth in the specification, which makes reference to the appended figures, in which: [0015] FIGs.1A-C depict block diagrams of an example approach in which two models interact with one another and various selections of training and processing data to perform image restoration, according to example embodiments of the present disclosure. [0016] FIG.2 depicts a block diagram of an example of a computing device according to example embodiments of the present disclosure. [0017] FIG.3 depicts a block diagram of an example computing environment including multiple computing systems, according to example embodiments of the present disclosure. [0018] FIG.4 depicts a flowchart diagram of an example method to perform image restoration, according to example embodiments of the present disclosure. [0019] FIG.5 depicts a graphical diagram showing an example process of restoring an image, according to example embodiments of the present disclosure. [0020] Reference numerals that are repeated across plural figures are intended to identify the same features in various implementations. DETAILED DESCRIPTION Overview [0021] Generally, the present disclosure is directed to the use of multiple denoising diffusion models to train a system capable of achieving higher-quality image restoration when compared to other models. Specifically, some example implementations of the proposed methodology include the usage of multiple denoising diffusion models, wherein at least one denoising diffusion model is trained using a dataset where some images contain some type of image degradations (e.g., compression loss, blur, or Gaussian noise). This model is then used to process and restore a database, where the restored set of images is in turn used to train at least one more denoising diffusion model. This secondary one or more denoising diffusion models then serve as the primary model for processing images moving forward. [0022] For example, in one potential implementation, a set of ten images are used to train an initial denoising diffusion model. The set of ten images contain five images that are clean, and five images that show at least minor degradation. After training the initial model, a new set of ten images is processed using this model, and ten restored images are generated. These ten restored images are then used to train a secondary denoising diffusion model, which can then take in a final set of ten images and restore them. [0023] In another example, there may be a photo editing software, wherein the photo editing software has access to one or more trained models. The software may have a number of tools available to users through a graphical user interface, one of which may include the option to enhance, restore, deblur, or otherwise alter an image. As examples, the software could be an application on a client device, or a software hosted through a web browser, or could be in any other form capable of allowing a user to interact with it utilizing the interface. The software can accept a photo from the user, either through transmission of data of a network or through a local storage system. The software can enhance the inputted image by running it through the trained secondary denoising diffusion models to produce an enhanced version of the photo that may fix alterations or degradations related to the original input image. This restored image can then be shown to the user through the graphical interface, and users could then further edit their photo using other tools, attempt further enhancements, reverse the process, or download the restored image as-is. As such, in some implementations, the models used within could be run on-device, and in other implementations, could be implemented within a cloud-based system. [0024] In some implementations, images used in the initial training process may show qualities of alteration or degradation, the qualities comprising blur, Gaussian noise, or compression loss. For example, an image received in any part of the training process may be blurry, possess static, or noise, or contain factors that distort the pixels within the image, such that the image is not high quality, or does not resemble its original pixel size, colors, of other features. This image can then be used to train the initial model, with the goal being that the initial model learns how to restore images showing signs of degradation as opposed to only clear or normal-quality images. [0025] In some implementations, images used in the initial training process may show signs of synthetic degradation or alteration. For example, an image being used to train the initial models may be purposefully manipulated to contain distorted pixels. This image can then be used to train the initial model, with the goal being that the initial model learns how to restore images showing signs of degradation as opposed to only clear or normal-quality images. [0026] In some implementations, the initial denoising diffusion model is utilized to predict the direct output, or an upscaled, restored image, as opposed to the noise of the input, the latter being the approach utilized throughout the remainder of the industry. For example, the initial denoising diffusion model, when prompted to process and restore an image, will directly produce a restored version of the input image, as opposed to providing a noise function that could be used to alter the image. [0027] In some implementations, the training and subsequent restoration processes may make use of the same database of photos. However, in other implementations, the database of photos processed, restored, and used to train the secondary models may be different from the database used to train the initial models. [0028] In some implementations, the images used to train the initial model and the images processed and restored are facial images that in some way depict a face. In other implementations, these images can depict other subjects. For example, in some implementations, the images used in the process described in the present disclosure may all consist of human faces. In another example, the images used may consist of animal faces. In another example, the images may consist of landscapes or environmental photos. [0029] In some implementations, the models are pre-trained. For instance, a photo editing software may have access to just the secondary model after it has been trained using the output of the initial model. In another example, the editing software may have access to both the initial and secondary model, both trained using the complete process. In another example, the initial model and secondary model may be the same model, trained twice, with the photo editing software having access to the single, double-trained model. [0030] In other implementations, the models are not pre-trained, and must be trained before being used to restore images. For instance, a set of models may not be pre-trained such that they can be trained on a specific set of images, with one possible intention being to enhance their ability to restore a specific set or genre of photos. [0031] In one example, a photo editing software may possess the ability to store, access, or otherwise utilize the models, even though they have not been trained. In an on-device application, the software may choose to train the data on a set of photos containing a subject, accessing or requesting to access a set of photos of the subject. A user may input a restoration request for an image, and the software may request access to a set of photos similarly containing the subject of the input photo, using these photos to train the models and subsequently restore the input image. [0032] In another example, the same software, on-device or functioning over a network, may have access to multiple sets of potential training data, and attempt to locate a set of training data wherein the subjects of the training data most closely resemble the subject in regards to a plurality of physical characteristics. The software could then train the model on just this training data, before attempting to restore the input image. [0033] In some implementations, the implementation features a method of “authentic restoration,” in which factors of quality, such as sharpness and the level of realism, are treated the same, regardless of the actual quality of the image, such that details are not lost over the diffusion and subsequent reverse denoising processes. This is to ensure that a processed image is truthful and accurately representative of its source image. For example, an authentic restoration method should preserve all features of the original image, and not sacrifice one feature for the sake of image quality alone. For example, if an image of an individual with freckles is processed for restoration, the restoration process will preserve the freckles as they appear in the original image, without erasing them for the sake of pixel quality alone. [0034] Thus, example implementations of the proposed methodology includes the usage of multiple independent denoising diffusion models, where at least one denoising diffusion model is trained using a dataset, and the trained model is used to process and restore a set of images. Using the restored images, a second set of one or more denoising diffusion models are trained, with the second set of models being used to process and create a subsequent set of restored images. [0035] In some implementations, a system may make use of a multiple computing systems connected to one another, in which each part of the system may encompass one of the necessary elements (e.g., image storage, two or more denoising diffusion models, and other such elements as embodied in the present disclosure). For example, in one potential implementation, one computing system may send images to a separate computing system, which processes the image using a stored, local model, before sending the generated restored image to a third system. [0036] In some implementations in which the initial denoising diffusion model is used to enhance the training data used in the next iteration, the secondary model can be any downstream machine learning model capable of accepting image-related data as an input. For instance, the initial set of one or more denoising diffusion models may be trained. The trained model then later processes and restores a set of images. These images can then be used in a downstream model or machine-learning technique. [0037] For example, the initial denoising diffusion model could be used within the context of autonomous vehicles, where the model restores a set of images captured by an external camera. The restored images could then be sent to a convolutional neural network engaging in some form of predictive modeling. Using the restored images, the network is able to make more accurate predictions utilizing the restored images, using these more accurate predictions to improve the functionality or operation of the autonomous vehicle. [0038] One of the benefits of the proposed implementation is an overall reduction in the number of training cycles necessary to produce a functioning system. Decreased processing cycles can provide for more efficient energy use, prolonging operation in energy-constrained environments (e.g., battery-powered client devices). Decreased processing cycles can provide for lower power usage. [0039] In this manner, for instance, the improved energy efficiency of example implementations of the present disclosure can reduce an amount of pollution or other waste, thereby advancing the field of network-connected computing systems as a whole. The amount of pollution can be reduced in total (e.g., an absolute magnitude thereof) or on a normalized basis (e.g., energy per task, per model size, etc.). For example, an amount of CO2 released (e.g., by a power source) in association with training and execution of machine- learned models can be reduced by implementing more energy-efficient training or inference operations. The amount of heat pollution in an environment (e.g., by the processors/storage locations) can be reduced by implementing more energy-efficient training or inference operations. [0040] Another benefit of the proposed implementation is the increase in authenticity and quality of images restored using the proposed process. Images derived from the proposed process, in both scenarios where the input images were original and cleaned, were of higher quality, and retained more of the original features like facial hair, freckles, wrinkles, and more. A more in-depth discussion of these results is offered in a later section. [0041] Another benefit of the proposed implementation is the increase in precision, recall, and Fréchet inception distance (FID), that is gained when downstream models utilize the generated restored images from the one or more initial denoising diffusion models. A more in-depth discussion of these results is described below. [0042] With reference now to the Figures, example embodiments of the present disclosure will be discussed in further detail. Example Techniques for Iterative Diffusion [0043] This section starts with a problem formulation for an example application of iterative diffusion: authentic face restoration. Next, this section introduces the core iterative restoration method from two aspects, (1) Intrinsic iterative learning benefits authentic restoration via iterative refinement. (2) Extrinsic iterative learning further improves restoration by automatically enhancing the training data. [0044] Example Problem Formulation [0045] Let ^^^^ denote the high-quality image domain and ^^^^ _^^^^ implies the distribution of ^^^^. Assume that there exists a non-learnable one-to-many degradation function ^^^^ that maps ^^^^ to degraded low-quality image domain ^^^^ _^^^^: ^^^^ → ^^^^ _^^^^, the goal of face restoration is to learn an ^{inverse function ^^^} _^ ^{^} _^^^ ^{parameterized by ^^^^: ^^^^} _^^^^ ^{→ ^^^^ that satisfies}

where ^^^^ is a distribution distance, e.g., KL (Kullback-Leibler) divergence (or maximum likelihood estimation) or Jensen-Shannon divergence (or adversarial loss); ^^^^(⋅) is an instance-wise distance between two images, e.g., perceptual loss or ^^^^ _^^^^ distance ( ^^^^~{1,2}). All current face restoration works can be formulated by Eq.1. For example, GFPGAN and CodeFormer both keep the adversarial loss for high-fidelity restoration and mix different kinds of ^^^^(⋅) for input’s content preservation. In contrast, the proposed approach does not explicitly put constraints to preserve content. Instead, the content preservation is achieved via injecting conditional signals inside the model plus iterative refinement. Thus, the second term is omitted. [0046] Authentic restoration: In real applications, it is common that the degraded input can be either severely degraded, only has minor degradation, or even as clean as ^^^^~ ^^^^. Since there is not a reliable metric to evaluate a single image’s quality, e.g., sharpness and realisticness, an authentic restoration model is expected to treat ^^^^ and ^^^^ _^^^^ the same regardless the actual quality. Thus, example implementations follow the following criterion that an authentic restoration model should obey: ^^^ _^^_^^^( ^^^^ _^^^^) = ^^^^, ^^^ _^^_^^^( ^^^^) = ^^^^ (2) which implies applying the model ^^^ _^^_^^^ iteratively by starting from ^^^^ _^^^^ should converge to the same high-quality image ^^^^,

[0047] A mild relaxation is ^^^ _^^_^^^( ^^^ _^^_^^^( ^^^^ _^^^^)) is not worse than ^^^^ after two iterations. [0048] By assuming the input is always of low quality, example implementations can focus on two iterative restorations to deal with almost all real restoration scenarios. Next, this section will describe how iterative diffusion models naturally couple with authentic restoration in Eq.3 with proper training. [0049] Intrinsic Iterative Learning [0050] Briefly, DDMs have a Markov chain structure residing in both the forward diffusion process and the reverse denoising chain. To recover the high-quality image, this structure naturally benefits authentic restoration via iterative refinement of the low-quality input, termed as intrinsic iterative learning. [0051] Conditional DDMs: DDMs are designed to learn a data distribution ^^^^( ^^^^₀) by defining a chain of latent variable models ^^^^ _^^^^( ^^^^_{0: ^^^^}) = ^^^^( ^^^^ _^^^^) Π _^ ^{^} ^^{^} ^^{^} ^^{^} =₁ ^^^^ _^^^^( ^^^^ _^^^^−1| ^^^^ _^^^^),

where each timestep’s example ^^^^ _^^^^ has the same dimensionality ^^^^ _^^^^ ∈ ℝ ^{^^^^}. Usually, the chain starts from a standard Gaussian distribution ^^^^ _^^^^~ ^^^^(0, ^^^^ ^{^^^^}) and only the final sample ^^^^₀ is stored. For restoration, we are more interested in the conditional DDMs that maps a low- quality source input ^^^^ _^^^^ to a high-quality target image ^^^^. The conditional DDMs iteratively edits the noisy intermediate image ^^^^ _^^^^ by learning a conditional distribution ^^^^ _^^^^( ^^^^ _^^^^−1| ^^^^ _^^^^, ^^^^ _^^^^) such that ^^^^₀~ ^^^^( ^^^^| ^^^^ _^^^^) where ^^^^₀ ≡ ^^^^, ^^^^ _^^^^ = ^^^^( ^^^^), ^^^^~ ^^^^ _^^^^. [0052] Learning: The forward diffusion process can be simplified into a specification of the true posterior distribution

where ^^^^ _^^^^ defines the noise schedule. [0053] Some example implementations of the present disclosure learn the reverse chain with a “denoiser" model ^^^ _^^_^^^ which takes both source image ^^^^ _^^^^ and a intermediate noisy target image ^^^^ _^^^^ by comparing it ^^^^ _^^^^( ^^^^ _^^^^−1| ^^^^ _^^^^, ^^^^ _^^^^) with the tractable posterior with ^^^^( ^^^^ _^^^^−1| ^^^^ _^^^^, ^^^^₀). Consequently, some example implementations aim to optimize the following objective that estimates the noise ^^^^,

where

and usually ^^^^ ∈ {1,2}. One can also directly predict the output of ^^^ _^^_^^^ to ^^^^₀, a.k.a., regression, which makes (6) becomes:

[0054] This formulation is more efficient in both training and inference since the sequence of network approximates ^^^^₀ at each time step starting from different amount of noises. [0055] Iterative restoration: As for the reverse diffusion process during the inference stage, some example implementations of the present disclosure follow the Langevin dynamics:

where ^^^^ _^^^^ is hyper-parameter related to ^^^^ _^^^^ = Π _^ ^{^} ^^{^} ^^{^} ^^{^} =₁ ^^^^ _^^^^ . To help understand why DDMs can provide authentic restoration, consider the following two aspects: [0056] Iterative refinement: the above process iteratively refines the input towards the target high-quality image. Different time steps are guided to learn restoration with an annealing noise schedule. That is, it coincides with the example authentic restoration motivation in Eq.3. Theoretically, as ^^^^ _^^^^ is annealed to 1, it makes the sufficiently long denoising chain always converge to the same data point, that is, ^^^^ _^^^^ ≈ ^^^^ _^^^^−1 (9) [0057] Dense architecture: an example design of the predictor ^^^ _^^_^^^ is a very dense structure with hierarchical links between encoder and decoder. Conditional signal can also be strong to preserve the high-quality contents without any auxiliary losses. That is, ^^^^ _^^^^ will gradually learn to absorb the clean content from ^^^^ _^^^^ to approximate a clean version ^^^^₀ of ^^^^ _^^^^ as in Eq.(7). [0058] Example Extrinsic Iterative Learning [0059] Most works choose to use 70K FFHQ (Flickr-Faces-HQ) data as training data. The dataset is currently the largest public high-quality high-resolution face data and is assumed to be clean and sharp. It is collected from the Internet and filtered by both automatic annotation tools and human workers. However, revisiting the dataset reveals that the training data is not always high-quality. For example, JPEG degradation, blur and Gaussian noise are found to account for around 15% when the first 2000 images are examined. The quality of dataset has proven to be a key factor in learning a restoration model. To avoid being trapped in such a dilemma, example implementation of the present disclosure provide a simple yet effective solution for extrinsic iterative learning. Thanks to the capability of authentic restoration of DDMs, the solution can automatically restore the training data without damaging the data, especially for those whose quality are already high and facial details are ineligible. [0060] Iterative restoration: DDMs are proven to satisfy Eq.2. After learning the restoration model ^^^ _^^_^^^, example implementations of the present disclosure can apply it to all the training data to produce a new high-quality image domain ^^^^ _^^^^ ^∗,

[0061] This method is referred to herein as extrinsic iterative learning because Eq.10 produces a higher quality data for another iteration of restoration model training, which is different from the internal chained process in DDMs. [0062] Note that some example implementations of the present disclosure still draw low- quality image samples ^^^^ _^^^^ with mapping: ^^^^ → ^^^^ _^^^^ and the target distribution becomes ^^^^ _^^^^ ^∗ instead of ^^^^ _^^^^. Therefore some example implementations learn a conditional distribution ^^^^ _^^^^ ^∗( ^^^^ _^^^^−1| ^^^^ _^^^^, ^^^^ _^^^^) such that ^^^^₀~ ^^^^( ^^^^^∗| ^^^^ _^^^^) where ^^^^₀ ≡ ^^^^^∗, ^^^^ _^^^^ = ^^^^( ^^^^), ^^^^~ ^^^^ _^^^^. [0063] Coupling two DDMs: Note that the only difference of learning ^^^ _^^_^^^ and ^^^ _^^_^^^ ^∗ is the target distribution ^^^^ _^^^^ and ^^^^ _^^^^ ^∗. With proper tuning, the pool of ^^^^ _^^^^ can be gradually replaced and filled by new data from ^^^^ _^^^^∗. Given an ideal “stop sign" for learning ^^^ _^^_^^^, some example implementations can unify the two DDMs ^^^ _^^_^^^ and ^^^ _^^_^^^∗ into only one learning process. Other example implementations can train two separate models, which works well and saves parameter tuning cost. Regardless, the decoupled DDMs can be viewed as a single model experimentally since ^^^ _^^_^^^ ^∗ is initialized from ^^^ _^^_^^^. Through intrinsic and extrinsic iterative learning, some example implementations only keep the latest model ^^^ _^^_^^^ ^∗ for inference. Example Devices and Systems [0064] Figures 1A-C depict block diagrams of an example computer-implemented method 100 that performs the process of training one initial model, outputting restored images, and using the images to train a secondary, primary model, according to example embodiments of the present disclosure. Specifically, Figure 1A shows an initial training process, Figure 1B shows an intermediate training process, and Figure 1C shows a runtime or inference process. [0065] Referring collectively to Figures 1A-C, the method 100 includes a set of initial training data, 110, an initial one or more models 120, the first training output data 125, the first model trainer 130, the first input data 140, the generated training data 150, the second one or more models 160, the second training output data 165, the second model trainer 170, the second input data 180, and the final output data 190. [0066] Referring first to Figure 1A, the training data 110 represents a set of images, wherein at least one image in the dataset possesses qualities related to image degradation, including but not limited to compression loss, Gaussian noise, and blur. Images can be in any form of image encapsulation and storage types, including but not limited to JPEG, PNG, GIF files. [0067] The model 120 represents one or more denoising diffusion models, wherein said denoising diffusion models are trained upon the training data set 110. [0068] The first training output data 125 represents the output data from model 120, wherein output data 125 is a set of restored images based on the original training data 110. Images can be in any form of image encapsulation and storage types, including but not limited to JPEG, PNG, GIF files. [0069] The first model trainer 130 represents a model trainer, which evaluates the first training data output 125 and determines whether further iterations are needed. The model trainer 130 can do this based upon a flat number of iterations (e.g., a set number of x number of iterations to be performed), a series of functions that evaluates the training data output 125 to determine whether it meets a certain level of accuracy, a series of functions that evaluates the training data output 125 to determine whether improvements can be made to the internal denoising/diffusion functions, manual operations, or other methods and techniques typically found within the training of machine learning models. If the trainer 130 determines additional iterations are needed, it modifies the parameters of model 120, and generates a new set of the training data output 125. [0070] Referring now to Figure 1B, the first input data 140 represents the data input into the first one or more models 120. The input data 140 is the data that is meant to be restored and converted into the generated training data 150 by the one or more models 120. The first input data 140 is a set of images, wherein the images can be in any form of image encapsulation and storage types, including but not limited to JPEG, PNG, GIF files. [0071] The generated training data 150 represents the final output of the one or more first models 120, wherein the generated training data is a set of restored images based on the original first input data 140. Images can be in any form of image encapsulation and storage types, including but not limited to JPEG, PNG, GIF files. [0072] The second model 160 represents a set of one or more additional machine learning models capable of taking image data as an input. The second one or more models are trained on the generated training data 150 from the one or more models 120. [0073] The second training output data 165 represents the output data from model 160, wherein output data 165 is a set of restored images based on the generated training data 150. In some implementations, this may be image data, wherein images can be in any form of image encapsulation and storage types, including but not limited to JPEG, PNG, GIF files. In some implementations, this data could be numerical or text-based predictions. [0074] The second model trainer 170 represents a model trainer, which evaluates the first training data output 165 and determines whether further iterations are needed. The model trainer 170 can do this based upon a flat number of iterations (e.g., a set number of x number of iterations to be performed), a series of functions that evaluates the training data output 165 to determine whether it meets a certain level of accuracy, a series of functions that evaluates the training data output 165 to determine whether improvements can be made to the internal denoising/diffusion functions, manual operations, or other methods and techniques typically found within the training of machine learning models. If the trainer 170 determines additional iterations are needed, it modifies the parameters of model 160, and generates a new set of the training data output 165. [0075] Referring now to Figure 1C, the first input data 180 represents the data input into the second one or more models 160. The input data 180 is the data that is meant to be restored by the one or more models 160. The input data 180 is a set of images, wherein the images can be in any form of image encapsulation and storage types, including but not limited to JPEG, PNG, GIF files. [0076] The generated training data 150 represents the final output of the one or more first models 120, wherein the generated training data is a set of restored images based on the original first input data 140. Images can be in any form of image encapsulation and storage types, including but not limited to JPEG, PNG, GIF files. [0077] Figure 2 depicts a block diagram of an example computing system 200 that performs the method embodied in Figure 1 according to example embodiments of the present disclosure. [0078] The user computing device 202 can be any type of computing device, such as, for example, a personal computing device (e.g., laptop or desktop), a mobile computing device (e.g., smartphone or tablet), a gaming console or controller, a wearable computing device, an embedded computing device, or any other type of computing device. [0079] The user computing device 202 includes one or more processors 212 and a memory 214. The one or more processors 212 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. The memory 214 can include one or more non-transitory computer-readable storage media, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. The memory 214 can store data 216 and instructions 218 which are executed by the processor 212 to cause the user computing device 202 to perform operations. [0080] One or more machine-learned models 220/222 can be included in. stored and implemented by the user computing device 202. For example, the models 220/22 can be or can otherwise include various layers of denoising diffusion models. [0081] Models 220 and 222 can be trained on local image and video data, from the model photo storage system 222/224. [0082] Processed image data, representing the end output of model 222, can be saved to the user device utilizing a localized photo storage system 222. [0083] Figure 3 depicts a block diagram of an example set of computing system 300 that performs the method embodied in Figure 1 according to example embodiments of the present disclosure. The system 300 includes a user computing device 302, a server computing system 330, and a training computing system 350 that are communicatively coupled over a network 380. [0084] The user computing device 302 can be any type of computing device, such as, for example, a personal computing device (e.g., laptop or desktop), a mobile computing device (e.g., smartphone or tablet), a gaming console or controller, a wearable computing device, an embedded computing device, or any other type of computing device. [0085] The user computing device 302 includes one or more processors 112 and a memory 114. The one or more processors 312 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. The memory 314 can include one or more non-transitory computer-readable storage media, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. The memory 314 can store data 316 and instructions 318 which are executed by the processor 312 to cause the user computing device 302 to perform operations. [0086] In some implementations, the one or more machine-learned models 320 can be received from the server computing system 330 over network 380, stored in the user computing device memory 314, and then used or otherwise implemented by the one or more processors 312. In some implementations, the user computing device 302 can implement multiple parallel instances of a single machine-learned model 120 (e.g., to perform parallel image restoration). [0087] Additionally, or alternatively, one or more machine-learned models 340 can be included in or otherwise stored and implemented by the server computing system 330 that communicates with the user computing device 302 according to a client-server relationship. For example, the machine-learned models 340 can be implemented by the server computing system 340 as a portion of a web service. Thus, one or more models can be stored and implemented at the user computing device 302 and/or one or more models 340 can be stored and implemented at the server computing system 330. [0088] The user computing device 202 can also include one or more user input components from the user storage 224 that receives user input in the form of images or videos. [0089] The server computing system 330 includes one or more processors 332 and a memory 334. The one or more processors 332 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. The memory 334 can include one or more non-transitory computer-readable storage media, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. The memory 334 can store data 336 and instructions 338 which are executed by the processor 332 to cause the server computing system 330 to perform operations. [0090] In some implementations, the server computing system 330 includes or is otherwise implemented by one or more server computing devices. In instances in which the server computing system 330 includes plural server computing devices, such server computing devices can operate according to sequential computing architectures, parallel computing architectures, or some combination thereof. [0091] As described above, the server computing system 330 can store or otherwise include one or more machine-learned models 340. For example, the models 340 can be or can otherwise include various layers of denoising diffusion models. [0092] The server computing system 330 can train the models 340 via interaction with the training computing system 350 that is communicatively coupled over the network 380. The training computing system 350 can be separate from the server computing system 330 or can be a portion of the server computing system 330. [0093] The training computing system 350 includes one or more processors 352 and a memory 354. The one or more processors 352 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. The memory 354 can include one or more non-transitory computer-readable storage media, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. The memory 354 can store data 356 and instructions 358 which are executed by the processor 352 to cause the training computing system 350 to perform operations. In some implementations, the training computing system 350 includes or is otherwise implemented by one or more server computing devices. [0094] The training computing system 350 can include a model trainer 360 that trains the machine-learned models 320 and/or 340 stored at the user computing device 302 and/or the server computing system 330 using various training or learning techniques, such as, for example, backwards propagation of errors. For example, a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the loss function). Various loss functions can be used such as mean squared error, likelihood loss, cross entropy loss, hinge loss, and/or various other loss functions. Gradient descent techniques can be used to iteratively update the parameters over several training iterations. [0095] In some implementations, performing backwards propagation of errors can include performing truncated backpropagation through time. The model trainer 360 can perform several generalization techniques (e.g., weight decays, dropouts, etc.) to improve the generalization capability of the models being trained. [0096] In particular, the model trainer 360 can train the machine-learned model 340 based on a set of training data 362 in the form of images. [0097] In some implementations, if the user has provided consent, the training examples can be provided by the user computing device 302. Thus, in such implementations, the model 320 provided to the user computing device 302 can be trained by the training computing system 350 on user-specific data received from the user computing device 302. In some instances, this process can be referred to as personalizing the model. [0098] The model trainer 360 includes computer logic utilized to provide desired functionality. The model trainer 360 can be implemented in hardware, firmware, and/or software controlling a general-purpose processor. For example, in some implementations, the model trainer 160 includes program files stored on a storage device, loaded into a memory, and executed by one or more processors. In other implementations, the model trainer 360 includes one or more sets of computer-executable instructions that are stored in a tangible computer-readable storage medium such as RAM, hard disk, or optical or magnetic media. [0099] The network 380 can be any type of communications network, such as a local area network (e.g., intranet), wide area network (e.g., Internet), or some combination thereof and can include any number of wired or wireless links. In general, communication over the network 380 can be carried via any type of wired and/or wireless connection, using a wide variety of communication protocols (e.g., TCP/IP, HTTP, SMTP, FTP), encodings or formats (e.g., HTML, XML), and/or protection schemes (e.g., VPN, secure HTTP, SSL). [0100] Processed image data, representing the end output of model 340, can be saved to the server computing device or the user device using the network 380, with both computing systems maintaining a localized photo storage system, 320/340. [0101] Figure 3 illustrates one example computing system that can be used to implement the present disclosure. Other computing systems can be used as well. For example, in some implementations, the user computing device 302 can include the model trainer 360 and the training dataset 362. In such implementations, the models 340 can be both trained and used locally at the user computing device 302. In some of such implementations, the user computing device 302 can implement the model trainer 360 to personalize the models 320 based on user-specific data. [0102] The machine-learned models described in this specification may be used in a variety of tasks, applications, and/or use cases. [0103] In some implementations, the input to the machine-learned model(s) of the present disclosure can be image data. The machine-learned model(s) can process the image data to generate an output. As an example, the machine-learned model(s) can process the image data to generate an image recognition output (e.g., a recognition of the image data, a latent embedding of the image data, an encoded representation of the image data, a hash of the image data, etc.). As another example, the machine-learned model(s) can process the image data to generate an image segmentation output. As another example, the machine- learned model(s) can process the image data to generate an image classification output. As another example, the machine-learned model(s) can process the image data to generate an image data modification output (e.g., an alteration of the image data, etc.). As another example, the machine-learned model(s) can process the image data to generate an encoded image data output (e.g., an encoded and/or compressed representation of the image data, etc.). As another example, the machine-learned model(s) can process the image data to generate an upscaled image data output. As another example, the machine-learned model(s) can process the image data to generate a prediction output. [0104] Figure 4 depicts a block diagram of an example of the two-part denoising diffusion model training and usage process 400 according to example embodiments of the present disclosure. In some implementations, the method 400 involves a denoising diffusion model, wherein the model is trained to receive a set of input data, 402, wherein the trained model 402 processes one or more photos and restores them in the process 404. The output data of process 404 is then used to train the secondary denoising diffusion model in the process 406, which takes a similar set of data to input data 402 and outputs image data like 404 into further restored photos in the process 408. [0105] In some implementations, the initial input image 402 may possess image degradation qualities, including but not limited to compression loss, Gaussian noise, and blur. [0106] In some implementations, the set of initial input images 402 may contain at least one image that has been synthetically degraded. [0107] In some implementations, the trained model 402 may directly predict a restored image version of the processed images from 402 to create the restored images in the process 404. [0108] In some implementations, images may be used as conditioning signals at interceding layers of the initial denoising diffusion model 402’s training processes. [0109] In some implementations, the processed photos in the process 404 used to create the restored images through the process 404 may be obtained from the original database of images. [0110] In some implementations, the one or more images processed by the one or more initial denoising diffusion models in the process 404 to generate the one or more restored images in process 404 are different from and not included in the original database of images. [0111] In some implementations, the initial one or more denoising diffusion models from the processes 402 and 404 are the same models trained and utilized in the subsequent processes 406 and 408. [0112] In some implementations, the initial one or more denoising diffusion models from the processes 402 and 404 are separate models from the secondary one or more denoising diffusion models trained and utilized in the subsequent processes 406 and 408. [0113] In some implementations, the images comprised in the processes 402, 404, 406, and 408 comprise facial images that depict a face. [0114] Although Figure 4 depicts steps performed in a particular order for purposes of illustration and discussion, the methods of the present disclosure are not limited to the particularly illustrated order or arrangement. The various steps of the method 400 can be omitted, rearranged, combined, and/or adapted in various ways without deviating from the scope of the present disclosure. [0115] Figure 5 represents a diagram demonstrating how a photo might be restored in a production environment, including the initial input image 502, the initial model 504, the process 506 of using the output of model 504 to pass data into the secondary model 508, and the final output image 510. [0116] The initial input image 502 represents an image that can be in any form of image encapsulation and storage types, including but not limited to JPEG, PNG, GIF files. [0117] In some implementations, the initial input image 502 may possess qualities related to image degradation, including but not limited to compression loss, Gaussian noise, and blur. [0118] The model 504 represents one or more denoising diffusion models. In some implementations, the initial image 502 may be used to train the model 504. In other implementations, the model 504 may already be trained, meaning that model 504 is used as processing data to generate a restored image. Additional Disclosure [0119] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the phrase “and/or” includes any and all combinations of one of more of the listed item; the singular forms “a,” “an,” and “the” are intended to include the plural forms as well as the singular forms, unless specifically indicated otherwise; the terms “comprises” and/or “comprising” are used to specific the presence of state features, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, elements, and/or components. [0120] The technology discussed herein makes reference to servers, databases, software applications, and other computer-based systems, as well as actions taken and information sent to and from such systems. The inherent flexibility of computer-based systems allows for a great variety of possible configurations, combinations, and divisions of tasks and functionality between and among components. For instance, processes discussed herein can be implemented using a single device or component or multiple devices or components working in combination. Databases and applications can be implemented on a single system or distributed across multiple systems. Distributed components can operate sequentially or in parallel. [0121] While the present subject matter has been described in detail with respect to various specific example embodiments thereof, each example is provided by way of explanation, not limitation of the disclosure. Those skilled in the art, upon attaining an understanding of the foregoing, can readily produce alterations to, variations of, and equivalents to such embodiments. Accordingly, the subject disclosure does not preclude inclusion of such modifications, variations and/or additions to the present subject matter as would be readily apparent to one of ordinary skill in the art. For instance, features illustrated or described as part of one implementation can be used with another implementation to yield a still further implementation. Thus, it is intended that the present disclosure cover such alterations, variations, and equivalents.

Claims

WHAT IS CLAIMED IS: 1. A computer-implemented method to perform image restoration, the method comprising: training one or more initial denoising diffusion models on a database of images; processing one or more images through the trained one or more initial denoising diffusion models to generate one or more restored images; using the one or more restored images generated by the one or more initial denoising diffusion models to train one or more secondary denoising diffusion models; and processing one or more subsequent images through the trained one or more secondary denoising diffusion models to generate one or more subsequent restored images.

2. The computer-implemented method of claim 1, wherein at least one of the images in the database of images shows degradation qualities of compression loss, blur, or Gaussian noise.

3. The computer-implemented method of claim 1, wherein at least one of the images in the database of images has been synthetically degraded.

4. The computer-implemented method of claim 1, wherein the one or more initial denoising diffusion models directly predict a restored image version of the one or more processed images.

5. The computer-implemented method of claim 1, wherein images are utilized as conditioning signals at interceding layers of the one or more initial denoising diffusion models' training processes.

6. The computer-implemented method of claim 1, wherein the one or more images processed by the one or more initial denoising diffusion models to generate the one or more restored images are obtained from the database of images.

7. The computer-implemented method of claim 1, wherein the one or more images processed by the one or more initial denoising diffusion models to generate the one or more restored images are different from and not included in the database of images.

8. The computer-implemented method of claim 1, wherein the one or more secondary denoising diffusion models are the one or more initial denoising diffusion models.

9. The computer-implemented method of claim 1, wherein the one or more secondary denoising diffusion models are different from the one or more initial denoising diffusion models.

10. The computer-implemented method of claim 1, wherein the one or more images comprise facial images that depict a face.

11. One or more non-transitory computer-readable media that collectively store instructions that, when executed by one or more computing devices, cause the one or more computing devices to perform operations, the operations comprising: training one or more initial denoising diffusion models on a database of images; processing one or more images through the one or more initial denoising diffusion models to generate one or more restored images; using the one or more restored images generated by the one or more initial denoising diffusion models to train one or more secondary denoising diffusion models; and processing one or more subsequent images through the one or more secondary denoising diffusion models to generate one or more subsequent restored images.

12. The one or more non-transitory computer-readable media of claim 11, wherein at least one of the images in the database shows compression loss, blur, or Gaussian noise.

13. The one or more non-transitory computer-readable media of claim 11, wherein the one or more initial denoising diffusion models directly predict a restored image version of the one or more processed images.

14. A computer-implemented method to enable improved performance of downstream models based on refined training data images, the method comprising: training one or more initial denoising diffusion models on a set of training images; processing a separate set of images through the trained one or more initial denoising diffusion models to generate restored images; and using the restored images generated by the one or more initial denoising diffusion models to train one or more downstream machine learning models.

15. One or more non-transitory computer-readable media that collectively store instructions that, when executed by one or more computing devices, cause the one or more computing devices to perform operations, the operations comprising: training one or more initial denoising diffusion models on a set of training images; processing a separate set of images through the trained one or more initial denoising diffusion models to generate restored images; and using the restored images generated by the one or more initial denoising diffusion models to train one or more downstream machine learning models.

16. A computing system configured to perform the method and/or operations of any preceding claim.