WO2025217001A1 - Video artifact reduction by diffusion models - Google Patents
Video artifact reduction by diffusion modelsInfo
- Publication number
- WO2025217001A1 WO2025217001A1 PCT/US2025/023309 US2025023309W WO2025217001A1 WO 2025217001 A1 WO2025217001 A1 WO 2025217001A1 US 2025023309 W US2025023309 W US 2025023309W WO 2025217001 A1 WO2025217001 A1 WO 2025217001A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- frames
- noise
- sampling
- video
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/85—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression
- H04N19/86—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression involving reduction of coding artifacts, e.g. of blockiness
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0475—Generative networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/50—Image enhancement or restoration using two or more images, e.g. averaging or subtraction
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/60—Image enhancement or restoration using machine learning, e.g. neural networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/70—Denoising; Smoothing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20172—Image enhancement details
- G06T2207/20182—Noise reduction or smoothing in the temporal domain; Spatio-temporal filtering
Definitions
- JCT-VC Joint Collaborate Team on Video Coding
- AVC advanced video coding
- HEVC high efficiency video coding
- AVC advanced video coding
- HEVC high efficiency video coding
- various compression artifacts e.g., blocking, blurring and ringing artifacts
- the artifacts mainly result from the block-wise prediction and quantization with limited precision. In addition to impairing the perceptual quality of the video, these artifacts can negatively impact subsequent video enhancement steps.
- An embodiment includes a method to enhance video frames, said method comprising: receiving a set of one or more source video frames; creating one or more consequent frames from the one or more source video frames; producing a noise frame according to a noise schedule; performing conditional diffusion on the noise frame concatenated with a set of one or more consequent frames, producing one or more enhanced video frames, the conditional diffusion being performed with a trained model.
- the trained model is trained by: preprocessing training data, training data comprising video; compressing the training data to produce compressed frames; performing forward diffusion based on the training data to produce noise frames; performing a U-Net model on a noise frame concatenated with corresponding compressed frames of the training data.
- An embodiment includes a system to enhance video frames, said system comprising: an input for compressed video frames; a memory bank of noise frames; a conditional diffusion model trained to convert the compressed video frames into enhanced video frames using the noise frames, the enhanced video frames being artifact reduced versions of the compressed video frames.
- the system can be a part of a video decoder.
- FIG. 1 illustrates an example flowchart of the diffusion process.
- FIG. 2 illustrates an example of a U-Net architecture as used herein.
- FIG. 3 illustrates an example training methodology for models used herein.
- FIG. 4 illustrates an example hierarchy of different sampling types. [0013] FIG.
- FIG. 5 illustrates example of n-steps.
- FIG. 6 illustrates an example testing methodology for models used herein.
- FIG. 7 illustrates an example of auxiliary enhancement for improved conditioning of models used herein.
- FIG. 8 illustrates an example use of the artifact reduction in a decoder. DETAILED DESCRIPTION
- Systems and methods herein describe the use of diffusion models for artifact correction in video.
- video refers to digital data representing consecutive image frames.
- “Compression” refers to methods that reduce the size (e.g., number of bytes) of the data used for the video.
- model refers to artificial intelligence/machine learning algorithms.
- Diffusion models belong to category of generative models (such as generative adversarial networks, GAN, and variational auto-encoders, VAE) that can be used to generate new images.
- GAN generative adversarial networks
- VAE variational auto-encoders
- DM utilizes forward diffusion steps to slowly convert image data into isometric Gaussian noise. Thereafter, these models are trained to learn the reverse diffusion process to map noise into novel images (belonging to a distribution of training data).
- the forward diffusion process adds the Gaussian noise to the sample in steps, producing a sequence of noisy samples ⁇ , ⁇ , ... , ⁇ , where the step sizes are controlled by a variance schedule
- the ⁇ ⁇ approaches the noise at each time instant such that ⁇ ⁇ approximates an isotropic Gaussian distribution.
- the ⁇ ⁇ is obtained from ⁇ ⁇ through Reverse Diffusion [0022] Reversing the forward process and sampling from ⁇ ( ⁇ ⁇ " ⁇
- ⁇ ⁇ is new sample from the training distribution. From the above equations, a network is trained to predict ⁇ % ⁇ , ' ⁇ as other variables are known. For more clarity, the variables with subscript + are predicted by the network. For example, ⁇ % is the predicted noise.
- FIG. 1 shows an example of the reverse diffusion process.
- the forward process would be X0 (130) being an image that is converted to gaussian noise at XT (105), so the reverse takes gaussian noise (105) and generates an image (130).
- X0 130
- XT gaussian noise
- ⁇ 120
- ⁇ ⁇
- a sample ⁇ ⁇ is chosen from training distribution ⁇ . ' is sampled from the uniform distribution ⁇ 0, D).
- Gaussian noise ⁇ is sampled from the normal distribution ⁇ ( 0,1 ) .
- a sample ⁇ ⁇ is taken from Gaussian noise and is refined over D steps with an intention to obtain a novel sample ⁇ ⁇ .
- a relatively refined ⁇ ⁇ " ⁇ is calculated by utilizing the predicted mean &%( ⁇ , '). This is obtained through ⁇ (E. F.
- Noise schedule defines the Markov chain and quantifies the noise added at each step.
- Two commonly used noise schedules are 1) linear, and 2) cosine. As compared to linear schedule, cosine schedule adds noise at a slower pace, resulting in slow degradations of image. In contrast, linear schedule destroys image quickly such that the samples in the last quarter are approximately pure noise. In some embodiments a linear noise schedule is used. In other embodiments a cosine noise schedule is utilized. The choice of noise schedule can depend on the types of images involved.
- conditional diffusion model is utilized for image enhancement.
- CDM conditional diffusion models
- the reverse diffusion process is conditioned on the source image.
- J) is a stochastic refinement process that maps a source image y ⁇ RM to the target image x ⁇ RM.
- to condition the model on the input J it is concatenated with ⁇ N along the channel dimension.
- Conditioning means that along with noisy target image, source image is also passed to the network during the training and reverse process (sampling). Accordingly, the modified loss function is: [0031] Similarly, the sampling step is modified as: [0032] It should be observed that source image J is same thorough out the process. Therefore, during sampling ⁇ ⁇ is refined at each step, but the conditioning is provided through the original source image.
- Training Framework for Artifact Reduction An example framework for training is shown in FIG. 3. In some embodiments, the system can be described in three modules: a data processing module (305), a high-efficiency video coding (HEVC) compression module (360), and a diffusion model module (355).
- HEVC high-efficiency video coding
- the data processing module (305) takes a video input (310) and converts it (315) to an enhanced video (325) in order to highlight the high frequency components of the images.
- This can be done, for example by taking an input video stream, and using a hybrid shifted sigmoid linear (HSSL) regression model to produce enhanced video (see, e.g., PCT publication WO2023/205548 “Generating HDR Image from Corresponding Camera Raw and SDR Images” by Guan-Ming Su et al., incorporated by reference herein).
- HSSL focuses on highlighting the high frequency components in the image. As a result, the generated image has comparatively better details.
- a frame (330) of the enhanced frames (325) is fed into the diffusion model (355).
- the enhanced frames (325) are fed into the HEVC compression module (360).
- the diffusion model (355) performs a forward diffusion process (335) on the enhanced frame (330) to produce a gaussian noise frame (340). This is sent through the reverse diffusion U-NET (350) with the compressed frames (345) to produce the 6 % (390) for determination of the loss function.
- compressed data training set of video samples
- HEVC codec can be utilized. The raw videos are passed to the HEVC codec to the generate the compressed videos.
- the artifact reduction framework has QP agnostic performance. QP varies based on the bitrate requirement.
- Multi-frame Conditioning is utilized instead of a single frame. In contrast to a single frame, multiple (e.g., three) consecutive frames are concatenated to the target image for conditioning.
- ⁇ N is (340)
- JS is (330)
- /JR" ⁇ , JR" ⁇ , JS1 is (345).
- Deformable Convolutions [0041]
- the U-Net can utilize a deformable convolution.
- Deformable convolution introduces the learnable offset and modulated factor for the W ⁇ a location such that: [0042] In contrast to conventional convolution, the deformable convolution allows non- uniform sampling that improves the performance on the vision tasks such as object detection. [0043] As discussed above, a U-Net with non-local attention block is deployed for learning the reverse process.
- a U-Net consists of two parts: 1) U-Net encoder, and 2) U-Net decoder.
- the features extracted by the U-Net encoder are utilized by the U-Net decoder for the required tasks.
- the deformable convolutions are introduced between the U-Net encoder and U-Net decoder such that the U-Net decoder can utilize the extracted features more effectively.
- FIG. 2 shows an example U-Net architecture.
- the diffusion frame ⁇ N (205) is concatenated to a group of one or more consequent frames (210) and put into the U-Net encoder portion (215A, 215B, 215C) of the U-Net for decoding by the U-Net decoder portion (215D, 215E, 215F) to produce the predicted previous diffusion frame ⁇ N" ⁇ (290).
- the figure is simplified – the number of steps in the U-Net encoder and U-Net decoder can be more than shown.
- a deformable convolution (220) is also performed.
- the noise schedule has three hyperparameters (see FIG.
- D d ⁇ ef ⁇ 505
- D ghM 515
- ⁇ d ⁇ gAd 510
- a linear schedule for instance, will take linear ⁇ d ⁇ gAd between D d ⁇ ef ⁇ and D ghM .
- the noise schedule used during the training might not be optimal for use during the inference.
- Varying only the ⁇ d ⁇ gAd while keeping the D d ⁇ ef ⁇ and D ghM fixed to training phase values will lead to poor reconstruction, especially, if ⁇ d ⁇ gAd are reduced significantly. For example, reducing the ⁇ d ⁇ gAd to 100 during inference in contrast to 2000 during training will provide a very noisy reconstructed image. [0047] Similarly, keeping ⁇ d ⁇ gAd fixed (to the training value) will lead to a very low throughput. For instance, utilizing 2000 steps for the enhancement of each frame will be very computationally expensive. [0048] Therefore, to obtain the optimal performance and time complexity, it is optimal to fine-tune all these three values.
- Table 1 shows example results of structural similarity index measure (SSIM) and peak signal-to-noise ratio (PSNR) for these example values.
- the performance can be significantly improved by providing the marginally improved source (condition) image.
- a module q:r ⁇ q(r) can be used to preprocess the source (condition) image (see e.g., 305 of FIG. 3).
- q:r ⁇ q(r) provides marginal improved source images from significantly degraded images.
- This step can also be considered as an auxiliary enhancement. It provides a better initial point to the diffusion model and significantly reduces the error in the final enhanced images. This process of providing the improved images to the diffusion model for better conditioning is termed as Enhanced Sampling.
- a GAN-based architecture is used.
- FIG. 7 shows an example of auxiliary enhancement.
- the source images (705) are preprocessed (710) to produce the marginally improved images (715) that are then used with the noise (720) for diffusion processing (725) to produce the target image (790).
- a dual discriminator set up is utilized, wherein the output of the generator is input to two different discriminators.
- the first discriminator takes input from the generator without any modifications.
- a 2D wavelet transform is applied to the output of the generator output.
- resultant LH, HL, and HH are fed to the discriminator.
- the whole set up is trained with a weighted combination of a conventional GAN loss function, feature loss, and VGG perceptual loss.
- the trajectory followed by the z in the first frame is also applied to subsequent frames (fixed sampling). This can be done for a subset of frames for a video stream, e.g., sampling z for a key frame and using that z until the next key frame.
- a “weighted temporal sampling” is used which is in-between the two extremes, and if the sampled s is visualized as a random process, then random process of the two consecutive frames are not entirely different. Instead, s for t u (frame i) is only slightly deviated from that of t uv ⁇ (frame i+1, i.e., the subsequent frame), i.e., s of the previous frame is not completely discarded. This is done utilizing an exponential moving average (EMA) to maintain the correspondence of the s from the previous frame.
- EMA exponential moving average
- the EMA will change the variance of the sN as merging of two Gaus ⁇ h sians ( ⁇ (#, ⁇ ) and ⁇ (#, leads to a new distribution ⁇ (#, ( ⁇ + ⁇ ) ) ⁇ ⁇ ⁇ ⁇ ,where ⁇ (#, ⁇ ) is a Gaussian sample with zero mean and variance. Similarly ⁇ (#, represents a Gaussian sample with zero and ⁇ varia ⁇ ⁇ nce. In general, ⁇ ( ⁇ , ⁇ ) is a representation for a Gaussian distribution with mean ⁇ and variance ⁇ ⁇ (or standard deviation ⁇ ). [0060] Finally, normalization leads to a Gaussian sample with zero mean and unit variance.
- adaptive temporal-spatial sampling is used to consider the relative motion between the frames to modulate the z.
- Optical flow is a popular technique to measure the inter-frames pixel motion. It captures the motion in terms of magnitude and direction. The subset of pixels with zero relative motion are assigned zero values in the optical flow. Similarly, optical flow has the nonzero values for pixels with nonzero relative motion.
- Adaptive Temporal-Spatial Sampling (405) is a general case of weighted temporal sampling (415), Fixed Sampling (425), and Conventional Sampling (435).
- Adaptive Temporal-Spatial Sampling (405) with non-zero optical flow (410) reduces to weighted temporal sampling (415).
- Adaptive Temporal-Spatial Sampling (405) with non-zero optical flow (410) and ⁇ ⁇ 1 (420) reduces to Fixed Sampling (425).
- Optical Flow Mask [0068] In some embodiments, an optical flow mask can be used to determine which pixels have non-zero optical flow.
- the original optical flow is non-binary (RGB), it needs to be converted to binary mask before utilization in the Adaptive Temporal-Spatial Sampling.
- RGB optical flow Given an RGB optical flow, following steps are utilized to obtain the binary mask: 1. Convert the RGB optical flow to grey image. 2. Define the threshold: In a frame, there could be motion of different magnitudes. For example, the foreground speed may move at a higher rate than the background. For such a case, choosing a higher threshold will lead to inaccurate binary mask. Threshold should be chosen such that moving pixels are not set to zero. Hence, a lower value of the threshold should be preferred. Accordingly, a threshold of five is utilized for binary conversion. 3.
- a control module is a component of the methodology. It can be used to control the enhancement quality.
- the reconstruction quality in the proposed architecture can be controlled through two components: 1) Auxiliary Enhancement, and 2) number of inference steps in the diffusion model. It may not always be necessary to apply same enhancement process for all the videos. For example, a specific video may have minor artifacts while the other may suffer from significant artifacts. For such varying video quality, control module can be used to bypass the auxiliary enhancement, and to control the inference steps for the diffusion model.
- Example Methodology [0071] FIG. 6 shows an example methodology for artifact removal.
- the raw frames (605) are compressed (610) to produce compressed frames (620) (e.g., for lower bandwidth distribution of the video).
- Motion estimation (615) is performed to produce an optical flow estimation (625).
- the compressed frames are passed to a non- diffusion model (630) (e.g., GAN) for auxiliary enhancement (625) into the diffusion process (655) (see e.g., 355 of FIG. 3) with adaptive temporal-spatial sampling (645).
- Optical flow from the compressed frames (625) is calculated for the relative motion between the frames to produce an optical flow map (650) for the diffusion process (655), which samples noise from a noise memory bank (660) for subsequent frames in fixed or weighted schedules.
- Video Decoder This system can be included as part of a video decoder.
- video decoder is a part of a video receiving device that decodes video data that has been encoded by a video encoder.
- FIG. 8 shows an example decoder flowchart.
- the compressed frames (805) are sent to the video system with compression codecs for decompression (810).
- the frames then undergo the diffusion process (815) to remove artifacts either for further processing (825) or viewing (890) depending on the system requirements (820).
- a method to enhance video frames comprising: receiving a set of one or more source video frames; creating one or more consequent frames from the one or more source video frames; producing a noise frame according to a noise schedule; performing conditional diffusion on the noise frame concatenated with a set of one or more consequent frames, producing one or more enhanced video frames, the conditional diffusion being performed with a trained model.
- EEE2 The method of EEE1, further comprising: performing auxiliary enhancement on the one or more source video frames prior to the performing conditional diffusion.
- EEE3 The method of EEE2, wherein the performing auxiliary enhancement comprises using a generative adversarial network to enhance the one or more source video frames.
- EEE5 The method of any one of EEEs 1-3, wherein the noise schedule includes sampling for each frame.
- EEE5. The method of any one of EEEs 1-4, wherein the noise schedule includes fixed sampling where the sampling of a frame uses stored z values from a previous frame.
- EEE6 The method of any one of EEEs 1-5, wherein the noise schedule includes weighted sampling that combines fixed sampling and sampling for each frame.
- EEE7 The method of EEE6, further comprising performing optical flow estimation of inter-frame pixel motion to create an optical flow map; wherein the weighted sampling is used for pixels where zero relative motion is mapped and another type of sampling is used for pixels where non-zero relative motion is mapped.
- EEE8 The method of EEE8.
- EEE7 where the another type of sampling is sampling for each frame or fixed sampling.
- EEE9. The method of any one of EEEs 1-7, further comprising using a control module to adjust enhancement intensity of the trained model.
- EEE10. The method of EEE 2 or 3, further comprising using a control module to adjust enhancement intensity of the auxiliary enhancement.
- EEE11 The method of any one of EEEs 1-9, wherein the set of one or more consequent frames comprises three consequent frames of the one or more source frames.
- EEE12 The method of any of EEEs 1-10, wherein a number of total steps used in the performing conditional diffusion is different than a number of total steps used in training the trained model.
- EEE14 The method of EEE13, wherein the u-net model comprises deformable convolution.
- EEE15 The method of any one of EEEs 1-14, wherein the trained model is trained by: preprocessing training data, training data comprising video; compressing the training data to produce compressed frames; performing forward diffusion based on the training data to produce noise frames; performing a U-Net model on a noise frame concatenated with corresponding compressed frames of the training data.
- EEE16 The method of EEE15, wherein the preprocessing comprises a hybrid shifted sigmoid linear regression model.
- EEE17 The method of EEE 15 or 16, wherein the compressing is performed at multiple quantization parameters (QPs).
- QPs quantization parameters
- EEE18 The method of any one of EEEs 15-17, further comprising calculating a loss function based on a difference between a sample noise at point t and a predicted noise at point t.
- EEE19 The method of any one of EEEs 15-18, wherein the U-Net model comprises a deformation kernel with learnable offsets and modulated factors.
- EEE20 The method of any one of EEEs 15-18, wherein the U-Net model comprises a deformation kernel with learnable offsets and modulated factors.
- training the generative adversarial network (GAN) model comprises: accessing a set of training video frames; applying the set of training frames to a generator; applying the output of the generator directly to a first discriminator; applying the output of the generator to a 2D wavelet transform to generate output HH, HL, LH, and LL components; applying the output components of the wavelet transform to a second discriminator; training the generator, the first discriminator, and the second discriminator with a weighted combination of loss functions generated by the first discriminator and the second discriminator; and discarding the first discriminator and the second discriminator after training.
- GAN generative adversarial network
- EEE20 wherein the loss functions comprise one or more of a GAN loss function, a feature loss function, or a VGG perceptual loss function.
- EEE22 The method of EEE20, wherein the output LL components of the 2D wavelet transform are excluded from input to the second discriminator.
- EEE23 A system to enhance video frames, said system comprising: an input for compressed video frames; a memory bank of noise frames; a conditional diffusion model trained to convert the compressed video frames into enhanced video frames using the noise frames, the enhanced video frames being artifact reduced versions of the compressed video frames.
- EEE24 The system of EEE23, further comprising: a motion estimation module configured to produce an optical flow map based on the compressed frames.
- EEE25 The system of EEE 23 or 24, further comprising an auxiliary enhancement module for enhancing the compressed frames prior to processing by the conditional diffusion model.
- EEE26 The system of EEE25, wherein the auxiliary enhancement module comprises a generative adversarial network model.
- EEE27 The system of any one of EEEs 23-26, further comprising a control module configured to adjust enhancement intensity of the conditional diffusion model.
- EEE28 The system of EEE 26 or 27, further comprising a control module configured to adjust enhancement intensity of the auxiliary enhancement module.
- EEE29 The system of EEE 23 or 24, further comprising an auxiliary enhancement module for enhancing the compressed frames prior to processing by the conditional diffusion model.
- a video decoder comprising the system of any one of EEEs 23-28, wherein the video decoder is configured to decode input video encoded from a video encoder.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Picture Signal Circuits (AREA)
Abstract
Novel methods and systems for using conditional diffusion models to reduce video artifacts is realized. The method/system can utilize a pre-processed ground truth to improve the process, as well as multiple-frame conditioning, deformable convolutions, auxiliary enhancements, and adaptive temporal-spatial sampling techniques for improved performance. A quantization parameter agnostic framework is also possible.
Description
VIDEO ARTIFACT REDUCTION BY DIFFUSION MODELS CROSS REFERENCE TO RELATED APPLICATIONS [0001] N/AThis application claims the benefit of priority from Indian Patent Application No. 202411028682, filed on 8 April 2024, which is incorporated by reference herein in its entirety. TECHNICAL FIELD [0002] The present disclosure relates to improvements for artifact removal for video. More particularly, it relates to methods and systems for using diffusion models to improve video quality. BACKGROUND [0003] When transmitting video over the bandwidth-limited Internet, video compression has to be applied to significantly save the coding bitrate. To aid the video compression, the Joint Collaborate Team on Video Coding (JCT-VC) has proposed H.264/advanced video coding (AVC) and high efficiency video coding (HEVC) standard for video compression. Furthermore, as compared to AVC standard, HEVC can save approximately 50% bitrate on average. However, various compression artifacts (e.g., blocking, blurring and ringing artifacts) are still present in compressed videos, especially at low bitrates The artifacts mainly result from the block-wise prediction and quantization with limited precision. In addition to impairing the perceptual quality of the video, these artifacts can negatively impact subsequent video enhancement steps. [0004] This, plus other sources of video artifacts, leads to a need to remove or reduce the number and/or severity of video artifacts prior to viewing and/or further video enhancement/processing. [0005] The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, issues identified with respect to one or more approaches should not assume to have been recognized in any prior art on the basis of this section, unless otherwise indicated.
SUMMARY [0006] An enhancement method/system is introduced using a diffusion model for artifact reduction in video. [0007] An embodiment includes a method to enhance video frames, said method comprising: receiving a set of one or more source video frames; creating one or more consequent frames from the one or more source video frames; producing a noise frame according to a noise schedule; performing conditional diffusion on the noise frame concatenated with a set of one or more consequent frames, producing one or more enhanced video frames, the conditional diffusion being performed with a trained model. In some embodiments, the trained model is trained by: preprocessing training data, training data comprising video; compressing the training data to produce compressed frames; performing forward diffusion based on the training data to produce noise frames; performing a U-Net model on a noise frame concatenated with corresponding compressed frames of the training data. [0008] An embodiment includes a system to enhance video frames, said system comprising: an input for compressed video frames; a memory bank of noise frames; a conditional diffusion model trained to convert the compressed video frames into enhanced video frames using the noise frames, the enhanced video frames being artifact reduced versions of the compressed video frames. In some embodiments, the system can be a part of a video decoder. BRIEF DESCRIPTION OF DRAWINGS [0009] FIG. 1 illustrates an example flowchart of the diffusion process. [0010] FIG. 2 illustrates an example of a U-Net architecture as used herein. [0011] FIG. 3 illustrates an example training methodology for models used herein. [0012] FIG. 4 illustrates an example hierarchy of different sampling types. [0013] FIG. 5 illustrates example of n-steps. [0014] FIG. 6 illustrates an example testing methodology for models used herein. [0015] FIG. 7 illustrates an example of auxiliary enhancement for improved conditioning of models used herein. [0016] FIG. 8 illustrates an example use of the artifact reduction in a decoder.
DETAILED DESCRIPTION [0017] Systems and methods herein describe the use of diffusion models for artifact correction in video. [0018] As used herein, “video” refers to digital data representing consecutive image frames. “Compression” refers to methods that reduce the size (e.g., number of bytes) of the data used for the video. [0019] As used herein, “model” refers to artificial intelligence/machine learning algorithms. “Training” refers to conditioning the model to perform a task using example data. Diffusion Models [0020] Diffusion models (DM) belong to category of generative models (such as generative adversarial networks, GAN, and variational auto-encoders, VAE) that can be used to generate new images. In essence, DM utilizes forward diffusion steps to slowly convert image data into isometric Gaussian noise. Thereafter, these models are trained to learn the reverse diffusion process to map noise into novel images (belonging to a distribution of training data). Forward Diffusion [0021] Given a sample from distribution ^^ ∼ ^(^), the forward diffusion process adds the Gaussian noise to the sample in steps, producing a sequence of noisy samples ^^, ^^, … , ^^, where the step sizes are controlled by a variance schedule
The ^^ approaches the noise at each time instant such that ^^ approximates an isotropic Gaussian distribution. The ^^ is obtained from ^^ through
Reverse Diffusion
[0022] Reversing the forward process and sampling from ^(^^"^|^^^ will give a new sample starting from ^^ ∼ ^^#, ^). However, estimating ^^^^"^|^^^ requires an entire dataset, and therefore, it is approximated with $% ^^^"^ |^^ ^ that is learned through a network. Hence, $%^^^"^|^^^ ^ ^^ ^^"^; &%^^^, '^, (%^^^^, '^^ [0023] Upon simplification,
[0024] Following the above process, ^^ is new sample from the training distribution. From the above equations, a network is trained to predict ^%^^^, '^ as other variables are known. For more clarity, the variables with subscript + are predicted by the network. For example, ^% is the predicted noise. Furthermore, ^%^^^, '^ implies the predicted noise (^%^ with ^^^, '^ as the input to the network. This suggests that predicted noise (^%^ depends on ^^ ^^^ '. Accordingly, the loss function is defined as:
i.e., model attempts to approximate the added noise at step ' (the sample noise at step t being 6^). In the above equation, ‖∙‖^ implies =^ ^>?@. A more generic form is ‖∙‖A Awith $ ^ 1 and $ ^ 2 representing =^ and =^ norm, respectively. In some embodiments, a U-Net with non- local attention block and provision for time embeddings can be utilized as the learnable architectures. [0025] FIG. 1 shows an example of the reverse diffusion process. The forward process would be X0 (130) being an image that is converted to gaussian noise at XT (105), so the reverse takes gaussian noise (105) and generates an image (130). At the intermediate steps (110, 115) a variational lower bound is used. In the reverse direction $%^^^"^|^^^ (120) estimates the ^^^^"^|^^^ (125) used in the forward direction.
[0026] In an example of the training algorithm, a sample ^^ is chosen from training distribution ^^^^. ' is sampled from the uniform distribution ^0, D). Gaussian noise ^ is sampled from the normal distribution ^(0,1). In accordance with sampled ' and ^, the noise is added to the sampled ^^ to obtain ^^ =
+ ^1 − ^^^ ^ which is then input to network along with ' to obtain the predicted noise ^%(^^, '), and loss is calculated against the true noise ^. [0027] In an example of the sampling (reverse diffusion) algorithm, a sample ^^ is taken from Gaussian noise and is refined over D steps with an intention to obtain a novel sample ^^. At each step, a relatively refined ^^"^ is calculated by utilizing the predicted mean &%(^^, '). This is obtained through ^^ (E. F. &%(^^, ') = ^ ^"G4 ^G4 (^^ − ^^"GH4 ^%(^^, '))) from the learned architecture until a true sample, ^^ is obtained. Noise Schedule [0028] Noise schedule defines the Markov chain and quantifies the noise added at each step. Two commonly used noise schedules are 1) linear, and 2) cosine. As compared to linear schedule, cosine schedule adds noise at a slower pace, resulting in slow degradations of image. In contrast, linear schedule destroys image quickly such that the samples in the last quarter are approximately pure noise. In some embodiments a linear noise schedule is used. In other embodiments a cosine noise schedule is utilized. The choice of noise schedule can depend on the types of images involved. For example, linear schedules are more optimal for high resolution images, whereas cosine schedules are better for lower resolution images (e.g., 32x32 and 64x64). Conditional Diffusion Model [0029] In embodiments herein, a conditional diffusion model is utilized for image enhancement. In conditional diffusion models (CDM), the reverse diffusion process is conditioned on the source image. A conditional distribution can be described as I = $(^|J), where CDM $(^|J) is a stochastic refinement process that maps a source image y ∈ ℝM to the target image x ∈ ℝM. Notably, to condition the model on the input J, it is concatenated with ^N along the channel dimension.
[0030] Conditioning means that along with noisy target image, source image is also passed to the network during the training and reverse process (sampling). Accordingly, the modified loss function is:
[0031] Similarly, the sampling step is modified as:
[0032] It should be observed that source image J is same thorough out the process. Therefore, during sampling ^^ is refined at each step, but the conditioning is provided through the original source image. Training Framework for Artifact Reduction [0033] An example framework for training is shown in FIG. 3. In some embodiments, the system can be described in three modules: a data processing module (305), a high-efficiency video coding (HEVC) compression module (360), and a diffusion model module (355). [0034] In some embodiments, the data processing module (305) takes a video input (310) and converts it (315) to an enhanced video (325) in order to highlight the high frequency components of the images. This can be done, for example by taking an input video stream, and using a hybrid shifted sigmoid linear (HSSL) regression model to produce enhanced video (see, e.g., PCT publication WO2023/205548 “Generating HDR Image from Corresponding Camera Raw and SDR Images” by Guan-Ming Su et al., incorporated by reference herein). HSSL focuses on highlighting the high frequency components in the image. As a result, the generated image has comparatively better details. In video compression, at higher QPs there is a significant loss of the high frequency details (artifacts). Accordingly, the diffusion model network should be able to recover these missing details and reduce the artifacts. As HSSL images are relatively rich in details, using these enhanced images as the ground truth instead of the normal images will guide the network for a better generation of the details. [0035] A frame (330) of the enhanced frames (325) is fed into the diffusion model (355). The enhanced frames (325) are fed into the HEVC compression module (360).
[0036] The HEVC compression module (360) compresses the frames at two or more QPs (quantization parameters) (370A, 370B). For example, QP=37 (370A) and QP=42 (370B). These compressed frames are also fed into the diffusion model (355). [0037] The diffusion model (355) performs a forward diffusion process (335) on the enhanced frame (330) to produce a gaussian noise frame (340). This is sent through the reverse diffusion U-NET (350) with the compressed frames (345) to produce the 6% (390) for determination of the loss function. [0038] To train the network, compressed data (training set of video samples) containing various artifacts is required. To generate the required compressed data, HEVC codec can be utilized. The raw videos are passed to the HEVC codec to the generate the compressed videos. [0039] In some embodiments, the artifact reduction framework has QP agnostic performance. QP varies based on the bitrate requirement. To infuse this capability in the network, augment the data with multiple QP data. For example, each video is compressed with QP=42 and 37, and both of the versions are utilized for the training. In this way, the model can work optimally on these two QPs during the inference. Furthermore, due to generative capability of the diffusion models, and because they are trained to recover the details from the severely degraded images at higher QPs, they can satisfactorily also operate at other QPs (e.g., QP=27). Multi-frame Conditioning [0040] In some embodiments, multi-frame conditioning is utilized instead of a single frame. In contrast to a single frame, multiple (e.g., three) consecutive frames are concatenated to the target image for conditioning. Consider a target frame ^ and source frame JR. According to the forward process, ^ is converted to ^N, and concatenated with three consequents /JR"^, JR"^, JS1 frames for conditioning. Hence, the input to the U-Net is /^T; JR"^, JR"^, JS1 . This multi-frame conditioning presents more spatial and temporal information which results in improved reconstruction and is also helpful for temporal consistency. In FIG. 3, ^N is (340), JS is (330), and /JR"^, JR"^, JS1 is (345). Deformable Convolutions [0041] In some embodiments, the U-Net can utilize a deformable convolution.
For a 3 × 3 convolutional kernel, the output feature @^W^ for the input $^W^ is defined as: X^W^ ^ ∑\ [!^ Z[. $(W + W[) , where W[ ∈ ](−1,1), (−1,0), … , (1,1)^ and _ = 9 defines the kernel. Deformable convolution introduces the learnable offset and modulated factor for the W^a location such that:
[0042] In contrast to conventional convolution, the deformable convolution allows non- uniform sampling that improves the performance on the vision tasks such as object detection. [0043] As discussed above, a U-Net with non-local attention block is deployed for learning the reverse process. A U-Net consists of two parts: 1) U-Net encoder, and 2) U-Net decoder. The features extracted by the U-Net encoder are utilized by the U-Net decoder for the required tasks. The deformable convolutions are introduced between the U-Net encoder and U-Net decoder such that the U-Net decoder can utilize the extracted features more effectively. [0044] FIG. 2 shows an example U-Net architecture. The diffusion frame ^N (205) is concatenated to a group of one or more consequent frames (210) and put into the U-Net encoder portion (215A, 215B, 215C) of the U-Net for decoding by the U-Net decoder portion (215D, 215E, 215F) to produce the predicted previous diffusion frame ^N"^ (290). The figure is simplified – the number of steps in the U-Net encoder and U-Net decoder can be more than shown. In some embodiments, a deformable convolution (220) is also performed. Optimal Inference Process [0045] The noise schedule has three hyperparameters (see FIG. 5), Dd^ef^ (505), DghM (515), and ^d^gAd (510). A linear schedule, for instance, will take linear ^d^gAd between Dd^ef^ and DghM. However, it has been observed that the noise schedule used during the training might not be optimal for use during the inference. For example, a linear schedule with Dd^ef^ = 1E − 6, DghM = 1E − 2, and ^d^gAd = 2000 is utilized for training. But the same values during the inference give, in this example, unsatisfactory results. Additionally, ^d^gAd = 2000 will lead to a significantly slow inference process. Therefore, it is optimal to finetune
these hyperparameters during the inference. Notably, explicit training for each set of hyperparameters is not required. A fixed set of hyperparameters can be used for training, and afterwards, these values can be grid searched during the inference. A critical question, however, is how the fine tuning (grid search) of these parameters should be performed. There are three possibilities for the fine tuning: 1. Varying ^d^gAd with fixed Dd^ef^ and DghM 2. Varying Dd^ef^ and DghM with ^d^gAd fixed 3. Varying
[0046] Varying only the ^d^gAd while keeping the Dd^ef^ and DghM fixed to training phase values will lead to poor reconstruction, especially, if ^d^gAd are reduced significantly. For example, reducing the ^d^gAd to 100 during inference in contrast to 2000 during training will provide a very noisy reconstructed image. [0047] Similarly, keeping ^d^gAd fixed (to the training value) will lead to a very low throughput. For instance, utilizing 2000 steps for the enhancement of each frame will be very computationally expensive. [0048] Therefore, to obtain the optimal performance and time complexity, it is optimal to fine-tune all these three values. Note that, during the inference, these values can go beyond the training phase values. For example, if DghM ^ 1E − 2 during the training, then values such as 0.1, 0.5, or 0.95 can also be utilized during the testing. [0049] The following approach can be used for the hyperparameter search: 1. Fix the maximum number of the ^d^gAd (e.g., set to 100) 2. Keep the DghM very close to 1 (e.g., 0.95) 3. For the fixed ^d^gAd and DghM, finetune Dd^ef^ such that Dd^ef^ < DghM. The larger difference between Dd^ef^ ^^^ DghM leads to more detail generation by the model. For example, sequentially decreased Dd^ef^ by a factor of 10, i.e. after setting ^d^gAd = 100 and DghM = 0.95, and test 0.1, 0.01, and 0.001 for Dd^ef^. 4. After obtaining the optimal values of the Dd^ef^ and DghM, the ^d^gAd can be increased
or decreased to check any variation in the performance. In an embodiment, without limitation, example values include: Dd^ef^ ^ 0.001, DghM = 0.95, ^d^gAd = 100. [0050] Table 1 shows example results of structural similarity index measure (SSIM) and peak signal-to-noise ratio (PSNR) for these example values. Dd^ef^ DghM ^d^gAd PSNR SSIM 1e-6 1e-4 100 7.616 0.0092 1e-3 1e-2 100 12.569 0.1998 1e-3 1e-1 100 28.228 0.8328 1e-3 0.5 100 28.602 0.8454 1e-3 0.95 100 28.995 0.8550 1e-3 0.95 50 29.822 0.8707 1e-3 0.95 10 30.774 0.8878 1e-2 0.95 10 30.629 0.8853 Table 1 Enhanced Sampling [0051] The sampling process (mapping the source image to the target image) is described in the sampling algorithm (above). This is utilized to remove the artifacts in the compressed images. In some embodiments, to deal with severe degradations (higher QPs), the performance can be significantly improved by providing the marginally improved source (condition) image. To achieve this, a module q:r→q(r) can be used to preprocess the source (condition) image (see e.g., 305 of FIG. 3). Hence, q:r→q(r) provides marginal improved source images from significantly degraded images. This step can also be considered as an auxiliary enhancement. It provides a better initial point to the diffusion model and significantly reduces the error in the final enhanced images. This process of providing the improved images to the diffusion model for better conditioning is termed as Enhanced Sampling. In some embodiments, a GAN-based architecture is used. [0052] FIG. 7 shows an example of auxiliary enhancement. The source images (705) are preprocessed (710) to produce the marginally improved images (715) that are then used with the noise (720) for diffusion processing (725) to produce the target image (790). [0053] For training the GAN architecture, a dual discriminator set up is utilized, wherein the output of the generator is input to two different discriminators. The first discriminator takes input from the generator without any modifications. For feeding the second discriminator, a
2D wavelet transform is applied to the output of the generator output. Thereafter, resultant LH, HL, and HH (not LL) are fed to the discriminator. The whole set up is trained with a weighted combination of a conventional GAN loss function, feature loss, and VGG perceptual loss. All these three loss components are derived from both the discriminators. After the testing phase, both discriminators are discarded, and only the generator is utilized for the inference. Fixed Sampling [0054] The sampling process involves sampling z from Gaussian noise for T steps. Therefore, sampling in diffusion models is a stochastic process. Due to this stochasticity, same set of pixels in the consecutive frames will map to slightly different values. Alternately, if one considers the sampling of z as a random process (trajectory), then each frame will follow a different trajectory. This in effect can lead to a flickering effect in the output video. [0055] In some embodiments, to handle the flickering, the system uses a memory bank to save the value of z of first frame for all the time steps. Thereafter, these values are utilized for the remaining frames. In another words, the trajectory followed by the z in the first frame is also applied to subsequent frames (fixed sampling). This can be done for a subset of frames for a video stream, e.g., sampling z for a key frame and using that z until the next key frame. [0056] This means that the error is same for a subset of pixels in all the frames. As expected, it will remove flickering but may create a ‘constant foreground’ across the frames. Notably, it is not observed when viewed in the independent frames. Weighted Sampling [0057] Above describes embodiments with two extremes in which sampling of s is either 1) different for each frame (conventional sampling), and 2) fixed for each frame (fixed sampling). The first leads to flickering whereas later leads to appearance of a constant foreground pattern. In some other embodiments, a “weighted temporal sampling” is used which is in-between the two extremes, and if the sampled s is visualized as a random process, then random process of the two consecutive frames are not entirely different. Instead, s for tu (frame i) is only slightly deviated from that of tuv^ (frame i+1, i.e., the subsequent frame), i.e., s of the previous frame is not completely discarded. This is done utilizing an exponential moving average (EMA) to maintain the correspondence of the s from the previous frame. If
for the ^^a frame, the value of s at time step ' is sh N , then,
where sx = ^(#, ^), and β is the weighting factor. Hence, the previous s is maintained by a factor of β which is added to the new 1 − β scaled s. For example, ^ = 0.5 provides equal weightage to the previous and newly sampled values. [0058] In effect, β provides a balance between flickering and constant background pattern. Also, conventional sampling and fixed sampling can be viewed as special cases of weighted temporal sampling. For β = 0, weighted temporal sampling reduces to conventional sampling, and for β = 1 it behaves as a fixed sampling. For 0 < β < 1, it provides balancing between these two extremities. With flickering for β = 0, and moving towards less flickering and more observable constant background pattern for higher values of β, and with no flickering and consistent foreground pattern at β = 1. [0059] The impact of the weighted temporal sampling removes the consistent foreground pattern of fixed sampling. A noted step is the normalization of the sy N to zero mean and unit variance after EMA as it is a strict requirement of the sampling. As observed, the EMA will change the variance of the sN as merging of two Gaus ^ h sians (^(#, σ^^) and ^(#,
leads to a new distribution ^(#, (σ^ + ^) ) ^ ^ σ^ ^ ,where ^(#, σ^^) is a Gaussian sample with zero mean and variance. Similarly ^(#,
represents a Gaussian sample with zero and σ^ varia ^ ^ nce. In general, ^(μ, σ ) is a representation for a Gaussian distribution with mean ^ and variance ^^ (or standard deviation ^). [0060] Finally, normalization leads to a Gaussian sample with zero mean and unit variance. If a Gaussian sample ^ has mean ^ and ^ variance, it can be normalized as ^^
. resultant ^′ will have zero mean and unit variance. Adaptive Temporal-Spatial Sampling [0061] In some embodiments, adaptive temporal-spatial sampling is used to consider the relative motion between the frames to modulate the z. Optical flow is a popular technique to measure the inter-frames pixel motion. It captures the motion in terms of magnitude and direction. The subset of pixels with zero relative motion are assigned zero values in the optical flow. Similarly, optical flow has the nonzero values for pixels with nonzero relative
motion. [0062] For a set of pixels with nonzero relative motion, modifying the sampling process is not required (even with weighted temporal sampling) and the sampling process for these pixels in the previous frame can be directly utilized in the current frame. For a set of pixels with zero relative motion (or for pixels with relative motion below a threshold value), weighted temporal sampling is applied. [0063] Furthermore, as discussed in previous section, fixed sampling is a special case of weighted temporal sampling. Hence, the complete hierarchy can be represented as in FIG. 4 with the following observations: [0064] Adaptive Temporal-Spatial Sampling (405) is a general case of weighted temporal sampling (415), Fixed Sampling (425), and Conventional Sampling (435). [0065] Adaptive Temporal-Spatial Sampling (405) with non-zero optical flow (410) reduces to weighted temporal sampling (415). [0066] Adaptive Temporal-Spatial Sampling (405) with non-zero optical flow (410) and β ^ 1 (420) reduces to Fixed Sampling (425). [0067] Adaptive Temporal-Spatial Sampling (405) with non-zero optical flow (410) and β = 0 (435) reduces to Conventional Sampling (435). Optical Flow Mask [0068] In some embodiments, an optical flow mask can be used to determine which pixels have non-zero optical flow. [0069] As the original optical flow is non-binary (RGB), it needs to be converted to binary mask before utilization in the Adaptive Temporal-Spatial Sampling. Given an RGB optical flow, following steps are utilized to obtain the binary mask: 1. Convert the RGB optical flow to grey image. 2. Define the threshold: In a frame, there could be motion of different magnitudes. For example, the foreground speed may move at a higher rate than the background. For such a case, choosing a higher threshold will lead to inaccurate binary mask. Threshold should be chosen such that moving pixels are not set to zero. Hence, a lower value of the threshold
should be preferred. Accordingly, a threshold of five is utilized for binary conversion. 3. Given the threshold in step 2, convert to grey to binary image such that pixels below the threshold are set to 0, and higher ones are set to 1. Control Module [0070] In some embodiments, a control module is a component of the methodology. It can be used to control the enhancement quality. The reconstruction quality in the proposed architecture can be controlled through two components: 1) Auxiliary Enhancement, and 2) number of inference steps in the diffusion model. It may not always be necessary to apply same enhancement process for all the videos. For example, a specific video may have minor artifacts while the other may suffer from significant artifacts. For such varying video quality, control module can be used to bypass the auxiliary enhancement, and to control the inference steps for the diffusion model. Example Methodology [0071] FIG. 6 shows an example methodology for artifact removal. The raw frames (605) (one or more contiguous frames) are compressed (610) to produce compressed frames (620) (e.g., for lower bandwidth distribution of the video). Motion estimation (615) is performed to produce an optical flow estimation (625). The compressed frames are passed to a non- diffusion model (630) (e.g., GAN) for auxiliary enhancement (625) into the diffusion process (655) (see e.g., 355 of FIG. 3) with adaptive temporal-spatial sampling (645). Optical flow from the compressed frames (625) is calculated for the relative motion between the frames to produce an optical flow map (650) for the diffusion process (655), which samples noise from a noise memory bank (660) for subsequent frames in fixed or weighted schedules. The control module (640) controls the enabling of auxiliary enhancement (630) and the noise schedule (e.g., number of steps) of the diffusion model (655), thereby controlling the level of image enhancement for the output (690). Video Decoder [0072] This system can be included as part of a video decoder. As used herein, “video decoder” is a part of a video receiving device that decodes video data that has been encoded by a video encoder. FIG. 8 shows an example decoder flowchart. The compressed frames (805) are sent to the video system with compression codecs for decompression (810). The
frames then undergo the diffusion process (815) to remove artifacts either for further processing (825) or viewing (890) depending on the system requirements (820). Conclusion [0073] In this document, a diffusion-model based video enhancement system is described. [0074] A number of embodiments of the disclosure have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the present disclosure. Accordingly, other embodiments are within the scope of the following claims. [0075] As described herein, an embodiment of the present invention may thus relate to one or more of the example embodiments, which are enumerated below. Accordingly, the invention may be embodied in any of the forms described herein, including, but not limited to the following Enumerated Example Embodiments (EEEs) which described structure, features, and functionality of some portions of the present invention: [0076] EEE1. A method to enhance video frames, said method comprising: receiving a set of one or more source video frames; creating one or more consequent frames from the one or more source video frames; producing a noise frame according to a noise schedule; performing conditional diffusion on the noise frame concatenated with a set of one or more consequent frames, producing one or more enhanced video frames, the conditional diffusion being performed with a trained model. [0077] EEE2. The method of EEE1, further comprising: performing auxiliary enhancement on the one or more source video frames prior to the performing conditional diffusion. [0078] EEE3. The method of EEE2, wherein the performing auxiliary enhancement comprises using a generative adversarial network to enhance the one or more source video frames. [0079] EEE4. The method of any one of EEEs 1-3, wherein the noise schedule includes sampling for each frame. [0080] EEE5. The method of any one of EEEs 1-4, wherein the noise schedule includes fixed sampling where the sampling of a frame uses stored z values from a previous frame. [0081] EEE6. The method of any one of EEEs 1-5, wherein the noise schedule includes
weighted sampling that combines fixed sampling and sampling for each frame. [0082] EEE7. The method of EEE6, further comprising performing optical flow estimation of inter-frame pixel motion to create an optical flow map; wherein the weighted sampling is used for pixels where zero relative motion is mapped and another type of sampling is used for pixels where non-zero relative motion is mapped. [0083] EEE8. The method of EEE7, where the another type of sampling is sampling for each frame or fixed sampling. [0084] EEE9. The method of any one of EEEs 1-7, further comprising using a control module to adjust enhancement intensity of the trained model. [0085] EEE10. The method of EEE 2 or 3, further comprising using a control module to adjust enhancement intensity of the auxiliary enhancement. [0086] EEE11. The method of any one of EEEs 1-9, wherein the set of one or more consequent frames comprises three consequent frames of the one or more source frames. [0087] EEE12. The method of any of EEEs 1-10, wherein a number of total steps used in the performing conditional diffusion is different than a number of total steps used in training the trained model. [0088] EEE13. The method of any of EEEs 1-12, wherein the performing conditional diffusion further comprises using a u-net model. [0089] EEE14. The method of EEE13, wherein the u-net model comprises deformable convolution. [0090] EEE15. The method of any one of EEEs 1-14, wherein the trained model is trained by: preprocessing training data, training data comprising video; compressing the training data to produce compressed frames; performing forward diffusion based on the training data to produce noise frames; performing a U-Net model on a noise frame concatenated with corresponding compressed frames of the training data. [0091] EEE16. The method of EEE15, wherein the preprocessing comprises a hybrid shifted sigmoid linear regression model. [0092] EEE17. The method of EEE 15 or 16, wherein the compressing is performed at
multiple quantization parameters (QPs). [0093] EEE18. The method of any one of EEEs 15-17, further comprising calculating a loss function based on a difference between a sample noise at point t and a predicted noise at point t. [0094] EEE19. The method of any one of EEEs 15-18, wherein the U-Net model comprises a deformation kernel with learnable offsets and modulated factors. [0095] EEE20. The method of EEE3, wherein training the generative adversarial network (GAN) model comprises: accessing a set of training video frames; applying the set of training frames to a generator; applying the output of the generator directly to a first discriminator; applying the output of the generator to a 2D wavelet transform to generate output HH, HL, LH, and LL components; applying the output components of the wavelet transform to a second discriminator; training the generator, the first discriminator, and the second discriminator with a weighted combination of loss functions generated by the first discriminator and the second discriminator; and discarding the first discriminator and the second discriminator after training. [0096] EEE21. The method of EEE20, wherein the loss functions comprise one or more of a GAN loss function, a feature loss function, or a VGG perceptual loss function. [0097] EEE22. The method of EEE20, wherein the output LL components of the 2D wavelet transform are excluded from input to the second discriminator. [0098] EEE23. A system to enhance video frames, said system comprising: an input for compressed video frames; a memory bank of noise frames; a conditional diffusion model trained to convert the compressed video frames into enhanced video frames using the noise frames, the enhanced video frames being artifact reduced versions of the compressed video frames. [0099] EEE24. The system of EEE23, further comprising: a motion estimation module configured to produce an optical flow map based on the compressed frames. [0100] EEE25. The system of EEE 23 or 24, further comprising an auxiliary enhancement module for enhancing the compressed frames prior to processing by the conditional diffusion model.
[0101] EEE26. The system of EEE25, wherein the auxiliary enhancement module comprises a generative adversarial network model. [0102] EEE27. The system of any one of EEEs 23-26, further comprising a control module configured to adjust enhancement intensity of the conditional diffusion model. [0103] EEE28. The system of EEE 26 or 27, further comprising a control module configured to adjust enhancement intensity of the auxiliary enhancement module. [0104] EEE29. A video decoder comprising the system of any one of EEEs 23-28, wherein the video decoder is configured to decode input video encoded from a video encoder. [0105] The examples set forth above are provided to those of ordinary skill in the art as a complete disclosure and description of how to make and use the embodiments of the disclosureand are not intended to limit the scope of what the inventor/inventors regard as their disclosure. [0106] Modifications of the above-described modes for carrying out the methods and systems herein disclosed that are obvious to persons of skill in the art are intended to be within the scope of the following claims. All patents and publications mentioned in the specification are indicative of the levels of skill of those skilled in the art to which the disclosure pertains. All references cited in this disclosure are incorporated by reference to the same extent as if each reference had been incorporated by reference in its entirety individually. [0107] It is to be understood that the disclosure is not limited to particular methods or systems, which can, of course, vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting. As used in this specification and the appended claims, the singular forms "a," "an," and "the" include plural referents unless the content clearly dictates otherwise. The term “plurality” includes two or more referents unless the content clearly dictates otherwise. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the disclosure pertains.
Claims
CLAIMS What is claimed is: 1. A method to enhance video frames, said method comprising: receiving a set of one or more source video frames; creating one or more consequent frames from the one or more source video frames; producing a noise frame according to a noise schedule; performing conditional diffusion on the noise frame concatenated with a set of one or more consequent frames, producing one or more enhanced video frames, the conditional diffusion being performed with a trained model.
2. The method of claim 1, further comprising: performing auxiliary enhancement on the one or more source video frames prior to the performing conditional diffusion.
3. The method of claim 2, wherein the performing auxiliary enhancement comprises using a generative adversarial network to enhance the one or more source video frames.
4. The method of any one of claims 1-3, wherein the noise schedule includes sampling for each frame.
5. The method of any one of claims 1-4, wherein the noise schedule includes fixed sampling where the sampling of a frame uses stored z values from a previous frame.
6. The method of any one of claims 1-5, wherein the noise schedule includes weighted sampling that combines fixed sampling and sampling for each frame.
7. The method of claim 6, further comprising performing optical flow estimation of inter-frame pixel motion to create an optical flow map; wherein the weighted sampling is used for pixels where zero relative motion is mapped and another type of sampling is used for pixels where non-zero relative motion is mapped.
8. The method of claim 7, where the another type of sampling is sampling for each frame or fixed sampling.
9. The method of any one of claims 1-7, further comprising using a control module to adjust enhancement intensity of the trained model.
10. The method of claim 2 or 3, further comprising using a control module to adjust enhancement intensity of the auxiliary enhancement.
11. The method of any one of claims 1-9, wherein the set of one or more consequent frames comprises three consequent frames of the one or more source frames.
12. The method of any one of claims 1-10, wherein a number of total steps used in the performing conditional diffusion is different than a number of total steps used in training the trained model.
13. The method of any one of claims 1-12, wherein the performing conditional diffusion further comprises using a u-net model.
14. The method of claim 13, wherein the u-net model comprises deformable convolution.
15. The method of any one of claims 1-14, wherein the trained model is trained by: preprocessing training data, training data comprising video; compressing the training data to produce compressed frames; performing forward diffusion based on the training data to produce noise frames; performing a U-Net model on a noise frame concatenated with corresponding compressed frames of the training data.
16. The method of claim 15, wherein the preprocessing comprises a hybrid shifted sigmoid linear regression model.
17. The method of claim 15 or 16, wherein the compressing is performed at multiple quantization parameters (QPs).
18. The method of any one of claims 15-17, further comprising calculating a loss function based on a difference between a sample noise at point t and a predicted noise at point t.
19. The method of any one of claims 15-18, wherein the U-Net model comprises a deformation kernel with learnable offsets and modulated factors.
20. The method of claim 3, wherein training the generative adversarial network (GAN) model comprises: accessing a set of training video frames; applying the set of training frames to a generator; applying the output of the generator directly to a first discriminator; applying the output of the generator to a 2D wavelet transform to generate output HH, HL, LH, and LL components; applying the output components of the wavelet transform to a second discriminator; training the generator, the first discriminator, and the second discriminator with a weighted combination of loss functions generated by the first discriminator and the second discriminator; and discarding the first discriminator and the second discriminator after training.
21. The method of claim 20, wherein the loss functions comprise one or more of a GAN loss function, a feature loss function, or a VGG perceptual loss function.
22. The method of claim 20, wherein the output LL components of the 2D wavelet transform are excluded from input to the second discriminator.
23. A system to enhance video frames, said system comprising: an input for compressed video frames; a memory bank of noise frames; a conditional diffusion model trained to convert the compressed video frames into enhanced video frames using the noise frames, the enhanced video frames being artifact reduced versions of the compressed video frames.
24. The system of claim 23, further comprising: a motion estimation module configured to produce an optical flow map based on the compressed frames.
25. The system of claim 23 or 24, further comprising an auxiliary enhancement module for enhancing the compressed frames prior to processing by the conditional diffusion model.
26. The system of claim 25, wherein the auxiliary enhancement module comprises a generative adversarial network model.
27. The system of any one of claims 23-26, further comprising a control module configured to adjust enhancement intensity of the conditional diffusion model.
28. The system of claim 26 or 27, further comprising a control module configured to adjust enhancement intensity of the auxiliary enhancement module.
29. A video decoder comprising the system of any one of claims 23-28, wherein the video decoder is configured to decode input video encoded from a video encoder.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| IN202411028682 | 2024-04-08 | ||
| IN202411028682 | 2024-04-08 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025217001A1 true WO2025217001A1 (en) | 2025-10-16 |
Family
ID=95583362
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2025/023309 Pending WO2025217001A1 (en) | 2024-04-08 | 2025-04-04 | Video artifact reduction by diffusion models |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2025217001A1 (en) |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2023205548A1 (en) | 2022-04-21 | 2023-10-26 | Dolby Laboratories Licensing Corporation | Generating hdr image from corresponding camera raw and sdr images |
| WO2024073213A1 (en) * | 2022-09-27 | 2024-04-04 | Qualcomm Incorporated | Diffusion-based data compression |
-
2025
- 2025-04-04 WO PCT/US2025/023309 patent/WO2025217001A1/en active Pending
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2023205548A1 (en) | 2022-04-21 | 2023-10-26 | Dolby Laboratories Licensing Corporation | Generating hdr image from corresponding camera raw and sdr images |
| WO2024073213A1 (en) * | 2022-09-27 | 2024-04-04 | Qualcomm Incorporated | Diffusion-based data compression |
Non-Patent Citations (7)
| Title |
|---|
| ESKANDAR GEORGE ET AL: "Wavelet-Based Unsupervised Label-to-Image Translation", ICASSP 2022 - 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 23 May 2022 (2022-05-23), pages 1760 - 1764, XP034158648, [retrieved on 20220427], DOI: 10.1109/ICASSP43922.2022.9746759 * |
| GAL RINON ET AL: "SWAGAN", ACM TRANSACTIONS ON GRAPHICS, ACM, NY, US, vol. 40, no. 4, 19 July 2021 (2021-07-19), pages 1 - 11, XP058683546, ISSN: 0730-0301, DOI: 10.1145/3450626.3459836 * |
| GYEONGMAN KIM ET AL: "Diffusion Video Autoencoders: Toward Temporally Consistent Face Video Editing via Disentangled Video Encoding", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 27 March 2023 (2023-03-27), XP091468725 * |
| MINGZHEN SUN ET AL: "GLOBER: Coherent Non-autoregressive Video Generation via GLOBal Guided Video DecodER", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 23 September 2023 (2023-09-23), XP091621986 * |
| XIN LI ET AL: "Diffusion Models for Image Restoration and Enhancement -- A Comprehensive Survey", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 18 August 2023 (2023-08-18), XP091595232 * |
| YAN QINGSEN ET AL: "Toward High-Quality HDR Deghosting With Conditional Diffusion Models", IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, IEEE, USA, vol. 34, no. 5, 19 October 2023 (2023-10-19), pages 4011 - 4026, XP011969717, ISSN: 1051-8215, [retrieved on 20231020], DOI: 10.1109/TCSVT.2023.3326293 * |
| ZHENGXIONG LUO ET AL: "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 13 October 2023 (2023-10-13), XP091634758 * |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN119031147B (en) | Video coding and decoding acceleration method and system based on learning task perception mechanism | |
| TWI870727B (en) | Method and processing device for encoding/reconstructing at least a portion of image | |
| EP3743855B1 (en) | Receptive-field-conforming convolution models for video coding | |
| EP4445609A1 (en) | Method of video coding by multi-modal processing | |
| US12244792B2 (en) | Processing image data | |
| CN117441333A (en) | Configurable location for input auxiliary information to the image data processing neural network | |
| EP4413501A1 (en) | Transformer based neural network using variable auxiliary input | |
| EP4315866A1 (en) | Multi-distribution entropy modeling of latent features in image and video coding using neural networks | |
| Song et al. | High frequency matters: Uncertainty guided image compression with wavelet diffusion | |
| JP7717985B2 (en) | Spatial frequency transform based image correction using inter-channel correlation information | |
| WO2025217001A1 (en) | Video artifact reduction by diffusion models | |
| CN121039696A (en) | Image feature processing method and decoding device | |
| CN120419185A (en) | Method, apparatus and medium for visual data processing | |
| Mital et al. | Deep stereo image compression with decoder side information using wyner common information | |
| WO2025002252A1 (en) | Re-sampling in image compression | |
| Wu et al. | End to End Scalable Image Coding for Machines | |
| US20250317605A1 (en) | Progressive generative face video compression with bandwidth intelligence | |
| WO2024169959A1 (en) | Method, apparatus, and medium for visual data processing | |
| WO2024193709A9 (en) | Method, apparatus, and medium for visual data processing | |
| US20250168367A1 (en) | Tunable Hybrid Neural Video Representations | |
| WO2024169958A1 (en) | Method, apparatus, and medium for visual data processing | |
| TW202520717A (en) | Method and apparatus for decoding and encoding data | |
| WO2025212980A1 (en) | Method, apparatus, and medium for visual data processing | |
| Chen et al. | Latent Feature-Guided Conditional Diffusion for Generative Image Semantic Communication | |
| Xie et al. | Wireless Video Semantic Communication with Decoupled Diffusion Multi-frame Compensation |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 25722697 Country of ref document: EP Kind code of ref document: A1 |