[go: up one dir, main page]

WO2023229589A1 - Super-résolution vidéo en temps réel pour dispositifs mobiles - Google Patents

Super-résolution vidéo en temps réel pour dispositifs mobiles Download PDF

Info

Publication number
WO2023229589A1
WO2023229589A1 PCT/US2022/030918 US2022030918W WO2023229589A1 WO 2023229589 A1 WO2023229589 A1 WO 2023229589A1 US 2022030918 W US2022030918 W US 2022030918W WO 2023229589 A1 WO2023229589 A1 WO 2023229589A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
output
recursive
resolution
block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2022/030918
Other languages
English (en)
Inventor
Jie Cai
Zibo MENG
Chiu Man HO
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Innopeak Technology Inc
Original Assignee
Innopeak Technology Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Innopeak Technology Inc filed Critical Innopeak Technology Inc
Priority to PCT/US2022/030918 priority Critical patent/WO2023229589A1/fr
Priority to PCT/US2022/053987 priority patent/WO2023229644A1/fr
Priority to PCT/US2022/053986 priority patent/WO2023229643A1/fr
Priority to PCT/US2022/053989 priority patent/WO2023229645A1/fr
Publication of WO2023229589A1 publication Critical patent/WO2023229589A1/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4053Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4046Scaling of whole images or parts thereof, e.g. expanding or contracting using neural networks

Definitions

  • This application relates generally to image processing technology including, but not limited to, methods, sy stems, and non -transitory computer-readable media for generating high resolution visual data from low resolution visual data to restore visual details during image super-resolution (ISR) or video super-resolution (VSR).
  • ISR image super-resolution
  • VSR video super-resolution
  • a super-resolution convolutional neural network includes three layers and is used in ISR.
  • Another super-resolution generative adversarial network uses an adversarial loss to implement ISR.
  • Residual -in-residual dense blocks RRDB
  • GAN relativistic generative adversarial network
  • a pyramid, cascading, and deformable (PCD) alignment module aligns frames at a feature level using deformable convolutions in a coarse-to-fine manner.
  • a temporal and spatial attention (TSA) fusion module is applied to emphasize important features for subsequent restoration in both temporal and spatial domains.
  • Spatial and temporal contexts are optionally integrated from continuous video frames using a recurrent encoder-decoder module.
  • an end-to- end trainable frame-recurrent framework may be applied to warp a previously inferred high resolution frame to estimate and super-resolve a subsequent frame.
  • Various embodiments of this application are directed to generating high resolution visual data from low resolution visual data during ISR and VSR efficiently and accurately.
  • An image feature map is extracted from raw image data or a color component of an image frame or video having a low resolution.
  • the image feature map is processed using a sequence of successive recursive blocks to generate an output feature map, and the sequence includes one or more recursive blocks each of which has a plurality of residual units and a skip connection coupling an input of the recursive block to an output of the recursive block.
  • the output feature map is converted to an output image having a higher resolution than the image frame.
  • the sequence of recursive blocks overcome constraints of mobile devices (e.g., power consumption, restricted memory, and compatibility with CNN operations), and can be deployed to mobile devices.
  • the plurality of recursive blocks have less than 100 thousand paraments, are implemented by 20G FLOPs on a neural processing unit of a mobile device, and take about 4 milliseconds to infer a high resolution frame.
  • some implementations of this application provide efficient and effective deep learning solutions that are based on recursive blocks and can be implemented directly on mobile devices.
  • an image processing method is implemented at an electronic device.
  • the method includes obtaining an input image including a plurality of components and separating an image component from one or more remaining components of the input image.
  • the image component has a first resolution.
  • the method further includes extracting an image feature map from the image component and processing the image feature map using a plurality of successive recursive blocks to generate an output feature map.
  • the image feature map has a second resolution that is equal to or less than the first resolution.
  • Each recursive block includes a plurality of residual units and a skip connection coupling an input of the recursive block to an output of the recursive block.
  • an image processing method is implemented at an electronic device. The method includes obtaining a first image having a first resolution and converting the first image to a plurality of distinct test images each having the first resolution. The first image corresponds to a plurality of predefined noise types.
  • Conversion of the first image includes, for each distinct test image and in accordance with a noise shuffling scheme, shuffling the plurality of predefined noise types to an ordered sequence of noise types, for each noise type in the ordered sequence of noise types successively selecting a respective noise creation operation from one or more predefined noise creation operations of the respective noise type, and applying a plurality of noise creation operations corresponding to the ordered sequence of noise types of the first image to generate the respective distinct test image.
  • the method further includes providing the plurality of distinct test images in a training data set for training an image processing model.
  • an image processing method is implemented at an electronic device.
  • the method includes obtaining raw image data captured by image sensors of a camera, and the raw image data includes an input image having a first resolution.
  • the method further includes extracting an image feature map from the raw image data, and the image feature map has a second resolution that, is equal to or less than the first resolution.
  • the method includes processing the image feature map using a sequence of successive recursive blocks to generate an output feature map.
  • the sequence includes one or more recursive blocks, and each recursive block includes a plurality of residual units and a skip connection coupling an input of the recursive block to an output of the recursive block.
  • the method further includes converting the output feature map to an output image having a third resolution that is greater than the first resolution and generating, from the output image, a color image having a color mode.
  • some implementations include an electronic device that includes one or more processors and memory? having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods.
  • some implementations include a non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.
  • Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.
  • FIG. 2 is a block diagram illustrating an electronic system configured to process content data (e.g., image data), in accordance with some embodiments.
  • content data e.g., image data
  • Figure 3 is an example data processing environment for training and applying a neural network-based data processing model for processing visual and/or audio data, in accordance with some embodiments.
  • Figure 4A is an example neural network applied to process content data in an NN-based data processing model, in accordance with some embodiments
  • Figure 4B is an example node in the neural network, in accordance with some embodiments.
  • FIG. 5 is a flow diagram of an example image processing method for increasing an image resolution (i.e., for image or video super-vision), in accordance with some embodiments.
  • FIGS 6A and 6B are two flow diagrams of example image processing methods for increasing an image resolution (i.e., image or video super-vision), in accordance with some embodiments.
  • Figure 7A is a block diagram of an example image degradation scheme for generating test images, in accordance with some embodiments, and Figures 7B and 7C are two additional example image degradation schemes, in accordance with some embodiments.
  • Figures 8-10 are flow diagrams of example image processing methods for improving image quality, in accordance with some embodiments.
  • Image or video super-resolution aims at recovering a high resolution (HR) image or video from a corresponding low-resolution (LR) image or video.
  • HR high resolution
  • LR low-resolution
  • HD high definition
  • U HD ultra-high definition
  • super-resolution attracts attention and becomes a critical function of many media related user applications.
  • super-resolution takes RGB or YUV color images as inputs.
  • Y component represents a luminance component
  • U and V components represent two chrominance components.
  • R, G, and B components correspond to three additive primary colors, red, green, and blue, respectively.
  • super-resolution is applied on raw' image data captured by image sensors of a camera before the raw image data are processed to color images by an image signal processor (ISP) of the camera.
  • ISP image signal processor
  • a sequence of recursive blocks each including a plurality of residual units is applied and optimized for implementing super-resolution on mobile devices.
  • the recursive blocks are configured to overcome constraints of mobile devices (e.g., power consumption, restricted memory', and compatibility with CNN operations).
  • some implementations of this application provide efficient and effective deep learning solutions that are based on recursive blocks, and can enable VSR in real time (e.g., at a rate of 30 frames per second (FPS)) on mobile devices having limited power, computational, and storage resources.
  • a data degradation pipeline is applied to provide image data to train the recursive blocks.
  • the image data takes into account and randomly shuffles a wide range of degradations in real-world super-resolution.
  • these degradations include, but are not limited to, blurring noises, Joint Photographic Experts Group (JPEG) compression noise, statistical noise, and downsampling degradations.
  • JPEG Joint Photographic Experts Group
  • a real-time mobile super-resolution model is trained with image data generated by the data degradation pipeline to provide desirable visual performance in a reliable manner.
  • FIG. 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments.
  • the one or more client devices 104 may be, for example, desktop computers 104A, tablet computers 104B, mobile phones 104C, head-mounted display (HMD) (also called augmented reality (AR) glasses) 104D, or intelligent, multi-sensing, network- connected home devices (e.g., a surveillance camera 104E, a smart television device, a drone).
  • HMD head-mounted display
  • AR augmented reality
  • Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface.
  • the collected data or user inputs can be processed locally at the client device 104 and/or remotely by the server) s) 102.
  • the one or more servers 102 provide system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104.
  • the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.
  • the one or more servers 102 are configured to enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 are configured to implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104.
  • the client devices 104 include a game console (e.g., the HMD 104D) that executes an interactive online gaming application.
  • the game console receives a user instruction and sends it to a game server 102 with user data.
  • the game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console.
  • the client devices 104 include a networked surveillance camera 104E and a mobile phone 104C.
  • the networked surveillance camera 104E collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera 104E, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104 to monitor the events occurring near the networked surveillance camera 104E in the real time and remotely.
  • the one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100,
  • the one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables.
  • Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (W AN) such as the Internet, or a combination thereof.
  • the one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDM A), Bluetooth, Wi-Fi , voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol.
  • a connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof.
  • a network interface 110 e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node
  • the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another.
  • TCP/IP Transmission Control Protocol/Internet Protocol
  • At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other computer systems that route data and messages.
  • deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video data, visual data, audio data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data.
  • content data e.g., video data, visual data, audio data
  • data processing models e.g., a speech extraction model and a feature extraction model in Figures 5A and 7
  • These data processing models are trained with training data before they are applied to process the content data.
  • both model training and data processing are implemented locally at each individual client device 104 (e.g., the mobile phone 104C and HMD 104D).
  • the client device 104 obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the data processing models.
  • both model training and data processing are implemented remotely at a server 102 (e.g., the server 102A) associated with a client device 104 (e.g. the client device 104A and HMD 104D).
  • the server 102A obtains the training data from itself, another server 102 or the storage 106 applies the training data to train the data processing models.
  • the client device 104 obtains the content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained data processing models, receives data processing results (e.g., recognized hand gestures) from the server 102 A, presents the results on a user interface (e.g., associated with the application), renders virtual objects in a field of view based on the poses, or implements some other functions based on the results.
  • the client device 104 itself implements no or little data processing on the content data prior to sending them to the server 102 A.
  • data processing is implemented locally at a client device 104 (e.g., the client device 104B and HMD 104D), while model training is implemented remotely at a server 102 (e.g., the server 102B) associated with the client device 104.
  • the server 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models.
  • the trained data processing models are optionally stored in the server 102B or storage 106.
  • the client device 104 imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface or used to initiate some functions (e.g., rendering virtual objects based on device poses) locally.
  • a pair of AR glasses 104D are communicatively coupled in the data processing environment 100.
  • the AR glasses 104D includes a camera, a microphone, a speaker, one or more inertial sensors (e.g., gyroscope, accelerometer), and a display .
  • the camera and microphone are configured to capture video and audio data from a scene of the AR glasses 104D, while the one or more inertial sensors are configured to capture inertial sensor data.
  • the camera captures hand gestures of a user wearing the AR glasses 104D, and recognizes the hand gestures locally and in real time using a two-stage hand gesture recognition model.
  • the microphone records ambient sound, including user’s voice commands.
  • both video or static visual data captured by the camera and the inertial sensor data measured by the one or more inertial sensors are applied to determine and predict device poses.
  • the video, static image, audio, or inertial sensor data captured by the AR glasses 104D is processed by the AR glasses 104D, server(s) 102, or both to recognize the device poses.
  • deep learning techniques are applied by the server(s) 102 and AR glasses 104D jointly to recognize and predict the device poses.
  • the device poses are used to control the AR glasses 104D itself or interact with an application (e.g., a gaming application) executed by the AR glasses 104D.
  • the display of the AR glasses 104D displays a user interface, and the recognized or predicted device poses are used to render or interact with user selectable display items (e.g., an avatar) on the user interface.
  • deep learning techniques are applied in the data processing environment 100 to process video data, static image data, or inertial sensor data captured by a client device 104.
  • 2D or 3D device poses are recognized and predicted based on such video, static image, and/or inertial sensor data using a first data processing model.
  • Visual content is optionally generated using a second data processing model.
  • the client device 104 has a limited computational capability
  • training of the first or second data processing models is optionally implemented by the server 102, while inference of the device poses and visual content is implemented by the client device 104.
  • the second data processing model includes an image processing model for ISR or VSR, and is implemented in a user application (e.g., a social networking application, a social media application, a short video application, and a media play application).
  • FIG. 2 is a block diagram illustrating an electronic system 200 configured to process content data (e.g., image data), in accordance with some embodiments.
  • the electronic system 200 includes a server 102, a client device 104 (e.g., a mobile phone 104C in Figure 1), a storage 106, or a combination thereof.
  • the electronic system 200 typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset).
  • the electronic system 200 includes one or more input devices 210 that facilitate user input, such as a.
  • the client device 104 of the electronic system 200 uses a microphone for voice recognition or a camera for gesture recognition to supplement or replace the keyboard.
  • the client device 104 includes one or more optical cameras (e.g., an RGB camera), scanners, or photo sensor units for capturing images, for example, of graphic seri al codes printed on the electronic devices.
  • the electronic system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays.
  • the client device 104 includes a location detection device, such as a GPS (global positioning svstem) or other geo-location receiver, for determining the location of the client device 104.
  • Memory' 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory/ within memory 206, includes a non-transitory computer readable storage medium. In some embodiments, memory' 206, or the non- transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof:
  • Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks;
  • Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;
  • User interface module 218 for enabling presentation of informati on (e.g. , a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text., etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.);
  • output devices 212 e.g., displays, speakers, etc.
  • Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;
  • Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;
  • One or more user applications 224 for execution by the electronic system 200 e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices;
  • Model training module 226 for obtaining training data and establishing a data processing model 250 for processing content data (e.g., video, image, audio, or textual data) to be collected or obtained by a client device 104;
  • content data e.g., video, image, audio, or textual data
  • Image degradation module 228 for synthesizing a plurality of distinct test images that incorporate noise of a plurality of predefined noise types into an input, image based on a noise shuffling scheme and providing the plurality of distinct test images to the model training module 226 to train a data processing model 250 (e.g., an image processing model 515 for ISR or VSR in Figure 5);
  • Data processing module 230 for processing content data using data processing models 240, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 230 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224, and in an example, the data processing module 230 is applied to increase an image resolution using an image processing model (e.g., models 515, 615, and 615’ in Figures 5 and 6A-6B); and
  • an image processing model e.g., models 515, 615, and 615’ in Figures 5 and 6A-6B
  • the one or more databases 250 are stored in one of the server 102, client device 104, and storage 106 of the electronic system 200 .
  • the one or more databases 250 are distributed in more than one of the server 102, client device 104, and storage 106 of the electronic system 200 ,
  • more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 240 are stored at the server 102 and storage 106, respectively.
  • Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above.
  • the above identified modules or programs i.e., sets of instructions
  • memory 206 optionally, stores a subset of the modules and data structures identified above.
  • memory' 206 optionally, stores additional modules and data structures not described above.
  • FIG. 3 is an example data processing system 300 for training and applying a neural network based (NN-based) data processing model 240 for processing content data (e.g., video, image, audio, or textual data), in accordance with some embodiments.
  • the data processing system 300 includes a model training module 226 for establishing the data processing model 240 and a data processing module 230 for processing the content data using the data processing model 240,
  • both of the model training module 226 and the data processing module 230 are located on a client device 104 of the data processing system 300, while a training data source 304 distinct form the client device 104 provides training data 306 to the client device 104.
  • the training data source 304 is optionally a server 102 or storage 106.
  • both of the model training module 226 and the data processing module 230 are located on a server 102 of the data processing system 300.
  • the training data source 304 providing the training data 306 is optionally the server 102 itself, another server 102, or the storage 106.
  • the model training module 226 and the data processing module 230 are separately located on a server 102 and client device 104, and the server 102 provides the trained data processing model 240 to the client device 104.
  • the model training module 226 includes one or more data pre-processing modules 308, a model training engine 310, and a loss control module 312.
  • the data processing model 240 is trained according to a type of the content data to be processed.
  • the training data 306 is consistent with the type of the content data, so is a data pre-processing module 308 applied to process the training data 306 consistent with the type of the content data.
  • an image pre-processing module 308A is configured to process image training data 306 to a predefined image format, e.g., extract a region of interest (ROI) in each training image, and crop each training image to a predefined image size.
  • ROI region of interest
  • an audio pre-processing module 308B is configured to process audio training data 306 to a predefined audio format, e.g., converting each training sequence to a frequency domain using a Fourier transform.
  • the model training engine 310 receives pre-processed training data provided by the data pre-processing modules 308, further processes the pre-processed training data using an existing data processing model 240, and generates an output from each training data item.
  • the loss control module 312 can monitor a loss function comparing the output associated with the respective training data item and a ground truth of the respective training data item.
  • the model training engine 310 modifies the data processing model 240 to reduce the loss function, until the loss function satisfies a loss criteria (e.g., a comparison result of the loss function is minimized or reduced below a loss threshold).
  • the modified data processing model 240 is provided to the data processing module 230 to process the content data.
  • the model training module 226 offers supervised learning in which the training data is entirely labelled and includes a desired output for each training data item (also called the ground truth in some situations). Conversely, in some embodiments, the model training module 226 offers unsupervised learning in which the training data are not labelled. The model training module 226 is configured to identify previously undetected patterns in the training data without pre-existing labels and with no or little human supervision. Additionally, in some embodiments, the model training module 226 offers partially supervised learning in which the training data are partially labelled.
  • the data processing module 230 includes a data pre-processing modules 314, a model-based processing module 316, and a data post-processing module 318.
  • the data preprocessing modules 314 pre-processes the content data based on the type of the content data. Functions of the data pre-processing modules 314 are consistent with those of the preprocessing modules 308 and covert the content data to a predefined content format that is acceptable by inputs of the model-based processing module 316. Examples of the content data include one or more of: video, image, audio, textual, and other types of data.
  • each image is pre-processed to extract an ROI or cropped to a predefined image size, and an audio clip is pre-processed to convert to a frequency domain using a Fourier transform.
  • the content data includes two or more types, e.g., video data and textual data.
  • the model-based processing module 316 applies the trained data processing model 240 provided by the model training module 226 to process the pre-processed content data.
  • the model-based processing module 316 can also monitor an error indicator to determine whether the content data has been properly processed in the data processing model 240.
  • the processed content data is further processed by the data postprocessing module 318 to present the processed content data in a preferred format or to provide other related information that can be derived from the processed content data.
  • Figure 4A is an example neural network (NN) 400 applied to process content data in an NN-based data processing model 240, in accordance with some embodiments
  • Figure 4B is an example node 420 in the neural network (NN) 400, in accordance with some embodiments.
  • the data processing model 240 is established based on the neural network 400.
  • a corresponding model -based processing module 316 applies the data processing model 240 including the neural network 400 to process content data that has been converted to a predefined content format.
  • the neural network 400 includes a collection of nodes 420 that are connected by links 412. Each node 420 receives one or more node inputs and applies a propagation function to generate a node output from the one or more node inputs.
  • a weight w associated with each link 412 is applied to the node output.
  • the one or more node inputs are combined based on corresponding weights and W4 according to the propagation function.
  • the propagation function is a product of a non-linear activation function and a linear weighted combination of the one or more node inputs.
  • the collection of nodes 420 is organized into one or more layers in the neural network 400.
  • the one or more layers includes a single layer acting as both an input layer and an output layer.
  • the one or more layers includes an input layer 402 for receiving inputs, an output layer 406 for providing outputs, and zero or more hidden layers 404 (e.g., 404A and 404B) between the input and output layers 402 and 406.
  • a deep neural network has more than one hidden layers 404 between the input and output layers 402 and 406.
  • each layer is only connected with its immediately preceding and/or immediately following layer.
  • a layer 402 or 404B is a fully connected layer because each node 420 in the layer 402 or 404B is connected to every node 420 in its immediately following layer.
  • one of the one or more hidden layers 404 includes two or more nodes that are connected to the same node in its immediately following layer for down sampling or pooling the nodes 420 between these two layers.
  • max pooling uses a maximum value of the two or more nodes in the layer 404B for generating the node of the immediately following layer 406 connected to the two or more nodes.
  • a convolutional neural network is applied in a data processing model 240 to process content data (particularly, video and image data).
  • the CNN employs convolution operations and belongs to a class of deep neural networks 400, i.e., a feedforward neural network that only moves data forward from the input layer 402 through the hidden layers to the output layer 406.
  • the one or more hidden layers of the CNN are convolutional layers convolving with a multiplication or dot product.
  • Each node in a convolutional layer receives inputs from a receptive area associated with a previous layer (e.g., five nodes), and the receptive area is smaller than the entire previous layer and may vary' based on a location of the convolution layer in the convolutional neural network.
  • Video or image data is pre-processed to a predefined video/image format corresponding to the inputs of the CNN.
  • the pre-processed video or image data is abstracted by each layer of the CNN to a respective feature map.
  • a recurrent neural network is applied in the data processing model 240 to process content data (particularly, textual and audio data). Nodes in successive layers of the RNN follow a temporal sequence, such that the RNN exhibits a temporal dynamic behavior.
  • each node 420 of the RNN has a time-varying real -valued activation.
  • the RNN examples include, but are not limited to, a long short-term memory' (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor.
  • LSTM long short-term memory'
  • BAM bidirectional associative memory
  • an echo state network an independently RNN (IndRNN)
  • a recursive neural network e.g., a recursive neural network
  • a neural history compressor examples include, but are not limited to, a long short-term memory' (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor.
  • the training process is a process for calibrating all of the weights for each layer of the learning model using a training data set which is provided in the input layer 402.
  • the training process typically includes two steps, forward propagation and backward propagation, which are repeated multiple times until a predefined convergence condition is satisfied.
  • forward propagation the set of weights for different layers are applied to the input data and intermediate results from the previous layers.
  • backward propagation a margin of error of the output (e.g., a loss function) is measured, and the weights are adjusted accordingly to decrease the error.
  • the activation function is optionally linear, rectified linear unit, sigmoid, hyperbolic tangent, or of other types.
  • a network bias term b i s added to the sum of the weighted outputs from the previous layer before the activation function is applied.
  • the network bias b provides a perturbation that helps the NN 400 avoid over fitting the training data.
  • the result of the training includes the network bias parameter b for each layer.
  • FIG. 5 is a flow diagram of an example image processing method 500 for increasing an image resolution (i.e., for image or video super- vision), in accordance with some embodiments.
  • the image processing method 500 is implemented by an electronic device (e.g., a data processing module 130 of a mobile phone 104D).
  • the electronic device obtains an input image 502 having a first resolution (e.g.,H X W) and generates an output image 504 having a third resolution (e.g., nH * nW, where n is a positive integer) that is greater than the first resolution.
  • the input image 502 is optionally a static image or an image frame of a video clip.
  • the input image 502 is received via one or more communication networks 108 and in a user application 224, e.g., a social networking application, a social media application, a short video application, and a media playapplication.
  • a user application 224 e.g., a social networking application, a social media application, a short video application, and a media playapplication.
  • these user application 224 include, but are limited to, Tiktok, Kuaishuo, WeChat, Tencent video, iQiyi, and Youku.
  • a server 102 associated with the user application 224 streams low-resolution visual data including the input image 502 to electronic devices distributed at different client nodes. If displayed without ISR, the input image 502 would result in poor user experience for users of the user application 224.
  • the input image 502 is pail of a low- resolution video stream provided to unpaid users of a media play application.
  • VSR aims to improve video quality and the users’ watching experience by utilizing artificial intelligence.
  • the image processing method 500 uses low-resolution information of the input image 502 and associated adjacent temporal information to predict missing information of the input image 502, which leads to a high-resolution video sequence including the output image 504.
  • the image processing method 500 also enhances a quality of the input image 502, e.g., by reducing noise, blurriness, and artifacts therein.
  • the input image 502 includes a plurality of image components (e.g., three components 502A, 502B, and 502C).
  • the three image components 502A, 502B, and 502C correspond to a luminance component (Y) and two chrominance components (U and V) of the input components, respectively.
  • the electronic device separates an image component 502A (e.g., the luminance component (Y)) from one or more remaining components 502B and 502C of the input image 502.
  • the image component 502 A has the first resolution (e.g., H X W).
  • the image component 502A has a single channel.
  • the electronic device extracts an image feature map 506 from the image component 502 A.
  • the image feature map 506 has a second resolution (e.g., H m W/h?, where m is a positive integer) that is equal to or less than the first resolution.
  • the image feature map 506 is expanded to a plurality of channels, e.g., 9 or 32 channels.
  • the image feature map 506 is further processed by a plurality of successive recursive blocks 508 (e.g., two successive recursive blocks 508A and 508B that are coupled in series) to generate an output feature map 510.
  • Each recursive block 508 includes a plurality of residual units 512 and a skip connection 514 coupling an input of the recursive block 508 to an output of the recursive block 508.
  • the electronic device converts the output feature map 510 to an output component 504A having a third resolution (e.g., 3Hx3W) that is greater than the first resolution.
  • the output component. 504A and the one or more remaining components 502B and 502C of the input image 502 are combined to generate an output image 504 having the third resolution.
  • Each of the one or more remaining components 502B and 502C has the first resolution.
  • the third resolution is equal to a multiplication of the first resolution by a scale number.
  • each pixel in the one or more remaining components 502B and 502C corresponds to a pixel group having the scale number of pixels and including the respective pixel itself.
  • a component value corresponding to each pixel is spread to cover the entire pixel group.
  • the first and third resolutions are H X W and 3Hx3W, respectively.
  • Each pixel in the component 502B or 502C corresponds to 9 respective immediately adjacent pixels in a counterpart component 504B or 504C, respectively.
  • the component value of each pixel in the component 502B or 502C is therefore used as component values of the 9 pixels in a counterpart component 504B or 504C, respectively.
  • the 9 pixels in each counterpart component 504B or 504C are organized in a 3 x3 pixel array.
  • the component value of each pixel in the component 502B or 502C is therefore used as a component value of a center pixel of the 9 pixels in the counterpart component 504B or 504C, respectively.
  • Other pixels in the 9 pixels are interpolated from two closest center pixels of two pixel groups in the counterpart components 504B or 504C based on relative distances to the two closest center pixels.
  • the output component 504A and the counterpart components 504B and 504C are combined to generate the output image 504 having the third resolution.
  • the image processing process 500 is implemented based on an image processing model 515 that includes a feature extraction model 516 and an output conversion module 518 in addition to the plurality of recursive blocks 508.
  • the feature extraction model 516 is configured to extract the image feature map 506 from the image component 502 A, In an example, a 3x3 convolution layer is followed by one ReLU layer to extract shallow features represented in a 9-channel input feature map 520. Another 3x3 convolution layer followed by one ReLU layer is optionally applied to extract additional features represented in the image feature map 506, e.g., having 32 channels.
  • the output conversion module 518 is coupled to an output of the plurality of recursive blocks 508 and configured to convert the output feature map 510 to the output component 504A.
  • a 3 3 convolution layer is followed by one ReLU layer to convert the output feature map 510 having 32 channels to a 9-channel intermediate feature map 522.
  • the input feature map 520 and intermediate feature map 522 are combined on a element-by-element basis and processed by a depth space model 524 (also called a pixel shuffle layer) to generate the output component 504A of the output image 504.
  • the plurality of recursive blocks 508 include a first number of successive recursive blocks 508A and 508B.
  • Each recursive block 508 includes a second number of residual units 512 that are successively coupled to each other and in series.
  • Each residual unit 512 optionally includes a CNN (e.g., having two 3x3 convolutional layers) and a rectified linear unit (ReLU) layer.
  • the first number is less than a recursive block threshold
  • the second number is less than a residual unit threshold.
  • the recursive block threshold and residual unit threshold are associated with a computational capability of the electronic device.
  • the image processing model 515 includes the first and second numbers.
  • the server 102 obtains information of the computational capability of the electronic device and determines the recursive block threshold and residual unit threshold based on the information of the computational capability of the electronic device. The server 102 further determines the first and second numbers for the image processing model 515 based on the recursive block threshold and residual unit threshold. The image processing model 515 is provided to the electronic device for ISR and VSR.
  • the plurality of recursive blocks 508 include two successive recursive blocks 508 A and 508B, and each recursive block 508 includes 2 residual units 512.
  • each residual block 508 includes two 3 x3 convolution layers (pad 1, stride 1, and channel 32), and the first convolution layer is followed by a ReLU layer.
  • a first recursive block 508A receives the image feature map 506 and generates a first block output feature map 526.
  • a second recursive block 508B receives the first block output feature map 526 and generates the output feature map 510.
  • the first, block output feature map 526 and the output feature map 510 correspond to mid-level and high-level features of the image component 502A of the input image 502.
  • a block input feature and a unit output feature of an output residual unit are combined to generate a block output feature at the output of the recursive block 508.
  • the image feature map 506 and a unit output feature of an output residual unit 512OA are combined to generate the first block output feature map 526.
  • the first block output feature map 526 is combined with a unit output feature of an output residual unit 512OB are combined to generate the output feature map 510.
  • the image processing model 515 applied in the image processing process 500 is trained using a predefined loss function L.
  • the predefined loss L function is a weighted combination of a pixel 1oss L LI structural similarity loss LSSIM, and a perceptual loss as follows: where and are weights for combining the losses
  • the pixel loss indicates a pixel-wise difference between a test output image and a ground truth image.
  • the pixel-wise difference is optionally measured in an LI loss (i.e., a mean absolute error) or an L2 loss (i.e., a mean square error).
  • the loss shows improved performance and convergence over the L2 loss.
  • PSDNR peak signal -to-noise ratio
  • the structural similarity loss L SSIM indicates a structural similarity between the test output image and ground truth images based on comparisons of luminance, contrast, and structures. That said, the structural similarity loss evaluates a reconstruction quality from a perspective of a human visual system.
  • the perceptual loss L VGG indicates a semantic difference between the test output image and ground truth image s using a pre-trained Visual Geometry' Group (VGG) image classification network, thereby reflecting how a high frequency content is restored for perceptual satisfaction.
  • VGG Visual Geometry' Group
  • Quantization is applied to perform computation and store weights and biases at lower bitwidths than a floating point precision.
  • a quantized model executes some or all of the operations on the weights and biases with integers rather than floating point values. This allows for a more compact model representation and the use of high performance vectorized operations on many hardware platforms.
  • the image processing model 515 is quantized according to a precision setting of the electronic device where the image processing model 515 will be loaded.
  • the electronic device is a mobile device having limited computational resources and has a lower precision than a floating point data format. Weights and biases of the image processing model 515 are quantized based on the lower precision.
  • the quantized image processing model 515 result in a significant accuracy drop, and make image processing a lossy process.
  • the image processing model 515 is re-trained with the quantized weights and biases to minimize the loss function L.
  • Such quantization-aware training simulates low precision behavior in a forward pass, while a backward pass remains the same, which induces a quantization error which is accumulated in the loss function L, and an optimizer module is applied to reduce the quantization error,
  • weights and biases associated with filters of the image processing model 515 maintain a float 32 format, and are quantized based on a precision setting of the electronic device.
  • the weights and biases are quantized from the float32 format to an int:8, uint8, inti 6, or uint 16 format based on the precision setting of the electronic device.
  • the electronic device uses a CPU to run the image processing model 515, and the CPU of the electronic device processes 32 bit data.
  • the weights and bi ases of the image processing model 515 are not quantized, and the image processing model 515 is provided to the electronic device directly.
  • the electronic device uses one or more GPUs to ran the image processing model 515, and the GPU(s) process 16 bit data.
  • the weights and biases of the image processing model 515 are quantized to an intl6 format.
  • the electronic device uses a digital signal processor (DSP) to run the image processing model 515, and the DSP processes 8 bit data.
  • the weights and biases of the image processing model 515 are quantized to an int8 format. After quantization of the weights and biases, e.g., to a fixed 8-bit format, the image processing model 515 have fewer MACs and smaller size, and are hardware-friendly during deployment on the electronic device.
  • weights and biases of an image processing model 515 have a float32 format and are quantized to an uint8 format.
  • the quantized image processing model 515 only causes a loss of 0,2 dB on image information that is contained in the output image 504 created by super-resolution.
  • the quantized image processing model 515 is executed within a duration of 20 milliseconds by a neural processing unit (NPU), and can be applied to process image frames of a video stream at a frame rate of 50 frames per second (FPS).
  • NPU neural processing unit
  • the image processing model 515 applied in the image processing process 500 is limited by capabilities of the electronic device (e.g,, a size of a random-access memory (RAM), computation resources, power consumption requirements, FLOPS of a system on chip (SoC) of a mobile phone).
  • Architecture of the image processing model 515 is designed according to the capabilities of the electronic device.
  • the image processing (i.e., VSR) method 500 is designed based on hardware friendly operations, e.g., using 8-bit quantization aware training (QAT) in a YUV domain.
  • VSR is applied to one or more color components in an RGB domain.
  • the R, G, and B components correspond to red, green, and blue colors of a given pixel size.
  • VSR is applied to one or more color components in a YUV domain.
  • a YUV color model defines a color space in terms of one luma component (Y) and two chrominance components including U (blue projection) and V (red projection).
  • YUV encodes a color image or video taking human perception into account, allowing reduced bandwidth for chrominance components.
  • a plurality of video devices therefore, render directly using YUV or luminance/chrominance images.
  • the most important component for YUV capture is the luminance or Y component.
  • the Y component has a sampling rate greater than a distinct sampling rate of the U or V component.
  • the VSR is applied in only on Y channel in the image processing process 500.
  • VSR process is operated with 1/3 FLOPS, a third of the FLOPS applied to process the RGB color format.
  • VSR in the YUV' domain achieves a greater super-resolution PSNR score, conserves mobile computing resources, and enhances a deployment efficiency of the image processing model 515.
  • VSR is enabled efficiently on the electronic device in terms of runtime, model parameters, FLOPs, and power consumption.
  • the image processing method 500 is executed on many mobile devices with high performance, e.g., at a rate of 30 FPS, and particularly, outperforms state-of-the-art methods in most of the public datasets in terms of signal quality (e.g., measured in PSNR).
  • the image processing model 515 applied in the image processing process 500 is robust to umt8 quantization and corresponds to only 0.2dB PSNR drop when compared with a float32 model built on DIV2K validation dataset.
  • VSR is implemented in the YUV domain, improving both signal quality, structural similarity, and visual perception of the input image 502 and model inference abilities of the image processing model 515.
  • FIGS 6 A and 6B are two flow diagrams of example image processing methods 600 and 650 for increasing an image resolution (i.e., image or video super-vision), in accordance with some embodiments.
  • image resolution i.e., image or video super-vision
  • each pixel is a meta sample of an original image, and more samples provide a more detailed representation.
  • the number of pixels in an input image is sometimes called a resolution.
  • a long-focus lens can be applied to provide a high-resolution input image, a range of a scene captured by the lens is usually limited by a size of a sensor array at an image plane.
  • a wide-range scene is captured at a lower resolution with a short-focus camera (e.g., a wide-angle lens) and apply the image processing method 600 or 650 to recovers high-resolution raw data from low-resolution version.
  • the image processing methods 600 and 650 are implemented on mobile devices based on real-time raw super-resolution models (e.g., models 615 and 615’), and such raw super-resolution models are established based on raw data degradation pipelines to recover high-resolution raw data with a high image fidelity'.
  • Each of the image processing methods 600 and 650 is implemented by an electronic device (e.g., a mobile phone 104D).
  • the electronic device obtains raw image data captured by image sensors of a camera.
  • the raw image data includes an input image 602 having a first resolution
  • the electronic device generates an output image 604 having a third resolution in Figure vn. Figure 6B) that is greater than the first resolution.
  • the input image 602 is optionally a static image or an image frame of a video clip. In some situations, the input image 602 is captured by a camera of the electronic device.
  • the input image 602 is received via one or more communication networks 108 and in a user application, e.g., a social networking application, a social media application, a short video application, and a media play application.
  • a user application e.g., a social networking application, a social media application, a short video application, and a media play application.
  • the user application include, but are limited to, Tiktok, Kuaishuo, WeChat, Tencent video, iQiyi, and Youku.
  • a server 102 associated with the user application streams low-resolution visual data including the input image 602 to electronic devices distributed at different client nodes. If displayed without ISR or VSR, the input image 602 would result in poor user experience for users of the user application.
  • the input image 602 is part of a low-resolution video stream provided to unpaid users in a media play application.
  • VSR aims to improve video quality and the users’ watching experience by utilizing artificial intelligence.
  • Each of the image processing methods 600 and 650 uses low-resolution information of the input image 602 and associated adjacent temporal information to predict missing information of the input image 602, which leads to a high-resolution video sequence including the output image 604.
  • each of the image processing methods 600 and 650 enhances a quality of the input image 602, e.g., by reducing noise, blurriness, and artifacts therein.
  • the electronic device extracts an image feature map 606 from the input image 602.
  • the image feature map 606 has a second resolution (e.g., where m is a positive integer) that is equal to or less than the first resolution.
  • the image feature map 606 is expanded to a plurality of channels, e.g., 9 or 32 channels.
  • the image feature map 606 is further processed by a sequence of successive recursive blocks 608 to generate an output feature map 610.
  • Each recursive block 508 includes a plurality of residual units 612 and a skip connection 614 coupling an input of the recursive block 608 to an output of the recursive block 608.
  • the electronic device converts the output feature map 610 to the output image 604 having the third resolution (e.g., 3H*3W) that is greater than the first resolution.
  • a color image 640 is further generated from the output image 604.
  • the color image 640 has a color mode that is one of: PMS, RGB, CMYK, HEX, YIJV, YCbCr, LAB, Index, Greyscale, and Bitmap.
  • the sequence of successive recursive blocks 608 include one or more recursive blocks 608.
  • the sequence of recursive blocks 608 includes two recursive blocks 608A and 608B coupled to each other and in series. Each recursive block 608 further includes two residual units 612 coupled to each other and in series. Feature maps processed in the successive recursive blocks 608A and 608B have the second resolution of and 32 channels.
  • the sequence of recursive blocks 608 includes a single recursive block 608C, and the recursive block has four or more residual units 612 that are coupled to each other and in series. Feature maps processed in successive recursive block 608C have the second resolution of and 32 channels.
  • Each of the image processing methods 600 and 650 is implemented based on a respective image processing model 615 or 615’ that includes a feature extraction model 616 and an output conversion module 618 in addition to the sequence of recursive blocks 608.
  • the feature extraction model 616 is configured to extract the image feature map 606 from the input image 602.
  • a 3x3 convolution layer is followed by one ReLU layer to extract shallow features represented in a 9-channel input feature map 620.
  • Another 3x3 convolution layer followed by one ReLU layer is optionally applied to extract additional features represented in the image feature map 606, e.g., having 32 channels.
  • a 3x3 convolution layer is followed by one ReLU layer to extract the image feature map 606 having the second resolution and 32 channels.
  • the output conversion module 618 is coupled to an output of the sequence of recursive blocks 608 and configured to convert the output feature map 610 to the output image 604.
  • a 3x3 convolution layer is followed by one ReLU layer to convert the output feature map 610 having 32 channels to a 9- channel intermediate feature map 622A in Figure 6A or a 16-channel intermediate feature map 622B in Figures 6B.
  • the input feature map 620 and intermediate feature map 622A are combined on a element-by-element basis and processed by a depth space model 624 (also called a pixel shuffle layer) to generate the output image 604 having the third resolution (e.g., ).
  • a depth space model 624 also called a pixel shuffle layer
  • the intermediate feature map 622B is processed by a depth space model 624 (also called a pixel shuffle layer) to generate the output image 604 having the third resolution (e.g., ).
  • the sequence of recursive blocks 608 include a first number of successive recursive blocks 608 A and 608B.
  • Each recursive block 608 includes a second number of residual units 612 that are successively coupled to each other and in series.
  • Each residual unit 612 optionally includes a CNN (e.g., having two 3 3 convolutional layers) and a rectified linear unit (ReLU) layer.
  • the first number is less than a recursive block threshold (e.g., 3), and the second number is less than a residual unit threshold (e.g., 6).
  • the recursive block threshold and residual unit threshold are associated with a computational capability of the electronic device.
  • the image processing model 615 includes the first and second numbers.
  • the server 102 obtains information of the computational capability of the electronic device and determines the recursive block threshold and residual unit threshold based on the information of the computational capability of the electronic device.
  • the server 102 further determines the first and second numbers for the image processing model 615 based on the recursive block threshold and residual unit threshold.
  • the image processing model 615 is provided to the electronic device.
  • the sequence of recursive blocks 608 include two successive recursive blocks 608A and 608B, and each recursive block 608 includes 2 residual units 612.
  • each residual block 608 includes two 3 x3 convolution layers (pad 1, stride 1, and channel 32), and the first convolution layer is followed by a ReLU layer.
  • a first recursive block 608A receives the image feature map 606 and generates a first block output feature map 626.
  • a second recursive block 608B receives the first block output feature map 626 and generates the output feature map 610.
  • the first, block output feature map 626 and the output feature map 610 correspond to mid-level and high-level features of the input image 602.
  • a block input feature and a unit output feature of an output residual unit are combined to generate a block output feature at the output of the recursive block 608.
  • the image feature map 606 and a unit output feature of an output residual unit 612OA are combined to generate the first block output feature map 626.
  • the first block output feature map 626 is combined with a unit output feature of an output residual unit 612OB are combined to generate the output feature map 610.
  • the image processing models 615 and 615’ are trained using a predefined loss function
  • the predefined loss L function is a weighted combination of a pixel loss a structural similarity loss and a perceptual loss LVGG based on equation (1) as follows: where and are weights for combining the losse .
  • the pixel loss indicates a pixel-wise difference between a test output image and a ground truth image.
  • the pixel-wise difference is optionally measured in an LI loss (i.e., a mean absolute error) or an L2 loss (i.e., a mean square error).
  • the LI loss shows improved performance and convergence over the L2 loss.
  • the pixel loss is highly correlated with pixel-wise difference, and minimizing pixel loss directly maximizes a PSNR.
  • the structural similarity loss indicates a structural similarity between the test output image and ground truth images based on comparisons of luminance, contrast, and structures. That said, the structural similarity loss evaluates a reconstruction quality from a perspective of a human visual system.
  • the perceptual loss LVGG indicates a semantic difference between the test output image and ground truth images using a pre-trained VGG image classification network, thereby reflecting how a high frequency content is restored for perceptual satisfaction.
  • Quantization is applied to perform computations and store weights and biases at lower bitwidths than a floating point precision.
  • a quantized model applied in the method 600 or 650 executes some or all of the operations on the weights and biases with integers rather than floating point values. This allows for a more compact model representation and the use of high performance vectorized operations on many hardware platforms.
  • the image processing model 615 or 615’ is quantized according to a precision setting of the electronic device where the image processing model 615 or 615’ will be loaded.
  • the electronic device is a mobile device having limited computational resources and has a lower precision than a floating point data format. Weights and biases of the image processing model 615 or 615’ are quantized based on the lower precision.
  • the quantized image processing model 615 or 615’ result in a significant accuracy drop, and make image processing a lossy process.
  • the image processing model 615 or 615’ is re-trained with the quantized weights and biases to minimize the loss function L.
  • Such quantization-aware training simulates low precision behavior in a forward pass, while a backward pass remains the same, which induces a quantization error which is accumulated in the loss function L, and an optimizer module is applied to reduce the quantization error.
  • weights and biases associated with filters of the image processing model 615 or 615’ maintain afloat32 format, and are quantized based on a precision setting of the electronic device.
  • the weights and biases are quantized from the float32 format to an int8, uint8, int 16, or uint16 format based on the precision setting of the electronic device.
  • the electronic device uses a CPU to run the image processing model 615 or 615’, and the CPU of the electronic device processes 32 bit data.
  • the weights and biases of the image processing model 615 or 615’ are not quantized, and the image processing model 615 or 615’ is provided to the electronic device directly.
  • the electronic device uses one or more GPUs to run the image processing model 615 or 615’, and the GPU(s) process 16 bit data.
  • the weights and biases of the image processing model 615 or 615’ are quantized to an inti 6 format.
  • the electronic device uses a digital signal processor (DSP) to run the image processing model 615 or 615’, and the DSP processes 8 bit data.
  • the weights and biases of the image processing model 615 or 615’ are quantized to an int8 format. After quantization of the weights and biases, e.g., to a fixed 8-bit format, the image processing model 615 or 615’ have fewer MACs and smaller size, and are hardware-friendly during deployment on the electronic device.
  • DSP digital signal processor
  • weights and biases of an image processing model 615 or 615’ applied in the method 600 or 650 have afloat32 format and are quantized to an umt8 format.
  • the quantized image processing model 615 or 615’ only causes a loss of 0.2 dB on image information that is contained in the output image 504 created by super-resolution.
  • the quantized image processing model 615 or 615’ is executed within a duration of 20 milliseconds by a neural processing unit (NPU), and can be applied to process image frames of a video stream at a frame rate of 50 frames per second (FPS).
  • NPU neural processing unit
  • the image processing model 615 or 615’ is limited by capabilities of the electronic device (e.g., a size of a random-access memory (RAM), computation resources, power consumption requirements, FLOPS of a system on chip (SoC) of a mobile phone).
  • Architecture of the image processing model 615 or 615’ is designed according to the capabilities of the electronic device.
  • the image processing (i.e., VSR) method 600 or 650 is designed based on hardware friendly operations, 8-bit quantization aware training (QAT), and raw image data.
  • VSR achieves a greater superresolution PSNR score, conserves mobile computing resources, and enhances a deployment efficiency of the image processing model 615 or 615’.
  • real-time VSR is enabled efficiently on the electronic device in terms of runtime, model parameters, FLOPs, and power consumption.
  • the image processing method 600 or 650 is executed on many mobile devices with high performance, e.g., at a rate of 30 FPS, and particularly, outperforms state-of-the-art methods in most of the public datasets in terms of signal quality (e.g., measured in PSNR).
  • the image processing model 615 or 615’ is robust to umt8 quantization.
  • raw image data are directly applied to restore high- resolution clear images 604. More information could be exploited in a raw image domain because the raw image data are arranged in 10 or 12 bits.
  • RGB or YUV images produced by an ISP of an image are represented in 8 bits.
  • the ISP contains nonlinear degradations, such as tone mapping and Gamma correction. Linear degradations (e.g., blurriness and noise) are nonlinear in the RGB or YUV domain, making it image restoration difficult.
  • VSR in the raw image domain effectively avoids image restoration based on nonlinear degradations and generates the output image 604 with better image qualities compared with those restored in the RGB or YUV domain.
  • Figure 7A is a block diagram of an example image degradation scheme 700 for generating test images, in accordance with some embodiments, and Figures 7B and 7C are two additional example image degradation schemes 740 and 760, in accordance with some embodiments.
  • the example image degradation scheme 700 is applied to generate a plurality of distinct test images 704 incorporating different noises in a first image 702.
  • the first image 702 is provided with the plurality of distinct test images 704 in a training data set for training an image processing model.
  • the first image 702 is a ground truth image of the plurality of distinct test images 704, and during a corresponding training process, the image processing model (which is configured to reduce a noise level of an input image) receives the distinct test images 704 as input images and reduces a noise level of each of the plurality of distinct test images 704 with reference to the first image 702.
  • the first image 702 has a first resolution and is generated from a second image 706 having a second resolution. The second resolution is greater than the first resolution.
  • the second image 706 is provided with the plurality of distinct test images 704 in the training data set for training the image processing model 515, 615, or 615’ associated with an ISR or VSR operation.
  • the second image 706 is a ground truth image of the plurality of distinct test images 704, and the image processing model 515, 615, or 615’ is configured to increase a resolution of each input image.
  • the image processing model 515, 615, or 615’ receives the distinct test images 704 as input images and increases corresponding resolutions with reference to the second image 706.
  • this image processing model 515, 615, or 615’ includes one or more recursive blocks, and each recursive block includes a plurality of residual units and a skip connection coupling an input of the residual block to an output of the residual block.
  • the image degradation scheme 700 includes bicubic interpolation, Gaussian blur, and Gaussian noise and is represented as follows: where x represents the first image 702 having a first, resolution, and y represents a test image 704 having a second resolution.
  • the test image 704 is generated by convolving the first image 702 with a Gaussian kernel k to generate a blurry image that is processed successively by a downsampling operation associated with a down scale factor s and an addition of Gaussian noise n.
  • the image degradation scheme 700 is implemented at an electronic device (e.g., a server 102) having one or more processors and memory storing one or more programs to be executed by the one or more processors.
  • the electronic device obtains the first image 702, and the first image 702 has a first resolution.
  • the electronic device converts the first image 702 to a plurality of distinct test images 704. Each test image 704 has the first resolution.
  • the first image 702 corresponds to a plurality of predefined noise types.
  • Each distinct test image 704 is converted from the first image 702 based on the plurality of predefined noise types and in accordance with a noise shuffling scheme.
  • the plurality of predefined noise types are shuffled to an ordered sequence of noise types 712 (e.g., 712A).
  • a respective noise creation operation is successively selected from one or more predefined noise creation operations 710 (e.g., 710A) of the respective noise type.
  • a plurality of noise creation operations 710 corresponding to the ordered sequence of noise types 712 of the first image 702 are applied to generate the respective distinct test image 704.
  • the plurality of distinct test images 704 are provided in a training data set for training an image processing model 515, 615, or 615’ applied in an image processing method 500, 600, or 650 for ISR or VSR.
  • each distinct test image 704 corresponds to a distinct order of the plurality of distinct noise types 708, a distinct combination of noise creation operations 710, or both of the distinct order and combination.
  • four different types of noise are involved in the image degradation scheme 700, and include blurring noise, statistical noise, downsampling, and compression noise.
  • a first test image 704A corresponds to a first ordered sequence of noise types 712A, i.e., blurring, statistical, downsampling, and compression noises.
  • a second test image 704B corresponds to a second ordered sequence of noise types 712B, i.e., statistical, blurring, compression and downsampling noises.
  • a third test image 704C corresponds to a third ordered sequence of noise types 712C, i.e., downsampling, blurring, compression, and statistical noises.
  • a fourth test image 704D corresponds to a fourth ordered sequence of noise types 712D, i.e., statistical, compression, downsampling, and blurring noises. As such, the plurality of test images 704 have distinct orders of these four different types of noise.
  • the combination of noise creation operations 710 is the same for the test images 704, and however, the same operations 710 are organized according to different orders of the noise types 708 in the test images 704A-704D.
  • the noise creation operations 710 include a 2D Sine blurring noise, a bilinear downsmapling noise, a statistical Gaussian noise, and a JPEG compression noise. That said, the same noise creation operations 710 are randomly shuffled and applied to generate the four test images 704A-704D.
  • the combinations of noise creation operations 710 are the same or distinct for every two of the test images 704.
  • the noise creation operations 710 are the same for the test images 704A and 704D, and include a 2D Sine blurring noise, a bilinear downsmapling noise, a statistical Gaussian noise, and a JPEG compression noise. However, the noise creation operations 710 are organized in two different orders for the test images 704A and 704B. Conversely, the noise creation operations 710 are different for the test images 704C and 704D, and however, organized in the same order of noise types 708.
  • the noise creation operations 710C of the test image 704C include a 2D Sine blurring noise, a bilinear downsmapling noise, a statistical Poisson noise, and a JPEG compression noise.
  • the plurality of noise types 708 include a blurring noise that is optionally approximated by one or more convolutions with isotropic and anisotropic Gaussian kernels, a linear blurring filter kernel, 2D Sine filter blur, or blind kernel estimation.
  • the 2D Sine filter is applied to synthesize ringing and overshoot artifacts such that the image processing model can be trained to suppress ringing and overshoot artifacts.
  • a kernel estimation algorithm is optionally applied to explicitly estimate SR kernels from real images.
  • kernel-based blurring includes a convolution operation with a linear blur filter kernel, and the convolution operation includes estimating one or more SR kernels from the first image and degrading the first image 702 using the SR kernels.
  • the plurality of noise types 708 include a statistical noise that is optionally synthesized by adding Gaussian noise, Poisson noise, or user-specific noise with different noise levels. Under some circumstances, target noise is predicted and used to degrade the first image 702 into a noisy test image 704, In some embodiments, the plurality of noise types 708 include a compression noise that is optionally achieved by adjusting JPEG compression with different quality factors. In an example, a quality factor is an integer in a range of 1 -100. The quality factor equal to 100 means lower compression and higher image quality. In some embodiments, the plurality of noise types 708 include a downsampling noise that is randomly chosen from nearest, bilinear, bicubic interpolations, and real-world pairs of high quality and low quality images.
  • a degradation shuffle strategy is applied to expand degradation space for the test images 704.
  • a training technique is applied to visually improve image sharpness, while not introducing visible artifacts.
  • the training technique includes a post-process sharpening algorithm, such as Gaussian sharpening or bilateral sharpening, which is applied during training to maintain a balance of image sharpness and overshoot artifact suppression.
  • a first ordered sequence of noise types 712A include blurring noise 708AA, statistical noise 708AB, compression noise 708 AC, and downsampling noise 708 AD
  • a second ordered sequence of noise types 708B include statistical noise 708BA, compression noise 708BB, downsampling noise 708BC, and blurring noise 708BD.
  • four noise creation operations 710A are applied on the first image 702 for adding blurring noise 708 AA, statistical noise 708 AB, compression noise 708 AC, and downsampling noise 708AD successively to generate a first test image 704 A.
  • the same four noise creation operations 710A are applied on the first image for adding statistical noise 708BA, compression noise 708BB, downsampling noise 708BC, and blurring noise 708BD successively to generate a second test image 704B;
  • a third ordered sequence of noise types 7I2C include statistical noise 708CA, blurring noise 708CB, compression noise 708CC, and downsampling noise 708CD.
  • a third set of four noise creation operations 710C are applied on the first image 702 to generate a third test image 704C.
  • a fourth set of four noise creation operations 710D are applied on the first image 702 to generate a fourth test image 704.
  • At least one of the third set of four noise creation operations (e.g., adding Gaussian noise) 710C is distinct from a corresponding one of the fourth set of four noise creation operations 710D (e.g., adding Poisson noise).
  • the image degradation scheme 700 offers a complex but practical solution for generating test images applied to train image processing models for image/video superresolution models. These test images are synthetic data, rather than being directly captured by cameras. Specifically, a plurality of degradation factors (e.g., blurring, statistical noises, downsampling, JPEG compression) are considered and applied with a random shuffle strategy. The image degradation scheme 700 further takes into account ringing and overshoot artifacts and image sharpness.
  • the plurality of test images 704 are applied to train the image processing model of the image processing method 500 that processes an image component 502 A of an input, image 502.
  • the plurality of test images 704 are applied to train the image processing model of the image processing method 600 or 650 that processes raw image data 602.
  • FIGS 8-10 are flow diagrams of example image processing methods 800, 900, and 1000 for improving image quality (e.g., enhancing an image resolution), in accordance with some embodiments.
  • each of the image processing methods 800, 900, and 1000 is described as being implemented by an electronic sy stem 200 (e.g., a mobile phone 104C for Figures 8 and 10, a server 102 for Figure 9).
  • Each of the image processing methods 800, 900, and 1000 is, optionally, governed by instructions that, are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the computer system.
  • Each of the operations shown in Figures 8-10 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 in Figure 2).
  • the computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices.
  • the instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 800 may be combined and/or the order of some operations may be changed.
  • the electronic system 200 includes an electronic device (e.g., a mobile device) having one or more processors and memory.
  • the electronic device obtains (802) an input image 502 including a plurality of components (e.g., 502A- 502C) and separates (804) an image component 502A from one or more remaining components 502B and 502C of the input image 502.
  • the image component 502 A has a first resolution.
  • An image feature map 506 is extracted (806) from the image component 502 A, and has a second resolution that is equal to or less than the first resolution.
  • the electronic device processes (808) the image feature map 506 using a plurality of successive recursive blocks 508 to generate an output feature map 510.
  • Each recursive block 508 includes (810) a plurality of residual units 512 and a skip connection 514 coupling an input of the recursive block 508 to an output of the recursive block 508.
  • the output feature map 510 is converted (812) to an output component 504A having a third resolution that is greater than the first resolution.
  • the output component 504A is combined (814) with the one or more remaining components 502B and 502C of the input image 502 to generate an output image 504 having the third resolution.
  • each recursive block 508 includes (816) an output residual unit 512 coupled to the output of the recursive block 508.
  • the image feature map 506 is processed by, for each recursive block 508: combining (818) a block input feature and a unit output feature of the output residual unit 512 to generate a block output feature at the output of the recursive block 508.
  • the first resolution is and the third resolution is where n is an integer equal to or greater than 1.
  • the second resolution is H where m is an integer equal to or greater than 1 .
  • each residual unit 512 in each recursive block 508, the plurality of residual units 512 are successively coupled (820) to each other and in series, and each residual unit 512 includes a convolutional neural network (CNN) and a rectified linear unit (ReLU) coupled to the CNN.
  • CNN convolutional neural network
  • ReLU rectified linear unit
  • each CNN includes two 3x3 convolution layer.
  • the image component includes (822) a luminance component (Y) of the input image 502, and the one or more remaining components include two chrominance components (U and V) of the input image 502,
  • the plurality of successive recursive blocks 508 includes at least a first recursive block 508A and a second recursive block 508B coupled to the first recursive block 508A.
  • the electronic device receives the image feature map 506 by the first recursive block 508 A and generates a first block output feature map 526 from the image feature map 506 by the first recursive block 508A.
  • the electronic device receives the first block output feature by the second recursive block 508B and generates the output feature map 510 from the first block output feature by the second recursive block 508B.
  • the plurality of successive recursive blocks 508 includes a first number of recursive blocks 508, and each recursive block 508 includes a second number of residual units 512.
  • the first number is less than a recursive block threshold.
  • the second number is less than a residual unit threshold.
  • the recursive block threshold and residual unit threshold are associated with a computational capability of the electronic device.
  • an image processing model 515 includes a feature extraction model 516, the sequence of recursive blocks 508, and an output conversion model 518.
  • the feature extraction model 516 is configured to extract the image feature map 506 from the image component 502A
  • the output conversion model 518 is configured to convert the output feature map 510 to the output component 504A .
  • a server 102 trains the image processing model 515 using a predefined loss function L.
  • the predefined loss function is a weighted combination of a pixel loss Lu, a structural similarity loss LSSIM, and a perceptual loss LVGG.
  • the pixel loss Lu indicates a pixel - wise difference between a test output image and a ground truth image
  • the structural similarity loss LSSIM indicates a structural similarity between the test output image and ground truth images based on comparisons of luminance, contrast, and structures.
  • the perceptual loss LVGG indicates a semantic difference between the test output image and ground truth images using a pre-trained VGG image classification network.
  • the plurality of successive recursive blocks 508 includes a first number of recursive blocks 508, and each recursive block 508 includes a second number of residual units 512
  • the server 102 obtains information of a computational capability of the electronic device and determines a recursive block threshold and a residual unit threshold based on the information of the computational capability of the electronic device.
  • the server 102 further determines the first and second numbers for the image processing model 515 based on the recursive block threshold and residual unit threshold and provides the image processing model 515 to the electronic device.
  • the image processing model 515 includes a plurality of layers having a plurality of filters defined by a plurality of weights.
  • a server 102 selects a data format of the plurality of weights based on a precision setting of the electronic device and quantizes the plurality of weights based on the data format. Specifically, the server 102 maintains the data format for the plurality of weights while re-training the image processing model 515.
  • the data format is selected from float32, int8, uint8, int!6, and umt!6.
  • the electronic system includes an electronic device, e.g., a server.
  • the electronic device obtains (902) a first image 702 having a first resolution and converts (904) the first image 702 to a plurality of distinct test images 704 each having the first resolution,
  • the first image 702 corresponds to a plurality of predefined noise types 708,
  • the electronic device shuffles (908) the plurality of predefined noise types 708 to an ordered sequence of noise types 712; for each noise type in the ordered sequence of noise types 712, successively selects (910) a respective noise creation operation 710 from one or more predefined noise creation operations of the respective noise type 708; and applies (912) a plurality of noise creation operations 710 corresponding to the ordered sequence of noise types 712 on the first image 702 to generate the respective distinct test image 704.
  • each distinct test image 704 corresponds (916) to a distinct order of the plurality of distinct noise types, a distinct combination of noise creation operations, or both of the distinct order and combination.
  • the electronic device provides (918) the first image 702 with the plurality of distinct test images 704 in the training data set for training the image processing model 515, 615, or 615’.
  • the first image 702 is a ground truth image of the plurality of distinct test images 704, and the image processing model 515, 615, or 615’ is configured to reduce a noise level of input images.
  • the electronic device generates the first image 702 having the first resolution from a second image 706 having a second resolution, and the second resolution is greater than the first resolution.
  • the electronic device provides the second image 706 with the plurality of distinct test images 704 in the training data set for training the image processing model 515, 615, or 615’.
  • the second image 706 is a ground truth image of the plurality of distinct test images 704, and the image processing model 515, 615, or 615’ is configured to increase a resolution of each input image.
  • the image processing model 515, 615, or 615’ includes one or more recursive blocks, each recursive block including a plurality of residual units and a skip connection coupling an input of the residual block to an output of the residual block.
  • the plurality of noise creation operations 710 includes (920) four noise creation operations configured to generate four different types of noise.
  • the four different types of noise include (922) blurring noise, statistical noise, compression noise, and downsampling noise.
  • the four different types of noise are (924) organized to the ordered sequence of noise types.
  • the one or more predefined noise creation operations configured to generate the blurring noise include at least isotropic Gaussian blurring, anisotropic Gaussian blurring, 2D Sine blurring, or kernel-based blurring.
  • kernel-based blurring includes a convolution operation with a linear blur filter kernel, and the convolution operation includes estimating one or more SR kernels from the first image 702 and degrading the first image 702 using the SR kernels.
  • the statistical noise includes at least Gaussian noise, Poisson noise, and user-specific noise.
  • the one or more predefined noise creation operations configured to generate the statistical noise includes at least: adding noise having a Gaussian distribution to the first image 702, adding noise having a Poisson distribution to the first image 702, and predicting the user-specific noise in the first image 702 after the first image 702 is received by a user and adding the user-specific noise to the first image 702.
  • the one or more predefined noise creation operations configured to generate the downsampling noise include at least area-based downsampling, bilinear interpolation, and bicubic interpolation.
  • the one or more predefined noise creation operations configured to generate the compression noise include at least Joint Photographic Experts Group (JPEG) compression having a quality factor, and the quality factor is an integer in a range of [1- 100],
  • a first ordered sequence of noise types 708A include blurring noise, statistical noise, compression noise, and downsampling noise.
  • a second ordered sequence of noise types 708B include statistical noise, compression noise, downsampling noise, and blurring noise.
  • four noise creation operations 710A are applied on the first image 702 for adding blurring noise, statistical noise, compression noise, and downsampling noise successively to generate a first test image 704 A.
  • the same four noise creation operations 710A are applied on the first image 702 for adding statistical noise, compression noise, downsampling noise, and blurring noise successively to generate a second test image 704B.
  • the electronic device applies a third set of four noise creation operations 710C on the first image 702 to generate a third test image 704 A.
  • the electronic device applies a fourth set of four noise creation operations 710D on the first image 702 to generate a fourth test image 704D. At least one of the third set of four noise creation operations 710C is distinct from a corresponding one of the fourth set of four noise creation operations 710D.
  • the electronic system includes an electronic device (e.g., a mobile device).
  • the electronic device obtains (1002) raw image data captured by image sensors of a camera, and the raw image data includes an input image 602 having a first resolution.
  • the electronic device extracts (1004) an image feature map 606 from the raw image data.
  • the image feature map 606 has a second resolution that is equal to or less than the first resolution.
  • the electronic device processes (1006) the image feature map 606 using a sequence of successive recursive blocks 608 to generate an output feature map 610.
  • the sequence includes (1008) one or more recursive blocks 608, and each recursive block 608 includes a plurality of residual units 612 and a skip connection 614 coupling an input of the recursive block 608 to an output of the recursive block 608.
  • the electronic device converts (1010) the output feature map 610 to an output image 640 having a third resolution that is greater than the first resolution and generates (1012), from the output image 640, a color image having a color mode.
  • the color mode is (1014) one of: PMS, RGB, CMYK, HEX, YUV, YCbCr, LAB, Index, Greyscale, and Bitmap.
  • each recursive block 608 includes (1016) an output residual unit 612 coupled to the output of the recursive block 608.
  • the image feature map 606 are processed by for each recursive block 608, combining (1018) a block input feature and a unit output feature of the output residual unit 612 to generate a block output feature at the output of the recursive block 608.
  • the first resolution is and the third resolution is where n is an integer equal to or greater than 1.
  • the second resolution is H where m is an integer equal to or greater than 1.
  • the sequence of successive recursive blocks 608 includes (1020) a single recursive block 608, and the skip connection of the single recursive block 608 couples the input to the output of the single recursive block 608.
  • each residual unit 612 in each recursive block 608, the plurality of residual units 612 are successively coupled (1022) to each other and in series, and each residual unit 612 includes a convolutional neural network (CNN) and a rectified linear unit (ReLU) coupled to the CNN.
  • CNN convolutional neural network
  • ReLU rectified linear unit
  • the sequence of successive recursive blocks 608 including at least a first recursive block 608 A and a second recursive block 608B coupled to the first recursive block 608 A.
  • the electronic device receives the image feature map 606 by the first recursive block 608A and generates a first block output feature map 626 from the image feature map 606 by the first recursive block 608A.
  • the electronic device receives the first block output feature map 626 by the second recursive block 608B and generates the output feature map 610 from the first block output feature map 626 by the second recursive block 608B.
  • the sequence of successive recursive blocks 608 includes a first number of recursive blocks 608, and each recursive block 608 includes a second number of residual units 612.
  • the first number is less than a recursive block threshold.
  • the second number is less than a residual unit threshold.
  • the recursive block threshold and residual unit threshold are associated with a computational capability of the electronic device. By these means, the successive recursive blocks 608 are controlled within the computational capability of the electronic device.
  • an image processing model 615 or 615’ includes a feature extraction model 616, the sequence of recursive blocks 608, and an output conversion model 618.
  • the feature extraction model 616 is configured to extract the image feature map 606 from the raw image data
  • the output conversion model 618 is configured to convert the output feature map 610 to the output image 640.
  • a server 102 trains the image processing model 615 or 615’ using a predefined loss function L.
  • the predefined loss function L is a weighted combination of a pixel loss L a structural similarity loss and a perceptual loss
  • the pixel loss Lu indicates a pixel-wise difference between a test output image and a.
  • the perceptual loss L VGG indicates a semantic difference between the test output image and ground truth images using a pre-trained VGG image classification network.
  • the plurality of successive recursive blocks 608 includes a first number of recursive blocks 608, and each recursive block 608 includes a second number of residual units 612.
  • the server obtains information of a computational capability of the electronic device and determines a recursive block threshold and a residual unit threshold based on the information of the computational capability of the electronic device.
  • the server 102 determines the first and second numbers for the image processing model 615 or 615’ based on the recursive block threshold and residual unit threshold and provides the image processing model 615 or 615’ to the electronic device.
  • the image processing model 615 or 615’ includes a plurality of layers having a plurality of filters defined by a plurality of weights.
  • the server 102 selects a data format of the plurality of weights based on a precision setting of the electronic device and quantizes the plurality of weights based on the data format by maintaining the data format for the plurality of weights while re-training the image processing model 615 or 615’.
  • the data format is selected from float32,
  • the term “if” is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context.
  • the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.
  • stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

La présente invention concerne la super-résolution vidéo. Une image d'entrée comprend une pluralité de composantes, et un dispositif électronique sépare une composante d'image des composantes restantes de l'image d'entrée. Une carte de caractéristiques d'image est extraite de la composante d'image, et a une seconde résolution qui est inférieure ou égale à une première résolution de la composante d'image. La carte de caractéristiques d'image est traitée au moyen d'une pluralité de blocs récursifs successifs pour générer une carte de caractéristiques de sortie, et chaque bloc récursif comprend une pluralité d'unités résiduelles et une connexion de saut couplant une entrée et une sortie du bloc récursif. Le dispositif électronique convertit la carte de caractéristiques de sortie en une composante de sortie ayant une troisième résolution qui est supérieure à la première résolution, et combine la composante de sortie et les composantes restantes de l'image d'entrée pour générer une image de sortie ayant la troisième résolution.
PCT/US2022/030918 2022-05-25 2022-05-25 Super-résolution vidéo en temps réel pour dispositifs mobiles Ceased WO2023229589A1 (fr)

Priority Applications (4)

Application Number Priority Date Filing Date Title
PCT/US2022/030918 WO2023229589A1 (fr) 2022-05-25 2022-05-25 Super-résolution vidéo en temps réel pour dispositifs mobiles
PCT/US2022/053987 WO2023229644A1 (fr) 2022-05-25 2022-12-23 Super-résolution vidéo en temps réel pour dispositifs mobiles
PCT/US2022/053986 WO2023229643A1 (fr) 2022-05-25 2022-12-23 Réseau résiduel récursif remanié pour super-résolution d'image
PCT/US2022/053989 WO2023229645A1 (fr) 2022-05-25 2022-12-23 Super-résolution vidéo récurrente par trame pour images brutes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2022/030918 WO2023229589A1 (fr) 2022-05-25 2022-05-25 Super-résolution vidéo en temps réel pour dispositifs mobiles

Publications (1)

Publication Number Publication Date
WO2023229589A1 true WO2023229589A1 (fr) 2023-11-30

Family

ID=88919762

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2022/030918 Ceased WO2023229589A1 (fr) 2022-05-25 2022-05-25 Super-résolution vidéo en temps réel pour dispositifs mobiles

Country Status (1)

Country Link
WO (1) WO2023229589A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119831848A (zh) * 2025-02-11 2025-04-15 浙江大学 一种多尺度空间优化视频超分辨率方法
CN120071110A (zh) * 2025-02-07 2025-05-30 山东省淡水渔业研究院(山东省淡水渔业监测中心) 一种基于深度学习的渔业养殖水质图像增强处理方法

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200218948A1 (en) * 2019-01-03 2020-07-09 Beijing Jingdong Shangke Information Technology Co., Ltd. Thundernet: a turbo unified network for real-time semantic segmentation
US20210264568A1 (en) * 2016-09-15 2021-08-26 Twitter, Inc. Super resolution using a generative adversarial network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210264568A1 (en) * 2016-09-15 2021-08-26 Twitter, Inc. Super resolution using a generative adversarial network
US20200218948A1 (en) * 2019-01-03 2020-07-09 Beijing Jingdong Shangke Information Technology Co., Ltd. Thundernet: a turbo unified network for real-time semantic segmentation

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120071110A (zh) * 2025-02-07 2025-05-30 山东省淡水渔业研究院(山东省淡水渔业监测中心) 一种基于深度学习的渔业养殖水质图像增强处理方法
CN119831848A (zh) * 2025-02-11 2025-04-15 浙江大学 一种多尺度空间优化视频超分辨率方法

Similar Documents

Publication Publication Date Title
CN111741211B (zh) 图像显示方法和设备
WO2021077140A2 (fr) Systèmes et procédés de transfert de connaissance préalable pour la retouche d'image
CN112602088A (zh) 提高弱光图像的质量的方法、系统和计算机可读介质
KR102628898B1 (ko) 인공 지능 기반의 영상 처리 방법 및 이를 수행하는 영상 처리 장치
JP7543080B2 (ja) 学習済みモデル及びデータ処理装置
US20230267587A1 (en) Tuning color image fusion towards original input color with adjustable details
US20230260092A1 (en) Dehazing using localized auto white balance
CN111047543A (zh) 图像增强方法、装置和存储介质
CN115965559A (zh) 面向森林场景的一体化航拍图像增强方法
WO2022103877A1 (fr) Génération d'avatar 3d à commande audio réaliste
CN111079864A (zh) 一种基于优化视频关键帧提取的短视频分类方法及系统
WO2023229589A1 (fr) Super-résolution vidéo en temps réel pour dispositifs mobiles
WO2023229591A1 (fr) Super-résolution de scène réelle avec des images brutes pour dispositifs mobiles
WO2023023162A1 (fr) Détection et reconstruction de plan sémantique 3d à partir d'images stéréo multi-vues (mvs)
US20230410830A1 (en) Audio purification method, computer system and computer-readable medium
CN115190226A (zh) 参数调整的方法、训练神经网络模型的方法及相关装置
WO2023229643A1 (fr) Réseau résiduel récursif remanié pour super-résolution d'image
WO2022235785A1 (fr) Architecture de réseau neuronal pour une restauration d'image dans des caméras à sous-affichage
US12394021B2 (en) Depth-based see-through prevention in image fusion
WO2023229590A1 (fr) Super-résolution vidéo basée sur l'apprentissage profond
US20230245290A1 (en) Image fusion in radiance domain
WO2023063944A1 (fr) Reconnaissance de gestes de la main en deux étapes
WO2023177388A1 (fr) Procédés et systèmes permettant une amélioration d'une vidéo en faible lumière
WO2023277877A1 (fr) Détection et reconstruction de plan sémantique 3d
US20230281839A1 (en) Image alignment with selective local refinement resolution

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22943981

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22943981

Country of ref document: EP

Kind code of ref document: A1