WO2025227394A1

WO2025227394A1 - System and method for unified reference picture synthesis

Info

Publication number: WO2025227394A1
Application number: PCT/CN2024/091009
Authority: WO
Inventors: Cheolkon Jung; Qipu QIN
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2024-04-30
Filing date: 2024-04-30
Publication date: 2025-11-06
Anticipated expiration: 2026-10-30

Abstract

According to one aspect of the present disclosure, a method of video encoding or decoding is provided. The method may include obtaining a first set of feature maps by inputting a first image-pyramid pair into a first feature-extraction network, a second set of feature maps by inputting a second image-pyramid pair into a second feature-extraction network, and a third set of feature maps by inputting a third image-pyramid pair into a third feature-extraction network. The first, second, and third sets of feature maps may be associated with a same resolution. The method may include computing a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first, second, and third sets of feature maps. The method may include generating a reference picture associated with a third time based on the plurality of final optical flow vectors, the third time being between the first time and the second time or after the first time and second time.

Description

SYSTEM AND METHOD FOR UNIFIED REFERENCE PICTURE SYNTHESIS

BACKGROUND

Embodiments of the present disclosure relate to video coding.

Digital video has become mainstream and is being used in a wide range of applications including digital television, video telephony, and teleconferencing. These digital video applications are feasible because of the advances in computing and communication technologies as well as efficient video coding techniques. Various video coding techniques may be used to compress video data, such that coding on the video data can be performed using one or more video coding standards. Exemplary video coding standards may include, but not limited to, versatile video coding (H. 266/VVC) , high-efficiency video coding (H. 265/HEVC) , advanced video coding (H. 264/AVC) , moving picture expert group (MPEG) coding, to name a few.

SUMMARY

According to one aspect of the present disclosure, a method of video encoding or decoding is provided. The method may include obtaining, by a processor, a first input picture associated with a first time and a second input picture associated with a second time. The first input picture and the second input picture may form a first image-pyramid pair. The method may include generating, by the processor, a second image-pyramid pair by performing a first bicubic downsampling of the first image-pyramid pair. The first image-pyramid pair and the second image-pyramid pair may be associated with different resolutions. The method may include generating, by the processor, a third image-pyramid pair by performing a second bicubic downsampling of the second image-pyramid pair. The first image-pyramid pair, the second image-pyramid pair, and the third image-pyramid pair each may be associated with a different resolution. The method may include obtaining, by the processor, a first set of feature maps by inputting the first image-pyramid pair into a first feature-extraction network, a second set of feature maps by inputting the second image-pyramid pair into a second feature-extraction network, and a third set of feature maps by inputting the third image-pyramid pair into a third feature-extraction network. The first set of feature maps, the second set of feature maps, and a third set of feature maps may be associated with a same resolution. The method may include computing, by the processor, a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps. The method may include generating, by the processor, a reference picture associated with a third time based on the plurality of final optical flow vectors. The third time may be between the first time and the second time or after the first time and the second time.

According to another aspect of the present disclosure, an apparatus for video encoding or decoding is provided. The apparatus may include a processor and memory storing instructions. The memory storing instructions, which when executed by the processor, may cause the processor to obtain a first input picture associated with a first time and a second input picture associated with a second time, the first input picture and the second input picture forming a first image-pyramid pair. The memory storing instructions, which when executed by the processor, may cause the processor to generate a second image-pyramid pair by performing a first bicubic downsampling of the first image-pyramid pair. The first image-pyramid pair and the second image-pyramid pair may be associated with different resolutions. The memory storing instructions, which when executed by the processor, may cause the processor to generate a third image-pyramid pair by performing a second bicubic downsampling of the second image-pyramid pair. The first image-pyramid pair, the second image-pyramid pair, and the third image-pyramid pair each may be associated with a different resolution. The memory storing instructions, which when executed by the processor, may cause the processor to obtain a first set of feature maps by inputting the first image-pyramid pair into a first feature-extraction network, a second set of feature maps by inputting the second image-pyramid pair into a second feature-extraction network, and a third set of feature maps by inputting the third image-pyramid pair into a third feature-extraction network. The first set of feature maps, the second set of feature maps, and a third set of feature maps may be associated with a same resolution. The memory storing instructions, which when executed by the processor, may cause the processor to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps. The memory storing instructions, which when executed by the processor, may cause the processor to generate a reference picture associated with a third time based on the plurality of final optical flow vectors. The third time may be between the first time and the second time or after the first time and the second time.

According to a further aspect of the present disclosure, a non-transitory computer-readable medium storing instructions is provided. The instructions, which when executed by the processor, may cause the processor to obtain a first input picture associated with a first time and a second input picture associated with a second time, the first input picture and the second input picture forming a first image-pyramid pair. The instructions, which when executed by the processor, may cause the processor to generate a second image-pyramid pair by performing a first bicubic downsampling of the first image-pyramid pair. The first image-pyramid pair and the second image-pyramid pair may be associated with different resolutions. The instructions, which when executed by the processor, may cause the processor to generate a third image-pyramid pair by performing a second bicubic downsampling of the second image-pyramid pair. The first image-pyramid pair, the second image-pyramid pair, and the third image-pyramid pair each may be associated with a different resolution. The instructions, which when executed by the processor, may cause the processor to obtain a first set of feature maps by inputting the first image-pyramid pair into a first feature-extraction network, a second set of feature maps by inputting the second image-pyramid pair into a second feature-extraction network, and a third set of feature maps by inputting the third image-pyramid pair into a third feature-extraction network. The first set of feature maps, the second set of feature maps, and a third set of feature maps may be associated with a same resolution. The instructions, which when executed by the processor, may cause the processor to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps. The instructions, which when executed by the processor, may cause the processor to generate a reference picture associated with a third time based on the plurality of final optical flow vectors. The third time may be between the first time and the second time or after the first time and the second time.

These illustrative embodiments are mentioned not to limit or define the present disclosure, but to provide examples to aid understanding thereof. Additional embodiments are described in the Detailed Description, and further description is provided there.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings, which are incorporated herein and form a part of the specification, illustrate embodiments of the present disclosure and, together with the description, further serve to explain the principles of the present disclosure and to enable a person skilled in the pertinent art to make and use the present disclosure.

FIG. 1 illustrates a block diagram of an exemplary encoding system, according to some embodiments of the present disclosure.

FIG. 2 illustrates a block diagram of an exemplary decoding system, according to some embodiments of the present disclosure.

FIG. 3 illustrates a detailed block diagram of an exemplary video-coding framework, according to some embodiments of the present disclosure.

FIG. 4 illustrates a detailed block diagram for training and use of a unified reference frame synthesis (URFS) model included in the video-coding framework of FIG. 3, according to some embodiments of the present disclosure.

FIG. 5 illustrates a detailed block diagram of the URFS model that may be included in the video-coding framework of FIG. 3, according to some embodiments of the present disclosure.

FIG. 6 illustrates a detailed block diagram of an exemplary image-feature pyramids network of the URFS model of FIG. 5, according to some embodiments of the present disclosure.

FIG. 7 illustrates a detailed block diagram of an exemplary progressive-recursive motion estimation network of the URFS model of FIG. 5, according to some embodiments of the present disclosure.

FIG. 8 illustrates a diagram of the optical-flow vectors and feature maps generated by the progressive-recursive motion estimation network of FIG. 7, according to some embodiments of the present disclosure.

FIG. 9 illustrates a detailed block diagram of an exemplary picture enhancer and picture synthesizer architecture that may be included in the URFS model of FIG. 5, according to some embodiments of the present disclosure.

FIG. 10 illustrates diagrams of an exemplary random access (RA) mode and an exemplary low delay B (LDB) mode of the URFS model of FIG. 5, according to some embodiments of the present disclosure.

FIG. 11 illustrates diagrams of the time steps of input pictures using the RA mode and the LDB mode of the URFS model, according to some embodiments of the present disclosure.

FIG. 12 illustrates a flow chart of an exemplary method of video encoding or decoding, according to some aspects of the present disclosure.

Embodiments of the present disclosure will be described with reference to the accompanying drawings.

DETAILED DESCRIPTION

Although some configurations and arrangements are discussed, it should be understood that this is done for illustrative purposes only. A person skilled in the pertinent art will recognize that other configurations and arrangements can be used without departing from the spirit and scope of the present disclosure. It will be apparent to a person skilled in the pertinent art that the present disclosure can also be employed in a variety of other applications.

It is noted that references in the specification to “one embodiment, ” “an embodiment, ” “an example embodiment, ” “some embodiments, ” “certain embodiments, ” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases do not necessarily refer to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it would be within the knowledge of a person skilled in the pertinent art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.

In general, terminology may be understood at least in part from usage in context. For example, the term “one or more” as used herein, depending at least in part upon context, may be used to describe any feature, structure, or characteristic in a singular sense or may be used to describe combinations of features, structures or characteristics in a plural sense. Similarly, terms, such as “a, ” “an, ” or “the, ” again, may be understood to convey a singular usage or to convey a plural usage, depending at least in part upon context. In addition, the term “based on” may be understood as not necessarily intended to convey an exclusive set of factors and may, instead, allow for existence of additional factors not necessarily expressly described, again, depending at least in part on context.

Various aspects of video coding systems will now be described with reference to various apparatus and methods. These apparatus and methods will be described in the following detailed description and illustrated in the accompanying drawings by various modules, components, circuits, steps, operations, processes, algorithms, etc. (collectively referred to as “elements” ) . These elements may be implemented using electronic hardware, firmware, computer software, or any combination thereof. Whether such elements are implemented as hardware, firmware, or software depends upon the particular application and design constraints imposed on the overall system.

The techniques described herein may be used for various video coding applications. As described herein, video coding includes both encoding and decoding a video. Encoding and decoding of a video can be performed by the unit of block. For example, an encoding/decoding process such as transform, quantization, prediction, in-loop filtering, reconstruction, or the like may be performed on a coding block, a transform block, or a prediction block. As described herein, a block to be encoded/decoded will be referred to as a “current block. ” For example, the current block may represent a coding block, a transform block, or a prediction block according to a current encoding/decoding process. In addition, it is understood that the term “unit” used in the present disclosure indicates a basic unit for performing a specific encoding/decoding process, and the term “block” indicates a sample array of a predetermined size. Unless otherwise stated, the “block, ” “unit, ” and “component” may be used interchangeably.

Recently, deep learning-based video picture interpolation and extrapolation have received considerable attention, and several studies have been devoted to exploring deep reference picture generation strategies for inter-frame prediction tasks in video coding (such as HEVC and VVC) .

To generate reference block for motion estimation and motion compensation in existing inter-frame coding techniques, neural reference synthesis (NRS) may be used. While NRS achieves some coding gains, it is designed for HEVC and not VVC, which is the latest video coding standard.

For inter-frame prediction, an error-corrected auto-regressive network (ECAR-net) may be used. Although the proposed scheme achieves an enhancement of HEVC inter-frame prediction coding performance, it is unable to achieve the same degree of enhancement in VVC coding performance.

To optimize the inter-prediction in VVC, a deep reference frame (DRF) generation method may be used. Although the DRF achieves improved coding performance, the optical flow estimation module uses an off-the-shelf pre-trained model (Intermediate Feature Refine Network (IFRNet) ) , which leads to sub-optimal results because the entire framework is unable to perform end-to-end optimization. The IFRNet may be used in the RA mode and the LDB mode.

To overcome these and other challenges, the present disclosure provides a unified reference picture generation framework for VVC inter-prediction under RA and LDB configurations. For example, the present disclosure performs optical-flow estimation for motion estimation using two pyramid levels: an image-pyramid level and a feature-pyramid level. This enables the present network to recursively calculate the optical flows between features with the same resolution, and also progressively calculate the optical flows between features with different resolutions, achieving a true coarse-to-fine refinement of motion estimation.

FIG. 1 illustrates a block diagram of an exemplary encoding system 100, according to some embodiments of the present disclosure. FIG. 2 illustrates a block diagram of an exemplary decoding system 200, according to some embodiments of the present disclosure. Each system 100 or 200 may be applied or integrated into various systems and apparatus capable of data processing, such as computers and wireless communication devices. For example, system 100 or 200 may be the entirety or part of a mobile phone, a desktop computer, a laptop computer, a tablet, a vehicle computer, a gaming console, a printer, a positioning device, a wearable electronic device, a smart sensor, a virtual reality (VR) device, an argument reality (AR) device, or any other suitable electronic devices having data processing capability. As shown in FIGs. 1 and 2, system 100 or 200 may include a processor 102, a memory 104, and an interface 106. These components are shown as connected to one another by a bus, but other connection types are also permitted. It is understood that system 100 or 200 may include any other suitable components for performing functions described here.

Processor 102 may include microprocessors, such as graphic processing unit (GPU) , image signal processor (ISP) , central processing unit (CPU) , digital signal processor (DSP) , tensor processing unit (TPU) , vision processing unit (VPU) , neural processing unit (NPU) , synergistic processing unit (SPU) , or physics processing unit (PPU) , microcontroller units (MCUs) , application-specific integrated circuits (ASICs) , field-programmable gate arrays (FPGAs) , programmable logic devices (PLDs) , state machines, gated logic, discrete hardware circuits, and other suitable hardware configured to perform the various functions described throughout the present disclosure. Although only one processor is shown in FIGs. 1 and 2, it is understood that multiple processors can be included. Processor 102 may be a hardware device having one or more processing cores. Processor 102 may execute software. Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, executables, threads of execution, procedures, functions, etc., whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise. Software can include computer instructions written in an interpreted language, a compiled language, or machine code. Other techniques for instructing hardware are also permitted under the broad category of software.

Memory 104 can broadly include both memory (a. k. a, primary/system memory) and storage (a. k. a. secondary memory) . For example, memory 104 may include random-access memory (RAM) , read-only memory (ROM) , static RAM (SRAM) , dynamic RAM (DRAM) , ferro-electric RAM (FRAM) , electrically erasable programmable ROM (EEPROM) , compact disc read-only memory (CD-ROM) or other optical disk storage, hard disk drive (HDD) , such as magnetic disk storage or other magnetic storage devices, Flash drive, solid-state drive (SSD) , or any other medium that can be used to carry or store desired program code in the form of instructions that can be accessed and executed by processor 102. Broadly, memory 104 may be embodied by any computer-readable medium, such as a non-transitory computer-readable medium. Although only one memory is shown in FIGs. 1 and 2, it is understood that multiple memories can be included.

Interface 106 can broadly include a data interface and a communication interface that is configured to receive and transmit a signal in a process of receiving and transmitting information with other external network elements. For example, interface 106 may include input/output (I/O) devices and wired or wireless transceivers. Although only one memory is shown in FIGs. 1 and 2, it is understood that multiple interfaces can be included.

Processor 102, memory 104, and interface 106 may be implemented in various forms in system 100 or 200 for performing video coding functions. In some embodiments, processor 102, memory 104, and interface 106 of system 100 or 200 are implemented (e.g., integrated) on one or more system-on-chips (SoCs) . In one example, processor 102, memory 104, and interface 106 may be integrated on an application processor (AP) SoC that handles application processing in an operating system (OS) environment, including running video encoding and decoding applications. In another example, processor 102, memory 104, and interface 106 may be integrated on a specialized processor chip for video coding, such as a GPU or ISP chip dedicated to image and video processing in a real-time operating system (RTOS) .

As shown in FIG. 1, in encoding system 100, processor 102 may include one or more modules, such as an encoder 101 (also referred to herein as a “pre-processing network” ) . Although FIG. 1 shows that encoder 101 is within one processor 102, it is understood that encoder 101 may include one or more sub-modules that can be implemented on different processors located closely or remotely with each other. Encoder 101 (and any corresponding sub-modules or sub-units) can be hardware units (e.g., portions of an integrated circuit) of processor 102 designed for use with other components or software units implemented by processor 102 through executing at least part of a program, i.e., instructions. The instructions of the program may be stored on a computer-readable medium, such as memory 104, and when executed by processor 102, it may perform a process having one or more functions related to video encoding, such as picture partitioning, inter-prediction, intra-prediction, transformation, quantization, filtering, entropy encoding, etc., as described below in detail.

Similarly, as shown in FIG. 2, in decoding system 200, processor 102 may include one or more modules, such as a decoder 201 (also referred to herein as a “post-processing network” ) . Although FIG. 2 shows that decoder 201 is within one processor 102, it is understood that decoder 201 may include one or more sub-modules that can be implemented on different processors located closely or remotely with each other. Decoder 201 (and any corresponding sub-modules or sub-units) can be hardware units (e.g., portions of an integrated circuit) of processor 102 designed for use with other components or software units implemented by processor 102 through executing at least part of a program, i.e., instructions. The instructions of the program may be stored on a computer-readable medium, such as memory 104, and when executed by processor 102, it may perform a process having one or more functions related to video decoding, such as entropy decoding, inverse quantization, inverse transformation, inter-prediction, intra-prediction, filtering, as described below in detail.

FIG. 3 illustrates a detailed block diagram of an exemplary video-coding network 300 (referred to hereinafter as “video-coding network 300” ) , according to some embodiments of the present disclosure.

As shown in FIG. 3, video-coding network 300 may include, e.g., a transformation unit 302, a quantization component 304, an entropy coding unit 306, a scaling unit 308, an inverse transform unit 310, an in-loop filter unit 312, a decoded-picture buffer 314, a URFS network, an inter-prediction unit 318, an intra-prediction unit 326, an adder/subtractor 328, and an adder/subtractor 330. The inter-prediction unit 318 may include, e.g., a reference-picture list 320, a motion estimation unit 322, and a motion compensation unit 324.

To initiate video-coding operations, video-coding network 300 feeds a current image area 301 into transformation unit 302, which may encode geometry positions and attributes associated with mesh vertices separately. A geometry of a mesh may be a collection of vertices with positions X_k= (x_k, y_k, z_k) , k=1, …, K, where K is the number of vertices in the mesh, and attributes A_k= (A_1k, A_2k, …, A_Dk) , k=1, …, K, where D is the number of attributes for each vertex. In some embodiments, attribute coding depends on decoded geometry. As a consequence, mesh vertex positions may be coded first. Since geometry positions may be represented by floating-point numbers in an original coordinate system, transformation unit 302 may be configured to perform a coordinate transformation and attribute transformation (e.g., based on the results from geometry analysis) . Quantization component 304 may be configured to quantize the transformed coefficients of attributes to generate quantization levels of the attributes associated with each point to reduce the dynamic range. Entropy coding unit 306 is configured to encode the resulting quantization levels of the attributes. After entropy coding, a bitstream 303 (e.g., encoded bitstream) may be output.

The quantization levels may be input to the scaling unit 308, on the decoder side. The scaling unit 308 may be configured to decode positions associated with vertices of a dynamic mesh from the geometry bitstream. In-loop filter unit 312 may be configured to generate decoded pictures, which are maintained in decoded-picture buffer 314. For example, decoded-picture buffer 314 may store a first input picture 305a and a second input picture 305b that are input into URFS model 316. URFS model 316 may be configured to generate a reference picture 307 using optical flow vectors generated by applying progressive-recursive motion estimation to the first input picture 305a and the second input picture 305b, as described below in connection with FIGs. 4-11.

Reference picture 307 may be maintained by reference-picture list 320. Motion estimation unit 322 may perform motion estimation using the reference pictures from reference-picture list 320. Motion compensation unit 324 may perform motion compensation using motion estimation information output by motion estimation unit 322. Inter-prediction unit 318 may perform inter-frame prediction for decoding, while intra-prediction unit 326 may perform intra-frame prediction for decoding.

In some implementations, inter-prediction unit 318 may derive a predicted block for the input current picture 301 based on a reference block (reference sample array) specified by a motion vector on a reference picture 307. Here, in order to reduce the amount of motion information transmitted in the inter-prediction mode, the motion information may be predicted in units of blocks, subblocks, or samples based on correlation of motion information between the neighboring block and the current block. The motion information may include a motion vector and a reference picture index. The motion information may further include inter-prediction direction (L0 prediction, L1 prediction, Bi prediction, etc. ) information. In the case of inter-prediction, the neighboring block may include a spatial neighboring block present in the input current picture 301 and a temporal neighboring block present in the reference picture 307. The reference picture 307 including the reference block and the reference picture including the temporal neighboring block may be the same or different. The temporal neighboring block may be called a collocated reference block, a co-located CU (colCU) , and the like, and the reference picture including the temporal neighboring block may be called a collocated picture (colPic) . For example, the inter-prediction unit 318 may configure a motion information candidate list based on neighboring blocks and generate information indicating which candidate is used to derive a motion vector and/or a reference picture index of the current block. Inter-prediction may be performed based on various prediction modes. For example, in the case of a skip mode and a merge mode, the inter-prediction unit 318 may use motion information of the neighboring block as motion information of the current block. In the skip mode, unlike the merge mode, the residual signal may not be transmitted. In the case of the motion vector prediction (MVP) mode, the motion vector of the neighboring block may be used as a motion vector predictor and the motion vector of the current block may be indicated by signaling a motion vector difference.

The inter-prediction unit 318 may generate a prediction signal based on various prediction methods described below. For example, the predictor may not only apply intra-prediction or inter-prediction to predict one block but also simultaneously apply both intra-prediction and inter-prediction. This may be called combined inter and intra-prediction (CIIP) . In addition, the predictor may be based on an intra block copy (IBC) prediction mode or a palette mode for prediction of a block. The IBC prediction mode or palette mode may be used for content image/video coding of a game or the like, for example, screen content coding (SCC) . The IBC basically performs prediction in the current picture but may be performed similarly to inter-prediction in that a reference block is derived in the current picture. That is, the IBC may use at least one of the inter-prediction techniques described in this document. The palette mode may be considered as an example of intra coding or intra-prediction. When the palette mode is applied, a sample value within a picture may be signaled based on information on the palette table and the palette index.

The prediction signal generated by the inter-prediction unit 318 may be used to generate a reconstructed signal or to generate a residual signal. The transformation unit 302 may generate transform coefficients by applying a transform technique to the residual signal. For example, the transform technique may include at least one of a discrete cosine transform (DCT) , a discrete sine transform (DST) , a karhunen-loève transform (KLT) , a graph-based transform (GBT) , or a conditionally non-linear transform (CNT) . Here, the GBT means transform obtained from a graph when relationship information between pixels is represented by the graph. The CNT refers to transform generated based on a prediction signal generated using all previously reconstructed pixels. In addition, the transform process may be applied to square pixel blocks having the same size or may be applied to blocks having a variable size rather than square.

The quantization unit 304 may quantize the transform coefficients and transmit them to the entropy coding unit 306, which may encode the quantized signal (information on the quantized transform coefficients) into output the output bitstream 303. The information on the quantized transform coefficients may be referred to as residual information. The quantization unit 304 may rearrange block type quantized transform coefficients into a one-dimensional vector form based on a coefficient scanning order and generate information on the quantized transform coefficients based on the quantized transform coefficients in the one-dimensional vector form. Information on transform coefficients may be generated. The entropy coding unit 306 may perform various encoding methods such as, for example, exponential Golomb, context-adaptive variable length coding (CAVLC) , context-adaptive binary arithmetic coding (CABAC) , and the like. The entropy coding unit 306 may encode information necessary for video/image reconstruction other than quantized transform coefficients (ex. values of syntax elements, etc. ) together or separately. Encoded information (ex. encoded video/image information) may be transmitted or stored in units of NALs (network abstraction layer) in the form of a bitstream. The video/image information may further include information on various parameter sets such as an adaptation parameter set (APS) , a picture parameter set (PPS) , a sequence parameter set (SPS) , or a video parameter set (VPS) . In addition, the video/image information may further include general constraint information. In this document, information and/or syntax elements transmitted/signaled from the encoding apparatus to the decoding apparatus may be included in video/picture information. The video/image information may be encoded through the above-described encoding procedure and included in the bitstream. The bitstream may be transmitted over a network or may be stored in a digital storage medium. The network may include a broadcasting network and/or a communication network, and the digital storage medium may include various storage media such as USB, SD, CD, DVD, Blu-ray, HDD, SSD, and the like. A transmitter (not shown) transmitting a signal output from the entropy coding unit 306 and/or a storage unit (not shown) storing the signal may be included as internal/external element of the video-coding network 300, and alternatively, the transmitter may be included in the entropy coding unit 306.

The quantized transform coefficients output from the quantization unit 304 may be used to generate a prediction signal. For example, the residual signal (residual block or residual samples) may be reconstructed by applying dequantization and inverse transform to the quantized transform coefficients through the scaling unit 308 and the inverse transformation unit 310. The adder/subtractor 330 adds the reconstructed residual signal to the prediction signal output from the inter-prediction unit 318 or the intra-prediction unit 326 to generate a reconstructed signal (reconstructed picture, reconstructed block, reconstructed sample array) . If there is no residual for the block to be processed, such as a case where the skip mode is applied, the predicted block may be used as the reconstructed block. The adder/subtractor 330 may be called a reconstructor or a reconstructed block generator. The generated reconstructed signal may be used for intra-prediction of a next block to be processed in the current picture and may be used for inter-prediction of a next picture through filtering as described below.

Meanwhile, luma mapping with chroma scaling (LMCS) may be applied during picture encoding and/or reconstruction.

The in-loop filter 312 may improve subjective/objective image quality by applying filtering to the reconstructed signal. For example, the in-loop filter 312 may generate a modified reconstructed picture by applying various filtering methods to the reconstructed picture and store the modified reconstructed picture in the decoded picture buffer 314. The various filtering methods may include, for example, deblocking filtering, a sample adaptive offset, an adaptive loop filter, a bilateral filter, and the like. The in-loop filter 312 may generate various information related to the filtering and transmit the generated information to the entropy coding unit 306. The information related to the filtering may be encoded by the entropy coding unit 306 and output in the output bitstream 303.

The modified reconstructed picture transmitted to the decoded picture buffer 314 may be sent to the reference picture list 320, which is later updated by the reference picture 307 generated by the URFS model 316. When the inter-prediction is applied through the encoding apparatus, prediction mismatch between an encoding apparatus and the decoding apparatus may be avoided and encoding efficiency may be improved.

FIG. 4 illustrates a detailed block diagram 400 for training and use of the URFS model 316 included in video-coding network 300 (e.g., VVC encoder/decoder) of FIG. 3, according to some embodiments of the present disclosure. Referring to FIG. 4, in the training stage 420, QP distance 407 is used to train the URFS model 316. The QP distance 407 training strategy uses the pictures (e.g., output pictures 403a from the URFS model 316 trained using input pictures 401a) compressed by VVC test model (VTM) with smaller artifacts as labels instead of using uncompressed pictures (e.g., ground truth pictures 405) as labels for training.

In the online inference stage 430, URFS model 316 is first integrated into the VTM software to ensure online use. For instance, two reconstructed pictures 401b from decoded picture buffer (DPB) 314, together with the QP maps, are fed into the URFS model 316 to synthesize an output picture 403b (e.g., a reference picture) . Output picture 403b is inserted into the reference picture list (RPL) as a replacement to mitigate motion-related artifacts.

URFS model 316 has the same network architecture in both RA and LDB configurations (e.g., a unified model) , while JVET-AD0160 has different network architectures in the RA and LDB configurations. URFS model 316 does not need to change motion estimation in VTM, and it does not need extra information signaled into the bitstream.

FIG. 5 illustrates a detailed block diagram 500 of the URFS model 316 that may be included in the video-coding network 300 of FIG. 3, according to some embodiments of the present disclosure. Referring to FIG. 5, the URFS model 316 may include an image-feature pyramids network 501, a progressive-recursive motion-estimation network 503 (referred to hereinafter as “motion-estimation network 503” ) , and a picture-generation network 505.

Image-feature pyramids network 501 generates image-level pyramids and feature-level pyramids, which collectively may overcome the challenges of large-scale motion while reducing computational complexity. For instance, image-feature pyramids network 501 may receive a pair of input pictures associated with a first resolution, which may be used as a first image-pyramid pair 507a. Bicubic downsampling of first image-pyramid pair 507a may be performed to obtain a second image-pyramid pair 507b of a second resolution lower than the first resolution. Bicubic downsampling of the second image-pyramid pair 507b may be performed to obtain a third image-pyramid pair 507c, and so on. For ease of illustration and depiction, only image-pyramid pairs of levels 5, 4, and 3 are shown and described in connection with FIG. 5. However, in practice, levels 5-1 may be implemented by image-feature pyramids network 501. Also, more or fewer than five levels may be implemented by image-feature pyramids network 501 without departing from the scope of the present disclosure.

Still referring to FIG. 5, each image-pyramid level may have a corresponding feature extractor configured to generate feature maps from its image-pyramids. The feature maps generated at each level may have the same resolution. For instance, the level-5 feature extractor 502a may generate a first set of feature maps 509a from first image-pyramid pair 507a, the level-4 feature extractor 502b may generate a second set of feature maps 509b from second image-pyramid pair 507b, and the level-3 feature extractor 502c may generate a third set of feature maps 509c from third image-pyramid pair 507c. First set of feature maps 509a, second set of feature maps 509b, and third set of feature maps 509c may be input into motion-estimation network 503. Example network structures of the feature extractors are described below in connection with FIG. 6.

FIG. 6 illustrates a detailed block diagram of image-feature pyramids network 501 of the URFS model 316 of FIG. 5, according to some embodiments of the present disclosure.

As shown in FIG. 6, the present techniques use feature-level pyramids and image-level pyramids. This is in contrast to existing techniques, which handle large-scale motion using only feature pyramids, which increases receptive fields. Firstly, the image-feature pyramids network 501 performs bicubic downsampling to construct a picture-level pyramid (L=3) , and then obtain a feature-level pyramid (L=5, 4, and 3) through the corresponding feature extraction networks (e.g., level-5 feature extractor 502a, level-4 feature extractor 502b, and level-3 feature extractor 502c) .

As shown in FIG. 6, level-5 feature extractor 502a includes five levels of convolutional-layer pairs (e.g., either 3x3 convolutional layers, stride 2 (s=2) 3x3 convolutional layers, or 2D convolutional layers) , level-4 feature extractor 502b includes four levels of convolutional-layer pairs (e.g., either 3x3 convolutional layers, stride 2 (s=2) 3x3 convolutional layers, or 2D convolutional layers) , and level-3 feature extractor 502c includes three levels of convolutional-layer pairs (e.g., either 3x3 convolutional layers, stride 2 (s=2) 3x3 convolutional layers, or 2D convolutional layers) . The first set of feature maps 509a generated by the level-5 feature extractor 502a may be represented bywhereare the feature maps. The second set of feature maps 509b generated by the level-4 feature extractor 502b may be represented bywhereare the feature maps. The third set of feature maps 509c generated by the level-3 feature extractor 502c may be represented bywhereare the feature maps.

Referring again to FIG. 5, motion-estimation network 503 may recursively predict the bidirectional optical flows at feature levels with the same resolution, and progressively estimate and upsample the optical flows at feature levels at different resolutions, as described below in connection with FIG. 7.

FIG. 7 illustrates a detailed block diagram of motion-estimation network 503 of the URFS model 316 of FIG. 5, according to some embodiments of the present disclosure.

Referring to FIG. 7, the motion-estimation network 503 predicts the bidirectional flows at feature levels with the same resolution, and progressively estimates and upsample the flows at feature levels at different resolutions. Overall, the motion-estimation network 503 learns residual optical flow vectors and continuously adds them together. For ease of description and illustration, only the feature maps from level-5, level-4, and level-3 are depicted, but it is understood that five sets of feature maps at each resolution may be used for motion estimation in determining the final set of optical flow vectors 513.

For instance, the third set of feature maps 509care input into motion estimation block1 (MEB1) 702. A time step T indicates the time step between images in the image-pyramids may also be included into MEB1 702. MEB1 702 may include a plurality of 3x3 convolutional layers and a summation operation to generate a first set of optical flow vectors from the third set of feature maps 509c. The first set of optical flow vectors may be input into the warping block 704, MEB1 702, and summation block 706 associated with the level-4 image-pyramids.

The second set of feature maps 509b may be input into the corresponding warping block 704, which may warp the second set of feature maps 509b using the first set of optical flow vectors. The warped feature maps may be input into MEB1, which generates an intermediate second set of optical flow vectors based on the warped second set of feature maps and the first set of optical flow vectors. Summation block 706 may perform a summation operation using the first set of optical flow vectors and the intermediate second set of optical flow vectors to generate a final second set of optical flow vectors. Motion-estimation network 503 may upsample the final second set of optical flow vectors. The upsampled second set of optical flow vectors may be input into the warping block 704, the motion estimation block2 (MEB2) 708, and the summation block 706 associated with the first set of feature maps 509a.

The first set of feature maps 509a may be input into the corresponding warping block 704, which may warp the first set of feature maps 509a using the upsampled second set of optical flow vectors. The warped feature maps may be input into MEB2 708, which generates an intermediate third set of optical flow vectors based on the warped first set of feature maps and the upsampled second set of optical flow vectors. Summation block 706 may perform a summation operation using the upsampled second set of optical flow vectors and the third set of optical flow vectors to generate a final third set of optical flow vectors. The final third set of optical flow vectors into the warping block 704, the motion estimation block2 (MEB1) 708, and the summation block 706 associated with the next stage in which the feature maps have been upsampledAs shown in FIG. 7, these operations continue until a final set of optical flow vectors 513is generated. The list 800 of inputs, outputs, and optical flow vectors used/generated at each stage of motion estimation in FIG. 7 are shown in FIG. 8.

Referring again to FIG. 5, the final set up feature maps 511 may be input into the backward warping block 504 of picture-generation network 505. The final set of optical flow vectors 513 may be input into backward warping block 504 and the feature enhancer 506 of picture-generation network 505. Moreover, the input pictures (first image-pyramid pair 507a) may be input into backward warping block 504. Backward warping block 504 may warp the input pictures based on the final set of optical flow vectors 513. The warped pictures may be input into feature enhancer 506, which enhances the features of the warped pictures using the final set of feature maps 511. The enhanced warped pictures are input into picture synthesizer 508, which synthesizes the enhanced warped pictures into output picture 515 (e.g., a reference picture) . Both feature enhancer 506 and picture synthesizer 508 include a U-net architecture as their basic structure and are equipped with depth wise-separable convolution (DSC) blocks, as shown in FIG. 9.

FIG. 9 illustrates an exemplary U-net architecture 900 that may be used for feature enhancer 506 and picture synthesizer 508 of FIG. 5, according to some embodiments of the present disclosure. For instance, the U-net architecture 900 may include a plurality of 3x3 convolutional layers 902, DSC blocks 904, max-pooling layers 906, pixel-shuffle layers 908, and a x2 DSC block 910 that includes a plurality of 1x1 convolutional layers 912, and a 3x3 DSC layer 914. For the picture-generation network 505 of FIG. 5, the feature enhancer 506 and picture synthesizer 508, which are constructed by the U-net architecture 900 as the basic backbone, organically combine motion compensation information to synthesize high-quality reference pictures. To reduce parameter complexity, DSC blocks 904/910 are used, as shown in FIG. 5.

FIG. 10 illustrates a diagram 1000 of an exemplary RA mode and an exemplary LDB mode of the URFS model of FIG. 5, according to some embodiments of the present disclosure. In the RA mode for URFS, the input pictures 1001a are non-contiguous in the time domain, e.g., andThe RA mode 1002a generates an output picture 1003athat is located between the two input pictures 1001a in the time domain. In the low delay B (LDB) model 1002b for URFS, the input pictures 1001b are contiguous in the time domain, e.g., andThe LDB model 1002b generates an output picture 1003bthat is located after the input pictures 1001b in the time domain.

FIG. 11 illustrates diagrams of the time steps 1100 of input pictures using the RA mode and the LDB mode of the URFS model, according to some embodiments of the present disclosure. Using the RA mode, a time step of T=1, 2, 3, or 4 may be used. Using the LDB mode, a time step of T=1 is used.

Referring to FIG. 11, in a first example RA mode 1102a, a time step of T=1 between the input pictures 1101a is used. Here, the first input picture 1101a from t=1 and the second input picture 1101a from t=3 are used to generate an output picture 1103a from t=2.

In a second RA mode 1102b, a time step of T=4 is used. Here, the first input picture 1101b from t=4 and the second input picture 1101b from t=12 are used to generate an output picture 1103b from t=8.

In the LDB mode 1104, a time step of T=1 is used. Here, the first input picture 1101c from t=1 and the second input picture 1101c from t=2 are used to generate an output picture 1103c from t=3.

FIG. 12 illustrates a flow chart of an exemplary method 1200 of video encoding, according to some embodiments of the present disclosure. Method 1200 may be performed by an apparatus, e.g., such as encoder 101, decoder 201, video-coding network 300, URFS model 316, image-feature pyramids network 501, progressive-recursive motion-estimation network 503, and/or picture-generation network 505. Method 1200 may include operations 1202-1212 as described below. It is understood that some of the operations may be optional, and some of the operations may be performed simultaneously, or in a different order other than shown in FIG. 12.

At 1202, the apparatus may obtain a first input picture associated with a first time and a second input picture associated with a second time that forms a first image-pyramid pair. For example, referring to FIG. 5, image-feature pyramids network 501 may receive a pair of input pictures associated with a first resolution, which may be used as a first image-pyramid pair 507a. The input pictures may be obtained from a decoded picture buffer, as shown in FIG. 3.

At 1204, the apparatus may generate a second image-pyramid pair by performing a first filtering of the first image-pyramid pair. For example, referring to FIG. 5, bicubic downsampling (or other filtering) of first image-pyramid pair 507a may be performed to obtain a second image-pyramid pair 507b of a second resolution lower than the first resolution.

At 1206, the apparatus may generate a third image-pyramid pair by performing a second filtering of the second image-pyramid pair. For example, referring to FIG. 5, bicubic downsampling (or other filtering) of the second image-pyramid pair 507b may be performed to obtain a third image-pyramid pair 507c, and so on.

At 1208, the apparatus may generate a first set of feature maps by inputting the first image-pyramid pair into a first feature-extraction network and generate a second set of feature maps by inputting the second image-pyramid pair into a second feature-extraction network, and a third set of feature maps by inputting the third image-pyramid pair into a third feature-extraction network. For example, referring to FIG. 6, level-5 feature extractor 502a includes five levels of convolutional-layer pairs (e.g., either 3x3 convolutional layers, stride 2 (s=2) 3x3 convolutional layers, or 2D convolutional layers) , level-4 feature extractor 502b includes four levels of convolutional-layer pairs (e.g., either 3x3 convolutional layers, stride 2 (s=2) 3x3 convolutional layers, or 2D convolutional layers) , and level-3 feature extractor 502c includes three levels of convolutional-layer pairs (e.g., either 3x3 convolutional layers, stride 2 (s=2) 3x3 convolutional layers, or 2D convolutional layers) . The first set of feature maps 509a generated by the level-5 feature extractor 502a may be represented bywhereare the feature maps. The second set of feature maps 509b generated by the level-4 feature extractor 502b may be represented bywhereare the feature maps. The third set of feature maps 509c generated by the level-3 feature extractor 502c may be represented bywhereare the feature maps.

At 1210, the apparatus may compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps. For example, referring to FIG. 7, the motion-estimation network 503 predicts the bidirectional flows at feature levels with the same resolution, and progressively estimates and upsample the flows at feature levels at different resolutions. Overall, the motion-estimation network 503 learns residual optical flow vectors and continuously adds them together. For ease of description and illustration, only the feature maps from level-5, level-4, and level-3 are depicted, but it is understood that five sets of feature maps at each resolution may be used for motion estimation in determining the final set of optical flow vectors 513. For instance, the third set of feature maps 509care input into motion estimation block1 (MEB1) 702. A time step T indicates the time step between images in the image-pyramids may also be included into MEB1 702. MEB1 702 may include a plurality of 3x3 convolutional layers and a summation operation to generate a first set of optical flow vectors from the third set of feature maps 509c. The first set of optical flow vectors may be input into the warping block 704, MEB1 702, and summation block 706 associated with the level-4 image-pyramids. The second set of feature maps 509b may be input into the corresponding warping block 706, which may warp the second set of feature maps 509b using the first set of optical flow vectors. The warped feature maps may be input into MEB1, which generates an intermediate second set of optical flow vectors based on the warped second set of feature maps and the first set of optical flow vectors. Summation block 706 may perform a summation operation using the first set of optical flow vectors and the intermediate second set of optical flow vectors to generate a final second set of optical flow vectors. Motion-estimation network 503 may upsample the final second set of optical flow vectors. The upsampled second set of optical flow vectors may be input into the warping block 704, the motion estimation block2 (MEB2) 708, and the summation block 706 associated with the first set of feature maps 509a. The first set of feature maps 509a may be input into the corresponding warping block 706, which may warp the first set of feature maps 509a using the upsampled second set of optical flow vectors. The warped feature maps may be input into MEB2 708, which generates an intermediate third set of optical flow vectors based on the warped first set of feature maps and the upsampled second set of optical flow vectors. Summation block 706 may perform a summation operation using the upsampled second set of optical flow vectors and the third set of optical flow vectors to generate a final third set of optical flow vectors. The final third set of optical flow vectors into the warping block 704, the motion estimation block2 (MEB1) 708, and the summation block 706 associated with the next stage in which the feature maps have been upsampledAs shown in FIG. 7, these operations continue until a final set of optical flow vectors 513is generated. The list 800 of inputs, outputs, and optical flow vectors used/generated at each stage of motion estimation in FIG. 7 are shown in FIG. 8.

At 1212, the apparatus may generate a reference picture associated with a third time based on the plurality of final optical flow vectors, the third time being between the first time and the second time. For example, referring to FIG. 5, the final set up feature maps 511 may be input into the backward warping block 504 of picture-generation network 505. The final set of optical flow vectors 513 may be input into backward warping block 504 and the feature enhancer 506 of picture-generation network 505. Moreover, the input pictures (first image-pyramid pair 507a) may be input into backward warping block 504. Backward warping block 504 may warp the input pictures based on the final set of optical flow vectors 513. The warped pictures may be input into feature enhancer 506, which enhances the features of the warped pictures using the final set of feature maps 511. The enhanced warped pictures are input into picture synthesizer 508, which synthesizes the enhanced warped pictures into output picture 515 (e.g., a reference picture) . Both feature enhancer 506 and picture synthesizer 508 include a U-net architecture as their basic structure and are equipped with depth wise-separable convolution (DSC) blocks, as shown in FIG. 9.

In various aspects of the present disclosure, the functions described herein may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored as instructions on a non-transitory computer-readable medium. Computer-readable media includes computer storage media. Storage media may be any available media that can be accessed by a processor, such as processor 102 in FIGs. 1 and 2. By way of example, and not limitation, such computer-readable media can include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, HDD, such as magnetic disk storage or other magnetic storage devices, Flash drive, SSD, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a processing system, such as a mobile device or a computer. Disk and disc, as used herein, include CD, laser disc, optical disc, digital video disc (DVD) , and floppy disk where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

According to one aspect of the present disclosure, a method of video encoding or decoding is provided. The method may include obtaining, by a processor, a first input picture associated with a first time and a second input picture associated with a second time. The first input picture and the second input picture may form a first image-pyramid pair. The method may include generating, by the processor, a second image-pyramid pair by performing a first filtering of the first image-pyramid pair. The first image-pyramid pair and the second image-pyramid pair may be associated with different resolutions. The method may include generating, by the processor, a third image-pyramid pair by performing a second filtering of the second image-pyramid pair. The first image-pyramid pair, the second image-pyramid pair, and the third image-pyramid pair each may be associated with a different resolution. The method may include obtaining, by the processor, a first set of feature maps by inputting the first image-pyramid pair into a first feature-extraction network, a second set of feature maps by inputting the second image-pyramid pair into a second feature-extraction network, and a third set of feature maps by inputting the third image-pyramid pair into a third feature-extraction network. The first set of feature maps, the second set of feature maps, and a third set of feature maps may be associated with a same resolution. The method may include computing, by the processor, a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps. The method may include generating, by the processor, a reference picture associated with a third time based on the plurality of final optical flow vectors. The third time may be between the first time and the second time or after the first time and the second time.

In some implementations, the computing, by the processor, a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps may include inputting, by the processor, the third set of feature maps into a first motion estimation block. In some implementations, the computing, by the processor, a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps may include generating, by the processor, a first set of optical flow vectors associated with the third set of feature maps using the first motion estimation block. In some implementations, the computing, by the processor, a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps may include inputting, by the processor, the first set of optical flow vectors into a first warping block, a second motion estimation block, and a first summation block.

In some implementations, the computing, by the processor, the plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps may include inputting, by the processor, the second set of feature maps into the first warping block. In some implementations, the computing, by the processor, the plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps may include generating, by the processor, a second set of optical flow vectors associated with the second set of feature maps using the first warping block, the second motion estimation block, and the first summation block. In some implementations, the computing, by the processor, the plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps may include upsampling, by the processor, the second set of optical flow vectors to generate an upsampled second set of optical flow vectors. In some implementations, the computing, by the processor, the plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps may include inputting, by the processor, the upsampled second set of optical flow vectors into a second warping block, a third motion estimation block, and a second summation block.

In some implementations, the computing, by the processor, the plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps may include inputting, by the processor, the first set of feature maps into the second warping block. In some implementations, the computing, by the processor, the plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps may include generating, by the processor, a third set of optical flow vectors associated with the first set of feature maps using the second warping block, the third motion estimation block, and the second summation block. In some implementations, the computing, by the processor, the plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps may include inputting, by the processor, the third set of optical flow vectors into a third warping block, a fourth motion estimation block, and a third summation block.

In some implementations, the computing, by the processor, the plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps may include upsampling, by the processor, the first set of feature maps, the second set of feature maps, and the third set of feature maps to generate an upsampled first set of feature maps, an upsampled second set of feature maps, and an upsampled third set of feature maps. In some implementations, the computing, by the processor, the plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps may include inputting, by the processor, the upsampled first set of feature maps into the third warping block. In some implementations, the computing, by the processor, the plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps may include generating, by the processor, a fourth set of optical flow vectors using the third warping block, the fourth motion estimation block, and the third summation block. In some implementations, the computing, by the processor, the plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps may include inputting, by the processor, the fourth set of optical flow vectors into a fourth warping block, a fifth motion estimation block, and a fourth summation block.

In some implementations, the computing, by the processor, the plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps may include inputting, by the processor, the upsampled second set of feature maps into the fourth warping block. In some implementations, the computing, by the processor, the plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps may include generating, by the processor, a fifth set of optical flow vectors using the fourth warping block, the fifth motion estimation block, and the fourth summation block. In some implementations, the computing, by the processor, the plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps may include upsampling, by the processor, the fifth set of optical flow vectors to generate an upsampled fifth set of optical flow vectors. In some implementations, the computing, by the processor, the plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps may include inputting, by the processor, the upsampled fifth set of optical flow vectors into a fifth warping block, a sixth motion estimation block, and a fifth summation block.

In some implementations, the computing, by the processor, the plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps may include inputting, by the processor, the upsampled third set of feature maps into the fifth warping block. In some implementations, the computing, by the processor, the plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps may include generating, by the processor, the plurality of final optical flow vectors using the fifth warping block, the sixth motion estimation block, and the fifth summation block.

In some implementations, the generating, by the processor, the reference picture associated with the third time based on the plurality of final optical flow vectors may include inputting, by the processor, the first input picture, the second input picture, the upsampled third set of feature maps, and the plurality of final optical flows into a sixth warping block. In some implementations, the generating, by the processor, the reference picture associated with the third time based on the plurality of final optical flow vectors may include performing, by the processor, a warping of the first input picture and the second input picture based on the upsampled third set of feature maps and the plurality of final optical flow vectors using the sixth warping block to generate a warped picture. In some implementations, the generating, by the processor, the reference picture associated with the third time based on the plurality of final optical flow vectors may include inputting, by the processor, the upsampled third set of feature maps and the warped picture into a feature enhancer block. In some implementations, the generating, by the processor, the reference picture associated with the third time based on the plurality of final optical flow vectors may include generating, by the processor, an enhanced warped picture using the feature enhancer block. In some implementations, the generating, by the processor, the reference picture associated with the third time based on the plurality of final optical flow vectors may include inputting, by the processor, the enhanced warped picture into a picture synthesizer block. In some implementations, the generating, by the processor, the reference picture associated with the third time based on the plurality of final optical flow vectors may include generating, by the processor, the reference picture using the picture synthesizer block.

According to another aspect of the present disclosure, an apparatus for video encoding or decoding is provided. The apparatus may include a processor and memory storing instructions. The memory storing instructions, which when executed by the processor, may cause the processor to obtain a first input picture associated with a first time and a second input picture associated with a second time, the first input picture and the second input picture forming a first image-pyramid pair. The memory storing instructions, which when executed by the processor, may cause the processor to generate a second image-pyramid pair by performing a first filtering of the first image-pyramid pair. The memory storing instructions, which when executed by the processor, may cause the processor to generate a third image-pyramid pair by performing a second filtering of the second image-pyramid pair. The first image-pyramid pair, the second image-pyramid pair, and the third image-pyramid pair each may be associated with a different resolution. The memory storing instructions, which when executed by the processor, may cause the processor to obtain a first set of feature maps by inputting the first image-pyramid pair into a first feature-extraction network, a second set of feature maps by inputting the second image-pyramid pair into a second feature-extraction network, and a third set of feature maps by inputting the third image-pyramid pair into a third feature-extraction network. The first set of feature maps, the second set of feature maps, and a third set of feature maps may be associated with a same resolution. The memory storing instructions, which when executed by the processor, may cause the processor to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps. The memory storing instructions, which when executed by the processor, may cause the processor to generate a reference picture associated with a third time based on the plurality of final optical flow vectors. The third time may be between the first time and the second time or after the first time and the second time.

In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the memory storing instructions, which when executed by the processor, may cause the processor to input the third set of feature maps into a first motion estimation block. In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the memory storing instructions, which when executed by the processor, may cause the processor to generate a first set of optical flow vectors associated with the third set of feature maps using the first motion estimation block. In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the memory storing instructions, which when executed by the processor, may cause the processor to input the first set of optical flow vectors into a first warping block, a second motion estimation block, and a first summation block.

In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the memory storing instructions, which when executed by the processor, may cause the processor to input the second set of feature maps into the first warping block. In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the memory storing instructions, which when executed by the processor, may cause the processor to generate a second set of optical flow vectors associated with the second set of feature maps using the first warping block, the second motion estimation block, and the first summation block. In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the memory storing instructions, which when executed by the processor, may cause the processor to upsample the second set of optical flow vectors to generate an upsampled second set of optical flow vectors. In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the memory storing instructions, which when executed by the processor, may cause the processor to input the upsampled second set of optical flow vectors into a second warping block, a third motion estimation block, and a second summation block.

In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the memory storing instructions, which when executed by the processor, may cause the processor to input the first set of feature maps into the second warping block. In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the memory storing instructions, which when executed by the processor, may cause the processor to generate a third set of optical flow vectors associated with the first set of feature maps using the second warping block, the third motion estimation block, and the second summation block. In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the memory storing instructions, which when executed by the processor, may cause the processor to input the third set of optical flow vectors into a third warping block, a fourth motion estimation block, and a third summation block.

In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the memory storing instructions, which when executed by the processor, may cause the processor to upsample the first set of feature maps, the second set of feature maps, and the third set of feature maps to generate an upsampled first set of feature maps, an upsampled second set of feature maps, and an upsampled third set of feature maps. In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the memory storing instructions, which when executed by the processor, may cause the processor to input the upsampled first set of feature maps into the third warping block. In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the memory storing instructions, which when executed by the processor, may cause the processor to generate a fourth set of optical flow vectors using the third warping block, the fourth motion estimation block, and the third summation block. In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the memory storing instructions, which when executed by the processor, may cause the processor to input the fourth set of optical flow vectors into a fourth warping block, a fifth motion estimation block, and a fourth summation block.

In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the memory storing instructions, which when executed by the processor, may cause the processor to input the upsampled second set of feature maps into the fourth warping block. In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the memory storing instructions, which when executed by the processor, may cause the processor to generate a fifth set of optical flow vectors using the fourth warping block, the fifth motion estimation block, and the fourth summation block. In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the memory storing instructions, which when executed by the processor, may cause the processor to upsample the fifth set of optical flow vectors to generate an upsampled fifth set of optical flow vectors. In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the memory storing instructions, which when executed by the processor, may cause the processor to input the upsampled fifth set of optical flow vectors into a fifth warping block, a sixth motion estimation block, and a fifth summation block.

In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the memory storing instructions, which when executed by the processor, may cause the processor to input the upsampled third set of feature maps into the fifth warping block. In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the memory storing instructions, which when executed by the processor, may cause the processor to generate the plurality of final optical flow vectors using the fifth warping block, the sixth motion estimation block, and the fifth summation block.

In some implementations, to generate the reference frame associated with the third time based on the plurality of final optical flow vectors, the memory storing instructions, which when executed by the processor, may cause the processor to input the first input picture, the second input picture, the upsampled third set of feature maps, and the plurality of final optical flows into a sixth warping block. In some implementations, to generate the reference picture associated with the third time based on the plurality of final optical flow vectors, the memory storing instructions, which when executed by the processor, may cause the processor to perform a warping of the first input picture and the second input picture based on the upsampled third set of feature maps and the plurality of final optical flow vectors using the sixth warping block to generate a warped picture. In some implementations, to generate the reference picture associated with the third time based on the plurality of final optical flow vectors, the memory storing instructions, which when executed by the processor, may cause the processor to input the upsampled third set of feature maps and the warped picture into a feature enhancer block. In some implementations, to generate the reference picture associated with the third time based on the plurality of final optical flow vectors, the memory storing instructions, which when executed by the processor, may cause the processor to generate an enhanced warped picture using the feature enhancer block. In some implementations, to generate the reference picture associated with the third time based on the plurality of final optical flow vectors, the memory storing instructions, which when executed by the processor, may cause the processor to input the enhanced warped picture into a picture synthesizer block. In some implementations, to generate the reference picture associated with the third time based on the plurality of final optical flow vectors, the memory storing instructions, which when executed by the processor, may cause the processor to generate the reference picture using the picture synthesizer block.

According to a further aspect of the present disclosure, a non-transitory computer-readable medium storing instructions is provided. The instructions, which when executed by the processor, may cause the processor to obtain a first input picture associated with a first time and a second input picture associated with a second time, the first input picture and the second input picture forming a first image-pyramid pair. The instructions, which when executed by the processor, may cause the processor to generate a second image-pyramid pair by performing a first filtering of the first image-pyramid pair. The instructions, which when executed by the processor, may cause the processor to generate a third image-pyramid pair by performing a second filtering of the second image-pyramid pair. The first image-pyramid pair, the second image-pyramid pair, and the third image-pyramid pair each may be associated with a different resolution. The instructions, which when executed by the processor, may cause the processor to obtain a first set of feature maps by inputting the first image-pyramid pair into a first feature-extraction network, a second set of feature maps by inputting the second image-pyramid pair into a second feature-extraction network, and a third set of feature maps by inputting the third image-pyramid pair into a third feature-extraction network. The first set of feature maps, the second set of feature maps, and a third set of feature maps may be associated with a same resolution. The instructions, which when executed by the processor, may cause the processor to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps. The instructions, which when executed by the processor, may cause the processor to generate a reference picture associated with a third time based on the plurality of final optical flow vectors. The third time may be between the first time and the second time or after the first time and the second time.

In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the instructions, which when executed by the processor, may cause the processor to input the third set of feature maps into a first motion estimation block. In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the instructions, which when executed by the processor, may cause the processor to generate a first set of optical flow vectors associated with the third set of feature maps using the first motion estimation block. In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the instructions, which when executed by the processor, may cause the processor to input the first set of optical flow vectors into a first warping block, a second motion estimation block, and a first summation block.

In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the instructions, which when executed by the processor, may cause the processor to input the second set of feature maps into the first warping block. In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the instructions, which when executed by the processor, may cause the processor to generate a second set of optical flow vectors associated with the second set of feature maps using the first warping block, the second motion estimation block, and the first summation block. In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the instructions, which when executed by the processor, may cause the processor to upsample the second set of optical flow vectors to generate an upsampled second set of optical flow vectors. In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the instructions, which when executed by the processor, may cause the processor to input the upsampled second set of optical flow vectors into a second warping block, a third motion estimation block, and a second summation block.

In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the instructions, which when executed by the processor, may cause the processor to input the first set of feature maps into the second warping block. In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the instructions, which when executed by the processor, may cause the processor to generate a third set of optical flow vectors associated with the first set of feature maps using the second warping block, the third motion estimation block, and the second summation block. In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the instructions, which when executed by the processor, may cause the processor to input the third set of optical flow vectors into a third warping block, a fourth motion estimation block, and a third summation block.

In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the instructions, which when executed by the processor, may cause the processor to upsample the first set of feature maps, the second set of feature maps, and the third set of feature maps to generate an upsampled first set of feature maps, an upsampled second set of feature maps, and an upsampled third set of feature maps. In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the instructions, which when executed by the processor, may cause the processor to input the upsampled first set of feature maps into the third warping block. In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the instructions, which when executed by the processor, may cause the processor to generate a fourth set of optical flow vectors using the third warping block, the fourth motion estimation block, and the third summation block. In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the instructions, which when executed by the processor, may cause the processor to input the fourth set of optical flow vectors into a fourth warping block, a fifth motion estimation block, and a fourth summation block.

In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the instructions, which when executed by the processor, may cause the processor to input the upsampled second set of feature maps into the fourth warping block. In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the instructions, which when executed by the processor, may cause the processor to generate a fifth set of optical flow vectors using the fourth warping block, the fifth motion estimation block, and the fourth summation block. In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the instructions, which when executed by the processor, may cause the processor to upsample the fifth set of optical flow vectors to generate an upsampled fifth set of optical flow vectors. In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the instructions, which when executed by the processor, may cause the processor to input the upsampled fifth set of optical flow vectors into a fifth warping block, a sixth motion estimation block, and a fifth summation block.

In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the instructions, which when executed by the processor, may cause the processor to input the upsampled third set of feature maps into the fifth warping block. In some implementations, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the instructions, which when executed by the processor, may cause the processor to generate the plurality of final optical flow vectors using the fifth warping block, the sixth motion estimation block, and the fifth summation block.

In some implementations, to generate the reference picture associated with the third time based on the plurality of final optical flow vectors, the instructions, which when executed by the processor, may cause the processor to input the first input picture, the second input picture, the upsampled third set of feature maps, and the plurality of final optical flows into a sixth warping block. In some implementations, to generate the reference picture associated with the third time based on the plurality of final optical flow vectors, the instructions, which when executed by the processor, may cause the processor to perform a warping of the first input picture and the second input picture based on the upsampled third set of feature maps and the plurality of final optical flow vectors using the sixth warping block to generate a warped picture. In some implementations, to generate the reference picture associated with the third time based on the plurality of final optical flow vectors, the instructions, which when executed by the processor, may cause the processor to input the upsampled third set of feature maps and the warped picture into a feature enhancer block. In some implementations, to generate the reference picture associated with the third time based on the plurality of final optical flow vectors, the instructions, which when executed by the processor, may cause the processor to generate an enhanced warped picture using the feature enhancer block. In some implementations, to generate the reference picture associated with the third time based on the plurality of final optical flow vectors, the instructions, which when executed by the processor, may cause the processor to input the enhanced warped picture into a picture synthesizer block. In some implementations, to generate the reference picture associated with the third time based on the plurality of final optical flow vectors, the instructions, which when executed by the processor, may cause the processor to generate the reference picture using the picture synthesizer block.

The foregoing description of the embodiments will so reveal the general nature of the present disclosure that others can, by applying knowledge within the skill of the art, readily modify and/or adapt for various applications such embodiments, without undue experimentation, without departing from the general concept of the present disclosure. Therefore, such adaptations and modifications are intended to be within the meaning and range of equivalents of the disclosed embodiments, based on the teaching and guidance presented herein. It is to be understood that the phraseology or terminology herein is for the purpose of description and not of limitation, such that the terminology or phraseology of the present specification is to be interpreted by the skilled artisan in light of the teachings and guidance.

Embodiments of the present disclosure have been described above with the aid of functional building blocks illustrating the implementation of specified functions and relationships thereof. The boundaries of these functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternate boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed.

The Summary and Abstract sections may set forth one or more but not all exemplary embodiments of the present disclosure as contemplated by the inventor (s) , and thus, are not intended to limit the present disclosure and the appended claims in any way.

Various functional blocks, modules, and steps are disclosed above. The arrangements provided are illustrative and without limitation. Accordingly, the functional blocks, modules, and steps may be reordered or combined in different ways than in the examples provided above. Likewise, some embodiments include only a subset of the functional blocks, modules, and steps, and any such subset is permitted.

The breadth and scope of the present disclosure should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Claims

A method of video encoding or decoding, comprising:

obtaining, by a processor, a first input picture associated with a first time and a second input picture associated with a second time, the first input picture and the second input picture forming a first image-pyramid pair;

generating, by the processor, a second image-pyramid pair by performing a first filtering of the first image-pyramid pair;

generating, by the processor, a third image-pyramid pair by performing a second filtering of the second image-pyramid pair, the first image-pyramid pair, the second image-pyramid pair, and the third image-pyramid pair each being associated with a different resolution;

obtaining, by the processor, a first set of feature maps by inputting the first image-pyramid pair into a first feature-extraction network, a second set of feature maps by inputting the second image-pyramid pair into a second feature-extraction network, and a third set of feature maps by inputting the third image-pyramid pair into a third feature-extraction network, the first set of feature maps, the second set of feature maps, and a third set of feature maps being associated with a same resolution;

computing, by the processor, a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps; and

generating, by the processor, a reference picture associated with a third time based on the plurality of final optical flow vectors, the third time being between the first time and the second time or after the first time and second time.
The method of claim 1, wherein the computing, by the processor, a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps comprises:

inputting, by the processor, the third set of feature maps into a first motion estimation block;

generating, by the processor, a first set of optical flow vectors associated with the third set of feature maps using the first motion estimation block; and

inputting, by the processor, the first set of optical flow vectors into a first warping block, a second motion estimation block, and a first summation block.
The method of claim 2, wherein the computing, by the processor, the plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps comprises:

inputting, by the processor, the second set of feature maps into the first warping block;

generating, by the processor, a second set of optical flow vectors associated with the second set of feature maps using the first warping block, the second motion estimation block, and the first summation block;

upsampling, by the processor, the second set of optical flow vectors to generate an upsampled second set of optical flow vectors; and

inputting, by the processor, the upsampled second set of optical flow vectors into a second warping block, a third motion estimation block, and a second summation block.
The method of claim 3, wherein the computing, by the processor, the plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps comprises:

inputting, by the processor, the first set of feature maps into the second warping block;

generating, by the processor, a third set of optical flow vectors associated with the first set of feature maps using the second warping block, the third motion estimation block, and the second summation block; and

inputting, by the processor, the third set of optical flow vectors into a third warping block, a fourth motion estimation block, and a third summation block.
The method of claim 4, wherein the computing, by the processor, the plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps comprises:

upsampling, by the processor, the first set of feature maps, the second set of feature maps, and the third set of feature maps to generate an upsampled first set of feature maps, an upsampled second set of feature maps, and an upsampled third set of feature maps;

inputting, by the processor, the upsampled first set of feature maps into the third warping block;

generating, by the processor, a fourth set of optical flow vectors using the third warping block, the fourth motion estimation block, and the third summation block; and

inputting, by the processor, the fourth set of optical flow vectors into a fourth warping block, a fifth motion estimation block, and a fourth summation block.
The method of claim 5, wherein the computing, by the processor, the plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps comprises:

inputting, by the processor, the upsampled second set of feature maps into the fourth warping block;

generating, by the processor, a fifth set of optical flow vectors using the fourth warping block, the fifth motion estimation block, and the fourth summation block;

upsampling, by the processor, the fifth set of optical flow vectors to generate an upsampled fifth set of optical flow vectors; and

inputting, by the processor, the upsampled fifth set of optical flow vectors into a fifth warping block, a sixth motion estimation block, and a fifth summation block.
The method of claim 6, wherein the computing, by the processor, the plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps comprises:

inputting, by the processor, the upsampled third set of feature maps into the fifth warping block; and

generating, by the processor, the plurality of final optical flow vectors using the fifth warping block, the sixth motion estimation block, and the fifth summation block.
The method of claim 7, wherein the generating, by the processor, the reference picture associated with the third time based on the plurality of final optical flow vectors comprises:

inputting, by the processor, the first input picture, the second input picture, the upsampled third set of feature maps, and the plurality of final optical flows into a sixth warping block;

performing, by the processor, a warping of the first input picture and the second input picture based on the upsampled third set of feature maps and the plurality of final optical flow vectors using the sixth warping block to generate a warped picture;

inputting, by the processor, the upsampled third set of feature maps and the warped picture into a feature enhancer block;

generating, by the processor, an enhanced warped picture using the feature enhancer block;

inputting, by the processor, the enhanced warped picture into a picture synthesizer block; and

generating, by the processor, the reference picture using the picture synthesizer block.
An apparatus for video encoding or decoding, comprising:

a processor; and

memory storing instructions, which when executed by the processor, cause the processor to:

obtain a first input picture associated with a first time and a second input picture associated with a second time, the first input picture and the second input picture forming a first image-pyramid pair;

generate a second image-pyramid pair by performing a first filtering of the first image-pyramid pair;

generate a third image-pyramid pair by performing a second filtering of the second image-pyramid pair, the first image-pyramid pair, the second image-pyramid pair, and the third image-pyramid pair each being associated with a different resolution;

obtain a first set of feature maps by inputting the first image-pyramid pair into a first feature-extraction network, a second set of feature maps by inputting the second image-pyramid pair into a second feature-extraction network, and a third set of feature maps by inputting the third image-pyramid pair into a third feature-extraction network, the first set of feature maps, the second set of feature maps, and a third set of feature maps being associated with a same resolution;

compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps; and

generate a reference picture associated with a third time based on the plurality of final optical flow vectors, the third time being between the first time and the second time or after the first time and second time.
The apparatus of claim 9, wherein, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the memory storing instructions, which when executed by the processor, cause the processor to:

input the third set of feature maps into a first motion estimation block;

generate a first set of optical flow vectors associated with the third set of feature maps using the first motion estimation block; and

input the first set of optical flow vectors into a first warping block, a second motion estimation block, and a first summation block.
The apparatus of claim 10, wherein, to compute the plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the memory storing instructions, which when executed by the processor, cause the processor to:

input the second set of feature maps into the first warping block;

generate a second set of optical flow vectors associated with the second set of feature maps using the first warping block, the second motion estimation block, and the first summation block;

upsample the second set of optical flow vectors to generate an upsampled second set of optical flow vectors; and

input the upsampled second set of optical flow vectors into a second warping block, a third motion estimation block, and a second summation block.
The apparatus of claim 11, wherein, to compute the plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the memory storing instructions, which when executed by the processor, cause the processor to:

input the first set of feature maps into the second warping block;

generate a third set of optical flow vectors associated with the first set of feature maps using the second warping block, the third motion estimation block, and the second summation block; and

input the third set of optical flow vectors into a third warping block, a fourth motion estimation block, and a third summation block.
The apparatus of claim 12, wherein, to compute the plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the memory storing instructions, which when executed by the processor, cause the processor to

upsample the first set of feature maps, the second set of feature maps, and the third set of feature maps to generate an upsampled first set of feature maps, an upsampled second set of feature maps, and an upsampled third set of feature maps;

input the upsampled first set of feature maps into the third warping block;

generate a fourth set of optical flow vectors using the third warping block, the fourth motion estimation block, and the third summation block; and

input the fourth set of optical flow vectors into a fourth warping block, a fifth motion estimation block, and a fourth summation block.
The apparatus of claim 13, wherein, to compute the plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the memory storing instructions, which when executed by the processor, cause the processor to

input the upsampled second set of feature maps into the fourth warping block;

generate a fifth set of optical flow vectors using the fourth warping block, the fifth motion estimation block, and the fourth summation block;

upsample the fifth set of optical flow vectors to generate an upsampled fifth set of optical flow vectors; and

input the upsampled fifth set of optical flow vectors into a fifth warping block, a sixth motion estimation block, and a fifth summation block.
The apparatus of claim 14, wherein, to compute the plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the memory storing instructions, which when executed by the processor, cause the processor to

input the upsampled third set of feature maps into the fifth warping block; and

generate the plurality of final optical flow vectors using the fifth warping block, the sixth motion estimation block, and the fifth summation block.
The apparatus of claim 15, wherein, to generate the reference picture associated with the third time based on the plurality of final optical flow vectors, the memory storing instructions, which when executed by the processor, cause the processor to:

input the first input picture, the second input picture, the upsampled third set of feature maps, and the plurality of final optical flows into a sixth warping block;

perform a warping of the first input picture and the second input picture based on the upsampled third set of feature maps and the plurality of final optical flow vectors using the sixth warping block to generate a warped picture;

input the upsampled third set of feature maps and the warped picture into a feature enhancer block;

generate an enhanced warped picture using the feature enhancer block;

input the enhanced warped picture into a picture synthesizer block; and

generate the reference picture using the picture synthesizer block.
A non-transitory computer-readable medium storing instructions, which when executed by a processor, cause the processor to:

obtain a first input picture associated with a first time and a second input picture associated with a second time, the first input picture and the second input picture forming a first image-pyramid pair;

generate a second image-pyramid pair by performing a first filtering of the first image-pyramid pair;

generate a third image-pyramid pair by performing a second filtering of the second image-pyramid pair, the first image-pyramid pair, the second image-pyramid pair, and the third image-pyramid pair each being associated with a different resolution;

obtain a first set of feature maps by inputting the first image-pyramid pair into a first feature-extraction network, a second set of feature maps by inputting the second image-pyramid pair into a second feature-extraction network, and a third set of feature maps by inputting the third image-pyramid pair into a third feature-extraction network, the first set of feature maps, the second set of feature maps, and a third set of feature maps being associated with a same resolution;

compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps; and

generate a reference picture associated with a third time based on the plurality of final optical flow vectors, the third time being between the first time and the second time or after the first time and second time.
The non-transitory computer-readable medium of claim 17, wherein, to compute a plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the instructions, which when executed by the processor, cause the processor to:

input the third set of feature maps into a first motion estimation block;

generate a first set of optical flow vectors associated with the third set of feature maps using the first motion estimation block; and

input the first set of optical flow vectors into a first warping block, a second motion estimation block, and a first summation block.
The non-transitory computer-readable medium of claim 18, wherein, to compute the plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the instructions, which when executed by the processor, cause the processor to:

input the second set of feature maps into the first warping block;

generate a second set of optical flow vectors associated with the second set of feature maps using the first warping block, the second motion estimation block, and the first summation block;

upsample the second set of optical flow vectors to generate an upsampled second set of optical flow vectors; and

input the upsampled second set of optical flow vectors into a second warping block, a third motion estimation block, and a second summation block.
The non-transitory computer-readable medium of claim 19, wherein, to compute the plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the instructions, which when executed by the processor, cause the processor to:

input the first set of feature maps into the second warping block;

generate a third set of optical flow vectors associated with the first set of feature maps using the second warping block, the third motion estimation block, and the second summation block; and

input the third set of optical flow vectors into a third warping block, a fourth motion estimation block, and a third summation block.
The non-transitory computer-readable medium of claim 20, wherein, to compute the plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the instructions, which when executed by the processor, cause the processor to

upsample the first set of feature maps, the second set of feature maps, and the third set of feature maps to generate an upsampled first set of feature maps, an upsampled second set of feature maps, and an upsampled third set of feature maps;

input the upsampled first set of feature maps into the third warping block;

generate a fourth set of optical flow vectors using the third warping block, the fourth motion estimation block, and the third summation block; and

input the fourth set of optical flow vectors into a fourth warping block, a fifth motion estimation block, and a fourth summation block.
The non-transitory computer-readable medium of claim 21, wherein, to compute the plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the instructions, which when executed by the processor, cause the processor to

input the upsampled second set of feature maps into the fourth warping block;

generate a fifth set of optical flow vectors using the fourth warping block, the fifth motion estimation block, and the fourth summation block;

upsample the fifth set of optical flow vectors to generate an upsampled fifth set of optical flow vectors; and

input the upsampled fifth set of optical flow vectors into a fifth warping block, a sixth motion estimation block, and a fifth summation block.
The non-transitory computer-readable medium of claim 22, wherein, to compute the plurality of final optical flow vectors by performing progressive-recursive motion estimation using the first set of feature maps, the second set of feature maps, and the third set of feature maps, the instructions, which when executed by the processor, cause the processor to

input the upsampled third set of feature maps into the fifth warping block; and

generate the plurality of final optical flow vectors using the fifth warping block, the sixth motion estimation block, and the fifth summation block.
The non-transitory computer-readable medium of claim 23, wherein, to generate the reference picture associated with the third time based on the plurality of final optical flow vectors, the instructions, which when executed by the processor, cause the processor to:

input the first input picture, the second input picture, the upsampled third set of feature maps, and the plurality of final optical flows into a sixth warping block;

perform a warping of the first input picture and the second input picture based on the upsampled third set of feature maps and the plurality of final optical flow vectors using the sixth warping block to generate a warped picture;

input the upsampled third set of feature maps and the warped picture into a feature enhancer block;

generate an enhanced warped picture using the feature enhancer block;

input the enhanced warped picture into a picture synthesizer block; and

generate the reference picture using the picture synthesizer block.